POS Tagger
What is POS tagging
The process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Source: wikipedia
However, for the purposes of natural language processing, a more fine grained part of speech tagging is required, like: noun-plural, verb-gerund, noun-proper, noun-proper-plural ...etc.
FIN POS tagger has scored
96.28%
accuracy score at the penn-treebank test, and considered to be one of the fastest POS taggers that scores more than 95% processing 132K tokens in 38 seconds. It's also safe to say that this is the most accurate and fastest POS tagger ever written in JavaScript.
Annotation Specifications
For a list of all possible tags please refer to the specification.
Accuracy and performance
In short:
When smoothing is enabled:
96.28%
accuracy (processing 132K tokens in 38 seconds).When smoothing is disabled:
94.38%
accuracy (processing 132K tokens in 3 seconds).
As of 25 Jan 2017, this library scored 96.28%
at the Penn Treebank test (0.3% away from being a state of the art tagger).
Being written in JavaScript, I think it's safe to say that this is the most accurate JavaScript POS tagger, since the only JS library I know of is pos-js which when I tested on the same treebank it scored 87.8%
, though it was faster than my implementation when smoothing is enabled.
However, if performance is what's you're after rather than accuracy, then you have the option to disable smoothing in this library and this will marginally increase performance making this library even faster than pos-js but with far better accuracy (94.38%).
Under the hood
Tagging an array of tokens (i.e. lexed sentence) is done through two steps:
Initial tagging
Smoothing
Initial Tagging
Initial tagging is the process of annotating each token regardless of the context in which this token is used. This is done through:
Matching the token against a lexicon
Matching the token against a list of given names
Matching the token against a list of cities
Checking if the token is a contraction
Analyzing compound words (like:
geo-location
)Matching token suffix against know rules
Matching the token against a list of informal and slang words
Removing repetitive characters and then matching the token against the rules above
Matching the token against a regular expression that detects proper nouns
If non of the rules above gives a result, then the tag defaults to singular noun (
NN
) or plural noun (NNS
) based on inflection.
Smoothing
Smoothing is the process of correcting the initial given tag based on the context in which the token is used.
For example, most of the English verbs has the same spelling for both past and past participle tense (an ed
suffix). Another case: is the word run
in these two sentences: the run lasted 5 minutes
we run every day
.
So to disambiguate the token's POS tag we must check the relevant tags that comes before or after this token. This process is done through a set of rules extracted through machine learning, and few manual rules.
Last updated