fin NLP
  • Introduction
  • Installation and usage
  • Available Extension
  • Cloning and Contributing
  • In Depth look
    • Processing
    • Lexer
    • POS Tagger
    • Dependency Parser
    • Pre and Post Processing
    • Detectors
  • Annotation Specifications
    • POS Tagger
    • Dependency Parsing
Powered by GitBook
On this page
  • What is POS tagging
  • Annotation Specifications
  • Accuracy and performance
  • Under the hood

Was this helpful?

  1. In Depth look

POS Tagger

PreviousLexerNextDependency Parser

Last updated 6 years ago

Was this helpful?

What is POS tagging

The process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

  • Source:

However, for the purposes of natural language processing, a more fine grained part of speech tagging is required, like: noun-plural, verb-gerund, noun-proper, noun-proper-plural ...etc.

FIN POS tagger has scored 96.28% accuracy score at the penn-treebank test, and considered to be one of the fastest POS taggers that scores more than 95% processing 132K tokens in 38 seconds. It's also safe to say that this is the most accurate and fastest POS tagger ever written in JavaScript.

Annotation Specifications

For a list of all possible tags please refer to the .

Accuracy and performance

In short:

  • When smoothing is enabled: 96.28% accuracy (processing 132K tokens in 38 seconds).

  • When smoothing is disabled: 94.38% accuracy (processing 132K tokens in 3 seconds).

As of 25 Jan 2017, this library scored 96.28% at the test (0.3% away from being a ).

Being written in JavaScript, I think it's safe to say that this is the most accurate JavaScript POS tagger, since the only JS library I know of is which when I tested on the same treebank it scored 87.8%, though it was faster than my implementation when smoothing is enabled.

However, if performance is what's you're after rather than accuracy, then you have the option to disable smoothing in this library and this will marginally increase performance making this library even faster than pos-js but with far better accuracy (94.38%).

Under the hood

Tagging an array of tokens (i.e. lexed sentence) is done through two steps:

  1. Initial tagging

  2. Smoothing

Initial Tagging

Initial tagging is the process of annotating each token regardless of the context in which this token is used. This is done through:

Smoothing

Smoothing is the process of correcting the initial given tag based on the context in which the token is used.

For example, most of the English verbs has the same spelling for both past and past participle tense (an edsuffix). Another case: is the word run in these two sentences: the run lasted 5 minutes we run every day.

So to disambiguate the token's POS tag we must check the relevant tags that comes before or after this token. This process is done through a set of rules extracted through machine learning, and few manual rules.

Matching the token against a

Matching the token against a

Matching the token against a

Checking if the token is a

Analyzing (like: geo-location)

Matching token suffix against

Matching the token against a

and then matching the token against the rules above

Matching the token against

If non of the rules above gives a result, then the tag defaults to singular noun (NN) or plural noun (NNS) based on .

wikipedia
specification
Penn Treebank
state of the art tagger
pos-js
lexicon
list of given names
list of cities
contraction
compound words
know rules
list of informal and slang words
Removing repetitive characters
a regular expression that detects proper nouns
inflection