# POS Tagger

### What is POS tagging <a href="#what-is-pos-tagging" id="what-is-pos-tagging"></a>

> The process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
>
> * *Source:* [*wikipedia*](https://en.wikipedia.org/wiki/Part-of-speech_tagging)

However, for the purposes of natural language processing, a more fine grained part of speech tagging is required, like: *noun-plural*, *verb-gerund*, *noun-proper*, *noun-proper-plural* ...etc.

> FIN POS tagger has scored `96.28%` accuracy score at the penn-treebank test, and considered to be one of the fastest POS taggers that scores more than 95% processing 132K tokens in 38 seconds. It's also safe to say that this is the most accurate and fastest POS tagger ever written in JavaScript.

### Annotation Specifications <a href="#annotation-specifications" id="annotation-specifications"></a>

For a list of all possible tags please refer to the [specification](https://app.gitbook.com/s/-LYkhCKGabZ1yLtwXI9o/specifications/dependency-parsing-annotations.html).

### Accuracy and performance <a href="#accuracy-and-performance" id="accuracy-and-performance"></a>

**In short:**

* When smoothing is enabled: `96.28%` accuracy (processing 132K tokens in 38 seconds).
* When smoothing is disabled: `94.38%` accuracy (processing 132K tokens in 3 seconds).

As of 25 Jan 2017, this library scored `96.28%` at the [Penn Treebank](http://www.cis.upenn.edu/~treebank/) test (0.3% away from being a [state of the art tagger](https://goo.gl/M0rzzb)).

Being written in JavaScript, I think it's safe to say that this is the most accurate JavaScript POS tagger, since the only JS library I know of is [pos-js](https://github.com/neopunisher/pos-js) which when I tested on the same treebank it scored `87.8%`, though it was faster than my implementation when smoothing is enabled.

However, if performance is what's you're after rather than accuracy, then you have the option to disable smoothing in this library and this will marginally increase performance making this library even faster than pos-js but with far better accuracy (**94.38%**).

### Under the hood <a href="#under-the-hood" id="under-the-hood"></a>

Tagging an array of tokens (i.e. lexed sentence) is done through two steps:

1. Initial tagging
2. Smoothing

#### Initial Tagging <a href="#initial-tagging" id="initial-tagging"></a>

Initial tagging is the process of annotating each token regardless of the context in which this token is used. This is done through:

* Matching the token against a [lexicon](https://github.com/FinNLP/en-lexicon)
* Matching the token against a [list of given names](https://github.com/FinNLP/humannames)
* Matching the token against a [list of cities](https://github.com/FinNLP/cities-list)
* Checking if the token is a [contraction](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/contractions.js)
* Analyzing [compound words](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/complex_words.js) (like: `geo-location`)
* Matching token suffix against [know rules](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/suffixes.js)
* Matching the token against a [list of informal and slang words](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/slang.js)
* [Removing repetitive characters](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/repetitive.js) and then matching the token against the rules above
* Matching the token against [a regular expression that detects proper nouns](https://github.com/FinNLP/en-pos/blob/master/lib/tagging/potential_proper.js)
* If non of the rules above gives a result, then the tag defaults to singular noun (`NN`) or plural noun (`NNS`) based on [inflection](https://github.com/FinNLP/en-inflectors).

#### Smoothing <a href="#smoothing" id="smoothing"></a>

Smoothing is the process of correcting the initial given tag based on the context in which the token is used.

For example, most of the English verbs has the same spelling for both **past** and **past participle** tense (an `ed`suffix). Another case: is the word `run` in these two sentences: `the run lasted 5 minutes` `we run every day`.

So to disambiguate the token's POS tag we must check the relevant tags that comes before or after this token. This process is done through a set of rules extracted through machine learning, and few manual rules.<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://finnlp.gitbook.io/fin/in-depth-look/pos-tagger.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
