Processing
Last updated
Last updated
When you pass a string to Fin it will go through a number of processors:
At first it will be intercepted by any preprocessors you define. Preprocessors are just functions that take the input string to Fin.Run
and return another string which will be used for any further analysis.
Then it will be lexed (tokenized): it will be split into sentences and each sentence will be split further into tokens.
Then each token will be POS tagged: by annotating it with it's relevant part of speech annotation, like noun, verb ..etc.
Then, each sentence will processed to resolve each the dependency tree.
Finally, postprocessors will take the result, do whatever it's supposed to do with them and return a similar object.
Imagine that you're mining data to have an idea about the most love/hated cars. The brands you're looking for are BMW, Lexus and Chevrolet. And the keywords that dictate the sentiment towards these cars are:
Positive Sentiment:
love: 1
adore: 1
perfect: 1
amazing: 1
Negative Sentiment:
broken: -1
hate: -1
waste: -1
old: -1
Now using the tokens, you can detect exactly where the above keywords are mentioned, and using the dependency tree you can see exactly what those keywords are describing. If a positive keyword describes a BMW then you add a point to the overall BMW score. If a negative keyword describes a BMW then you subtract one point from the overall BMW score.
So you can imagine after mining megabytes of data you'll have each car brand with a score.
In real world cases, annotating tokens with just noun
, verb
, object
, subject
might not suffice. The processors does a little bit more than just that.
Verbs can actually be VBG
for the gerund form, VBZ
for the third person form, VBD
for the past form, VBP
for the present form ... etc. While nouns can be NN
for singular nouns, NNS
for plural nouns, NNP
for proper nouns ... etc.
For a complete list of annotations head to the specification articles.
As you may have seen from the usage example the processing result object have 3 main keys:
raw
: is a string that is exactly the raw input without any processing or preprocessing.
intercepted
: is the resulting string from preprocessors working on the raw input.
sentences
: is an array of objects. Each element in this array represents a sentence result. Each object has the following keys:
sentence
: is a string representing this single sentence.
tokens
: an array tokens (words), e.g. ["I","had","burgers","."]
tags
: POS annotations of the tokens, e.g. ["PRP","VBD","NNS","."]
deps
: an array of objects, each object describes the relationship of this token to other tokens. Each object has the following keys:
label
: The relationship of this token to it's parent, e.g. for a direct object you would see DOBJ
.
type
: type of this phrase, verbal phrase would be VP
while nominal phrase would be NP
. If no special phrase type detected then it would default to the token tag.
parent
: a number that refers to the index of the token's parent. If it's -1
then it's the ROOT
of the sentence (also called master
and governor
).
In the next few articles we'll go through each of the processors and how to interpret the data thy give us.