Processing

Introduction

When you pass a string to Fin it will go through a number of processors:

  1. At first it will be intercepted by any preprocessors you define. Preprocessors are just functions that take the input string to Fin.Run and return another string which will be used for any further analysis.

  2. Then it will be lexed (tokenized): it will be split into sentences and each sentence will be split further into tokens.

  3. Then each token will be POS tagged: by annotating it with it's relevant part of speech annotation, like noun, verb ..etc.

  4. Then, each sentence will processed to resolve each the dependency tree.

  5. Finally, postprocessors will take the result, do whatever it's supposed to do with them and return a similar object.

Example

How the above information can be useful?

Imagine that you're mining data to have an idea about the most love/hated cars. The brands you're looking for are BMW, Lexus and Chevrolet. And the keywords that dictate the sentiment towards these cars are:

  • Positive Sentiment:

    • love: 1

    • adore: 1

    • perfect: 1

    • amazing: 1

  • Negative Sentiment:

    • broken: -1

    • hate: -1

    • waste: -1

    • old: -1

Now using the tokens, you can detect exactly where the above keywords are mentioned, and using the dependency tree you can see exactly what those keywords are describing. If a positive keyword describes a BMW then you add a point to the overall BMW score. If a negative keyword describes a BMW then you subtract one point from the overall BMW score.

So you can imagine after mining megabytes of data you'll have each car brand with a score.

Real world annotations

In real world cases, annotating tokens with just noun, verb, object, subject might not suffice. The processors does a little bit more than just that.

Verbs can actually be VBG for the gerund form, VBZ for the third person form, VBD for the past form, VBP for the present form ... etc. While nouns can be NN for singular nouns, NNS for plural nouns, NNP for proper nouns ... etc.

For a complete list of annotations head to the specification articles.

The Processing result

As you may have seen from the usage example the processing result object have 3 main keys:

  • raw: is a string that is exactly the raw input without any processing or preprocessing.

  • intercepted: is the resulting string from preprocessors working on the raw input.

  • sentences: is an array of objects. Each element in this array represents a sentence result. Each object has the following keys:

    • sentence: is a string representing this single sentence.

    • tokens: an array tokens (words), e.g. ["I","had","burgers","."]

    • tags: POS annotations of the tokens, e.g. ["PRP","VBD","NNS","."]

    • deps: an array of objects, each object describes the relationship of this token to other tokens. Each object has the following keys:

      • label: The relationship of this token to it's parent, e.g. for a direct object you would see DOBJ.

      • type: type of this phrase, verbal phrase would be VP while nominal phrase would be NP. If no special phrase type detected then it would default to the token tag.

      • parent: a number that refers to the index of the token's parent. If it's -1 then it's the ROOT of the sentence (also called master and governor).

Full example

{
    raw:"this is some text",
    intercepted:"this is some text",
    sentences:[{
        sentence: "this is some text",
        tokens: ["this","is","some","text"],
        tags: ["DT","VBZ","DT","NN"],
        deps: [
            {
                "label": "NSUBJ",
                "type": "NP",
                "parent": 1
            },
            {
                "label": "ROOT",
                "type": "VP",
                "parent": -1
            },
            {
                "label": "MWE",
                "type": "NP",
                "parent": 3
            },
            {
                "label": "ATTR",
                "type": "NP",
                "parent": 1
            }
        ]
    }]
}

Next articles

In the next few articles we'll go through each of the processors and how to interpret the data thy give us.

Last updated