Pre and Post Processing

The Problem

Processing steps, can only be so smart, it can not possibly detect all real world cases. This is why it has been designed to be extensible. Now let's take another real world problem:

If Fin received this sentence:

Rick & Morty is a good show

It will be able to do all the processing correctly. However, if it receives an encoded version:

Rick & Morty is a good show

Things won't be so accurate, The & is an encoded ampersand &, known as HTML entity. This HTML entity can be expected from web entries, like social media posts, comments ...etc.

If we run the above example Rick & Morty is a good show in Fin:

  • & will be considered as conjugation coordinate.

  • amp will be considered as a noun.

  • ; will be considered as a mid sentence punctuation.

This is obviously wrong, and it will lead to inaccurate POS tagging, and thus inaccurate dependency parsing.

The Solution

To solve the aforementioned problem (and other similar problems) we need to use preprocessors. Preprocessors act like an intercepting functions that intercepts any input, decodes it and return a decoded version.

import * as Fin from "finnlp";
Fin.preProcessors.push((string) => return str.replace(/&/gi,"&"));

The interceptor we defined above will take the string and replace all occurrences of & with &.

Postprocessors

Much like how preprocessors intercept the input string, postprocessors intercept the result object before it get returned to the caller.

Last updated