# Lexer

## Lexer <a href="#lexer" id="lexer"></a>

### What is lexing <a href="#what-is-lexing" id="what-is-lexing"></a>

Lexing is the process of separating a string of characters into different sections. The lexing library that **Fin** uses operates in two levels, first it tries to identify sentences and convert this string of text into an array of sentences, then on each sentence, it tries to identify tokens (words, punctuation ...etc) and convert a sentence into an array of token.

![Lexing in Natural Language Processing](/files/-LYkpC-9Usr4GSNn7oe_)

> **HINT** The lexer library is `~99%` compliant with the penn treebank corpus. Which makes it the best natural natural language processing lexer ever implemented in javascript.

So technically, the lexer is what converts a string of text into an array of sentences and further into an array of tokens.

### Example <a href="#example" id="example"></a>

```javascript

var input = "O'reilly Media (formerly O'reilly Associates), is a 49%-owned company. I didn't address any of the emails. Mr. T.J., an employee is at E!. He said: \"$4,000 was the profit.\" We met T.J. around 08:30 in the morning.";
// This is quite complex paragraph of text.
// We should learn a lot by running it through the lexer.

var processed = Fin.Run(input);

```

Let's see the lexing processing result:

* `processed.sentences.map(x=>x.sentence)`: logs an array of sentences

  ```javascript
  [
    "O'reilly Media (formerly O'reilly Associates), is a 49%-owned company.",
    "I didn't address any of the emails.",
    "Mr. T.J., an employee is at E!.",
    "He said: \"$4,000 was the profit.\"",
    "We met T.J. around 08:30 in the morning."
  ]
  ```
* `processed.sentence.map(x=>x.tokens)`: logs the token level lexication result

  ```javascript
  [
    ["O'reilly","Media","(","formerly","O'reilly","Associates",")",",","is","a","49%-owned","company","."],
    ["I","did","n't","address","any","of","the","emails","."],
    ["Mr.","T.J.",",","an","employee","is","at","E!","."],
    ["He","said",":","\"","$4,000","was","the","profit",".","\""],
    ["We","met","T.J.","around","08:30","in","the","morning","."]
  ]
  ```

### How it works <a href="#how-it-works" id="how-it-works"></a>

#### Separating sentences <a href="#separating-sentences" id="separating-sentences"></a>

The lexer identifies a sentence end by looking for punctuation marks that are usually found at the end of the sentence.

* **Full Stop**: `.`
* **Exclamation Mark**: `!`
* **Question Mark**: `?`
* **Ellipses**: `…` or `...`

The above punctuation marks are considered separators between tokens.

> **NOTE** The lexer actually does more than that, by seeing for example if the full stop punctuation mark came after an abbreviation like: `Morty Jr. had fun last night with Mr. Barney`. and it also detects those sentences where the full stop mark is included inside the parenthesis or the quotation like: `I felt I'm "losing my mind." It was obvious.`

#### Separating tokens <a href="#separating-tokens" id="separating-tokens"></a>

* Every two words that are separated by space are considered two different tokens.
* Additionally, some words might have multiple tokens, so it needs to be separated for example:
  * opening parenthesis, e.g. `(something` and closing parenthesis `something)`.
  * punctuation marks that have no spaces between them and the previous word, e.g. `alex;`. English language contractions, e.g. `I'll` and `I'm`.
  * Symbols, e.g. `$33`.

So based on the above rules you'll get an array of tokens.

### Extensibility <a href="#extensibility" id="extensibility"></a>

As mentioned in the note above, the lexer takes in consideration the dot (`.`) that comes after an abbreviation like `Morty Jr. had a burger` so it doesn't consider it as a sentence stop.

This is based on a dictionary of common abbreviations. Although the dictionary includes about 160 common abbreviation (which should suffice for most of the use cases) but if you need to extend this dictionary you should do something like the following example:

```typescript
import {abbreviations} from "lexed";
abbreviations.push("mme"); // french abbreviation for Madame.
```

### Standalone usage <a href="#standalone-usage" id="standalone-usage"></a>

The lexer library can also be used as a standalone package:

```
npm i --save lexed
```

For more about the lexer library (i.e. **lexed**), refer to it's [readme.md](https://github.com/FinNLP/lexed/blob/master/readme.md).<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://finnlp.gitbook.io/fin/in-depth-look/lexer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
