Tokenization

A tokenizer takes raw text and splits it into string tokens. It also returns the begin (inclusive) and the end (exclusive) character offsets from the original text for each token. ELIT’s tokenizers provide an option of performing several types of sentence segmentation, which groups chunks of consecutive tokens into sentences:

0: no segmentation.
1: segment by newlines (\n).
2: segment by symbol rules.
3: segment by 1 and 2.

Space Tokenizer

The Space Tokenizer splits input text by whitespaces, which is useful when the input text is already tokenized (either manually or by some other tool) such that no further tokenization is necessary.

Associated models: elit_tok_space_en
API reference: SpaceTokenizer
Decode parameters: * segment: 0, 1 (default), 2, or 3

Web API

{"model": "elit_tok_space_en", "args": {"segment": 1}}

Python API

from elit.component import SpaceTokenizer
tok = SpaceTokenizer()
text = [
    'This is the 1st sentence\nThis is the 2nd sentence',
    'This is the 3rd sentence\nThis is the 4th sentence']
print(tok.decode(text, segment=1))  # segment by newlines (default)

Output

[
  {
    "doc_id": 0,
    "sens": [
      {
        "sid": 0,
        "tok": ["This", "is", "the", "1st", "sentence"],
        "off": [[0, 4], [5, 7], [8, 11], [12, 15], [16, 24]]
      },
      {
        "sid": 1,
        "tok": ["This", "is", "the", "2nd", "sentence"],
        "off": [[25, 29], [30, 32], [33, 36], [37, 40], [41, 49]]
      }
    ]
  },
  {
    "doc_id": 1,
    "sens": [
      {
        "sid": 0,
        "tok": ["This", "is", "the", "3rd", "sentence"],
        "off": [[0, 4], [5, 7], [8, 11], [12, 15], [16, 24]]
      },
      {
        "sid": 1,
        "tok": ["This", "is", "the", "4th", "sentence"],
        "off": [[25, 29], [30, 32], [33, 36], [37, 40], [41, 49]]
      }
    ]
  }
]

English Tokenizer

The English Tokenizer splits input text into linguistic tokens using lexicalized rules.

Associated models: elit_tok_lexrule_en
API reference: EnglishTokenizer
Decode parameters: * segment: 0, 1, 2 (default), or 3

The following shows key features of this tokenizer:

Key Features
Feature	Input Text	Tokens
Email addresses	`Email (support@elit.cloud)`	`["Email", "(", "support@elit.cloud", ")"]`
Hyperlinks	`URL: https://elit.cloud`	`["URL", ":", "https://elit.cloud"]`
Emoticons	`I love ELIT :-)!?.`	`["I", "love", "ELIT", ":-)", "!?."]`
Hashtags	`ELIT is the #1 platform #elit2018.`	`["ELIT", "is", "the", "#", "1", "platform", "#elit2018", "."]`
HTML entities	`A←B`	`["A", "←", "B"]`
Hyphens	`(123) 456-7890, 123-456-7890, 2014-2018`	`["(123)", "456-7890", ",", "123-456-7890", ",", "2014", "-", "2018"]`
List items	`(A)First (A.1)Second [2a]Third [Forth]`	`["(A)", "First", "(A.1)", "Second", "[2a]", "Third", "[", "Forth", "]"]`
Units	`$1,000 20mg 100cm 11:00a.m. 10:30PM`	`["$", "1,000", "20", "mg", "100", "cm", "11:00", "a.m.", "10:30", "PM"]`
Acronyms	`I'm gonna miss Dr. Choi 'cause he isn't here.`	`["I", "'m", "gon", "na", "miss", "Dr.", "Choi", "'cause", "he", "is", "n't", "here", "."]`

Web API

{"model": "elit_tok_lexrule_en", "args": {"segment": 2}}

Python API

from elit.component import EnglishTokenizer
tok = EnglishTokenizer()
text = [
    "Mr. Johnson doesn't like cats! What's his favorite then?",
    "He likes puffy-dogs. He is gonna buy one."]
print(tok.decode(text, segment=2))  # segment by symbol rule