Tokenization
A tokenizer takes raw text and splits it into string tokens. It also returns the begin (inclusive) and the end (exclusive) character offsets from the original text for each token. ELIT’s tokenizers provide an option of performing several types of sentence segmentation, which groups chunks of consecutive tokens into sentences:
0: no segmentation.1: segment by newlines (\n).2: segment by symbol rules.3: segment by1and2.
Space Tokenizer
The Space Tokenizer splits input text by whitespaces, which is useful when the input text is already tokenized (either manually or by some other tool) such that no further tokenization is necessary.
Associated models:
elit_tok_space_enAPI reference: SpaceTokenizer
Decode parameters: *
segment:0,1(default),2, or3
Web API
{"model": "elit_tok_space_en", "args": {"segment": 1}}
Python API
from elit.component import SpaceTokenizer
tok = SpaceTokenizer()
text = [
'This is the 1st sentence\nThis is the 2nd sentence',
'This is the 3rd sentence\nThis is the 4th sentence']
print(tok.decode(text, segment=1)) # segment by newlines (default)
Output
[
{
"doc_id": 0,
"sens": [
{
"sid": 0,
"tok": ["This", "is", "the", "1st", "sentence"],
"off": [[0, 4], [5, 7], [8, 11], [12, 15], [16, 24]]
},
{
"sid": 1,
"tok": ["This", "is", "the", "2nd", "sentence"],
"off": [[25, 29], [30, 32], [33, 36], [37, 40], [41, 49]]
}
]
},
{
"doc_id": 1,
"sens": [
{
"sid": 0,
"tok": ["This", "is", "the", "3rd", "sentence"],
"off": [[0, 4], [5, 7], [8, 11], [12, 15], [16, 24]]
},
{
"sid": 1,
"tok": ["This", "is", "the", "4th", "sentence"],
"off": [[25, 29], [30, 32], [33, 36], [37, 40], [41, 49]]
}
]
}
]
English Tokenizer
The English Tokenizer splits input text into linguistic tokens using lexicalized rules.
Associated models:
elit_tok_lexrule_enAPI reference: EnglishTokenizer
Decode parameters: *
segment:0,1,2(default), or3
The following shows key features of this tokenizer:
Feature |
Input Text |
Tokens |
|---|---|---|
Email addresses |
|
|
Hyperlinks |
|
|
Emoticons |
|
|
Hashtags |
|
|
HTML entities |
|
|
Hyphens |
|
|
List items |
|
|
Units |
|
|
Acronyms |
|
|
Web API
{"model": "elit_tok_lexrule_en", "args": {"segment": 2}}
Python API
from elit.component import EnglishTokenizer
tok = EnglishTokenizer()
text = [
"Mr. Johnson doesn't like cats! What's his favorite then?",
"He likes puffy-dogs. He is gonna buy one."]
print(tok.decode(text, segment=2)) # segment by symbol rule