English Datasets

  • sc = sentence count.
  • tc = token count.

CRAFT

BOLT

EWT

  • English Web Treebank
    • Question-answer (sc = 3,089, tc = 50,404)
    • Email (sc = 3,436, tc = 51,504)
    • Newsgroup (sc = 2,122, tc = 41,891)
    • Review (sc = 2,951, tc = 45,864)
    • Weblog (sc = 1,886, tc = 42,988)

OntoNotes

  • OntoNotes 5.0
    • Broadcasting conversation (sc = 14,648, tc = 239,940)
    • Broadcasting news (sc = 11,867, tc = 240,241)
    • News magazine (sc = 7,960, tc = 194,926)
    • Newswire (sc = 40,491, tc = 1,038,190)
    • Pivot text (sc = 24,386, tc = 339,013)
    • Telephone conversation (sc = 10,955, tc = 112,847)
    • Weblog (sc = 11,800, tc = 262,049)

QuestionBank

MiPACQ

SHARP

  • Strategic Health IT Advanced Research Projects
    • Clinical note (sc = 7,841, tc = 111,789)
    • Seattle group health note (sc = 8,268, tc = 110,208)
    • Stratified (sc = 5,022, tc = 51,629)
    • Stratified Seattle group health note (sc = 15,948, tc = 165,960)

THYME

Mixed

A combined dataset consisting of CRAFT, BOLT, EWT, OntoNotes, QuestionBank, MiPACQ, SHARP, and THYME.

Part-of-Speech Tags

Words:

Tag Description Tag Description
ADD Email PDT Predeterminer
AFX Affix POS Possessive ending
CC Coordinating conjunction RB Adverb
CD Cardinal number RBR Adverb, comparative
DT Determiner RBS Adverb, superlative
EX Existential there RP Particle
FW Foreign word TO To
GW Go with UH Interjection
IN Preposition VB Verb, base form
JJ Adjective VBD Verb, past tense
JJR Adjective, comparative VBG Verb, gerund or present participle
JJS Adjective, superlative VBN Verb, past participle
LS List item VBP Verb, non-3rd person singular present
MD Modal VBZ Verb, 3rd person singular present
NN Noun, singular or mass WDT Wh-determiner
NNS Noun, plural WP Wh-pronoun
NNP Proper noun, singular WP$ Wh-pronoun, possessive
NNPS Proper noun, plural WRB Wh-adverb
PRP Pronoun XX Unknown
PRP$ Pronoun, possessive

Punctuation:

Tag Description Tag Description Tag Description
$ Currency `` Left quote HYPH Hyphen
: Colon ‘’ Right quote NFP Superfluous punctuation
, Comma -LRB- Left bracket SYM Symbol
. Period -RRB- Right bracket

Named Entity Tags

Named entities:

Tag Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Vehicles, weapons, foods, etc. (not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.

Other entities:

Tag Description
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage (including "%").
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL "first", "second", etc.
CARDINAL Numerals that do not fall under another type

Dependency Labels

See the Deep Dependency Guidelines for more details:

Label Description Label Description
acl clausal modifier of noun lv light verb
adv adverbial mark clausal marker
advcl adverbial clause meta meta element
advnp adverbial noun phrase modal modal
appo apposition neg negation
attr attribute nsbj nominal subject
aux auxiliary verb num numeric modifier
case case marker obj object
cc coordinating conjunction p punctuation or symbol
com compound word poss possessive modifier
comp complement ppmod prepositional phrase
conj conjunct prn parenthetical notation
cop copula prt verb particle
csbj clausal subject raise raising predicate
dat dative r-* referential
dep unclassified dependency relcl relative clause
det determiner root root
disc discourse element voc vocative
expl expletive