English Datasets

  • sc = sentence count.

  • tc = token count.

CRAFT

BOLT

  • Broad Operational Language Translation * Conversational telephone speech (sc = 11,552, tc = 160,319) * Discussion forum (sc = 17,382, tc = 396,584) * SMS message (sc = 22,883, tc = 260,431)

EWT

  • English Web Treebank * Question-answer (sc = 3,089, tc = 50,404) * Email (sc = 3,436, tc = 51,504) * Newsgroup (sc = 2,122, tc = 41,891) * Review (sc = 2,951, tc = 45,864) * Weblog (sc = 1,886, tc = 42,988)

OntoNotes

  • OntoNotes 5.0 * Broadcasting conversation (sc = 14,648, tc = 239,940) * Broadcasting news (sc = 11,867, tc = 240,241) * News magazine (sc = 7,960, tc = 194,926) * Newswire (sc = 40,491, tc = 1,038,190) * Pivot text (sc = 24,386, tc = 339,013) * Telephone conversation (sc = 10,955, tc = 112,847) * Weblog (sc = 11,800, tc = 262,049)

QuestionBank

MiPACQ

SHARP

  • Strategic Health IT Advanced Research Projects * Clinical note (sc = 7,841, tc = 111,789) * Seattle group health note (sc = 8,268, tc = 110,208) * Stratified (sc = 5,022, tc = 51,629) * Stratified Seattle group health note (sc = 15,948, tc = 165,960)

THYME

Mixed

A combined dataset consisting of CRAFT, BOLT, EWT, OntoNotes, QuestionBank, MiPACQ, SHARP, and THYME.

Part-of-Speech Tags

Part-of-Speech Tags

Tag

Description

ADD

Email

AFX

Affix

CC

Coordinating conjunction

CD

Cardinal number

DT

Determiner

EX

Existential _there_

FW

Foreign word

GW

Go with

IN

Preposition

JJ

Adjective

JJR

Adjective

comparative

JJS

Adjective

superlative

LS

List item

MD

Modal

NN

Noun

singular or mass

NNS

Noun

plural

NNP

Proper noun

singular

NNPS

Proper noun

plural

PRP

Pronoun

PRP$

Pronoun

possessive

PDT

Predeterminer

POS

Possessive ending

RB

Adverb

RBR

Adverb

comparative

RBS

Adverb

superlative

RP

Particle

TO

To

UH

Interjection

VB

Verb

base form

VBD

Verb

past tense

VBG

Verb

gerund or present participle

VBN

Verb

past participle

VBP

Verb

non-3rd person singular present

VBZ

Verb

3rd person singular present

WDT

_Wh_-determiner

WP

_Wh_-pronoun

WP$

_Wh_-pronoun

possessive

WRB

_Wh_-adverb

XX

Unknown

Punctuation Tags

Tag

Description

$

Currency

:

Colon

Comma

.

Period

``

Left quote

``

Right quote

-LRB-

Left bracket

-RRB-

Right bracket

HYPH

Hyphen

NFP

Superfluous punctuation

SYM

Symbol