English Datasets
=================
* sc = sentence count.
* tc = token count.
CRAFT
-----
* `Colorado Richly Annotated Full Text Corpus `_
* Biomedical journal articles (sc = 19,792, tc = 554,539)
BOLT
----
* `Broad Operational Language Translation `_
* Conversational telephone speech (sc = 11,552, tc = 160,319)
* Discussion forum (sc = 17,382, tc = 396,584)
* SMS message (sc = 22,883, tc = 260,431)
EWT
---
* `English Web Treebank `_
* Question-answer (sc = 3,089, tc = 50,404)
* Email (sc = 3,436, tc = 51,504)
* Newsgroup (sc = 2,122, tc = 41,891)
* Review (sc = 2,951, tc = 45,864)
* Weblog (sc = 1,886, tc = 42,988)
OntoNotes
---------
* `OntoNotes 5.0 `_
* Broadcasting conversation (sc = 14,648, tc = 239,940)
* Broadcasting news (sc = 11,867, tc = 240,241)
* News magazine (sc = 7,960, tc = 194,926)
* Newswire (sc = 40,491, tc = 1,038,190)
* Pivot text (sc = 24,386, tc = 339,013)
* Telephone conversation (sc = 10,955, tc = 112,847)
* Weblog (sc = 11,800, tc = 262,049)
QuestionBank
------------
* `QuestionBank Revised `_
* Question (sc = 3,989, tc = 38,100)
MiPACQ
------
* `Multi-source Integrated Platform for Answering Clinical Questions `_
* Clinical note (sc = 9,706, tc = 132,235)
* Clinical question (sc = 1,980, tc = 37,178)
* Medpedia (sc = 2,921, tc = 49,252)
* Pathological note (sc = 1,182, tc = 22,088)
SHARP
-----
* `Strategic Health IT Advanced Research Projects `_
* Clinical note (sc = 7,841, tc = 111,789)
* Seattle group health note (sc = 8,268, tc = 110,208)
* Stratified (sc = 5,022, tc = 51,629)
* Stratified Seattle group health note (sc = 15,948, tc = 165,960)
THYME
-----
* `Temporal History of Your Medical Events `_
* Brain cancer note (sc = 21,284, tc = 263,011)
* Clinical/Pathological note (sc = 30,090, tc = 448,603)
Mixed
-----
A combined dataset consisting of `CRAFT`, `BOLT`, `EWT`, `OntoNotes`, `QuestionBank`, `MiPACQ`, `SHARP`, and `THYME`.
Part-of-Speech Tags
-------------------
.. csv-table:: Part-of-Speech Tags
:header-rows: 1
Tag, Description
---, ---
ADD, Email
AFX, Affix
CC, Coordinating conjunction
CD, Cardinal number
DT, Determiner
EX, Existential _there_
FW, Foreign word
GW, Go with
IN, Preposition
JJ, Adjective
JJR, Adjective, comparative
JJS, Adjective, superlative
LS, List item
MD, Modal
NN, Noun, singular or mass
NNS, Noun, plural
NNP, Proper noun, singular
NNPS, Proper noun, plural
PRP, Pronoun
PRP$, Pronoun, possessive
PDT, Predeterminer
POS, Possessive ending
RB, Adverb
RBR, Adverb, comparative
RBS, Adverb, superlative
RP, Particle
TO, To
UH, Interjection
VB, Verb, base form
VBD, Verb, past tense
VBG, Verb, gerund or present participle
VBN, Verb, past participle
VBP, Verb, non-3rd person singular present
VBZ, Verb, 3rd person singular present
WDT, _Wh_-determiner
WP, _Wh_-pronoun
WP$, _Wh_-pronoun, possessive
WRB, _Wh_-adverb
XX, Unknown
.. csv-table:: Punctuation Tags
:header-rows: 1
Tag, Description
---, ---
$, Currency
:, Colon
,, Comma
., Period
``, Left quote
``, Right quote
-LRB-, Left bracket
-RRB-, Right bracket
HYPH, Hyphen
NFP, Superfluous punctuation
SYM, Symbol