English Datasets
sc = sentence count.
tc = token count.
CRAFT
Colorado Richly Annotated Full Text Corpus * Biomedical journal articles (sc = 19,792, tc = 554,539)
BOLT
Broad Operational Language Translation * Conversational telephone speech (sc = 11,552, tc = 160,319) * Discussion forum (sc = 17,382, tc = 396,584) * SMS message (sc = 22,883, tc = 260,431)
EWT
English Web Treebank * Question-answer (sc = 3,089, tc = 50,404) * Email (sc = 3,436, tc = 51,504) * Newsgroup (sc = 2,122, tc = 41,891) * Review (sc = 2,951, tc = 45,864) * Weblog (sc = 1,886, tc = 42,988)
OntoNotes
OntoNotes 5.0 * Broadcasting conversation (sc = 14,648, tc = 239,940) * Broadcasting news (sc = 11,867, tc = 240,241) * News magazine (sc = 7,960, tc = 194,926) * Newswire (sc = 40,491, tc = 1,038,190) * Pivot text (sc = 24,386, tc = 339,013) * Telephone conversation (sc = 10,955, tc = 112,847) * Weblog (sc = 11,800, tc = 262,049)
QuestionBank
QuestionBank Revised * Question (sc = 3,989, tc = 38,100)
MiPACQ
Multi-source Integrated Platform for Answering Clinical Questions * Clinical note (sc = 9,706, tc = 132,235) * Clinical question (sc = 1,980, tc = 37,178) * Medpedia (sc = 2,921, tc = 49,252) * Pathological note (sc = 1,182, tc = 22,088)
THYME
Temporal History of Your Medical Events * Brain cancer note (sc = 21,284, tc = 263,011) * Clinical/Pathological note (sc = 30,090, tc = 448,603)
Mixed
A combined dataset consisting of CRAFT, BOLT, EWT, OntoNotes, QuestionBank, MiPACQ, SHARP, and THYME.