| ACL Anthology Reference Corpus (ARC) |
English |
open |
62,196,334 |
| Afrikaans Web 2024 (afTenTen24) |
Afrikaans |
trial |
141,774,668 |
| Afrikaans Web 2024 (afTenTen24, HunPos tagger) |
Afrikaans |
main |
142,303,550 |
| Afrikaans Wikipedia 2022 |
Afrikaans |
trial |
22,227,137 |
| Afrikaans Wikipedia corpus 2018 (afwiki) |
Afrikaans |
main |
14,466,792 |
| Albanian Web 2020 (sqTenTen20) |
Albanian |
trial |
528,084,150 |
| Alsatian Drama Corpus |
German |
main |
276,204 |
| American Spanish Web 2011 (esamTenTen11) |
Spanish |
main |
7,475,579,365 |
| Amharic Web 2013-17 (amWaC17) |
Amharic |
trial |
25,975,846 |
| ArabCC – Learner Corpus of English Essays |
English |
main |
202,364 |
| Arabic Learner Corpus (ALC) |
Arabic |
main |
362,712 |
| Arabic Trends (2014–today) |
Arabic |
trial |
6,840,655,044 |
| Arabic Web 2009 |
Arabic |
main |
150,282,522 |
| Arabic Web 2012 (arTenTen12) |
Arabic |
main |
7,475,624,779 |
| Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) |
Arabic |
main |
115,315,274 |
| Arabic Web 2024 (arTenTen24) |
Arabic |
trial |
6,572,150,262 |
| Araneum Anglicum Africanum Maius [2015] |
English |
main |
854,484,093 |
| Araneum Anglicum Asiaticum Maius [2015] |
English |
main |
867,259,037 |
| Araneum Anglicum Maius [2015] |
English |
trial |
888,466,066 |
| Araneum Finnicum Maius [2014] |
Finnish |
main |
817,453,523 |
| Araneum Francogallicum Maius [2015] |
French |
main |
933,688,995 |
| Araneum Germanicum Maius [2013] |
German |
main |
875,465,845 |
| Araneum Hispanicum Maius [2013] |
Spanish |
main |
892,299,770 |
| Araneum Hungaricum Maius [2014] |
Hungarian |
trial |
792,549,686 |
| Araneum Italicum Maius (Italian, 14.12) 1,20 G |
Italian |
main |
890,568,531 |
| Araneum Nederlandicum Maius [2013] |
Dutch |
main |
713,417,518 |
| Araneum Polonicum Maius [2013] |
Polish |
main |
595,768,667 |
| Araneum Portugallicum Maius [2015] |
Portuguese |
main |
862,134,902 |
| Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G |
Russian |
trial |
859,319,823 |
| Araneum Slovacum Maius [2013] |
Slovak |
trial |
816,125,010 |
| Armenian Wikipedia corpus 2020 (hywiki20) |
Armenian |
trial |
51,349,694 |
| Assamese Wikipedia 2023 (asWiki23) |
Assamese |
trial |
2,581,684 |
| Australian Legislative Corpus 2023 |
English |
ondemand |
138,411,932 |
| Bashkir Drama Corpus |
Bashkir |
main |
18,723 |
| Basque Web (BasqueWaC v2) |
Basque |
trial |
99,719,584 |
| Belarusian Web 2016 (beTenTen16) |
Belarusian |
main |
63,327,264 |
| Belarusian Web 2020 (beTenTen20) |
Belarusian |
trial |
51,297,389 |
| Bengali Web (bnWaC) |
Bengali |
main |
11,519,730 |
| Bengali Web 2017 (bnTenTen17) |
Bengali |
main |
812,606,941 |
| Bengali Web 2021 (bnTenTen21) |
Bengali |
trial |
470,732,738 |
| BIBLE Polish-Swahili |
Polish |
main |
138,216 |
| BIBLE Swahili-Polish |
Swahili |
main |
139,160 |
| Boot Camp English |
English |
trial |
85,683,246 |
| Bosnian Web (bsWaC 1.2) |
Bosnian |
trial |
248,478,730 |
| Brexit corpus (English) |
English |
trial |
108,452,923 |
| Brexit corpus without retweets (English) |
English |
trial |
4,789,571 |
| British Academic Spoken English Corpus (BASE) |
English |
open |
1,477,281 |
| British Academic Written English Corpus (BAWE) |
English |
open |
6,968,089 |
| British Law Report Corpus |
English |
main |
8,515,749 |
| British National Corpus (BNC) |
English |
trial |
96,132,981 |
| British National Corpus (BNC), tagged by CLAWS |
English |
trial |
96,052,598 |
| British National Corpus 2014 (BNC2014, spoken part) |
English |
trial |
10,495,185 |
| British parliamentary debates (ParlaMint 2.1, CoNLL format) |
English |
main |
100,967,492 |
| British Web 2007 (ukWaC) |
English |
main |
1,313,058,436 |
| Brown Corpus |
English |
open |
1,007,299 |
| Brown Family |
English |
main |
6,963,778 |
| Brown Family (CLAWS + TreeTagger tags) |
English |
main |
6,975,474 |
| Bulgarian National Corpus (BulgarianNC) |
Bulgarian |
main |
20,975,703 |
| Bulgarian National Corpus nonweb genres |
Bulgarian |
main |
22,398,507 |
| Bulgarian National Corpus with web |
Bulgarian |
main |
419,512,059 |
| Bulgarian Web 2012 (bgTenTen12) |
Bulgarian |
main |
705,156,683 |
| Bulgarian Web 2021 (bgTenTen21) |
Bulgarian |
trial |
4,674,884,452 |
| Burmese Web 2021 (myTenTen21) |
Burmese |
trial |
557,329,406 |
| Cambridge Academic English |
English |
main |
3,163,648 |
| Cantonese Web (CantoneseWaC) |
Cantonese |
trial |
30,898,663 |
| Catalan Trends (2022–today) |
Catalan |
trial |
87,949,884 |
| Catalan Web 2014 (caTenTen14) |
Catalan |
trial |
182,608,420 |
| Cebuano Web 2018 (cebTenTen18) |
Cebuano |
trial |
4,552,105 |
| CELEN: Learner Corpus of Spanish in Japan |
Spanish |
open |
658,467 |
| CHILDES Afrikaans Corpus |
Afrikaans |
main |
26,020 |
| CHILDES Catalan Corpus |
Catalan |
main |
209,525 |
| CHILDES Croatian Corpus |
Croatian |
main |
300,832 |
| CHILDES Danish Corpus |
Danish |
main |
285,231 |
| CHILDES English Corpus |
English |
main |
22,693,506 |
| CHILDES Estonian Corpus |
Estonian |
main |
313,457 |
| CHILDES Farsi Corpus |
Persian |
main |
120,527 |
| CHILDES French Corpus |
French |
main |
2,583,460 |
| CHILDES Gaelic Corpus |
Irish |
main |
16,848 |
| CHILDES German Corpus |
German |
main |
5,941,266 |
| CHILDES Hebrew Corpus |
Hebrew |
main |
807,657 |
| CHILDES Hungarian Corpus |
Hungarian |
main |
247,881 |
| CHILDES Italian Corpus |
Italian |
main |
459,881 |
| CHILDES Japanese Corpus |
Japanese |
main |
1,578,068 |
| CHILDES Korean Corpus |
Korean |
main |
36,056 |
| CHILDES Norwegian Corpus |
Norwegian |
main |
56,827 |
| CHILDES Polish Corpus |
Polish |
main |
1,041,300 |
| CHILDES Portuguese Corpus |
Portuguese |
main |
216,407 |
| CHILDES Russian Corpus |
Russian |
main |
48,791 |
| CHILDES Spanish Corpus |
Spanish |
main |
802,743 |
| CHILDES Swedish Corpus |
Swedish |
main |
520,478 |
| CHILDES Tamil Corpus |
Tamil |
main |
15,490 |
| CHILDES Thai Corpus |
Thai |
main |
243,939 |
| CHILDES Turkish Corpus |
Turkish |
main |
178,100 |
| Chinese GigaWord 2 Corpus: Mainland, simplified |
Chinese Simplified |
main |
205,031,379 |
| Chinese GigaWord 2 Corpus: Taiwan, traditional |
Chinese Traditional |
main |
382,600,557 |
| Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) |
Chinese Traditional |
main |
259,156,002 |
| Chinese Traditional Web 2011 (TaiwanWaC) |
Chinese Traditional |
main |
259,156,002 |
| Chinese Trends (2023–today) |
Chinese Simplified |
trial |
30,960,922 |
| Chinese Web 2005 (Internet-ZH, NEUCSP tagger) |
Chinese Simplified |
main |
198,205,344 |
| Chinese Web 2011 (zhTenTen11, sample 10M) |
Chinese Simplified |
main |
9,012,125 |
| Chinese Web 2011 (zhTenTen11, Stanford tagger) |
Chinese Simplified |
main |
1,729,867,455 |
| Chinese Web 2017 (zhTenTen17) Simplified |
Chinese Simplified |
trial |
13,531,331,169 |
| Chinese Web 2017 (zhTenTen17) Traditional |
Chinese Traditional |
trial |
2,400,405,372 |
| COMPAS 2015 |
English |
ondemand |
114,967,191 |
| COMPAS 2016 |
English |
ondemand |
260,896,404 |
| CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) |
Portuguese |
main |
40,423,011 |
| Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ |
N'Ko |
open |
4,102,593 |
| Corpus of Academic Journal Articles (CAJA) |
English |
ondemand |
78,970,299 |
| Corpus of English Dialogues 1560–1760 |
English |
ondemand |
1,151,171 |
| Corpus of Estonian Web sentences 2020 |
Estonian |
main |
280,961,465 |
| Corpus of Estonian Web sentences 2021 |
Estonian |
main |
473,455,876 |
| Corpus of the MagyarOK teaching materials for Hungarian, levels A1 to B2 |
Hungarian |
open |
259,200 |
| COVID-19 Open Research Dataset (CORD-19) |
English |
open |
1,443,530,655 |
| Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar |
Crimean Tatar |
open |
2,958,868 |
| Crimean Tatar National Monolingual & Parallel Corpora, English |
English |
open |
92,947 |
| Crimean Tatar National Monolingual & Parallel Corpora, Russian |
Russian |
open |
538,135 |
| Crimean Tatar National Monolingual & Parallel Corpora, Ukrainian |
Ukrainian |
open |
344,454 |
| Croatian parliamentary debates (ParlaMint 2.1) |
Croatian |
main |
20,337,753 |
| Croatian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Croatian |
main |
20,342,230 |
| Croatian Web (hrWaC 2.2, ReLDI) |
Croatian |
main |
1,210,021,198 |
| Croatian Web (hrWaC 2.2, RFTagger) |
Croatian |
main |
1,211,328,660 |
| csSkELL v1 (whole documents) |
Czech |
main |
1,717,516,129 |
| csSkELL v2.2 (sentences with GDEX scores) |
Czech |
main |
1,443,410,941 |
| Cundeelee Wangka Stories (Cundeelee Wangka) |
Cundeelee Wangka |
ondemand |
1,965 |
| Cundeelee Wangka Stories (English) |
English |
ondemand |
4,423 |
| Czech Drama Corpus |
Czech |
main |
135,105 |
| Czech European Literary Text Collection (ELTeC) |
Czech |
main |
5,472,720 |
| Czech news and web 1995–2002 (czes2.2) |
Czech |
main |
366,796,757 |
| Czech RapCor Boosted v1 |
Czech |
main |
3,701,921 |
| Czech Trends (2014–today) |
Czech |
trial |
2,394,468,832 |
| Czech Web (csTenTen 12+17+19) |
Czech |
trial |
11,722,066,502 |
| Czech Web 2012 (csTenTen12 v9a) |
Czech |
main |
4,175,089,441 |
| Czech Web 2019 (csTenTen19) |
Czech |
main |
6,280,217,621 |
| Czech Web 2023 (csTenTen23) |
Czech |
trial |
4,456,427,977 |
| CzechParl 2012 (v2 with lempos) |
Czech |
main |
37,184,025 |
| Danish Gigaword (DAGW) |
Danish |
main |
964,617,784 |
| Danish Trends |
Danish |
trial |
144,238,919 |
| Danish Web 2010 (DanishWaC) |
Danish |
main |
288,272,967 |
| Danish Web 2014 (daTenTen14) |
Danish |
main |
2,040,976,501 |
| Danish Web 2017 (daTenTen17) |
Danish |
main |
1,956,590,663 |
| Danish Web 2020 (daTenTen20) |
Danish |
trial |
3,480,275,804 |
| DGT-Translation Memory parallel – Bulgarian |
Bulgarian |
main |
25,912,721 |
| DGT-Translation Memory parallel – Croatian |
Croatian |
main |
3,968,608 |
| DGT-Translation Memory parallel – Czech |
Czech |
main |
43,621,933 |
| DGT-Translation Memory parallel – Danish |
Danish |
main |
44,962,280 |
| DGT-Translation Memory parallel – Dutch |
Dutch |
main |
50,523,892 |
| DGT-Translation Memory parallel – English |
English |
main |
59,106,576 |
| DGT-Translation Memory parallel – Estonian |
Estonian |
main |
34,155,488 |
| DGT-Translation Memory parallel – Finnish |
Finnish |
main |
35,129,923 |
| DGT-Translation Memory parallel – French |
French |
main |
58,224,781 |
| DGT-Translation Memory parallel – German |
German |
main |
45,380,666 |
| DGT-Translation Memory parallel – Greek |
Greek |
main |
51,865,988 |
| DGT-Translation Memory parallel – Hungarian |
Hungarian |
main |
2,306,272 |
| DGT-Translation Memory parallel – Irish |
Irish |
main |
1,065,421 |
| DGT-Translation Memory parallel – Italian |
Italian |
main |
53,260,912 |
| DGT-Translation Memory parallel – Latvian |
Latvian |
main |
38,898,134 |
| DGT-Translation Memory parallel – Lithuanian |
Lithuanian |
main |
38,675,242 |
| DGT-Translation Memory parallel – Maltese |
Maltese |
main |
22,388,562 |
| DGT-Translation Memory parallel – Polish |
Polish |
main |
44,149,107 |
| DGT-Translation Memory parallel – Portuguese |
Portuguese |
main |
53,950,705 |
| DGT-Translation Memory parallel – Romanian |
Romanian |
main |
26,644,734 |
| DGT-Translation Memory parallel – Slovak |
Slovak |
main |
43,276,048 |
| DGT-Translation Memory parallel – Slovenian |
Slovenian |
main |
42,897,385 |
| DGT-Translation Memory parallel – Spanish |
Spanish |
main |
57,311,149 |
| DGT-Translation Memory parallel – Swedish |
Swedish |
main |
44,378,725 |
| Directory of Open Access Journals (DOAJ) – English |
English |
trial |
2,662,763,697 |
| Dutch Trends |
Dutch |
trial |
390,735,265 |
| Dutch Web 2014 (nlTenTen14) |
Dutch |
main |
2,253,777,579 |
| Dutch Web 2020 (nlTenTen20) |
Dutch |
trial |
5,890,009,964 |
| e-flux (International art English) |
English |
main |
5,036,119 |
| EcoLexicon English Corpus (EEC) |
English |
open |
23,169,446 |
| ELEXIS Bulgarian Web 2021 |
Bulgarian |
main |
1,014,316,771 |
| ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample |
Bulgarian |
main |
1,992,046 |
| ELEXIS Croatian Web 2020 |
Croatian |
main |
1,006,040,496 |
| ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample |
Croatian |
main |
1,964,238 |
| ELEXIS Czech Web 2019 |
Czech |
main |
949,730,627 |
| ELEXIS Czech Web 2019 (csTenTen19) WSD sample |
Czech |
main |
1,970,054 |
| ELEXIS Danish Web 2020 |
Danish |
main |
989,769,308 |
| ELEXIS Danish Web 2020 (daTenTen20) WSD sample |
Danish |
main |
1,982,549 |
| ELEXIS Dutch Web 2020 |
Dutch |
main |
1,024,660,354 |
| ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample |
Dutch |
main |
1,982,397 |
| ELEXIS English Web 2020 |
English |
main |
1,000,329,442 |
| ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample |
English |
main |
1,999,789 |
| ELEXIS Estonian Web 2021 |
Estonian |
main |
1,006,940,696 |
| ELEXIS Estonian Web 2021 (etTenTen21) WSD sample |
Estonian |
main |
1,995,380 |
| ELEXIS Finnish Web 2019 |
Finnish |
main |
1,011,352,644 |
| ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample |
Finnish |
main |
1,993,821 |
| ELEXIS French Web 2020 |
French |
main |
1,069,392,783 |
| ELEXIS French Web 2020 (frTenTen20) WSD sample |
French |
main |
2,099,651 |
| ELEXIS German Web 2020 |
German |
main |
1,023,830,342 |
| ELEXIS German Web 2020 (deTenTen20) WSD sample |
German |
main |
1,998,166 |
| ELEXIS Greek Web 2019 |
Greek |
main |
1,003,265,093 |
| ELEXIS Greek Web 2019 (elTenTen19) WSD sample |
Greek |
main |
1,961,351 |
| ELEXIS Hebrew Web 2021 |
Hebrew |
main |
1,043,504,840 |
| ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample |
Hebrew |
main |
2,017,821 |
| ELEXIS Hungarian Web 2020 |
Hungarian |
main |
994,806,145 |
| ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample |
Hungarian |
main |
1,989,855 |
| ELEXIS Irish Web 2021 |
Irish |
main |
58,130,702 |
| ELEXIS Irish Web 2021 (gaTenTen21) WSD sample |
Irish |
main |
1,980,914 |
| ELEXIS Italian Web 2020 |
Italian |
main |
1,020,349,212 |
| ELEXIS Italian Web 2020 (itTenTen20) WSD sample |
Italian |
main |
1,996,623 |
| ELEXIS Latvian Web 2021 |
Latvian |
main |
1,029,262,793 |
| ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample |
Latvian |
main |
2,006,576 |
| ELEXIS Lithuanian Web 2021 |
Lithuanian |
main |
846,563,251 |
| ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample |
Lithuanian |
main |
2,004,075 |
| ELEXIS Polish Web 2019 |
Polish |
main |
987,945,132 |
| ELEXIS Polish Web 2019 (plTenTen19) WSD sample |
Polish |
main |
1,971,906 |
| ELEXIS Portuguese Web 2020 |
Portuguese |
main |
1,021,937,614 |
| ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample |
Portuguese |
main |
1,997,515 |
| ELEXIS Romanian Web 2021 |
Romanian |
main |
995,033,835 |
| ELEXIS Romanian Web 2021 (roTenTen21) WSD sample |
Romanian |
main |
1,968,801 |
| ELEXIS Slovak Web 2021 |
Slovak |
main |
1,008,238,227 |
| ELEXIS Slovak Web 2021 (skTenTen21) WSD sample |
Slovak |
main |
1,975,380 |
| ELEXIS Slovene Web 2020 (slTenTen20) WSD sample |
Slovenian |
main |
1,964,284 |
| ELEXIS Slovenian Web 2020 |
Slovenian |
main |
1,007,206,400 |
| ELEXIS Spanish Web 2020 |
Spanish |
main |
1,012,502,656 |
| ELEXIS Spanish Web 2020 (esTenTen20) WSD sample |
Spanish |
main |
1,988,999 |
| ELEXIS Swedish Web 2020 |
Swedish |
main |
1,006,477,461 |
| ELEXIS Swedish Web 2020 (svTenTen20) WSD sample |
Swedish |
main |
1,980,144 |
| Elsevier OA CC-BY Corpus |
English |
main |
187,615,459 |
| English Broadsheet Newspapers 1993–2021 (SiBol) |
English |
main |
858,566,374 |
| English Corpus for SKELL 3.10 |
English |
main |
1,038,200,313 |
| English Corpus for SKELL 3.11 |
English |
main |
1,038,200,313 |
| English Corpus for SkELL 3.8 |
English |
main |
1,041,772,774 |
| English Corpus for SkELL 3.9 |
English |
main |
1,041,138,575 |
| English Drama Corpus |
English |
main |
18,846,687 |
| English European Literary Text Collection (ELTeC) |
English |
main |
10,769,589 |
| English Historical Book Collection (EEBO, ECCO, Evans) |
English |
main |
826,296,048 |
| English parliamentary debates (ParlaMint 2.1) |
English |
main |
100,616,051 |
| English Preposition Corpus |
English |
trial |
2,136,325 |
| English Trends (2014–today) |
English |
trial |
86,336,782,713 |
| English Web 2008 (ententen08_tt31) |
English |
trial |
3,083,193,293 |
| English Web 2012 (enTenTen12) |
English |
main |
11,191,860,036 |
| English Web 2013 (enTenTen13) |
English |
main |
19,685,733,337 |
| English Web 2015 (enTenTen15) |
English |
main |
13,190,556,334 |
| English Web 2018 (enTenTen18) |
English |
main |
21,926,740,748 |
| English Web 2021 (enTenTen21) |
English |
trial |
52,268,286,493 |
| English Wikipedia |
English |
main |
1,356,523,079 |
| English Wikipedia sample with Error annotations |
English |
trial |
951,824 |
| Environment corpus |
English |
main |
61,197,742 |
| Estonian Corpus for Learners 2020 (etSkELL) |
Estonian |
main |
280,572,215 |
| Estonian coursebook corpus 2018 |
Estonian |
main |
121,114 |
| Estonian National Corpus 2021 (Estonian NC 2021) |
Estonian |
main |
2,410,296,919 |
| Estonian National Corpus 2021 (Estonian NC 2021, CoNLL format) |
Estonian |
main |
2,410,296,919 |
| Estonian National Corpus 2023 (Estonian NC 2023) |
Estonian |
main |
3,080,721,728 |
| Estonian Trends |
Estonian |
trial |
250,047,851 |
| Estonian Web 2017 (etTenTen17) |
Estonian |
main |
658,558,136 |
| Estonian Web 2019 (etTenTen19) |
Estonian |
main |
508,447,009 |
| Estonian Web 2021 (etTenTen21) |
Estonian |
trial |
725,832,092 |
| Estonian Web 2023 (etTenTen23) |
Estonian |
trial |
1,508,458,913 |
| EUR-Lex 2/2016 parallel – Bulgarian |
Bulgarian |
trial |
329,071,554 |
| EUR-Lex 2/2016 parallel – Croatian |
Croatian |
trial |
109,138,184 |
| EUR-Lex 2/2016 parallel – Czech |
Czech |
trial |
350,230,088 |
| EUR-Lex 2/2016 parallel – Danish |
Danish |
trial |
519,765,085 |
| EUR-Lex 2/2016 parallel – Dutch |
Dutch |
trial |
583,263,688 |
| EUR-Lex 2/2016 parallel – English |
English |
trial |
629,722,593 |
| EUR-Lex 2/2016 parallel – Estonian |
Estonian |
trial |
291,077,511 |
| EUR-Lex 2/2016 parallel – Finnish |
Finnish |
trial |
384,119,975 |
| EUR-Lex 2/2016 parallel – French |
French |
trial |
677,063,993 |
| EUR-Lex 2/2016 parallel – German |
German |
trial |
528,617,843 |
| EUR-Lex 2/2016 parallel – Greek |
Greek |
trial |
579,344,223 |
| EUR-Lex 2/2016 parallel – Hungarian |
Hungarian |
trial |
340,618,970 |
| EUR-Lex 2/2016 parallel – Irish |
Irish |
trial |
31,439,542 |
| EUR-Lex 2/2016 parallel – Italian |
Italian |
trial |
606,070,097 |
| EUR-Lex 2/2016 parallel – Latvian |
Latvian |
trial |
324,734,544 |
| EUR-Lex 2/2016 parallel – Lithuanian |
Lithuanian |
trial |
323,151,426 |
| EUR-Lex 2/2016 parallel – Maltese |
Maltese |
trial |
314,396,006 |
| EUR-Lex 2/2016 parallel – Polish |
Polish |
trial |
360,862,149 |
| EUR-Lex 2/2016 parallel – Portuguese |
Portuguese |
trial |
595,066,701 |
| EUR-Lex 2/2016 parallel – Romanian |
Romanian |
trial |
336,928,068 |
| EUR-Lex 2/2016 parallel – Slovak |
Slovak |
trial |
255,531,673 |
| EUR-Lex 2/2016 parallel – Slovenian |
Slovenian |
trial |
351,899,258 |
| EUR-Lex 2/2016 parallel – Spanish |
Spanish |
trial |
635,187,126 |
| EUR-Lex 2/2016 parallel – Swedish |
Swedish |
trial |
478,485,126 |
| EUR-Lex judgments 12/2016 parallel – Bulgarian |
Bulgarian |
trial |
17,071,495 |
| EUR-Lex judgments 12/2016 parallel – Croatian |
Croatian |
trial |
5,613,468 |
| EUR-Lex judgments 12/2016 parallel – Czech |
Czech |
trial |
18,226,505 |
| EUR-Lex judgments 12/2016 parallel – Danish |
Danish |
trial |
34,934,021 |
| EUR-Lex judgments 12/2016 parallel – Dutch |
Dutch |
trial |
40,534,071 |
| EUR-Lex judgments 12/2016 parallel – English |
English |
trial |
42,339,337 |
| EUR-Lex judgments 12/2016 parallel – Estonian |
Estonian |
trial |
15,029,608 |
| EUR-Lex judgments 12/2016 parallel – Finnish |
Finnish |
trial |
23,601,422 |
| EUR-Lex judgments 12/2016 parallel – French |
French |
trial |
48,023,524 |
| EUR-Lex judgments 12/2016 parallel – German |
German |
trial |
35,297,517 |
| EUR-Lex judgments 12/2016 parallel – Greek |
Greek |
trial |
35,815,108 |
| EUR-Lex judgments 12/2016 parallel – Hungarian |
Hungarian |
trial |
17,940,879 |
| EUR-Lex judgments 12/2016 parallel – Italian |
Italian |
trial |
42,053,315 |
| EUR-Lex judgments 12/2016 parallel – Latvian |
Latvian |
trial |
16,908,831 |
| EUR-Lex judgments 12/2016 parallel – Lithuanian |
Lithuanian |
trial |
16,252,111 |
| EUR-Lex judgments 12/2016 parallel – Maltese |
Maltese |
trial |
19,146,797 |
| EUR-Lex judgments 12/2016 parallel – Polish |
Polish |
trial |
18,799,551 |
| EUR-Lex judgments 12/2016 parallel – Portuguese |
Portuguese |
trial |
35,412,936 |
| EUR-Lex judgments 12/2016 parallel – Romanian |
Romanian |
trial |
17,592,388 |
| EUR-Lex judgments 12/2016 parallel – Slovak |
Slovak |
trial |
18,265,664 |
| EUR-Lex judgments 12/2016 parallel – Slovenian |
Slovenian |
trial |
18,439,766 |
| EUR-Lex judgments 12/2016 parallel – Spanish |
Spanish |
trial |
39,431,836 |
| EUR-Lex judgments 12/2016 parallel – Swedish |
Swedish |
trial |
30,666,764 |
| Europarl spoken parallel – Bulgarian |
Bulgarian |
trial |
9,215,233 |
| Europarl spoken parallel – Czech |
Czech |
trial |
13,013,774 |
| Europarl spoken parallel – Danish |
Danish |
trial |
48,343,860 |
| Europarl spoken parallel – Dutch |
Dutch |
trial |
54,007,722 |
| Europarl spoken parallel – English |
English |
trial |
53,837,625 |
| Europarl spoken parallel – English |
English |
open |
15,099,625 |
| Europarl spoken parallel – Estonian |
Estonian |
trial |
11,171,727 |
| Europarl spoken parallel – Finnish |
Finnish |
trial |
34,182,031 |
| Europarl spoken parallel – French |
French |
trial |
59,145,988 |
| Europarl spoken parallel – French |
French |
open |
16,815,290 |
| Europarl spoken parallel – German |
German |
trial |
47,805,055 |
| Europarl spoken parallel – Greek |
Greek |
trial |
38,868,863 |
| Europarl spoken parallel – Hungarian |
Hungarian |
trial |
12,421,715 |
| Europarl spoken parallel – Italian |
Italian |
trial |
52,871,060 |
| Europarl spoken parallel – Latvian |
Latvian |
trial |
11,920,085 |
| Europarl spoken parallel – Lithuanian |
Lithuanian |
trial |
11,424,032 |
| Europarl spoken parallel – Polish |
Polish |
trial |
13,034,164 |
| Europarl spoken parallel – Polish |
Polish |
open |
13,034,164 |
| Europarl spoken parallel – Portuguese |
Portuguese |
trial |
53,778,766 |
| Europarl spoken parallel – Romanian |
Romanian |
trial |
9,554,864 |
| Europarl spoken parallel – Slovak |
Slovak |
trial |
12,942,651 |
| Europarl spoken parallel – Slovenian |
Slovenian |
trial |
12,496,942 |
| Europarl spoken parallel – Spanish |
Spanish |
trial |
54,302,284 |
| Europarl spoken parallel – Spanish |
Spanish |
open |
15,513,307 |
| Europarl spoken parallel – Swedish |
Swedish |
trial |
46,303,799 |
| European Spanish Web 2011 (eseuTenTen11) |
Spanish |
main |
2,021,633,644 |
| Film Corpus |
English |
main |
21,661,806 |
| Finnish Web 2014 (fiTenTen14) |
Finnish |
main |
1,404,083,812 |
| Finnish Web 2014 (fiTenTen14, TreeTagger v2) |
Finnish |
main |
1,404,100,049 |
| Finnish Web 2024 (fiTenTen24) |
Finnish |
trial |
4,417,192,749 |
| Frantext (French literature of the 18th-20th century) |
French |
main |
15,573,070 |
| Frantext (French literature of the 18th-20th century), without trends |
French |
main |
15,573,070 |
| French corpus of 88,000 SMS (88milSMS) |
French |
trial |
1,206,663 |
| French Drama Corpus |
French |
main |
12,822,260 |
| French European Literary Text Collection (ELTeC) |
French |
main |
8,557,536 |
| French Trends |
French |
trial |
1,085,578,896 |
| French Web 2008 (v2 with lempos) |
French |
main |
104,705,211 |
| French Web 2010 (frWaC) |
French |
main |
1,330,564,200 |
| French Web 2012 (frTenTen12) |
French |
main |
9,889,689,889 |
| French Web 2017 (frTenTen17) |
French |
main |
5,752,261,039 |
| French Web 2020 (frTenTen20) |
French |
main |
15,115,914,647 |
| French Web 2023 (frTenTen23) |
French |
trial |
23,191,789,469 |
| Georgian Web 2013 (kaWaC) |
Georgian |
trial |
50,713,604 |
| Georgian Web 2024 (kaTenTen24) |
Georgian |
trial |
869,908,570 |
| German Corpus for SkELL 1.0 |
German |
main |
769,810,745 |
| German Drama Corpus |
German |
main |
9,374,314 |
| German European Literary Text Collection (ELTeC) |
German |
main |
10,724,668 |
| German Political Speeches Corpus |
German |
trial |
11,144,258 |
| German Trends |
German |
trial |
2,230,143,165 |
| German Web 2010 |
German |
main |
2,338,036,362 |
| German Web 2010 (deWaC) |
German |
main |
1,348,188,416 |
| German Web 2013 (deTenTen13) |
German |
main |
16,526,335,416 |
| German Web 2018 (deTenTen18) |
German |
main |
5,346,041,196 |
| German Web 2020 (deTenTen20) |
German |
main |
17,512,733,172 |
| German Web 2023 (deTenTen23) |
German |
trial |
16,667,474,100 |
| GerManC (German Newspapers 1650-1800) |
German |
main |
667,310 |
| Gigafida v2.0 (referenčni) |
Slovenian |
main |
1,109,441,592 |
| Greek Drama Corpus |
Greek |
main |
269,334 |
| Greek Trends |
Greek |
trial |
1,019,615,502 |
| Greek Web (GkWaC with lempos) |
Greek |
main |
124,285,612 |
| Greek Web 2014 (elTenTen14) |
Greek |
main |
1,671,692,845 |
| Greek Web 2019 (elTenTen19) |
Greek |
trial |
2,342,091,029 |
| Guangwai - Lancaster Chinese Learner Corpus |
Chinese Simplified |
open |
1,289,060 |
| Gujarati Web (guWaC) |
Gujarati |
main |
17,960,095 |
| Gujarati Web 2021 (guTenTen21) |
Gujarati |
trial |
88,574,710 |
| Gutenberg Afrikaans 2020 |
Afrikaans |
main |
315,010 |
| Gutenberg Bulgarian 2020 |
Bulgarian |
main |
33,352 |
| Gutenberg Catalan 2020 |
Catalan |
main |
1,320,242 |
| Gutenberg Chinese Traditional 2020 |
Chinese Traditional |
main |
27,136,782 |
| Gutenberg Czech 2020 |
Czech |
main |
364,683 |
| Gutenberg Danish 2020 |
Danish |
main |
3,959,344 |
| Gutenberg Dutch 2020 |
Dutch |
main |
87,390,658 |
| Gutenberg English 2020 |
English |
main |
2,903,177,585 |
| Gutenberg Esperanto 2020 |
Esperanto |
trial |
2,024,013 |
| Gutenberg Finnish 2020 |
Finnish |
main |
68,174,366 |
| Gutenberg French 2020 |
French |
main |
197,560,500 |
| Gutenberg German 2020 |
German |
main |
74,709,930 |
| Gutenberg Greek 2020 |
Greek |
main |
7,837,742 |
| Gutenberg Hebrew 2020 |
Hebrew |
main |
158,212 |
| Gutenberg Hungarian 2020 |
Hungarian |
main |
9,140,833 |
| Gutenberg Icelandic 2020 |
Icelandic |
main |
82,211 |
| Gutenberg Italian 2020 |
Italian |
main |
93,049,296 |
| Gutenberg Japanese 2020 |
Japanese |
main |
963,368 |
| Gutenberg Latin 2020 |
Latin |
main |
3,871,335 |
| Gutenberg Norwegian Bokmål 2020 |
Norwegian Bokmål |
main |
762,295 |
| Gutenberg Polish 2020 |
Polish |
main |
421,318 |
| Gutenberg Portuguese 2020 |
Portuguese |
main |
14,309,476 |
| Gutenberg Russian 2020 |
Russian |
main |
13,643 |
| Gutenberg Serbian 2020 |
Serbian |
main |
70,724 |
| Gutenberg Spanish 2020 |
Spanish |
main |
37,202,233 |
| Gutenberg Swedish 2020 |
Swedish |
main |
7,919,783 |
| Gutenberg Tagalog 2020 |
Tagalog |
main |
2,468,064 |
| Gutenberg Telugu 2020 |
Telugu |
main |
157,077 |
| Gutenberg Welsh 2020 |
Welsh |
main |
221,733 |
| Hausa Web 2015 (hausaWaC15) |
Hausa (Boko) |
trial |
5,304,300 |
| Hebrew Drama Corpus |
Hebrew |
main |
954,359 |
| Hebrew General Corpus (web crawled, mostly newspapers) |
Hebrew |
main |
157,947,728 |
| Hebrew Translation Corpus |
Hebrew |
trial |
1,180,003 |
| Hebrew Trends |
Hebrew |
trial |
323,560,516 |
| Hebrew Web (HebWaC) |
Hebrew |
main |
47,832,254 |
| Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) |
Hebrew |
ondemand |
895,876,116 |
| Hebrew Web 2014 (heTenTen14, no POS tagging) |
Hebrew |
main |
890,282,843 |
| Hebrew Web 2021 (heTenTen21) |
Hebrew |
trial |
2,775,686,699 |
| Hindi Web 2012 (HindiWaC v. 4) |
Hindi |
trial |
107,960,109 |
| Hindi Web 2013 (hiTenTen13) |
Hindi |
main |
351,289,441 |
| Hindi Web 2017 (hiTenTen17) |
Hindi |
main |
1,228,379,747 |
| Hindi Web 2021 (hiTenTen21) |
Hindi |
trial |
792,395,313 |
| Hungarian Drama Corpus |
Hungarian |
main |
533,088 |
| Hungarian Trends |
Hungarian |
trial |
544,761,000 |
| Hungarian Web 2012 (huTenTen12) |
Hungarian |
main |
2,572,620,694 |
| Hungarian Web 2020 (huTenTen20) |
Hungarian |
main |
5,164,717,029 |
| Hungarian Web 2023 (huTenTen23) |
Hungarian |
trial |
3,494,350,960 |
| Hungary European Literary Text Collection (ELTeC) |
Hungarian |
main |
6,626,495 |
| Icelandic Gigaword Corpus 2017 |
Icelandic |
main |
532,028,866 |
| Icelandic parliamentary debates (ParlaMint 2.1) |
Icelandic |
main |
23,468,157 |
| Icelandic parliamentary debates (ParlaMint 2.1, CoNLL format) |
Icelandic |
main |
23,461,109 |
| Icelandic texts [sample] |
Icelandic |
trial |
5,436,035 |
| Icelandic Web 2020 (isTenTen20) |
Icelandic |
trial |
518,620,759 |
| Igbo Web 2015 (IgboWaC15) |
Igbo |
main |
331,042 |
| Igbo Web 2017 (igTenTen17) |
Igbo |
trial |
629,294 |
| Indonesian Web (IndonesianWaC) |
Indonesian |
trial |
90,120,046 |
| Indonesian Web 2020 (idTenTen20) |
Indonesian |
main |
3,687,192,045 |
| Indonesian Web 2024 (idTenTen24) |
Indonesian |
trial |
7,108,841,939 |
| Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) |
Irish |
open |
478,445 |
| Irish Trends |
Irish |
trial |
3,288,003 |
| Irish Web 2022 (gaTenTen22) |
Irish |
trial |
125,040,541 |
| Italian Corpus for SkELL 1.0 |
Italian |
main |
328,270,600 |
| Italian Drama Corpus |
Italian |
main |
1,669,717 |
| Italian Trends (2014–today) |
Italian |
trial |
9,945,540,574 |
| Italian Web 2006 (itWaC) |
Italian |
main |
1,597,295,469 |
| Italian Web 2010 (itTenTen) |
Italian |
main |
2,588,873,046 |
| Italian Web 2016 (itTenTen16) |
Italian |
main |
4,989,729,171 |
| Italian Web 2020 (itTenTen20) |
Italian |
trial |
12,451,734,885 |
| itWAC (reduced) |
Italian |
main |
751,542,948 |
| Japanese Web 2006 (jpWaC) |
Japanese |
main |
336,867,039 |
| Japanese Web 2011 (jaTenTen11) |
Japanese |
trial |
8,432,294,787 |
| Japanese Web 2011 (jaTenTen11, sample) |
Japanese |
main |
301,407,652 |
| Japanese Web 2011 sample (jaTenTen11, LUW) |
Japanese |
trial |
163,837,764 |
| Kannada Web 2012 (knWaC12) |
Kannada |
trial |
11,056,526 |
| KAS-Dipl (diplome) |
Slovenian |
main |
568,188,810 |
| KAS-Dr (doktorati) |
Slovenian |
main |
30,244,519 |
| KAS-Mag (magisteriji) |
Slovenian |
main |
157,168,378 |
| Khmer Web 2018 (kmTenTen18) |
Khmer |
main |
16,500,379 |
| Khmer Web 2021 (kmTenTen21) |
Khmer |
trial |
103,066,083 |
| Korean Web 2012 (koTenTen12) |
Korean |
main |
461,196,240 |
| Korean Web 2018 (koTenTen18) |
Korean |
trial |
1,668,851,720 |
| Korpus Malti v2.0 |
Maltese |
trial |
110,714,844 |
| KSUCCA (Classical Arabic) |
Arabic |
trial |
46,705,577 |
| Lao Web 2018 (loTenTen18) |
Lao |
main |
15,862,991 |
| Lao Web 2019 (loTenTen19) |
Lao |
trial |
105,018,584 |
| LatinISE corpus |
Latin |
trial |
11,139,890 |
| Latvian Web (LatvianWaC) |
Latvian |
main |
57,666,024 |
| Latvian Web 2014 (lvTenTen14) |
Latvian |
trial |
530,367,474 |
| Lektor (Learner corpus of proofreading and translations) |
Slovenian |
main |
953,038 |
| LEXMCI |
English |
main |
1,448,180,339 |
| Limerick Corpus of Irish English (LCIE 2004) |
English |
main |
830,210 |
| Lithuanian parliamentary debates (ParlaMint 2.1) |
Lithuanian |
main |
14,573,624 |
| Lithuanian parliamentary debates (ParlaMint 2.1, CoNLL format) |
Lithuanian |
main |
14,428,682 |
| Lithuanian Web (LithuanianWaC v2) |
Lithuanian |
main |
48,650,918 |
| Lithuanian Web 2014 (ltTenTen14) |
Lithuanian |
main |
778,151,979 |
| Lithuanian Web 2021 (ltTenTen21) |
Lithuanian |
trial |
1,772,410,416 |
| London English Corpus |
English |
main |
2,391,040 |
| MaCoCu Albanian Web v1 (2022) |
Albanian |
main |
617,643,884 |
| MaCoCu Bosnian Web v1 (2021-2022) |
Bosnian |
trial |
715,708,157 |
| MaCoCu Croatian Web v2 (2021–2022) |
Croatian |
trial |
2,299,750,788 |
| MaCoCu Macedonian Web v2 (2021) |
Macedonian |
trial |
512,171,886 |
| MaCoCu Maltese Web v2 (2021) |
Maltese |
main |
331,665,362 |
| MaCoCu Montenegrin Web v1 (2021-2022) |
Montenegrin |
main |
157,680,373 |
| MaCoCu Serbian Web v1 (2021-2022) |
Serbian |
trial |
2,435,143,021 |
| MaCoCu Slovene Web v2 (2021-2022) |
Slovenian |
main |
1,863,942,989 |
| MaCoCu Turkish Web v2 (2021) |
Turkish |
main |
4,261,087,826 |
| MaCoCu Ukrainian Web v1 (2021-2022) |
Ukrainian |
main |
5,912,040,719 |
| Magpie corpus |
English |
main |
4,597,782 |
| Malay Web 2020 (msTenTen20) |
Malay |
main |
296,419,465 |
| Malay Web 2024 (msTenTen24) |
Malay |
trial |
805,094,746 |
| Malayalam Web (malayalamWaC) |
Malayalam |
trial |
15,950,663 |
| Malaysian Web (MalaysianWaC) |
Malay |
trial |
182,578,743 |
| Maldivian Web 2022 (dvTenTen22) |
Maldivian |
trial |
20,880,246 |
| Maldivian Wikipedia corpus 2019 (dvwiki) |
Maldivian |
trial |
548,211 |
| Maltese Trends |
Maltese |
trial |
9,902,243 |
| Maori Web 2013 and 2020 (miTenTen20) |
Maori |
trial |
11,814,825 |
| MDPI Open Peer Review Corpus 2 |
English |
open |
721,890,270 |
| Medical Web Corpus |
English |
main |
33,961,786 |
| Merlin Written Learner Czech |
Czech |
main |
75,526 |
| Merlin Written Learner German |
German |
main |
150,256 |
| Merlin Written Learner Italian |
Italian |
main |
107,797 |
| METCLIL: Metaphor in EMI seminars |
English |
open |
110,493 |
| Mongolian Web Texts 2016 (mnWaC16) |
Mongolian |
trial |
6,104,565 |
| Mueller Report |
English |
trial |
167,103 |
| Nepalbhasa Online Media Corpus |
Newari |
open |
7,750,050 |
| Nepali National Corpus |
Nepali |
trial |
13,440,835 |
| Nepali Web (NepaliWaC) |
Nepali |
main |
1,290,388 |
| New corpus for English (NCI English) |
English |
main |
216,618,095 |
| New Model Corpus |
English |
main |
95,276,958 |
| Newspapers in Portuguese (CetemPúblico, CetenFolha) |
Portuguese |
main |
56,768,822 |
| Norwegian Bokmål Trends |
Norwegian Bokmål |
trial |
110,623,528 |
| Norwegian dictionary corpus (Nynorskkorpuset) |
Norwegian |
main |
74,496,664 |
| Norwegian European Literary Text Collection (ELTeC) |
Norwegian |
main |
2,743,719 |
| Norwegian Nynorsk Trends |
Norwegian Nynorsk |
trial |
11,185,612 |
| Norwegian Web 2012 |
Norwegian |
main |
669,511,569 |
| Norwegian Web 2017 (noTenTen17, Bokmål and Nynorsk) |
Norwegian |
trial |
2,630,849,803 |
| Norwegian Web 2017 (noTenTen17, Bokmål) |
Norwegian Bokmål |
trial |
2,461,704,417 |
| Norwegian Web 2017 (noTenTen17, Nynorsk) |
Norwegian Nynorsk |
trial |
169,145,386 |
| Norwegian Web 2023 (nnTenTen23, Nynorsk) |
Norwegian Nynorsk |
trial |
151,767,346 |
| Norwegian Web 2023 (noTenTen23, Bokmål) |
Norwegian Bokmål |
trial |
2,471,455,518 |
| OEC |
English |
ondemand |
2,073,319,589 |
| Old French and Middle French (BFM 2022) |
French |
main |
6,002,552 |
| Open American National Corpus (spoken) |
English |
main |
3,202,026 |
| Open American National Corpus (written) |
English |
main |
11,048,137 |
| Open Cambridge Learner Corpus (Uncoded) |
English |
ondemand |
2,975,701 |
| Open Parallel Corpus (OPUS) – Afrikaans |
Afrikaans |
main |
586,334 |
| Open Parallel Corpus (OPUS) – Albanian |
Albanian |
main |
46,304,346 |
| Open Parallel Corpus (OPUS) – Arabic |
Arabic |
main |
300,000,057 |
| Open Parallel Corpus (OPUS) – Bosnian |
Bosnian |
main |
43,582,516 |
| Open Parallel Corpus (OPUS) – Bulgarian |
Bulgarian |
main |
183,115,244 |
| Open Parallel Corpus (OPUS) – Croatian |
Croatian |
main |
121,369,625 |
| Open Parallel Corpus (OPUS) – Czech |
Czech |
main |
203,845,619 |
| Open Parallel Corpus (OPUS) – Danish |
Danish |
main |
120,107,271 |
| Open Parallel Corpus (OPUS) – Dutch |
Dutch |
main |
356,363,571 |
| Open Parallel Corpus (OPUS) – English |
English |
main |
1,139,515,048 |
| Open Parallel Corpus (OPUS) – Estonian |
Estonian |
main |
64,879,741 |
| Open Parallel Corpus (OPUS) – Finnish |
Finnish |
main |
131,985,872 |
| Open Parallel Corpus (OPUS) – French |
French |
main |
766,833,908 |
| Open Parallel Corpus (OPUS) – German |
German |
main |
125,229,773 |
| Open Parallel Corpus (OPUS) – Greek |
Greek |
main |
239,360,926 |
| Open Parallel Corpus (OPUS) – Hebrew |
Hebrew |
main |
130,972,343 |
| Open Parallel Corpus (OPUS) – Hindi |
Hindi |
main |
854,741 |
| Open Parallel Corpus (OPUS) – Hungarian |
Hungarian |
main |
157,495,018 |
| Open Parallel Corpus (OPUS) – Italian v2 |
Italian |
main |
180,532,849 |
| Open Parallel Corpus (OPUS) – Japanese |
Japanese |
main |
5,455,106 |
| Open Parallel Corpus (OPUS) – Korean |
Korean |
main |
374,850 |
| Open Parallel Corpus (OPUS) – Latvian |
Latvian |
main |
24,499,516 |
| Open Parallel Corpus (OPUS) – Lithuanian |
Lithuanian |
main |
29,621,940 |
| Open Parallel Corpus (OPUS) – Macedonian |
Macedonian |
main |
40,348,792 |
| Open Parallel Corpus (OPUS) – Persian |
Persian |
main |
4,425,133 |
| Open Parallel Corpus (OPUS) – Polish |
Polish |
main |
208,008,636 |
| Open Parallel Corpus (OPUS) – Portuguese |
Portuguese |
main |
272,300,927 |
| Open Parallel Corpus (OPUS) – Portuguese |
Portuguese |
main |
297,700,205 |
| Open Parallel Corpus (OPUS) – Romanian |
Romanian |
main |
282,408,295 |
| Open Parallel Corpus (OPUS) – Russian |
Russian |
main |
307,709,872 |
| Open Parallel Corpus (OPUS) – Serbian |
Serbian |
main |
153,237,786 |
| Open Parallel Corpus (OPUS) – Slovak |
Slovak |
main |
62,451,407 |
| Open Parallel Corpus (OPUS) – Slovenian |
Slovenian |
main |
121,228,966 |
| Open Parallel Corpus (OPUS) – Spanish |
Spanish |
main |
701,944,027 |
| Open Parallel Corpus (OPUS) – Swedish |
Swedish |
main |
102,298,686 |
| Open Parallel Corpus (OPUS) – Turkish |
Turkish |
main |
151,342,424 |
| Open Parallel Corpus (OPUS) – Ukrainian |
Ukrainian |
main |
2,577,481 |
| Open Parallel Corpus OPUS – Chinese Simplified |
Chinese Simplified |
main |
243,427,123 |
| Open Parallel Corpus OPUS – Chinese Traditional |
Chinese Traditional |
main |
380,245 |
| Open Parallel Corpus OPUS – Norwegian Bokmål |
Norwegian |
main |
20,237,510 |
| OpenSubtitles 2018 parallel – Afrikaans |
Afrikaans |
main |
341,349 |
| OpenSubtitles 2018 parallel – Albanian |
Albanian |
main |
15,662,170 |
| OpenSubtitles 2018 parallel – Arabic |
Arabic |
main |
333,329,378 |
| OpenSubtitles 2018 parallel – Armenian |
Armenian |
main |
24,216 |
| OpenSubtitles 2018 parallel – Basque |
Basque |
main |
3,919,829 |
| OpenSubtitles 2018 parallel – Bengali |
Bengali |
main |
2,270,841 |
| OpenSubtitles 2018 parallel – Bosnian |
Bosnian |
main |
125,323,299 |
| OpenSubtitles 2018 parallel – Brazilian Portuguese |
Portuguese |
main |
545,598,189 |
| OpenSubtitles 2018 parallel – Breton |
Breton |
trial |
85,503 |
| OpenSubtitles 2018 parallel – Bulgarian |
Bulgarian |
main |
371,685,493 |
| OpenSubtitles 2018 parallel – Catalan |
Catalan |
main |
3,273,561 |
| OpenSubtitles 2018 parallel – Chinese Simplified |
Chinese Simplified |
main |
119,998,854 |
| OpenSubtitles 2018 parallel – Chinese Traditional |
Chinese Traditional |
main |
41,876,166 |
| OpenSubtitles 2018 parallel – Croatian |
Croatian |
main |
370,177,938 |
| OpenSubtitles 2018 parallel – Czech |
Czech |
main |
453,218,524 |
| OpenSubtitles 2018 parallel – Danish |
Danish |
main |
135,228,416 |
| OpenSubtitles 2018 parallel – Dutch |
Dutch |
main |
444,413,064 |
| OpenSubtitles 2018 parallel – English |
English |
main |
1,211,666,401 |
| OpenSubtitles 2018 parallel – Esperanto |
Esperanto |
main |
396,790 |
| OpenSubtitles 2018 parallel – Estonian |
Estonian |
main |
107,391,459 |
| OpenSubtitles 2018 parallel – European Portuguese |
Portuguese |
main |
466,021,603 |
| OpenSubtitles 2018 parallel – Finnish |
Finnish |
main |
175,247,181 |
| OpenSubtitles 2018 parallel – French |
French |
main |
462,749,061 |
| OpenSubtitles 2018 parallel – Galician |
Galician |
trial |
1,572,312 |
| OpenSubtitles 2018 parallel – Georgian |
Georgian |
main |
1,157,136 |
| OpenSubtitles 2018 parallel – German |
German |
main |
185,133,927 |
| OpenSubtitles 2018 parallel – Greek |
Greek |
main |
457,347,003 |
| OpenSubtitles 2018 parallel – Hebrew |
Hebrew |
main |
371,473,205 |
| OpenSubtitles 2018 parallel – Hindi |
Hindi |
main |
675,322 |
| OpenSubtitles 2018 parallel – Hungarian |
Hungarian |
main |
378,525,740 |
| OpenSubtitles 2018 parallel – Icelandic |
Icelandic |
main |
9,194,074 |
| OpenSubtitles 2018 parallel – Indonesian |
Indonesian |
main |
77,273,767 |
| OpenSubtitles 2018 parallel – Italian |
Italian |
main |
431,415,848 |
| OpenSubtitles 2018 parallel – Japanese |
Japanese |
main |
15,224,480 |
| OpenSubtitles 2018 parallel – Kazakh |
Kazakh |
main |
14,172 |
| OpenSubtitles 2018 parallel – Korean |
Korean |
main |
7,432,927 |
| OpenSubtitles 2018 parallel – Latvian |
Latvian |
main |
2,494,901 |
| OpenSubtitles 2018 parallel – Lithuanian |
Lithuanian |
main |
6,806,857 |
| OpenSubtitles 2018 parallel – Macedonian |
Macedonian |
main |
28,859,153 |
| OpenSubtitles 2018 parallel – Malay |
Malay |
main |
13,465,077 |
| OpenSubtitles 2018 parallel – Malayalam |
Malayalam |
main |
1,671,708 |
| OpenSubtitles 2018 parallel – Norwegian (Mixed) |
Norwegian |
main |
61,215,172 |
| OpenSubtitles 2018 parallel – Persian |
Persian |
main |
53,444,595 |
| OpenSubtitles 2018 parallel – Polish |
Polish |
main |
496,167,686 |
| OpenSubtitles 2018 parallel – Romanian |
Romanian |
main |
658,289,867 |
| OpenSubtitles 2018 parallel – Russian |
Russian |
main |
180,032,832 |
| OpenSubtitles 2018 parallel – Serbian |
Serbian |
main |
480,367,760 |
| OpenSubtitles 2018 parallel – Sinhalese |
Sinhalese |
trial |
3,430,727 |
| OpenSubtitles 2018 parallel – Slovak |
Slovak |
main |
66,455,056 |
| OpenSubtitles 2018 parallel – Slovenian |
Slovenian |
main |
198,366,873 |
| OpenSubtitles 2018 parallel – Spanish |
Spanish |
main |
753,235,853 |
| OpenSubtitles 2018 parallel – Swedish |
Swedish |
main |
153,717,474 |
| OpenSubtitles 2018 parallel – Tagalog |
Tagalog |
main |
96,291 |
| OpenSubtitles 2018 parallel – Tamil |
Tamil |
main |
132,055 |
| OpenSubtitles 2018 parallel – Telugu |
Telugu |
main |
109,730 |
| OpenSubtitles 2018 parallel – Thai |
Thai |
main |
33,223,171 |
| OpenSubtitles 2018 parallel – Turkish |
Turkish |
main |
461,809,489 |
| OpenSubtitles 2018 parallel – Ukrainian |
Ukrainian |
main |
5,054,963 |
| OpenSubtitles 2018 parallel – Urdu |
Urdu |
main |
229,947 |
| OpenSubtitles 2018 parallel – Vietnamese |
Vietnamese |
main |
31,848,385 |
| OPUS MontenegrinSubs parallel – English |
English |
trial |
468,337 |
| OPUS MontenegrinSubs parallel – Montenegrin |
Montenegrin |
trial |
365,698 |
| Oromo Web 2016 (orWaC16) |
Oromo |
trial |
4,249,953 |
| Oxford Children's Corpus 2015 (PTag) |
English |
ondemand |
210,322,185 |
| Oxford Children's Corpus 2015 -- Education (PTag) |
English |
ondemand |
1,323,174 |
| Oxford Children's Corpus 2015 -- Reading (PTag) |
English |
ondemand |
34,284,687 |
| Oxford Children's Corpus 2015 -- Writing (PTag) |
English |
ondemand |
174,714,324 |
| Oxford Children's Corpus 2016 (PTag) |
English |
ondemand |
284,360,063 |
| Oxford Children's Corpus 2016 -- Reading (PTag) |
English |
ondemand |
53,858,955 |
| Oxford Children's Corpus 2016 -- Writing (PTag) |
English |
ondemand |
229,177,934 |
| Oxford Corpus of Academic English (OCAE, April 2012) |
English |
ondemand |
71,371,739 |
| Paisa |
Italian |
main |
221,989,288 |
| ParlaTalk Austria - parliamentary debates |
German |
trial |
14,216,980 |
| ParlaTalk Belgium (Dutch) - parliamentary debates |
Dutch |
trial |
61,465,546 |
| ParlaTalk Belgium (French) - parliamentary debates |
French |
trial |
61,302,487 |
| ParlaTalk Bulgaria - parliamentary debates |
Bulgarian |
trial |
8,233,209 |
| ParlaTalk Czechia - parliamentary debates |
Czech |
trial |
37,092,731 |
| ParlaTalk Denmark - parliamentary debates |
Danish |
trial |
90,339,857 |
| ParlaTalk Estonia - parliamentary debates |
Estonian |
trial |
12,013,335 |
| ParlaTalk Finland - parliamentary debates |
Finnish |
trial |
25,722,677 |
| ParlaTalk Finland parliamentary debates (old version) |
Finnish |
main |
22,660,060 |
| ParlaTalk France - parliamentary debates |
French |
trial |
108,766,242 |
| ParlaTalk Germany - parliamentary debates |
German |
trial |
287,861,438 |
| ParlaTalk Greece - parliamentary debates |
Greek |
trial |
81,525,195 |
| ParlaTalk Hungary - parliamentary debates |
Hungarian |
trial |
55,999,529 |
| ParlaTalk Ireland - parliamentary debates |
English |
trial |
50,924,083 |
| ParlaTalk Italy - parliamentary debates |
Italian |
trial |
107,861,747 |
| ParlaTalk Latvia - parliamentary debates |
Latvian |
trial |
1,001,694,349 |
| ParlaTalk Netherlands - parliamentary debates |
Dutch |
trial |
108,292,132 |
| ParlaTalk Poland - parliamentary debates |
Polish |
trial |
20,567,963 |
| ParlaTalk Portugal - parliamentary debates |
Portuguese |
trial |
147,871,854 |
| ParlaTalk Romania - parliamentary debates |
Romanian |
trial |
45,475,225 |
| ParlaTalk Slovakia - parliamentary debates |
Slovak |
trial |
12,247,101 |
| ParlaTalk Slovenia - parliamentary debates |
Slovenian |
trial |
86,897,922 |
| ParlaTalk Spain - parliamentary debates |
Spanish |
trial |
467,965,192 |
| ParlaTalk Sweden - parliamentary debates |
Swedish |
trial |
137,727,701 |
| Parsed German Web (sDeWaC) |
German |
main |
755,165,551 |
| Penn Corpora of Historical English |
English |
ondemand |
3,800,639 |
| Persian Trends |
Persian |
trial |
564,681,487 |
| PICAE 2010 |
English |
ondemand |
31,025,920 |
| Polish Drama Corpus |
Polish |
main |
117,230 |
| Polish European Literary Text Collection (ELTeC) |
Polish |
main |
8,226,827 |
| Polish language of the 1960s |
Polish |
main |
546,042 |
| Polish Parliamentary Corpus (PPC) |
Polish |
main |
553,858,723 |
| Polish Trends |
Polish |
trial |
1,014,472,583 |
| Polish Web (PolishWac, Morfeusz and TaKIPI tagger) |
Polish |
main |
103,028,410 |
| Polish Web 2012 (plTenTen12, RFTagger) |
Polish |
main |
7,715,835,214 |
| Polish Web 2012 sample (plTenTen12) |
Polish |
main |
45,208,497 |
| Polish Web 2019 (plTenTen19) |
Polish |
trial |
3,994,024,317 |
| Polish Web 2019 term reference (plTenTen19_01) |
Polish |
trial |
181,036,098 |
| Portuguese European Literary Text Collection (ELTeC) |
Portuguese |
main |
6,626,158 |
| Portuguese Trends |
Portuguese |
trial |
1,156,069,003 |
| Portuguese Web 2011 (ptTenTen11) |
Portuguese |
main |
3,896,392,719 |
| Portuguese Web 2011 (ptTenTen11, Palavras parsed) |
Portuguese |
main |
2,757,635,105 |
| Portuguese Web 2018 (ptTenTen18) |
Portuguese |
trial |
7,407,393,731 |
| Portuguese Web 2023 (ptTenTen23) |
Portuguese |
trial |
16,976,742,883 |
| Project Gutenberg English |
English |
main |
443,471,071 |
| pukWaC (ukWaC parsed with MaltParser) |
English |
main |
39,496,785 |
| Quran annotated corpus [unvowelled Arabic] |
Arabic |
main |
128,243 |
| Quran annotated corpus [unvowelled Latin] |
Arabic |
main |
99,268 |
| Quran annotated corpus [vowelled Arabic] |
Arabic |
main |
128,241 |
| Quran annotated corpus [vowelled Latin] |
Arabic |
main |
97,970 |
| RapCor Boosted v1 |
French |
trial |
47,303,463 |
| RapCor1360 - Francophone rap songs |
French |
main |
735,513 |
| Riznica v0.1 |
Croatian |
main |
85,273,724 |
| Roman Drama Corpus |
Latin |
main |
278,890 |
| Romanian European Literary Text Collection (ELTeC) |
Romanian |
main |
5,420,094 |
| Romanian Web 2016 (roTenTen16) |
Romanian |
main |
2,640,496,763 |
| Romanian Web 2021 (roTenTen21) |
Romanian |
trial |
2,763,173,824 |
| ruSkELL 1.6 |
Russian |
main |
975,584,449 |
| Russian Drama Corpus |
Russian |
main |
2,011,699 |
| Russian Sites in Estonian Web 2017–2023 |
Russian |
main |
312,244,562 |
| Russian Trends |
Russian |
trial |
2,464,041,131 |
| Russian Web 2006 (v2 with lempos) |
Russian |
main |
147,930,261 |
| Russian Web 2011 (ruTenTen11) |
Russian |
trial |
14,553,856,113 |
| Russian Web 2017 (ruTenTen17) |
Russian |
main |
9,034,837,939 |
| Russian Web 2020 (ruTenTen20) |
Russian |
trial |
19,125,894,850 |
| Samoan Web (SamoanWac1) |
Samoan |
trial |
3,115,385 |
| Santa Barbara Corpus of Spoken American English |
English |
main |
249,655 |
| ScienceBlogs |
English |
main |
103,175,233 |
| Scottish Gaelic Wiki 2015 (gdWiki) |
Scottish Gaelic |
trial |
980,026 |
| Semcor v3.0 (sense-tagged corpus) |
English |
main |
664,038 |
| Serbian European Literary Text Collection (ELTeC) |
Serbian |
main |
3,953,020 |
| Serbian Web (srWaC 1.2 processed by Hunpos) |
Serbian |
main |
477,724,164 |
| Serbian Web (srWaC 1.2 processed by RFTagger v1) |
Serbian (Latin) |
trial |
441,888,202 |
| Serbian Web (srWaC 1.2) |
Serbian (Latin) |
trial |
476,888,297 |
| Setswana/Tswana Web (SetswanaWaC v2) |
Setswana |
trial |
11,496,687 |
| Shakespeare English Drama Corpus |
English |
main |
810,929 |
| Shakespeare German Drama Corpus |
German |
main |
796,439 |
| Slovak Trends |
Slovak |
trial |
347,435,721 |
| Slovak Web 2011 (skTenTen11) |
Slovak |
main |
540,112,634 |
| Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) |
Slovak |
main |
715,707,053 |
| Slovak Web 2023 (skTenTen23) |
Slovak |
trial |
898,031,101 |
| Slovene Trends |
Slovenian |
trial |
199,562,010 |
| Slovenian European Literary Text Collection (ELTeC) |
Slovenian |
main |
5,609,847 |
| Slovenian reference corpus (FidaPLUS v2) |
Slovenian |
trial |
600,309,637 |
| Slovenian Web (slWaC 2.1) |
Slovenian |
trial |
754,255,589 |
| Slovenian Web (slWaC 2.1, processed with TreeTagger version 2) |
Slovenian |
trial |
755,255,547 |
| Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
Slovenian |
trial |
829,544,337 |
| Somali Web 2016 (soWaC16) |
Somali |
trial |
71,871,585 |
| SoNaR |
Dutch |
ondemand |
425,978,755 |
| Sorani Kurdish Wikipedia corpus 2020 (ckbwiki20) |
Kurdish (Sorani) |
trial |
5,042,449 |
| Spanish Calderon Drama Corpus |
Spanish |
main |
2,112,643 |
| Spanish Drama Corpus |
Spanish |
main |
371,624 |
| Spanish European Literary Text Collection (ELTeC) |
Spanish |
main |
7,186,472 |
| Spanish Trends |
Spanish |
trial |
2,211,436,915 |
| Spanish Web 2005 (SpanishWaC) |
Spanish |
main |
97,773,185 |
| Spanish Web 2011 (esTenTen11, Eu + Am) |
Spanish |
main |
9,497,213,009 |
| Spanish Web 2018 (esTenTen18) |
Spanish |
main |
16,953,735,742 |
| Spanish Web 2023 (esTenTen23) |
Spanish |
trial |
28,652,392,686 |
| Susanne |
English |
trial |
128,998 |
| Swahili Web 2014 (swWaC) |
Swahili |
trial |
17,882,483 |
| Swedish Drama Corpus |
Swedish |
main |
581,524 |
| Swedish European Literary Text Collection (ELTeC) |
Swedish |
main |
4,240,209 |
| Swedish Parole |
Swedish |
main |
21,735,113 |
| Swedish Web 2014 (svTenTen14) |
Swedish |
main |
3,401,035,817 |
| Swedish Web 2020 (svTenTen20) |
Swedish |
trial |
2,366,298,161 |
| Tagalog (Filipino) Web 2019 (tlTenTen19) |
Tagalog |
trial |
198,303,250 |
| Tajik Web (TajikWaC) |
Tajik |
trial |
93,151,897 |
| TalkBank Persian (blog posts) |
Persian |
trial |
269,753,238 |
| Tamil Trends |
Tamil |
trial |
92,207,043 |
| Tamil Web 2015 (TamilWaC) |
Tamil |
main |
26,750,515 |
| Tamil Web 2021 (taTenTen21) |
Tamil |
trial |
823,837,031 |
| Tatar Drama Corpus |
Turkish |
main |
10,595 |
| Tatar Mixed Corpus |
Tatar |
trial |
102,779,803 |
| Tatar News (2000–2014) |
Tatar |
main |
24,927,439 |
| Tatar Web 2015 sample |
Tatar |
trial |
195,901 |
| Telugu Web 2017 (teTenTen) |
Telugu |
trial |
126,807,158 |
| Terms of Service (English) |
English |
open |
168,199 |
| Thai Web (ThaiWaC) |
Thai |
trial |
82,787,119 |
| Thai Web 2018 (thTenTen18) |
Thai |
trial |
640,530,227 |
| The Annotated Corpus of Classical Tibetan (ACTib 2.0) |
Tibetan |
trial |
170,202,078 |
| The Digital Corpus of Sanskrit (2010 – 2019) |
Sanskrit (romanised) |
trial |
3,361,394 |
| The Digital Parisian Stage Corpus |
French |
main |
172,202 |
| The New Corpus for Ireland |
Irish |
main |
29,886,201 |
| Tigrinya Web 2016 (tiWaC16) |
Tigrinya |
trial |
2,087,613 |
| Timestamped JSI web corpus 2014-2016 Catalan |
Catalan |
trial |
99,395,494 |
| Timestamped JSI web corpus 2014-2016 Finnish |
Finnish |
trial |
119,109,490 |
| Timestamped JSI web corpus 2014-2016 French |
French |
trial |
1,870,341,756 |
| Timestamped JSI web corpus 2014-2016 German |
German |
trial |
1,987,759,563 |
| Timestamped JSI web corpus 2014-2016 Hebrew |
Hebrew |
trial |
111,339,363 |
| Timestamped JSI web corpus 2014-2016 Hungarian |
Hungarian |
trial |
180,843,359 |
| Timestamped JSI web corpus 2014-2016 Korean |
Korean |
trial |
438,816,127 |
| Timestamped JSI web corpus 2014-2016 Polish |
Polish |
trial |
157,930,228 |
| Timestamped JSI web corpus 2014-2016 Portuguese |
Portuguese |
trial |
1,109,771,393 |
| Timestamped JSI web corpus 2014-2016 Russian |
Russian |
trial |
1,120,731,416 |
| Timestamped JSI web corpus 2014-2016 Serbian |
Serbian |
trial |
86,380,673 |
| Timestamped JSI web corpus 2014-2016 Spanish |
Spanish |
trial |
4,055,944,612 |
| Timestamped JSI web corpus 2014-2016 Swedish |
Swedish |
trial |
335,782,681 |
| Timestamped JSI web corpus 2014-2021 Catalan |
Catalan |
main |
449,634,119 |
| Timestamped JSI web corpus 2014-2021 Finnish |
Finnish |
main |
421,879,841 |
| Timestamped JSI web corpus 2014-2021 French |
French |
main |
6,998,186,326 |
| Timestamped JSI web corpus 2014-2021 German |
German |
main |
7,055,641,455 |
| Timestamped JSI web corpus 2014-2021 Hebrew |
Hebrew |
main |
466,851,576 |
| Timestamped JSI web corpus 2014-2021 Hungarian |
Hungarian |
main |
903,862,798 |
| Timestamped JSI web corpus 2014-2021 Korean |
Korean |
main |
1,576,995,357 |
| Timestamped JSI web corpus 2014-2021 Polish |
Polish |
main |
973,863,152 |
| Timestamped JSI web corpus 2014-2021 Portuguese |
Portuguese |
main |
4,685,199,909 |
| Timestamped JSI web corpus 2014-2021 Russian |
Russian |
main |
5,788,590,952 |
| Timestamped JSI web corpus 2014-2021 Serbian |
Serbian |
main |
565,311,513 |
| Timestamped JSI web corpus 2014-2021 Spanish |
Spanish |
main |
16,358,148,966 |
| Timestamped JSI web corpus 2014-2021 Swedish |
Swedish |
main |
1,162,692,802 |
| Timestamped JSI web corpus 2014-2022 Estonian |
Estonian |
main |
270,502,859 |
| Timestamped JSI web corpus 2021-03 Catalan |
Catalan |
main |
12,107,597 |
| Timestamped JSI web corpus 2021-03 Czech |
Czech |
main |
20,431,801 |
| Timestamped JSI web corpus 2021-03 Finnish |
Finnish |
main |
6,154,402 |
| Timestamped JSI web corpus 2021-03 French |
French |
main |
145,384,862 |
| Timestamped JSI web corpus 2021-03 German |
German |
main |
126,775,824 |
| Timestamped JSI web corpus 2021-03 Hebrew |
Hebrew |
main |
8,450,710 |
| Timestamped JSI web corpus 2021-03 Hungarian |
Hungarian |
main |
30,439,114 |
| Timestamped JSI web corpus 2021-03 Italian |
Italian |
main |
365,307,999 |
| Timestamped JSI web corpus 2021-03 Korean |
Korean |
main |
19,324,576 |
| Timestamped JSI web corpus 2021-03 Polish |
Polish |
main |
38,911,481 |
| Timestamped JSI web corpus 2021-03 Portuguese |
Portuguese |
main |
108,540,406 |
| Timestamped JSI web corpus 2021-03 Russian |
Russian |
main |
150,971,438 |
| Timestamped JSI web corpus 2021-03 Serbian |
Serbian |
main |
15,122,285 |
| Timestamped JSI web corpus 2021-03 Spanish |
Spanish |
main |
373,185,400 |
| Timestamped JSI web corpus 2021-03 Swedish |
Swedish |
main |
22,715,935 |
| Timestamped JSI web corpus 2021-04 Catalan |
Catalan |
main |
8,926,986 |
| Timestamped JSI web corpus 2021-04 Czech |
Czech |
main |
15,095,366 |
| Timestamped JSI web corpus 2021-04 Finnish |
Finnish |
main |
5,624,514 |
| Timestamped JSI web corpus 2021-04 French |
French |
main |
113,581,013 |
| Timestamped JSI web corpus 2021-04 German |
German |
main |
89,579,085 |
| Timestamped JSI web corpus 2021-04 Hebrew |
Hebrew |
main |
6,544,178 |
| Timestamped JSI web corpus 2021-04 Hungarian |
Hungarian |
main |
23,392,828 |
| Timestamped JSI web corpus 2021-04 Italian |
Italian |
main |
261,813,779 |
| Timestamped JSI web corpus 2021-04 Korean |
Korean |
main |
15,506,235 |
| Timestamped JSI web corpus 2021-04 Polish |
Polish |
main |
28,676,001 |
| Timestamped JSI web corpus 2021-04 Portuguese |
Portuguese |
main |
85,486,841 |
| Timestamped JSI web corpus 2021-04 Russian |
Russian |
main |
117,645,204 |
| Timestamped JSI web corpus 2021-04 Serbian |
Serbian |
main |
12,237,307 |
| Timestamped JSI web corpus 2021-04 Spanish |
Spanish |
main |
289,923,417 |
| Timestamped JSI web corpus 2021-04 Swedish |
Swedish |
main |
16,876,787 |
| Timestamped JSI web corpus 2021-2022 Ukrainian |
Ukrainian |
main |
199,135,032 |
| Timestamped JSI web corpus 2021-22 Spanish |
Spanish |
main |
5,869,620,451 |
| Toxicity Corpus |
English |
main |
102,132,547 |
| Transhistorical Corpus of Written English (TCWE) |
English |
open |
501,633 |
| Turkic web – Azerbaijani |
Azerbaijani |
trial |
94,267,206 |
| Turkic web – Kazakh |
Kazakh |
trial |
139,417,763 |
| Turkic web – Kyrgyz |
Kyrgyz |
trial |
19,369,507 |
| Turkic web – Turkmen |
Turkmen |
trial |
2,105,359 |
| Turkic web – Uzbek |
Uzbek |
trial |
18,720,334 |
| Turkish parliamentary debates (ParlaMint 2.1) |
Turkish |
main |
40,873,301 |
| Turkish parliamentary debates (ParlaMint 2.1, CoNLL format) |
Turkish |
main |
42,913,306 |
| Turkish Web (trWaC) |
Turkish |
main |
32,791,491 |
| Turkish Web 2012 (trTenTen12) |
Turkish |
main |
3,388,418,900 |
| Turkish Web 2020 (trTenTen20) |
Turkish |
trial |
4,980,168,485 |
| Ukrainian Drama Corpus |
Ukrainian |
main |
322,441 |
| Ukrainian European Literary Text Collection (ELTeC) |
Ukrainian |
main |
1,818,180 |
| Ukrainian Trends |
Ukrainian |
trial |
1,020,717,578 |
| Ukrainian Web 2014 (ukTenTen14) |
Ukrainian |
main |
2,194,447,594 |
| Ukrainian Web 2020 and 2014 (ukTenTen20) |
Ukrainian |
main |
2,592,516,436 |
| Ukrainian Web 2022 (ukTenTen22) |
Ukrainian |
trial |
7,594,784,148 |
| UKWaC super sensed |
English |
main |
315,402,632 |
| United Nations Parallel Corpus (UNPC) – Arabic |
Arabic |
trial |
545,594,235 |
| United Nations Parallel Corpus (UNPC) – Chinese |
Chinese Simplified |
trial |
372,004,482 |
| United Nations Parallel Corpus (UNPC) – English |
English |
trial |
664,924,245 |
| United Nations Parallel Corpus (UNPC) – French |
French |
trial |
800,980,141 |
| United Nations Parallel Corpus (UNPC) – Russian |
Russian |
trial |
529,667,487 |
| United Nations Parallel Corpus (UNPC) – Spanish |
Spanish |
trial |
692,809,915 |
| Urdu Web (UrduWaC) |
Urdu |
main |
53,269,273 |
| Urdu Web 2018 (urTenTen18) |
Urdu |
trial |
245,656,128 |
| Vietnamese Web (viWaC) |
Vietnamese |
trial |
106,664,817 |
| Vietnamese Web 2017 (viTenTen17) |
Vietnamese |
trial |
6,056,899,600 |
| Welsh Web 2013 (WelshWaC) |
Welsh |
trial |
12,458,397 |
| Welsh web corpus |
Welsh |
main |
50,392,441 |
| Western Frisian Web 2013 (FrisianWaC) |
Frisian |
trial |
3,116,119 |
| Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) |
Punjabi (Shahmukhi) |
trial |
2,806,904 |
| Yiddish Drama Corpus |
Yiddish |
main |
51,351 |
| Yiddish Wikipedia corpus 2018 (yiwiki) |
Yiddish |
trial |
2,106,912 |
| Yoruba Web 2015 (YorubaWaC15) |
Yoruba |
trial |
2,816,965 |