Skip to content
Cristina Ramos edited this page Nov 12, 2020 · 55 revisions
Home Data Readings

For each new data resource added to awakateko, please add an entry below that includes:

  • Name of corpus/resource
  • Language or languages
  • Amount of data (# words, # segments, etc. - the unit depends on the corpus)
  • Link to website/source of data (i.e. where did you find the data?)
  • Link to relevant paper(s)

UniMorph resources

Parallel multilingual corpora

JW300 Corpus

Description:

A multilingual corpus of teachings of the Jehovah's Witnesses, available using an open-source api, in the form of bi-texts between pairs of languages. Since there are different amounts of text in each language, the api creates a custom txt document for each query, based on the two languages requested, and the texts the languages share in common.

This corpus is available on Awakateko through the opus api in the FOLTA virtual environment. For directions on accessing individual queries, see here.

Languages:

There are 380 languages in the corpus. There is a matrix available here that shows the intersections of all of the languages, as well as allows for the viewing of sample texts and sentence alignments.

Quantity:

  • 380 languages, 46,219 bitexts
  • total number of files: 1,285,939
  • total number of tokens: 1.95G
  • total number of sentence fragments: 105.11M

Source: http://opus.nlpl.eu/JW300.php

Paper & Link:

Agić, Ž., & Vulić, I. (2019). JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3204–3210. https://www.aclweb.org/anthology/P19-1310(https://www.aclweb.org/anthology/P19-1310)

XML Bible Corpus

Description:

"This [is] an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level." - Documentation

This corpus can be accessed on Awakateko, in the form of .xml files in the /corpora/xml_bible-corpus-v.1.13.1 directory. Directions for accessing and working with the corpus can be found here.

Languages:

There are 108 bibles in 102 languages in this corpus, including 20 languages of a non-latin script, 39 languages with less than 1 million speakers and 67 Non-Indo-European languages. 45 of the bibles are only partial texts, not containing the entire bible. The list of languages can be found here.

Quantity:

Again, I still need to calculate this. -jk

Source:

This corpus is courtesy of Christos Christodoulopoulos and Mark Steedman @ christos-c.com/bible(http://christos-c.com/bible/).

Paper & Link:

Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, 49(2), 375–395. https://doi.org/10.1007/s10579-014-9287-y(https://doi.org/10.1007/s10579-014-9287-y)

TXT Bible Corpus

Description:

This corpus was provided to us by Dr. Taraka Kasicheyanula, to help fill language gaps in the XML corpus.

This corpus can be accessed on Awakateko, in the form of .txt files in the /corpora/txt_bible-corpus directory.

Languages:

Specific details about each bible in the corpus can be found here

aai aak aau aaz abt abx aby aca acc acd ace acf ach acn
acr acu ade adh adi adj adl ady adz aeb aeu aey afr agd
agg agm agn agr agt agu agw agx ahk aia aii aim aji ajz
akb ake akh ald alj aln alp alq alt aly alz ame amf amh
amk amm amn amp amr amu amx ann anv aoi aoj aom aon aoz
apb ape apn apr apt apu apw apy apz ara arb are arl arn
arp ary arz asg aso ata atb atd atg att auc aui auy ava
avt avu awa awb awi awx aym ayo ayr aze azg azz bak bam
ban bao bar bav bba bbb bbc bbj bbr bcc bch bci bcl bco
bcw bdd bdh bea bef bel bem ben beq bex bfd bfo bgr bgs
bgz bhg bhl bhp bib big bim bis biu biv bjp bjr bjv bkd
bkq bku bkv blh blw blz bmb bmh bmk bmq bmr bmu bnj bnp
boa bod boj bom bon box bpr bps bqc bqj bqp bre bru bsc
bsn bsp bss btd bth bto bts btt btx bud bug buk bul bum
bus bvr bvz bwd bwq bwu bxh bxr byr byx bzd bzh bzi bzj
caa cab cac caf cag cak cao cap caq car cas cat cav cax
cbc cbi cbk cbr cbs cbt cbu cbv cce cco ceb ceg ces cfm
cgc cha chd che chf chk chq chr chu chv chz cjo cjp cjs
cjv ckb cko ckt cle clu cly cme cmo cnh cni cnl cnt cnw
coe cof cok con cop cor cot cpa cpb cpc cpu cpy crh crm
crn crq crs crt crx csk cso csy cta ctd ctp ctu cub cuc
cui cuk cul cut cux cwe cwt cya cym czt daa dad dah dak
dan dar ded des deu dgc dgi dgr dgz dhg dhm dig dik dip
dis dje djk djr dng dnj dob dop dow dtp dts due dug duo
dur dwr dww dyi dyo dyu ebk efi eka eko ell emi emp enb
eng enm enx epo eri ese esi esk est esu etr etu eus eve
ewe eza faa fad fai fal fao ffm fij fil fin fon for fra
fry fub fue fuf fuh fuq fuv gaa gag gah gam gaw gbi gbo
gbr gde gdg gdn gdr geb gej gfk ghe ghs gid gil giz gjn
gkn gkp gla gle glk glv gmv gnb gnd gng gnn gnw gof gog
gor got gqr grc grt gso gub guc gud gug guh gui guj guk
gul gum gun guo guq gur guw gux guz gvc gvf gvl gvn gwi
gya gym gyr hae hag hak hat hau haw hay hbo hch heb heg
heh hif hig hil hin hix hla hlt hmo hne hnj hnn hns hop
hot hra hrv hto hub hui hun hus huu huv hva hvn hwc hye
i ian iba ibo icr ifa ifb ifk ifu ify ign ikk iku ikw
ilb ilo imo inb ind ino iou ipi iqw iri irk iry isd isl
ita itv ium ivb ivv iws ixl izr izz jac jae jam jav jbu
jic jiv jmc jpn jra jvn k kaa kab kac kal kan kao kap
kaq kat kaz kbc kbd kbh kbm kbp kbq kbr kck kdc kde kdh
kdi kdj kdl kek ken ket kew kez kff kgf kgk kgp khk khm
khs khy khz kia kik kin kir kix kjb kje kjh kjs kkc kki
kkj kkl klt klv kma kmg kmh kmk kmm kmo kmr kms kmu kne
knf kng knj knk kno knv kog kor kos kpf kpg kpj kpr kpv
kpw kpx kpz kqc kqe kqf kqo kqp kqs kqw kqy krc kri krj
ksc ksd ksf ksp ksr kss ksw ktb ktj ktm kto ktu kua kub
kud kue kum kup kus kvj kvn kwd kwf kwi kwj kxc kxm kxw
kyc kyf kyg kyq kyu kyz kze kzf lac lai laj lam lao las
lat lav lbb lbj lbk lcm ldi lee lef leg leh lem leu lew
lex lgm lh lhi lhm lhu lia lid lif lin lip lit ljp lmk
lmp lob lol lom loq loz lsi lsm lug luo lus lwo lww lzh
maa mad maf mah mai maj mak mal mam maq mar mau mav maw
maz mbb mbc mbd mbf mbh mbi mbj mbl mbs mbt mca mcb mcd
mcf mck mcn mco mcp mcq mcu mda mdy med mee mej mek men
meq meu mfe mfh mfi mfk mfq mfy mfz mgc mgh mhi mhl mhr
mhx mhy mib mic mie mif mig mih mil min mio miq mir mit
miy miz mjc mjw mkd mkl mkn mks mlp mlt mmn mmo mmx mna
mnb mnf mnh mnk mnx moc mog moh mop mor mos mox mpg mph
mpm mpp mps mpt mpx mqb mqj mqy mrg mri mrw msa msb msc
mse msk msm msy mta mtg mti mtj mto mtp mua muh mur mux
muy mva mvn mvp mwc mwf mwm mwp mwq mwv mww mxb mxp mxq
mxt mya myb myk myu myv myw myx myy mza mzh mzk mzl mzm
mzw mzz nab naf nak nan naq nas nav nbc nbe nbl nbq nca
nch ncj ncl nct ncu ndc nde ndi ndj ndo ndp nds ndz neb
nep nfa nfr ngc ngp ngu nhd nhe nhg nhi nho nhr nhu nhw
nhx nhy nia nif nii nij nim nin niq niv niy njb njm njn
njo njz nko nlc nld nma nmf nmo nmw nmz nnb nng nnh nno
nnp nnq nnw noa nob nog nop not nou nph npi npl npo npy
nrf nri nsa nse nsn nso nss nst nsu ntp ntr nus nuy nvm
nwb nwi nwx nxd nya nyf nyn nyo nyu nyy nzm o obo oji
ojs okv old omw on ong ons ood opm ory oss ote otm otn
otq ots oym ozm pab pad pag pah pam pan pao pap pbb pbc
pbi pbl pcm pdt pes pfe pib pio pir pis pkb plg pls plt
plu plw pma pmf pmx pnc pne poe poh poi pol pon por pot
poy ppk ppo pps prf prk prs pse ptp ptu pua pwg pww py
qub quc quf quh qul qup quw quy quz qvc qve qvh qvi qvm
qvn qvo qvs qvw qvz qwh qxh qxn qxo qxr r rai rim rkb
rmo rmy ron roo rop rro ruf run rus rwo sab sag sah sas
sba sbd sbe sbl sda seh sey sgb sgw sgz shi shk shp shu
sig sil sim sin sja sld slk sll slv sme smk sml smo sna
snc snd snn snp snw sny som soq sot soy spa spl spp sps
spy sqi sri srm srn srp srq ssd ssg ssw ssx stn stp sua
sue suk sun sur sus swe swg swh swk swp sxb sxn syb syc
szb tab tac taj tam taq tar tat tav taw tbc tbg tbk tbl
tbo tby tbz tca tcc tcs tcz tdt ted tee tel tem teo ter
tew tfr tgk tgl tgo tgp tha thk tif tih tik tim tir tiy
tke tkr tku tlb tlf tlh tmd tna tnc tnk tnn tnp tob toc
tod toh toi toj ton too top tos tpa tpi tpm tpp tpt tpz
tqb trc trn trq tsg tsn tso tsw tsz ttc tte ttq tuc tue
tuf tui tuk tum tuo tur tvk twi twu txu tyv tzc tzh tzj
tzo ubr ubu udu uig ukr upv ura urb urd uri urk urt usa
usp uvh uvl uyg uzb v vag var ven vid vie viv vmy vun
vut waj wal wap war wat way wbm wbp wca wed wer whk wib
wim wiu wmt wmw wnc wnu wob wol wos wrk wrs wsk wuv wwa
xal xav xbi xbr xed xho xla xnn xon xrb xsb xsi xsm xsr
xsu xtd xtm xuo yaa yad yal yam yan yaq yby ycn yim ykg
yle yli yml yom yon yor yrb yre yrk yss yua yue yuj yut
yuw yuz yva zaa zab zac zad zae zai zam zao zar zas zat
zav zaw zca zho zia ziw zom zos zpc zpi zpl zpm zpo zpq
zpt zpu zpv zpz zsm zsr ztq zty zul zyp

Quantity:

Bibles Total: 1820

Languages represented: 1396

Source Paper & Link:

Mayer, T., & Cysouw, M. (2014). Creating a Massively Parallel Bible Corpus. Proceedings of the International Conference on Language Resources and Evaluation (LREC), 3158–3163. http://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdf

Myanmar - English Parallel Data from ALT for WAT 2020

Description:

This corpus features data featured for use in WAT2020 from the ALT parallel corpus of Asian languages, specifically parallel data for Burmese and English. It is not tagged.

Languages:

Myanmar (Burmese) and English parallel data, the full Asian Language Treebank (ALT) contains parallel data for 13 languages.

Quantity:

6 Files - 3 Myanmar, 3 English, parallel Sentences - 20,000 total

Source:

http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/

Paper and Link:

Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. (2016, May). Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 1574-1578).

https://www.aclweb.org/anthology/L16-1249.pdf#:~:text=Overview%20of%20the%20Asian%20Language%20Treebank%20%28ALT%29%20ASTREC%2C,languages%20by%20the%20end%20of%20this%20time%20span.

Wixarika-Spanish Parallel Corpus

Description:

The Wixarika-Spanish Parallel Corpus is composed of 8,967 different phrases from the Wixarika to the Spanish language. Wixarika (also known as Huichol) is a polysynthetic indigenous language spoken in Mexico by roughly 50,000 native speakers. The corpus consists of a parallel collection of sentences that originated from Hans Christian Andersen’s and brother Grimm's classic fairy tales. This work was done by Dionio Carrillo Gonzáles (dionico94@gmail.com) in 2016. This file is licensed by Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Languages:

Wixarika and Spanish

Quantity:

8 Files: 11,562 total phrases: 56,0337 total tokens

Source:

https://github.com/pywirrarika/wixarikacorpora

Paper and Link:

Mager, M., Carrillo, D., & Meza, I. (2018). Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language. Journal of Intelligent & Fuzzy Systems, 34(5), 3081-3087.

https://www.researchgate.net/publication/325363441_Probabilistic_Finite-State_morphological_segmenter_for_Wixarika_huichol_language

Shipibo-konibo and Spanish Parallel Corpus

Description:

The Shipibo-konibo language (approx. 26,000 speakers) belongs to the Panoan language family and is spoken in the Amazon region of Peru and Brazil. This parallel corpus between Spanish and Shipibo-konibo language was constructed using educational and religious documents.

Languages:

Shipibo-konibo and Spanish

Quantity:

3 Files:

BibliaShiSpa_1.txt contains 9804 aligned versicles from the bible (SHI and SPA) row structure: {book,chapter,versicle,SHI,SPA}

BibliaShiSpa_2.txt contains 13587 aligned sentences from the bible (SHI and SPA) row structure: {book,chapter,versicle,SHI,SPA}

traduccionTsanas1.csv contains 1545 aligned sentences from a kindergarten book row structure: bookName,sentenceNumber,SHI,SPA

Source:

http://chana.inf.pucp.edu.pe/resources/

Paper and Link:

Galarreta, Ana-Paula & Melgar, Andrés & Oncevay Marcos, Félix. (2017). Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo. 238-244. 10.26615/978-954-452-049-6_033.

https://www.acl-bg.org/proceedings/2017/RANLP%202017/pdf/RANLP033.pdf

Link to API's

http://chana.inf.pucp.edu.pe/index.php/en/api-2/

Nunavut Hansard Inuktitut-English Parallel Corpus 3.0.1

Description:

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. This corpus is collected from the Nunavut Hansard between 1999 and 2017, representing 16 sessions over 4 assemblies, and 687 days of debates in the Legislative Assembly of Nunavut. This corpus was processed in 2019 by Eric Joanis (Eric.Joanis@cnrc-nrc.gc.ca), with the assistance of Rebecca Knowles, Roland Kuhn, Samuel Larkin, Patrick Littell, Chi-kiu Lo and Darlene Stewart, National Research Council Canada, and Jeffrey Micher, US Army Research Laboratory.

Languages:

Inuktitut and English

Quantity:

Files: 9

8,068,977 Inuktitut words

17,330,271 English words

Source:

https://nrc-digital-repository.canada.ca/eng/view/object/?id=c7e34fa7-7629-43c2-bd6d-19b32bf64f60

Paper and Link:

Eric Joanis, Rebecca Knowles, Roland Kuhn, Samuel Larkin, Patrick Littell, Chi-kiu Lo, Darlene Stewart and Jeffrey Micher. The Nunavut Hansard Inuktitut-English Parallel Corpus 3.0 with Preliminary Machine Translation Results. Submitted to LREC 2020

https://www.aclweb.org/anthology/2020.lrec-1.312.pdf

Monolingual data, including IGT and other resources from language documentation projects - initial focus on Tibeto-Burman and Mayan languages

Annotated Corpora of Classical Tibetan (ACTib)

Description:

This corpus is a very large set of pos tagged tibetan data. It is a compilation of texts from the Buddhist Digital Resource Center. The pos tags were generated by training a model with the data from the SOAS corpus, which was compiled and tagged by Hill and Edward.

Languages:

Classical Tibetan in Tibetan script, tagged with roman alphabet tags

Quantity:

Number of Files: 11 Collections of Texts. Each collection varies in how many texts it has, ranging from around 100 to several hundred texts per collection. Each collection contains tagged and segmented versions of the data in the 'pos' folder and only segmented versions of the data in the 'seg' folder. Number of Tokens: >185 million tokens across all collections

Source:

https://zenodo.org/record/3951503#.X20_u4tOlEZ

Paper and Link:

Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503

SAOS Corpus of Classical Tibetan

Description:

This corpus is tibetan data with pos tags and morphological / case information. It was the manually training data used to create the tags for the acTiB corpus. In this zip file are two copies of the data: one seemingly plain data folder and one folder for MacOS compatible data. Within the data folder, there are 4 texts, each with two versions. The versions differ in the amount of tags used, as the txt files with "lex" in their name use extremely specific tags.

Languages:

Classical Tibetan with roman alphabet tags

Quantity:

Number of files = 8 Number of texts = 4 Number of sentences = not sure

Source:

Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878

Paper and Link:

I couldn't find a paper directly describing this dataset, so I have linked the author's paper most closely related to the data.

Hill, Nathan & Meelen, Marieke. (2017). Segmenting and POS tagging Classical Tibetan using a Memory-Based Tagger. Himalayan Linguistics. 16. 10.5070/H916234501.

https://www.researchgate.net/publication/322631224_Segmenting_and_POS_tagging_Classical_Tibetan_using_a_Memory-Based_Tagger

UCSY myPOS Burmese Data

Description:

This small set of POS tagged burmese data was created by the researchers at UCSY. This is data is contained within a single txt file, and is approximately 11000 sentences long according to the github page. It contains POS tags generated by manual annotation.

Languages:

Burmese, tags use roman alphabet

Quantity:

1 File ~11,000 sentences ~240,000 words

Source:

https://github.com/ye-kyaw-thu/myPOS

Paper and Link:

https://github.com/ye-kyaw-thu/myPOS/blob/master/CICLING2017/myPOS-CICLing2017-paper.pdf


Typological/linguistic resources