-
Notifications
You must be signed in to change notification settings - Fork 1
Data collected
| Home | Data | Readings |
|---|
For each new data resource added to awakateko, please add an entry below that includes:
- Name of corpus/resource
- Language or languages
- Amount of data (# words, # segments, etc. - the unit depends on the corpus)
- Link to website/source of data (i.e. where did you find the data?)
- Link to relevant paper(s)
Description:
A multilingual corpus of teachings of the Jehovah's Witnesses, available using an open-source api, in the form of bi-texts between pairs of languages. Since there are different amounts of text in each language, the api creates a custom txt document for each query, based on the two languages requested, and the texts the languages share in common.
This corpus is available on Awakateko through the opus api in the FOLTA virtual environment. For directions on accessing individual queries, see here.
Languages:
There are 380 languages in the corpus. There is a matrix available here that shows the intersections of all of the languages, as well as allows for the viewing of sample texts and sentence alignments.
Quantity:
- 380 languages, 46,219 bitexts
- total number of files: 1,285,939
- total number of tokens: 1.95G
- total number of sentence fragments: 105.11M
Source: http://opus.nlpl.eu/JW300.php
Paper & Link:
Agić, Ž., & Vulić, I. (2019). JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3204–3210. https://www.aclweb.org/anthology/P19-1310(https://www.aclweb.org/anthology/P19-1310)
Description:
"This [is] an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level." - Documentation
This corpus can be accessed on Awakateko, in the form of .xml files in the /corpora/xml_bible-corpus-v.1.13.1 directory. Directions for accessing and working with the corpus can be found here.
Languages:
There are 108 bibles in 102 languages in this corpus, including 20 languages of a non-latin script, 39 languages with less than 1 million speakers and 67 Non-Indo-European languages. 45 of the bibles are only partial texts, not containing the entire bible. The list of languages can be found here.
Quantity:
Again, I still need to calculate this. -jk
Source:
This corpus is courtesy of Christos Christodoulopoulos and Mark Steedman @ christos-c.com/bible(http://christos-c.com/bible/).
Paper & Link:
Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, 49(2), 375–395. https://doi.org/10.1007/s10579-014-9287-y(https://doi.org/10.1007/s10579-014-9287-y)
Description:
This corpus was provided to us by Dr. Taraka Kasicheyanula, to help fill language gaps in the XML corpus.
This corpus can be accessed on Awakateko, in the form of .txt files in the /corpora/txt_bible-corpus directory.
Languages:
Specific details about each bible in the corpus can be found here
| aai | aak | aau | aaz | abt | abx | aby | aca | acc | acd | ace | acf | ach | acn |
| acr | acu | ade | adh | adi | adj | adl | ady | adz | aeb | aeu | aey | afr | agd |
| agg | agm | agn | agr | agt | agu | agw | agx | ahk | aia | aii | aim | aji | ajz |
| akb | ake | akh | ald | alj | aln | alp | alq | alt | aly | alz | ame | amf | amh |
| amk | amm | amn | amp | amr | amu | amx | ann | anv | aoi | aoj | aom | aon | aoz |
| apb | ape | apn | apr | apt | apu | apw | apy | apz | ara | arb | are | arl | arn |
| arp | ary | arz | asg | aso | ata | atb | atd | atg | att | auc | aui | auy | ava |
| avt | avu | awa | awb | awi | awx | aym | ayo | ayr | aze | azg | azz | bak | bam |
| ban | bao | bar | bav | bba | bbb | bbc | bbj | bbr | bcc | bch | bci | bcl | bco |
| bcw | bdd | bdh | bea | bef | bel | bem | ben | beq | bex | bfd | bfo | bgr | bgs |
| bgz | bhg | bhl | bhp | bib | big | bim | bis | biu | biv | bjp | bjr | bjv | bkd |
| bkq | bku | bkv | blh | blw | blz | bmb | bmh | bmk | bmq | bmr | bmu | bnj | bnp |
| boa | bod | boj | bom | bon | box | bpr | bps | bqc | bqj | bqp | bre | bru | bsc |
| bsn | bsp | bss | btd | bth | bto | bts | btt | btx | bud | bug | buk | bul | bum |
| bus | bvr | bvz | bwd | bwq | bwu | bxh | bxr | byr | byx | bzd | bzh | bzi | bzj |
| caa | cab | cac | caf | cag | cak | cao | cap | caq | car | cas | cat | cav | cax |
| cbc | cbi | cbk | cbr | cbs | cbt | cbu | cbv | cce | cco | ceb | ceg | ces | cfm |
| cgc | cha | chd | che | chf | chk | chq | chr | chu | chv | chz | cjo | cjp | cjs |
| cjv | ckb | cko | ckt | cle | clu | cly | cme | cmo | cnh | cni | cnl | cnt | cnw |
| coe | cof | cok | con | cop | cor | cot | cpa | cpb | cpc | cpu | cpy | crh | crm |
| crn | crq | crs | crt | crx | csk | cso | csy | cta | ctd | ctp | ctu | cub | cuc |
| cui | cuk | cul | cut | cux | cwe | cwt | cya | cym | czt | daa | dad | dah | dak |
| dan | dar | ded | des | deu | dgc | dgi | dgr | dgz | dhg | dhm | dig | dik | dip |
| dis | dje | djk | djr | dng | dnj | dob | dop | dow | dtp | dts | due | dug | duo |
| dur | dwr | dww | dyi | dyo | dyu | ebk | efi | eka | eko | ell | emi | emp | enb |
| eng | enm | enx | epo | eri | ese | esi | esk | est | esu | etr | etu | eus | eve |
| ewe | eza | faa | fad | fai | fal | fao | ffm | fij | fil | fin | fon | for | fra |
| fry | fub | fue | fuf | fuh | fuq | fuv | gaa | gag | gah | gam | gaw | gbi | gbo |
| gbr | gde | gdg | gdn | gdr | geb | gej | gfk | ghe | ghs | gid | gil | giz | gjn |
| gkn | gkp | gla | gle | glk | glv | gmv | gnb | gnd | gng | gnn | gnw | gof | gog |
| gor | got | gqr | grc | grt | gso | gub | guc | gud | gug | guh | gui | guj | guk |
| gul | gum | gun | guo | guq | gur | guw | gux | guz | gvc | gvf | gvl | gvn | gwi |
| gya | gym | gyr | hae | hag | hak | hat | hau | haw | hay | hbo | hch | heb | heg |
| heh | hif | hig | hil | hin | hix | hla | hlt | hmo | hne | hnj | hnn | hns | hop |
| hot | hra | hrv | hto | hub | hui | hun | hus | huu | huv | hva | hvn | hwc | hye |
| i | ian | iba | ibo | icr | ifa | ifb | ifk | ifu | ify | ign | ikk | iku | ikw |
| ilb | ilo | imo | inb | ind | ino | iou | ipi | iqw | iri | irk | iry | isd | isl |
| ita | itv | ium | ivb | ivv | iws | ixl | izr | izz | jac | jae | jam | jav | jbu |
| jic | jiv | jmc | jpn | jra | jvn | k | kaa | kab | kac | kal | kan | kao | kap |
| kaq | kat | kaz | kbc | kbd | kbh | kbm | kbp | kbq | kbr | kck | kdc | kde | kdh |
| kdi | kdj | kdl | kek | ken | ket | kew | kez | kff | kgf | kgk | kgp | khk | khm |
| khs | khy | khz | kia | kik | kin | kir | kix | kjb | kje | kjh | kjs | kkc | kki |
| kkj | kkl | klt | klv | kma | kmg | kmh | kmk | kmm | kmo | kmr | kms | kmu | kne |
| knf | kng | knj | knk | kno | knv | kog | kor | kos | kpf | kpg | kpj | kpr | kpv |
| kpw | kpx | kpz | kqc | kqe | kqf | kqo | kqp | kqs | kqw | kqy | krc | kri | krj |
| ksc | ksd | ksf | ksp | ksr | kss | ksw | ktb | ktj | ktm | kto | ktu | kua | kub |
| kud | kue | kum | kup | kus | kvj | kvn | kwd | kwf | kwi | kwj | kxc | kxm | kxw |
| kyc | kyf | kyg | kyq | kyu | kyz | kze | kzf | lac | lai | laj | lam | lao | las |
| lat | lav | lbb | lbj | lbk | lcm | ldi | lee | lef | leg | leh | lem | leu | lew |
| lex | lgm | lh | lhi | lhm | lhu | lia | lid | lif | lin | lip | lit | ljp | lmk |
| lmp | lob | lol | lom | loq | loz | lsi | lsm | lug | luo | lus | lwo | lww | lzh |
| maa | mad | maf | mah | mai | maj | mak | mal | mam | maq | mar | mau | mav | maw |
| maz | mbb | mbc | mbd | mbf | mbh | mbi | mbj | mbl | mbs | mbt | mca | mcb | mcd |
| mcf | mck | mcn | mco | mcp | mcq | mcu | mda | mdy | med | mee | mej | mek | men |
| meq | meu | mfe | mfh | mfi | mfk | mfq | mfy | mfz | mgc | mgh | mhi | mhl | mhr |
| mhx | mhy | mib | mic | mie | mif | mig | mih | mil | min | mio | miq | mir | mit |
| miy | miz | mjc | mjw | mkd | mkl | mkn | mks | mlp | mlt | mmn | mmo | mmx | mna |
| mnb | mnf | mnh | mnk | mnx | moc | mog | moh | mop | mor | mos | mox | mpg | mph |
| mpm | mpp | mps | mpt | mpx | mqb | mqj | mqy | mrg | mri | mrw | msa | msb | msc |
| mse | msk | msm | msy | mta | mtg | mti | mtj | mto | mtp | mua | muh | mur | mux |
| muy | mva | mvn | mvp | mwc | mwf | mwm | mwp | mwq | mwv | mww | mxb | mxp | mxq |
| mxt | mya | myb | myk | myu | myv | myw | myx | myy | mza | mzh | mzk | mzl | mzm |
| mzw | mzz | nab | naf | nak | nan | naq | nas | nav | nbc | nbe | nbl | nbq | nca |
| nch | ncj | ncl | nct | ncu | ndc | nde | ndi | ndj | ndo | ndp | nds | ndz | neb |
| nep | nfa | nfr | ngc | ngp | ngu | nhd | nhe | nhg | nhi | nho | nhr | nhu | nhw |
| nhx | nhy | nia | nif | nii | nij | nim | nin | niq | niv | niy | njb | njm | njn |
| njo | njz | nko | nlc | nld | nma | nmf | nmo | nmw | nmz | nnb | nng | nnh | nno |
| nnp | nnq | nnw | noa | nob | nog | nop | not | nou | nph | npi | npl | npo | npy |
| nrf | nri | nsa | nse | nsn | nso | nss | nst | nsu | ntp | ntr | nus | nuy | nvm |
| nwb | nwi | nwx | nxd | nya | nyf | nyn | nyo | nyu | nyy | nzm | o | obo | oji |
| ojs | okv | old | omw | on | ong | ons | ood | opm | ory | oss | ote | otm | otn |
| otq | ots | oym | ozm | pab | pad | pag | pah | pam | pan | pao | pap | pbb | pbc |
| pbi | pbl | pcm | pdt | pes | pfe | pib | pio | pir | pis | pkb | plg | pls | plt |
| plu | plw | pma | pmf | pmx | pnc | pne | poe | poh | poi | pol | pon | por | pot |
| poy | ppk | ppo | pps | prf | prk | prs | pse | ptp | ptu | pua | pwg | pww | py |
| qub | quc | quf | quh | qul | qup | quw | quy | quz | qvc | qve | qvh | qvi | qvm |
| qvn | qvo | qvs | qvw | qvz | qwh | qxh | qxn | qxo | qxr | r | rai | rim | rkb |
| rmo | rmy | ron | roo | rop | rro | ruf | run | rus | rwo | sab | sag | sah | sas |
| sba | sbd | sbe | sbl | sda | seh | sey | sgb | sgw | sgz | shi | shk | shp | shu |
| sig | sil | sim | sin | sja | sld | slk | sll | slv | sme | smk | sml | smo | sna |
| snc | snd | snn | snp | snw | sny | som | soq | sot | soy | spa | spl | spp | sps |
| spy | sqi | sri | srm | srn | srp | srq | ssd | ssg | ssw | ssx | stn | stp | sua |
| sue | suk | sun | sur | sus | swe | swg | swh | swk | swp | sxb | sxn | syb | syc |
| szb | tab | tac | taj | tam | taq | tar | tat | tav | taw | tbc | tbg | tbk | tbl |
| tbo | tby | tbz | tca | tcc | tcs | tcz | tdt | ted | tee | tel | tem | teo | ter |
| tew | tfr | tgk | tgl | tgo | tgp | tha | thk | tif | tih | tik | tim | tir | tiy |
| tke | tkr | tku | tlb | tlf | tlh | tmd | tna | tnc | tnk | tnn | tnp | tob | toc |
| tod | toh | toi | toj | ton | too | top | tos | tpa | tpi | tpm | tpp | tpt | tpz |
| tqb | trc | trn | trq | tsg | tsn | tso | tsw | tsz | ttc | tte | ttq | tuc | tue |
| tuf | tui | tuk | tum | tuo | tur | tvk | twi | twu | txu | tyv | tzc | tzh | tzj |
| tzo | ubr | ubu | udu | uig | ukr | upv | ura | urb | urd | uri | urk | urt | usa |
| usp | uvh | uvl | uyg | uzb | v | vag | var | ven | vid | vie | viv | vmy | vun |
| vut | waj | wal | wap | war | wat | way | wbm | wbp | wca | wed | wer | whk | wib |
| wim | wiu | wmt | wmw | wnc | wnu | wob | wol | wos | wrk | wrs | wsk | wuv | wwa |
| xal | xav | xbi | xbr | xed | xho | xla | xnn | xon | xrb | xsb | xsi | xsm | xsr |
| xsu | xtd | xtm | xuo | yaa | yad | yal | yam | yan | yaq | yby | ycn | yim | ykg |
| yle | yli | yml | yom | yon | yor | yrb | yre | yrk | yss | yua | yue | yuj | yut |
| yuw | yuz | yva | zaa | zab | zac | zad | zae | zai | zam | zao | zar | zas | zat |
| zav | zaw | zca | zho | zia | ziw | zom | zos | zpc | zpi | zpl | zpm | zpo | zpq |
| zpt | zpu | zpv | zpz | zsm | zsr | ztq | zty | zul | zyp |
Quantity:
Bibles Total: 1820
Languages represented: 1396
Source Paper & Link:
Mayer, T., & Cysouw, M. (2014). Creating a Massively Parallel Bible Corpus. Proceedings of the International Conference on Language Resources and Evaluation (LREC), 3158–3163. http://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdf
Description:
This corpus features data featured for use in WAT2020 from the ALT parallel corpus of Asian languages, specifically parallel data for Burmese and English. It is not tagged.
Languages:
Myanmar (Burmese) and English parallel data, the full Asian Language Treebank (ALT) contains parallel data for 13 languages.
Quantity:
6 Files - 3 Myanmar, 3 English, parallel Sentences - 20,000 total
Source:
http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/
Paper and Link:
Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. (2016, May). Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 1574-1578).
Description:
The Wixarika-Spanish Parallel Corpus is composed of 8,967 different phrases from the Wixarika to the Spanish language. Wixarika (also known as Huichol) is a polysynthetic indigenous language spoken in Mexico by roughly 50,000 native speakers. The corpus consists of a parallel collection of sentences that originated from Hans Christian Andersen’s and brother Grimm's classic fairy tales. This work was done by Dionio Carrillo Gonzáles (dionico94@gmail.com) in 2016. This file is licensed by Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Languages:
Wixarika and Spanish
Quantity:
8 Files: 11,562 total phrases: 56,0337 total tokens
Source:
https://github.com/pywirrarika/wixarikacorpora
Paper and Link:
Mager, M., Carrillo, D., & Meza, I. (2018). Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language. Journal of Intelligent & Fuzzy Systems, 34(5), 3081-3087.
Description:
The Shipibo-konibo language (approx. 26,000 speakers) belongs to the Panoan language family and is spoken in the Amazon region of Peru and Brazil. This parallel corpus between Spanish and Shipibo-konibo language was constructed using educational and religious documents.
Languages:
Shipibo-konibo and Spanish
Quantity:
3 Files:
BibliaShiSpa_1.txt contains 9804 aligned versicles from the bible (SHI and SPA) row structure: {book,chapter,versicle,SHI,SPA}
BibliaShiSpa_2.txt contains 13587 aligned sentences from the bible (SHI and SPA) row structure: {book,chapter,versicle,SHI,SPA}
traduccionTsanas1.csv contains 1545 aligned sentences from a kindergarten book row structure: bookName,sentenceNumber,SHI,SPA
Source:
http://chana.inf.pucp.edu.pe/resources/
Paper and Link:
Galarreta, Ana-Paula & Melgar, Andrés & Oncevay Marcos, Félix. (2017). Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo. 238-244. 10.26615/978-954-452-049-6_033.
https://www.acl-bg.org/proceedings/2017/RANLP%202017/pdf/RANLP033.pdf
Link to API's
http://chana.inf.pucp.edu.pe/index.php/en/api-2/
Description:
The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. This corpus is collected from the Nunavut Hansard between 1999 and 2017, representing 16 sessions over 4 assemblies, and 687 days of debates in the Legislative Assembly of Nunavut. This corpus was processed in 2019 by Eric Joanis (Eric.Joanis@cnrc-nrc.gc.ca), with the assistance of Rebecca Knowles, Roland Kuhn, Samuel Larkin, Patrick Littell, Chi-kiu Lo and Darlene Stewart, National Research Council Canada, and Jeffrey Micher, US Army Research Laboratory.
Languages:
Inuktitut and English
Quantity:
Files: 9
8,068,977 Inuktitut words
17,330,271 English words
Source:
https://nrc-digital-repository.canada.ca/eng/view/object/?id=c7e34fa7-7629-43c2-bd6d-19b32bf64f60
Paper and Link:
Eric Joanis, Rebecca Knowles, Roland Kuhn, Samuel Larkin, Patrick Littell, Chi-kiu Lo, Darlene Stewart and Jeffrey Micher. The Nunavut Hansard Inuktitut-English Parallel Corpus 3.0 with Preliminary Machine Translation Results. Submitted to LREC 2020
https://www.aclweb.org/anthology/2020.lrec-1.312.pdf
Monolingual data, including IGT and other resources from language documentation projects - initial focus on Tibeto-Burman and Mayan languages
Description:
This corpus is a very large set of pos tagged tibetan data. It is a compilation of texts from the Buddhist Digital Resource Center. The pos tags were generated by training a model with the data from the SOAS corpus, which was compiled and tagged by Hill and Edward.
Languages:
Classical Tibetan in Tibetan script, tagged with roman alphabet tags
Quantity:
Number of Files: 11 Collections of Texts. Each collection varies in how many texts it has, ranging from around 100 to several hundred texts per collection. Each collection contains tagged and segmented versions of the data in the 'pos' folder and only segmented versions of the data in the 'seg' folder. Number of Tokens: >185 million tokens across all collections
Source:
https://zenodo.org/record/3951503#.X20_u4tOlEZ
Paper and Link:
Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503
Description:
This corpus is tibetan data with pos tags and morphological / case information. It was the manually training data used to create the tags for the acTiB corpus. In this zip file are two copies of the data: one seemingly plain data folder and one folder for MacOS compatible data. Within the data folder, there are 4 texts, each with two versions. The versions differ in the amount of tags used, as the txt files with "lex" in their name use extremely specific tags.
Languages:
Classical Tibetan with roman alphabet tags
Quantity:
Number of files = 8 Number of texts = 4 Number of sentences = not sure
Source:
Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878
Paper and Link:
I couldn't find a paper directly describing this dataset, so I have linked the author's paper most closely related to the data.
Hill, Nathan & Meelen, Marieke. (2017). Segmenting and POS tagging Classical Tibetan using a Memory-Based Tagger. Himalayan Linguistics. 16. 10.5070/H916234501.
Description:
This small set of POS tagged burmese data was created by the researchers at UCSY. This is data is contained within a single txt file, and is approximately 11000 sentences long according to the github page. It contains POS tags generated by manual annotation.
Languages:
Burmese, tags use roman alphabet
Quantity:
1 File ~11,000 sentences ~240,000 words
Source:
https://github.com/ye-kyaw-thu/myPOS
Paper and Link:
https://github.com/ye-kyaw-thu/myPOS/blob/master/CICLING2017/myPOS-CICLing2017-paper.pdf