MATAS corpus (version 3.0) CONTRIBUTORS Erika Rimkutė, Agnė Bielinskienė, Loic Boizou, Virginijus Dadurkevičius, Jolanta Kovalevskaitė, Andrius Utka DESCRIPTION Updated, manually checked, morphologically annotated corpus MATAS LANGUAGE Lithuanian PREVIOUS VERSIONS 1. MATAS v0.2 (http://hdl.handle.net/20.500.11821/9) 2. MATAS v1.0 (http://hdl.handle.net/20.500.11821/33) FORMATS, STANDARTS 1. CoNLL-U (https://universaldependencies.org/format.html); 2. JABLONSKIS tagset v2 (https://sitti.vdu.lt/jablonskis-en/); 3. MULTEXT-East tagset (http://nl.ijs.si/ME/V4/msd/html/index.html) 4. UTF-8 SIZE Tokens (incl. punctuation): 2,137,287 Words: 1,694,819 Sentences: 144,047 Documents: 1,234 GENRES Genres Files Tokens % documents (dok) 74 289697 13.6 fiction (gro) 33 428929 20.1 scientific (mok) 75 517092 24.2 news (pub) 1047 757201 35.4 transcripts (ste) 5 144368 6.8 POS COUNTS noun (N) 637306 verb (V) 338659 adjective (A) 122411 pronoun (P) 147579 numeral (M) 62425 adverb (R) 105235 preposition (S) 77431 conjunction (C) 129492 particle (Q) 36523 interjection (I) 3015 onomatopoeia (O) 209 abbreviation (Y) 28023 others (X) 6511 punctuation (T) 442468 PUBLISHER Institute of Digital Resources and Interdisciplinary Research (SITTI), Vytautas Magnus University