MATAS corpus (version 1.0)
DESCRIPTION
Manually checked, morphologically annotated corpus MATAS
FORMATS
1. CoNLL-U (CONLLU, conllu)
2. SketchEngine - tab delimited word per line (TAB-WPL, txt)
SIZE
Wordform count: 1,693,410
Sentence count: 144,047
GENRES
Contains 5 genres: Documents (14%), Fiction (19%), Periodicals (36%), Scientific texts (24%), Transcripts(7%)
TAGSETS
morphological annotation presented with 3 different tagsets:
- Universal Dependencies (POS 4 column, morphological categories 6 column), see universaldependencies.org;
- Jablonskis (5 column) see Documentation folder;
- Multext-EAST (10 column), see Documentation folder.
JABLONSKIS AND MULTEXT-EAST TAGSETS
Jablonskis -> Lithuanian tagset -> human-readable
Multext-East -> English tagset -> machine-readable
Please use the following text to cite this item:
Rimkutė E., Daudaravičius V., Utka A. 2007: Morphological Annotation of the Lithuanian Corpus. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; Workshop Balto-Slavonic Natural Language Processing 2007, Prague, 94–99.