| dc.description |
English-Lithuanian Parallel Migration Corpus includes original English texts and their Lithuanian translations, aligned at the sentence level. The texts are drawn from EU legal acts and other migration-related documents published in the EUR-Lex database between 1998 and 2024.
The total size of the corpus is 1,223,350 words (EN - 688,410 words; LT - 534,940 words). The corpus contains 43,345 aligned segments (sentences).
Within the dataset, the following files are included:
1) EN-LT_Parallel_Migration_Corpus_TMX.zip
This file is composed of 51 files in TMX (translation memory exchange) format:
- 50 separate EN-LT TMX files with aligned texts
- 1 combined file consolidating all 50 EN-LT TMX files
2) EN-LT_Parallel_Migration_Corpus_VERT.zip
This file is composed of 102 files in VERT (vertical text) format:
- 50 separate EN files with morphological annotation
- 1 combined EN file consolidating all 50 EN VERT files
- 50 separate LT files with morphological annotation
- 1 combined LT file consolidating all 50 LT VERT files
Sentence aglinment:
Each <align> block corresponds to a TMX translation unit <tu>.
Morphological annotation structure:
EN: wordform | tag | lempos (EN TreeTagger)
LT: wordform | lempos | tag (LT MULTEXT-East)
Tagset references:
https://www.sketchengine.eu/english-treetagger-pipeline-2/
https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/
3) EN-LT_Parallel_Migration_Corpus_TXT.zip
This files is composed of 100 files in TXT (plain text) format:
- 50 separate EN files
- 50 separate LT files
4) EN-LT_Parallel_Migration_Corpus_CSV(Metadata).zip
This file is composed of 2 files with metadata in CSV (comma separated values) format:
- 1 EN file with metadata
- 1 LT file with metadata
Metadata categories: Form of document, File name (CELEX number of document), Title of document, Author of document (Institution), Year of Publication, Word count, URL.
The dataset comprises a total of 255 files, all ecoded in UTF-8. |