Show simple item record

 
dc.contributor.author Usinskiene, Olga
dc.contributor.author Rackevičienė, Sigita
dc.date.accessioned 2025-10-20T18:12:58Z
dc.date.available 2025-10-20T18:12:58Z
dc.date.issued 2025-10-15
dc.identifier.uri http://hdl.handle.net/20.500.11821/72
dc.description English-Lithuanian Parallel Migration Corpus includes original English texts and their Lithuanian translations, aligned at the sentence level. The texts are drawn from EU legal acts and other migration-related documents published in the EUR-Lex database between 1998 and 2024. The total size of the corpus is 1,223,350 words (EN - 688,410 words; LT - 534,940 words). The corpus contains 43,345 aligned segments (sentences). Within the dataset, the following files are included: 1) EN-LT_Parallel_Migration_Corpus_TMX.zip This file is composed of 51 files in TMX (translation memory exchange) format: - 50 separate EN-LT TMX files with aligned texts - 1 combined file consolidating all 50 EN-LT TMX files 2) EN-LT_Parallel_Migration_Corpus_VERT.zip This file is composed of 102 files in VERT (vertical text) format: - 50 separate EN files with morphological annotation - 1 combined EN file consolidating all 50 EN VERT files - 50 separate LT files with morphological annotation - 1 combined LT file consolidating all 50 LT VERT files Sentence aglinment: Each <align> block corresponds to a TMX translation unit <tu>. Morphological annotation structure: EN: wordform | tag | lempos (EN TreeTagger) LT: wordform | lempos | tag (LT MULTEXT-East) Tagset references: https://www.sketchengine.eu/english-treetagger-pipeline-2/ https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/ 3) EN-LT_Parallel_Migration_Corpus_TXT.zip This files is composed of 100 files in TXT (plain text) format: - 50 separate EN files - 50 separate LT files 4) EN-LT_Parallel_Migration_Corpus_CSV(Metadata).zip This file is composed of 2 files with metadata in CSV (comma separated values) format: - 1 EN file with metadata - 1 LT file with metadata Metadata categories: Form of document, File name (CELEX number of document), Title of document, Author of document (Institution), Year of Publication, Word count, URL. The dataset comprises a total of 255 files, all ecoded in UTF-8.
dc.language.iso eng
dc.language.iso lit
dc.publisher Mykolas Romeris University
dc.rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
dc.rights.uri https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm
dc.rights.label PUB
dc.subject parallel corpus
dc.subject specialized corpus
dc.subject migration corpus
dc.subject migration
dc.title English-Lithuanian Parallel Migration Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN-LT
contact.person Sigita Rackevičienė sigita.rackeviciene@mruni.eu Mykolas Romeris University
size.info 1223350 words
size.info 43345 sentences
size.info 255 files
files.size 17738735
files.count 6


 Files in this item  Download all files in item (16.92 MB)

This item is
Publicly Available
and licensed under:
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
Icon
Name
readme_LT.pdf
Size
61.67 KB
Format
PDF
Description
description
 Download file
Icon
Name
readme_EN.pdf
Size
58.46 KB
Format
PDF
Description
description
 Download file
Icon
Name
EN-LT_Parallel_Migration_Corpus_TMX.zip
Size
5.36 MB
Format
application/zip
Description
EN-LT_Parallel_Migration_Corpus_TMX
 Download file
Icon
Name
EN-LT_Parallel_Migration_Corpus_VERT.zip
Size
9.17 MB
Format
application/zip
Description
EN-LT_Parallel_Migration_Corpus_VERT
 Download file
Icon
Name
EN-LT_Parallel_Migration_Corpus_TXT.zip
Size
2.26 MB
Format
application/zip
Description
EN-LT_Parallel_Migration_Corpus_TXT
 Download file
Icon
Name
EN-LT_Parallel_Migration_Corpus_CSV(Metadata).zip
Size
9.92 KB
Format
application/zip
Description
EN-LT_Parallel_Migration_Corpus_CSV(Metadata)
 Download file

Show simple item record