<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>CLARIN-LT digital library in the Republic of Lithuania</title>
<link>https://clarin.vdu.lt:443/xmlui</link>
<description>The CLARIN-LT digital repository system captures, stores, indexes, preserves, and distributes digital research material.</description>
<pubDate xmlns="http://apache.org/cocoon/i18n/2.1">Fri, 06 Mar 2026 18:20:31 GMT</pubDate>
<dc:date>2026-03-06T18:20:31Z</dc:date>
<item>
<title>English (L2) Learner Corpus</title>
<link>http://hdl.handle.net/20.500.11821/80</link>
<description>English (L2) Learner Corpus
Juknevičienė, Rita; Šeškauskienė, Inesa
The NEC corpus samples used in the study comprises 433 examination responses (essays) written in L2 English on two topics, namely, The importance of volunteering for young people (from the English examination of 2012, coded as ‘1’; see the code explanation in the description file), and Studying abroad: advantages and disadvantages (from the pilot examination of 2012, coded as ‘2’). The total number of words in the sample is 89,232.&#13;
Reference&#13;
Juknevičienė, R. and I. Šeškauskienė. (2014). The National Examination of English in Lithuania: Searching for Evidence of CEFR Criterial Achievement Levels. Studies About Languages, 25, 88–96. https://doi.org/10.5755/j01.sal.0.25.8579.
</description>
<pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/80</guid>
<dc:date>2026-03-06T00:00:00Z</dc:date>
</item>
<item>
<title>Lithuanian-English Parallel Cybersecurity Corpus – DVITAS</title>
<link>http://hdl.handle.net/20.500.11821/78</link>
<description>Lithuanian-English Parallel Cybersecurity Corpus – DVITAS
Mickevič, Jolanta; Rackevičienė, Sigita
Lithuanian-English Parallel Cybersecurity Corpus consists of official cybersecurity documents of the Republic of Lithuania and their English translations, dating from 2014 to 2024. The documents were obtained from the legal act repositories of the Republic of Lithuania (e-seimas.lrs.lt; e-tar.lt) and from the official website of the National Cyber Security Centre under the Ministry of National Defence of the Republic of Lithuania (nksc.lt).&#13;
&#13;
The total size of the corpus is 216,213 words (LT: 96,085 words; EN: 120,128 words). The texts are aligned at the sentence level; the corpus contains 6,417 aligned segments (sentences).&#13;
&#13;
The dataset consists of 152 files encoded in UTF-8. The files are arranged in the following archives:&#13;
&#13;
1) LT-EN_Parallel_Cybersecurity_Corpus_TMX.zip - 30 files in TMX (translation memory exchange) format: &#13;
   - 29 separate LT-EN TMX files with aligned texts&#13;
   - 1 combined file consolidating all 29 LT-EN TMX files &#13;
&#13;
2) LT-EN_Parallel_Cybersecurity_Corpus_VERT.zip - 60 files in VERT (vertical text) format:&#13;
   - 29 separate LT VERT files with morphological annotation,&#13;
   - 1 combined file consolidating all 29 LT VERT files,&#13;
   - 29 separate EN VERT files with morphological annotation,&#13;
   - 1 combined file consolidating all 29 EN VERT files.&#13;
   SENTENCE ALIGNMENT: &#13;
   Each &lt;align&gt; block in VERT files corresponds to a translation unit &lt;tu&gt; in TMX files.&#13;
   MORPHOLOGICAL ANNOTATION STRUCTURE:&#13;
   LT: wordform | lempos | tag (LT MULTEXT-East)   &#13;
   EN: wordform | tag | lempos (EN TreeTagger)&#13;
   TAGSET REFERENCES:&#13;
   https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/ &#13;
   https://www.sketchengine.eu/english-treetagger-pipeline-2/ &#13;
&#13;
3) LT-EN_Parallel_Cybersecurity_Corpus_TXT.zip - 60 files in TXT (plain text) format: &#13;
   - 29 separate LT TXT files,&#13;
   - 1 combined file consolidating all 29 LT TXT files,&#13;
   - 29 separate EN TXT files,&#13;
   - 1 combined file consolidating all 29 EN TXT files.&#13;
&#13;
4) LT-EN_Parallel_Cybersecurity_Corpus_CSV(Metadata).zip - 2 files with metadata in CSV (comma separated values) format: &#13;
   - 1 LT CSV file with metadata,&#13;
   - 1 EN CSV file with metadata.&#13;
   Metadata categories: &#13;
   File names, Type of document, Title of document, Author of document, Year, Words, Source, URL.
</description>
<pubDate>Sat, 24 Jan 2026 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/78</guid>
<dc:date>2026-01-24T00:00:00Z</dc:date>
</item>
<item>
<title>CCLL Lemmatised Frequency Lists</title>
<link>http://hdl.handle.net/20.500.11821/77</link>
<description>CCLL Lemmatised Frequency Lists
Mindaugas, Petkevičius
The resource contains 6 frequency lists for the Corpus of Contemporary Lithuanian language (CCLL) (https://sitti.vdu.lt/en/services/)&#13;
1-LT_token_freq_list.txt&#13;
- a full frequency list of all tokens in CCLL&#13;
2-LT_token_freq_stats.txt&#13;
- statistics of the tokens and most common 100 tokens in CCLL &#13;
3-LT_alpha_wordform_freq_list.txt&#13;
- a full frequency list of Lithuanian alphabetic wordforms in CCLL&#13;
4-LT_lemma_alpha_freq_list.txt&#13;
- a full frequency list of Lithuanian alphabetic lemmas in CCLL&#13;
5-LT_lemma_and_punct_freq_list_freq_list.txt&#13;
- a full frequency list of Lithuanian lemmas and punctuation marks in CCLL&#13;
6-LT_lemma_and_punct_freq_stats.txt&#13;
- statistics of lemmas and punctuation marks and most common 100 lemmas and punctuation marks in CCLL.
</description>
<pubDate>Mon, 29 Dec 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/77</guid>
<dc:date>2025-12-29T00:00:00Z</dc:date>
</item>
<item>
<title>Lithuanian Science and Research Terminology: Multilingual Term List</title>
<link>http://hdl.handle.net/20.500.11821/76</link>
<description>Lithuanian Science and Research Terminology: Multilingual Term List
Rimkutė, Erika; Bielinskienė, Agnė; Boizou, Loïc; Grigonytė, Gintarė; Kovalevskaitė, Jolanta; Utka, Andrius
Tab-separated (TSV) UTF-8 text file containing 223 Lithuanian science and research terms with definitions and translation equivalents in English, German, and French.&#13;
Intended use: as a reference resource or for import into computer-assisted translation (CAT) tools and translation projects.&#13;
Fields (columns) per record:&#13;
* **Term-LT** – Lithuanian term&#13;
* **Definition-LT** – Lithuanian definition&#13;
* **Term-EN** – English equivalent&#13;
* **Term-DE** – German equivalent&#13;
* **Term-FR** – French equivalent&#13;
* **Domain-LT** – subject domain in Lithuanian&#13;
* **Acronym-LT** – Lithuanian acronym (if applicable)
</description>
<pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/76</guid>
<dc:date>2025-12-05T00:00:00Z</dc:date>
</item>
<item>
<title>Frequency List of Lithuanian Homoforms</title>
<link>http://hdl.handle.net/20.500.11821/75</link>
<description>Frequency List of Lithuanian Homoforms
Žemrietė, Miglė; Rimkutė, Erika; Dadurkevičius, Virginijus; Petkevičius, Mindaugas
The list contains 63,139 homoforms. In the Frequency List of Lithuanian Homoforms, the following data are provided for each homoform: 1. the homoform itself, 2) its lemma (or lemmas), 3) detailed morphological information, 4) its frequency in the morphologically annotated corpus MATAS (for more information on this corpus, see https://clarin.vdu.lt/xmlui/handle/20.500.11821/61), 5) the frequency of the homoform (each homoform consists of at least two components; the number in the homoform frequency column indicates how many times a given component of the homoform occurs in the MATAS corpus; a zero indicates that the component is only theoretically possible), 6) the type and subtype of the homoform. The list of homoforms can be searched at https://morph.vdu.lt/homoformos/.
</description>
<pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/75</guid>
<dc:date>2025-11-18T00:00:00Z</dc:date>
</item>
<item>
<title>LOCOLE (Longitudinal Corpus of Learner English)</title>
<link>http://hdl.handle.net/20.500.11821/74</link>
<description>LOCOLE (Longitudinal Corpus of Learner English)
Juknevičienė, Rita; Vilkaitė-Lozdienė, Laura; Kasteckienė, Jurga; Salei, Palina
Information about LOCOLE&#13;
This corpus comprises essays written by university students of English Philology over the course of one academic year. The essays were collected four times during the 2024-2025 academic year. They were all written by hand in the classroom, without access to any reference tools. The essays were manually keyboarded, preserving the authentic learner writing, including non-standard spelling, language use, and punctuation. Essay topics are provided in Table 1 below.&#13;
Table 1. Essays in the corpus&#13;
Cohort	Essay topic	Number of essays	Date of data collection&#13;
1.	Is education the key to success?	28	September&#13;
2.	University pressures	26	October&#13;
3.	Can AI replace human teachers?	29	January&#13;
4.	Linguistic theories have no place outside academia	26	May &#13;
		109 in total	&#13;
&#13;
Text ID&#13;
Each text in the corpus has a unique ID code, for example, 1_F_01. The first number in the code represents the cohorts by the time of data collection and topic prompt. The last number codes each participant's sequential number in the list (see the CSV file). The letter indicates the gender of the participant:&#13;
-	F stands for ‘Female’,&#13;
-	M stands for “Male’,&#13;
-	O stands for ‘Other’,&#13;
-	N shows that the participant preferred not to indicate their gender.&#13;
More detailed information about the participants is available in the attached CSV file. &#13;
&#13;
Funding sources&#13;
The digitization of the corpus was supported by the Research Council of Lithuania as a Student Summer Internship project.
</description>
<pubDate>Fri, 14 Nov 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/74</guid>
<dc:date>2025-11-14T00:00:00Z</dc:date>
</item>
<item>
<title>Corpus of Transcriptions - part 2</title>
<link>http://hdl.handle.net/20.500.11821/73</link>
<description>Corpus of Transcriptions - part 2
Bikelienė, Lina; Martín de la Rosa, Victoria; Černelytė, Laura
The second part of the Corpus of Transcriptions contains phonemic transcriptions of a short passage from Lecumberri and Maidment (2000, p. 78) performed by the undergraduate students at the end of the course of English Phonetics: (1) Vilnius University subcorpus (22 files, c. 2,853 tokens, c. 11,712 phonetic units); (2) Complutense University of Madrid subcorpus (27 files, c. 3,737 tokens, c. 14,956 phonetic units). The preferred variety of English: SSBE. The data can be used for comparative research of non-native English learners' phonemic transcriptions. The files are in TXT,  DOCX, and ODT formats; MP4A and MP3 audio files are also included. Also see the first part of the corpus: https://clarin.vdu.lt/xmlui/handle/20.500.11821/67.
</description>
<pubDate>Sun, 26 Oct 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/73</guid>
<dc:date>2025-10-26T00:00:00Z</dc:date>
</item>
<item>
<title>English-Lithuanian Parallel Migration Corpus</title>
<link>http://hdl.handle.net/20.500.11821/72</link>
<description>English-Lithuanian Parallel Migration Corpus
Usinskiene, Olga; Rackevičienė, Sigita
English-Lithuanian Parallel Migration Corpus includes original English texts and their Lithuanian translations, aligned at the sentence level. The texts are drawn from EU legal acts and other migration-related documents published in the EUR-Lex database between 1998 and 2024.&#13;
&#13;
The total size of the corpus is 1,223,350 words (EN - 688,410 words; LT - 534,940 words). The corpus contains 43,345 aligned segments (sentences).&#13;
&#13;
Within the dataset, the following files are included:&#13;
1) EN-LT_Parallel_Migration_Corpus_TMX.zip &#13;
   This file is composed of 51 files in TMX (translation memory exchange) format: &#13;
   - 50 separate EN-LT TMX files with aligned texts&#13;
   - 1 combined file consolidating all 50 EN-LT TMX files&#13;
2) EN-LT_Parallel_Migration_Corpus_VERT.zip&#13;
   This file is composed of 102 files in VERT (vertical text) format:&#13;
   - 50 separate EN files with morphological annotation&#13;
   - 1 combined EN file consolidating all 50 EN VERT files&#13;
   - 50 separate LT files with morphological annotation&#13;
   - 1 combined LT file consolidating all 50 LT VERT files&#13;
   Sentence aglinment: &#13;
   Each &lt;align&gt; block corresponds to a TMX translation unit &lt;tu&gt;.&#13;
   Morphological annotation structure:&#13;
   EN: wordform | tag | lempos (EN TreeTagger)&#13;
   LT: wordform | lempos | tag (LT MULTEXT-East)&#13;
   Tagset references:&#13;
   https://www.sketchengine.eu/english-treetagger-pipeline-2/&#13;
   https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/&#13;
3) EN-LT_Parallel_Migration_Corpus_TXT.zip&#13;
   This files is composed of 100 files in TXT (plain text) format: &#13;
   - 50 separate EN files&#13;
   - 50 separate LT files&#13;
4) EN-LT_Parallel_Migration_Corpus_CSV(Metadata).zip&#13;
   This file is composed of 2 files with metadata in CSV (comma separated values) format: &#13;
   - 1 EN file with metadata&#13;
   - 1 LT file with metadata&#13;
   Metadata categories: Form of document, File name (CELEX number of document), Title of document, Author of document (Institution), Year of Publication, Word count, URL.&#13;
The dataset comprises a total of 255 files, all ecoded in UTF-8.
</description>
<pubDate>Wed, 15 Oct 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/72</guid>
<dc:date>2025-10-15T00:00:00Z</dc:date>
</item>
<item>
<title>Oral History Resource: Lithuanian Testimonies of Siberian Deportations</title>
<link>http://hdl.handle.net/20.500.11821/71</link>
<description>Oral History Resource: Lithuanian Testimonies of Siberian Deportations
Usonytė, Gabrielė; Augustinavičienė, Elvyra; Vaičenonienė, Jurgita
The oral history resource includes:&#13;
(1) Audio recordings (recorded in 2009-2010) of personal narratives by siblings Pranas Šuminskas and Vladislava Šuminskaitė about their childhood in the Lithuanian village of Laičiai, their deportation to Siberia, and lives after returning from exile. In total, there are 54 recordings (Šuminskaitė (VS) – 18 recordings, Šuminskas (PS) – 27, both speakers (PS-VS) – 9)) with the combined duration of 7 hours 14 minutes. &#13;
(2) Transcriptions of audio recordings.&#13;
(3) An extended description by Gabrielė Usonytė on her acquaintance with the speakers, the recording process and a transcribed version of the records accompanied by related illustrative material. &#13;
(4) English translation of Gabrielė Usonytė’s text a as well as speakers’ biographies. Translation was performed by ChatGPT 3.5, Open AI, edited by Jurgita Vaičenonienė.&#13;
This oral and written resource might be of interest to linguists (as the speakers speak in Eastern Aukštaitian dialect), historians, anthropologists and other researchers both in Lithuania and abroad.
</description>
<pubDate>Fri, 22 Aug 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/71</guid>
<dc:date>2025-08-22T00:00:00Z</dc:date>
</item>
<item>
<title>LITUND corpus v1</title>
<link>http://hdl.handle.net/20.500.11821/70</link>
<description>LITUND corpus v1
Dambrauskas, Edgaras; Utka, Andrius
LITUND contains two comparable corpora:&#13;
1. Unreliable news texts. 147 full-text articles (100,678 words) identified as misleading by professional fact-checkers. The corpus includes metadata file with the following information: file name, text topic category, title, the specific false claim addressed, publication date, url to the text, word count, debunking reference, and url to the debunking reference.&#13;
2. LRT corpus. 147 full-text articles (131,640 words), published by Lithuania’s national broadcaster (LRT) on topics similar to those in the Unreliable News Corpus. The corpus includes metadata file with the following information: file name, text topic category, publication date, url to the text, and word count.&#13;
The corpora are in two formats: 1) plain text (UTF-8 encoding) and 2) morphologically tagged in CoNLL-U format. The morphological annotation was done by morphological anlyser MORFUOKKLIS (https://sitti.vdu.lt/morfuoklis/en/about).&#13;
Corpus covers 6 topic categories: Environment, COVID-19, Health, Politics, War in Ukraine, Others.
</description>
<pubDate>Wed, 25 Jun 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">http://hdl.handle.net/20.500.11821/70</guid>
<dc:date>2025-06-25T00:00:00Z</dc:date>
</item>
</channel>
</rss>
