CLARIN-LT digital library in the Republic of Lithuania

CLARIN-LT digital library in the Republic of Lithuania https://clarin.vdu.lt:443/xmlui The CLARIN-LT digital repository system captures, stores, indexes, preserves, and distributes digital research material. Wed, 24 Apr 2024 23:04:00 GMT 2024-04-24T23:04:00Z Read Speech Corpus (7G) http://hdl.handle.net/20.500.11821/58 Read Speech Corpus (7G) Raškinis, Gailius; Rudžionis, Vytautas The corpus of read Lithuanian speech „7G“ was compiled in 2015-2016. The corpus consists of 352 audio recordings with a total duration of over 7 hours. Seven different speakers are reading excerpts of books and a list of isolated words (the list reflects the diversity of triphones in the Lithuanian). The audio recordings are stored as WAV PCM 44.1 kHz 16-bit mono format files. Annotations are stored in MLF format (the format used by the HTK Toolkit). Most of the speakers are young women aged between 20 and 25. The aim was to obtain recordings in as natural a recording environment as possible, so no requirements were placed on the speakers in terms of recording equipment, microphone settings or recording environment. Most of the speakers used personal laptops with a built-in microphone. Tue, 03 Jan 2017 00:00:00 GMT http://hdl.handle.net/20.500.11821/58 2017-01-03T00:00:00Z Wordlist of the Contemporary Corpus of Lithuanian Language in the Face of War in Ukraine http://hdl.handle.net/20.500.11821/57 Wordlist of the Contemporary Corpus of Lithuanian Language in the Face of War in Ukraine Dadurkevičius, Virginijus We present the comparative wordlist based on the Corpus of the Contemporary Lithuanian Language (CCLL2 version 2, pre-2020), supplemented by the media (courtesy of the news media company 15min – www.15min.lt) and social networks lexicons of the war in Ukraine period (Feb 2022 to Feb 2024). For a fair comparison, all word counts have been normalized as if they were 100m words in each source. CCLL2 has 162m words, wartime media – 36m words and wartime social networks – 2m words. The term "word" does not apply here to punctuation, numbers, dates, URL's, hashtags, popular English words, etc. The data itself is in the form of a tab-separated-values (TSV) text file consisting of the following columns: word(token), CCLL2 count, CCLL2 docs, media count, media docs, social networks count, social networks docs. Where "docs" mean number (normalized) of documents with a particular word. All words are written as case-insensitive using capital letters. Wed, 13 Mar 2024 00:00:00 GMT http://hdl.handle.net/20.500.11821/57 2024-03-13T00:00:00Z JABLONSKIS tagset v2 http://hdl.handle.net/20.500.11821/56 JABLONSKIS tagset v2 Rimkutė, Erika; Bielinskienė, Agnė; Boizou, Loïc; Utka, Andrius; Dadurkevičius, Virginijus JABLONSKIS VERSION 2 is a Lithuanian standard morphologiclal tagset that is based on the abbreviations of parts of speech and other grammatical categories commonly used in Lithuanian linguistic works. The tagset is most suitable in applications with human-readable scenarios. The data contains: 1) Lithuanian and English descriptions of the tagset; 2) TSV files with POS categories and values. Wed, 24 Jan 2024 00:00:00 GMT http://hdl.handle.net/20.500.11821/56 2024-01-24T00:00:00Z Lithuanian-English Cybersecurity Termbase v.0.1 http://hdl.handle.net/20.500.11821/55 Lithuanian-English Cybersecurity Termbase v.0.1 Utka, Andrius; Rackevičienė, Sigita; Bielinskienė, Agnė; Laurinaitis, Marius; Mockienė, Liudmila; Rokas, Aivaras The bilingual termbase is TBX export of the online termbase https://www.terminologue.org/csterms/. The termbase includes terms for 233 cybersecurity concepts. Thu, 13 Apr 2023 00:00:00 GMT http://hdl.handle.net/20.500.11821/55 2023-04-13T00:00:00Z DIGIRES COVID-19 ML Dataset v.1 http://hdl.handle.net/20.500.11821/54 DIGIRES COVID-19 ML Dataset v.1 Amilevičius, Darius; Utka, Andrius; Meidutė, Aistė; Ruzaitė, Jūratė DIGIRES COVID-19 ML dataset v.1 is a tab-separated (.tsv) file prepared for training machine learning algorithms. The training dataset was compiled from various internet public Lithuanian media sources. It contains 351 records and has the following attributes: "Title": the title of a news article "Text": the text of the article "Label": a label that marks the article as 1: unreliable; 0: reliable 1) "unrealiable" marks articles, which were identified by professional fact checkers as fake news; 2) "reliable" marks trustworthy articles. Classes Labels Word tokens Reliable: 175 67902 Unreliable: 176 118747 Total 351 186649 Mon, 20 Feb 2023 00:00:00 GMT http://hdl.handle.net/20.500.11821/54 2023-02-20T00:00:00Z DIGIRES COVID-19 Corpus v.1 http://hdl.handle.net/20.500.11821/53 DIGIRES COVID-19 Corpus v.1 Amilevičius, Darius; Utka, Andrius; Meidutė, Aistė; Ruzaitė, Jūratė DIGIRES COVID-19 Corpus v.1 consists of 351 Lithuanian media articles about COVID-19 pandemics. The corpus was compiled from various internet public Lithuanian media sources. Corpus contains 351 files in plain text format (TXT) with UTF-8 encoding. Each article consists of a title (in the 1st line) and an article body. Files are classified into two subcorpora: 1) "unrealiable" that contains articles, which were identified by professional fact checkers as fake news; 2) "reliable" that contains trustworthy articles. Subcorpus Files Word tokens Reliable: 175 67902 Unreliable: 176 118747 Total 351 186649 Mon, 20 Feb 2023 00:00:00 GMT http://hdl.handle.net/20.500.11821/53 2023-02-20T00:00:00Z English-Lithuanian Comparable Vaccination Corpus http://hdl.handle.net/20.500.11821/52 English-Lithuanian Comparable Vaccination Corpus Dalgedaitė, Jovita; Rackevičienė, Sigita Two news portals were selected for comparable corpora building: the Lithuanian portal DELFI and the English portal The Guardian. The compiled corpora comprise 135 Lithuanian articles from DELFI portal and 135 English articles from the Guardian portal. The main criterion for article extraction from the portals was the presence of the two keywords in the articles: "vaccination" and "vaccine". The selected time period for the articles was from January 2021 to September 2021. 30 (15 Lithuanian and 15 English) articles were selected for each month of this period. The extracted articles were used to build two types of comparable corpora necessary for further analysis: full–text corpora composed of full texts of articles and extract corpora composed of titles and lead paragraphs of articles. The sizes of the full-text corpora are: the Lithuanian full-text corpus ‘Lithuanian media articles on vaccination’ contains 45,827 words, the English full-text corpus ‘English media articles on vaccination’ contains 96,759 words. The sizes of the extract corpora are: the Lithuanian extract corpus contains 4,863 words, the English extract corpus contains 3,828 words. Wed, 30 Nov 2022 00:00:00 GMT http://hdl.handle.net/20.500.11821/52 2022-11-30T00:00:00Z Frequency lists of pivot words and GSE counts http://hdl.handle.net/20.500.11821/51 Frequency lists of pivot words and GSE counts Dadurkevičius, Virginijus; Utka, Andrius The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled. Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf. Fri, 11 Nov 2022 00:00:00 GMT http://hdl.handle.net/20.500.11821/51 2022-11-11T00:00:00Z Pedagogic Corpus of Lithuanian http://hdl.handle.net/20.500.11821/50 Pedagogic Corpus of Lithuanian Rimkutė, Erika; Kamandulytė-Merfeldienė, Laura; Aleksandravičiūtė, Gabrielė; Anglickienė, Laimutė; Barkauskaitė, Giedrė; Bielinskienė, Agnė; Boizou, Loïc; Grigonytė, Gintarė; Kovalevskaitė, Jolanta; Virbickienė, Gabrielė The Pedagogic Corpus of Lithuanian is a monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The pedagogic corpus includes authentic Lithuanian texts, selected using such criteria as a learner-relevant communicative function and genre. Spoken language as well as written language are represented in the corpus. The size of the corpus is 669,000 tokens: 111,000 tokens from texts and spoken language for A1-A2 levels, 558,000 tokens from texts and spoken language for B1-B2 levels (according to the Common European Framework of Reference for Languages). The spoken component constitutes appr. 7.5 % of the Corpus. The written subpart of the corpus (containing 620,000 tokens) includes levelled texts from coursebooks and unlevelled texts from other sources. The texts from coursebooks and other sources could be classified into 29 text types (dialogs, narratives, information, etc.) and 4 groups according to the communicative aims: informational texts, educational texts, advertising and fiction. There are two types of searches in the corpus: simple and advanced (see „Search Tips“). Simple Search allows you to find instances of a search item (word form, lemma, two words) in the whole corpus, or particular part of the corpus (spoken or written texts). After selecting the written subcorpus, you can further select the text type (coursebooks or non-coursebook texts) and/or the genre of the written texts. Advanced Search allows you to use all the features of simple search and find some additional options. Since the Pedagogic Corpus is morphologically annotated, the advanced search allows you to search by grammatical features (e.g. part of speech, case, number, verb form, etc.). At https://kalbu.vdu.lt/mokymosi-priemones/mokomasis-tekstynas/ you can find truncated wordlists: list of lemmas, word forms (for the whole corpus, spoken and written components, and for each level), lists of particular part of speech in the whole corpus. The lists can be downloaded as .xlsx files. REFERENCE Kovalevskaitė, Jolanta and Rimkutė, Erika. "Pedagogic Corpus of Lithuanian: A New Resource for Learning and Teaching Lithuanian as a Foreign Language" Sustainable Multilingualism, vol.17, no.1, 2020, pp.197-230. https://doi.org/10.2478/sm-2020-0019 Mon, 29 Aug 2022 00:00:00 GMT http://hdl.handle.net/20.500.11821/50 2022-08-29T00:00:00Z The Database of Lithuanian multiword expressions http://hdl.handle.net/20.500.11821/49 The Database of Lithuanian multiword expressions Bielinskienė, Agnė; Boizou, Loïc; Bumbulienė, Ieva; Kovalevskaitė, Jolanta; Krilavičius, Tomas; Mandravickaitė, Justina; Rimkutė, Erika; Vaičenonienė, Jurgita; Vilkaitė-Lozdienė, Laura The Database of Lithuanian multiword expressions (MWEs) is freely accessible for online search at: https://resursai.pastovu.vdu.lt/paieska/paprastoji from 2019. It contains two-word and three-word MWEs extracted from the DELFI.lt corpus representing news texts on the various topics (https://klc.vdu.lt/pastovuSearch.html). First, 12,000 MWEs (mostly collocations, a few idioms) were included in the database. In 2022, the database was updated adding new collocations from the same corpus and filtering arbitrary collocations: out of appr. 19,000 collocations appr. 9000 are marked as arbitrary collocations, i.e., having lexical collocability restrictions. The database provides rich information about the usage of collocations: lemma, word forms, frequencies (in the DELFI.lt corpus), morphological information, syntactic relations, grammatical variants, text genres, and usage examples. Usage variation cases are also illustrated, for example, word order changes or insertions between collocation constituents. Mon, 20 Jun 2022 00:00:00 GMT http://hdl.handle.net/20.500.11821/49 2022-06-20T00:00:00Z