Show simple item record

 
dc.contributor.author Dadurkevičius, Virginijus
dc.contributor.author Utka, Andrius
dc.date.accessioned 2022-11-28T13:36:30Z
dc.date.available 2022-11-28T13:36:30Z
dc.date.issued 2022-11-11
dc.identifier.uri http://hdl.handle.net/20.500.11821/51
dc.description The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled. Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf.
dc.language.iso lit
dc.language.iso bel
dc.language.iso est
dc.language.iso fin
dc.language.iso lav
dc.language.iso pol
dc.language.iso rus
dc.publisher SITTI, Vytautas Magnus University
dc.rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
dc.rights.uri https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm
dc.rights.label PUB
dc.subject global search engines
dc.subject Google
dc.subject Bing
dc.subject Yandex
dc.subject Lithuanian language
dc.subject webometrics
dc.subject corpus
dc.subject pivot words
dc.title Frequency lists of pivot words and GSE counts
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.mediaType text
hidden false
hasMetadata false
has.files yes
branding CLARIN-LT
contact.person Virginijus Dadurkevičius virginijus.dadurkevicius@vdu.lt Vytautas Magnus University
size.info 199 words
files.size 10799
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
Icon
Name
GSE_data_20220907.zip
Size
10.55 KB
Format
application/zip
Description
Unknown
 Download file

Show simple item record