dc.contributor.author |
Dadurkevičius, Virginijus |
dc.contributor.author |
Utka, Andrius |
dc.date.accessioned |
2022-11-28T13:36:30Z |
dc.date.available |
2022-11-28T13:36:30Z |
dc.date.issued |
2022-11-11 |
dc.identifier.uri |
http://hdl.handle.net/20.500.11821/51 |
dc.description |
The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled.
Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages.
Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdf. |
dc.language.iso |
lit |
dc.language.iso |
bel |
dc.language.iso |
est |
dc.language.iso |
fin |
dc.language.iso |
lav |
dc.language.iso |
pol |
dc.language.iso |
rus |
dc.publisher |
SITTI, Vytautas Magnus University |
dc.rights |
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT |
dc.rights.uri |
https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm |
dc.rights.label |
PUB |
dc.subject |
global search engines |
dc.subject |
Google |
dc.subject |
Bing |
dc.subject |
Yandex |
dc.subject |
Lithuanian language |
dc.subject |
webometrics |
dc.subject |
corpus |
dc.subject |
pivot words |
dc.title |
Frequency lists of pivot words and GSE counts |
dc.type |
lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
hidden |
false |
hasMetadata |
false |
has.files |
yes |
branding |
CLARIN-LT |
contact.person |
Virginijus Dadurkevičius virginijus.dadurkevicius@vdu.lt Vytautas Magnus University |
size.info |
199 words |
files.size |
10799 |
files.count |
1 |