We present the comparative wordlist based on the Corpus of the Contemporary Lithuanian Language (CCLL2 version 2, pre-2020), supplemented by the media (courtesy of the news media company 15min – www.15min.lt) and social networks lexicons of the war in Ukraine period (Feb 2022 to Feb 2024).
For a fair comparison, all word counts have been normalized as if they were 100m words in each source. CCLL2 has 162m words, wartime media – 36m words and wartime social networks – 2m words. The term "word" does not apply here to punctuation, numbers, dates, URL's, hashtags, popular English words, etc.
The data itself is in the form of a tab-separated-values (TSV) text file consisting of the following columns: word(token), CCLL2 count, CCLL2 docs, media count, media docs, social networks count, social networks docs. Where "docs" mean number (normalized) of documents with a particular word. All words are written as case-insensitive using capital letters.