Lithuanian 2-gram dataset

Lithuanian 2-gram dataset

Authors: Bielinskienė, Agnė ; Boizou, Loïc ; Bumbulienė, Ieva ; Kovalevskaitė, Jolanta ; Krilavičius, Tomas ; Mandravickaitė, Justina ; Rimkutė, Erika ; Vilkaitė-Lozdienė, Laura

Project URL: http://mwe.lt/

Date issued: 2019

Type: lexicalConceptualResource

Size: 67000000 entries

Language(s): Lithuanian

Description: Dataset of 2-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 2-grams generated, making all in all 67 million entries. Frequencies of all entries were added to the dataset as well.

Publisher: Baltic Institute of Advanced Technology

Vytautas Magnus University

Subject(s): n-grams Lithuanian

Collection(s): CLARIN-LT

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT

Name: 2gram.zip
Size: 89.93 MB
Format: application/zip
Description: Lithuanian 2gram dataset