dc.contributor.author |
Ulčar, Matej |
dc.contributor.author |
Robnik-Šikonja, Marko |
dc.date.accessioned |
2020-12-24T06:11:39Z |
dc.date.available |
2020-12-24T06:11:39Z |
dc.date.issued |
2020-12-24 |
dc.identifier.uri |
http://hdl.handle.net/20.500.11821/42 |
dc.description |
Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. LitLat BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library).
The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian.
LitLat BERT is based on XLM-RoBERTa model and comes in two versions, one for usage with transformers library (https://github.com/huggingface/transformers), and one for usage with fairseq library (https://github.com/pytorch/fairseq). More information is in the readme.txt. |
dc.language.iso |
lit |
dc.language.iso |
lav |
dc.language.iso |
eng |
dc.publisher |
University of Ljubljana, Faculty of Computer and Information Science |
dc.rights |
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT |
dc.rights.uri |
https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm |
dc.rights.label |
PUB |
dc.source.uri |
https://embeddia.eu |
dc.subject |
BERT |
dc.subject |
RoBERTa |
dc.subject |
embeddings |
dc.subject |
multilingual |
dc.subject |
model |
dc.subject |
contextual embeddings |
dc.subject |
word embeddings |
dc.title |
LitLat BERT |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
has.files |
yes |
branding |
CLARIN-LT |
contact.person |
Matej Ulčar matej.ulcar@fri.uni-lj.si University of Ljubljana, Faculty of Computer and Information Science |
sponsor |
European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds |
files.size |
1969610215 |
files.count |
3 |