Show simple item record

 
dc.contributor.author Ulčar, Matej
dc.contributor.author Robnik-Šikonja, Marko
dc.date.accessioned 2020-12-24T06:11:39Z
dc.date.available 2020-12-24T06:11:39Z
dc.date.issued 2020-12-24
dc.identifier.uri http://hdl.handle.net/20.500.11821/42
dc.description Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. LitLat BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian. LitLat BERT is based on XLM-RoBERTa model and comes in two versions, one for usage with transformers library (https://github.com/huggingface/transformers), and one for usage with fairseq library (https://github.com/pytorch/fairseq). More information is in the readme.txt.
dc.language.iso lit
dc.language.iso lav
dc.language.iso eng
dc.publisher University of Ljubljana, Faculty of Computer and Information Science
dc.rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
dc.rights.uri https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm
dc.rights.label PUB
dc.source.uri https://embeddia.eu
dc.subject BERT
dc.subject RoBERTa
dc.subject embeddings
dc.subject multilingual
dc.subject model
dc.subject contextual embeddings
dc.subject word embeddings
dc.title LitLat BERT
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN-LT
contact.person Matej Ulčar matej.ulcar@fri.uni-lj.si University of Ljubljana, Faculty of Computer and Information Science
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds
files.size 1969610215
files.count 3


 Files in this item

This item is
Publicly Available
and licensed under:
PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT
Icon
Name
readme.txt
Size
2.13 KB
Format
Text file
Description
Readme file, with aditional information.
 Download file
Icon
Name
LitLat-BERT_transformers.tar.gz
Size
341.8 MB
Format
application/gzip
Description
Compressed folder with transformers compatible model, subword vocabulary and configuration files.
 Download file
Icon
Name
LitLat-BERT_fairseq.tar.gz
Size
1.5 GB
Format
application/gzip
Description
Compressed folder with fairseq compatible model, subword vocabulary and configuration files.
 Download file

Show simple item record