LitLat BERT is a XLM-RoBERTa-based model, trained on three languages: Lithuanian, Latvian, and English. The corpora used for training the model have 4.07 billion tokens in total (full word tokens, not subword tokens), of which English corpora have 2.32 billion tokens, Lithuanian corpora have 1.21 billion tokens, and Latvian corpora have 0.53 billion tokens. The model has 12 transformer layers, each of length 768. It was trained as a masked language model, that is, the train task was to predict the masked tokens on sequences of 512 tokens. 15% of tokens were randomly masked. We used whole-word-masking, which means that for words, which are composed of more than one token, all tokens in the word were masked. The model was trained with fairseq toolset, based on pytorch library. The text is tokenized into subword tokens, before it is fed to the input of the model, both for training and usage. We used sentencepiece model as a tokenizer. The file "dict.txt" lists all the subword tokens. Tokens, prepended with ▁ symbol, indicate they're used at the beginning of the word. Tokens, that aren't prepended with a special symbol are only used for continuation, ie. they're necessarily combined with a preceding token(s) to form a full word. The total vocabulary of the model is composed of 84200 subword tokens. Because the models were trained on sequences of 512 tokens, that is the maximum length they support. For practical usage of the model with fairseq toolset, see examples at: https://github.com/pytorch/fairseq/tree/master/examples/camembert and https://github.com/pytorch/fairseq/tree/master/examples/roberta where you replace the path to the model with the path to downloaded LitLat BERT. RobertaModel or CamembertModel (imported from fairseq.models.roberta) seem to work just as fine as XLMRobertaModel for LitLat BERT. For practical usage of the model with transformers toolset, see examples at: https://github.com/huggingface/transformers where you replace the path to the model with the path to downloaded LitLat BERT or use a shorthand "EMBEDDIA/litlat-bert" to download the model from huggingface on the fly (it might download again, each time it is used).