Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

There is a newer version of the record available.

Published September 4, 2019 | Version 1.2.0
Software Open

huggingface/pytorch-transformers: DistilBERT, GPT-2 Large, XLM multilingual models, bug fixes

Description

New model architecture: DistilBERT

Adding Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

  • distilbert-base-uncased: the base DistilBert model
  • distilbert-base-uncased-distilled-squad: DistilBert model fine-tuned with distillation on SQuAD.
An awaited new pretrained checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint (GPT-2 large) is available in the library under the shortcut name gpt2-large: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual pretrained checkpoints in 17 and 100 languages

We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

New dependency: sacremoses

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:

  • pythainlp: Thai tokenizer
  • kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
  • jieba: Chinese tokenizer *

* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules
  • Bertology script has seen major improvements (@tuvuumass )
  • Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
  • Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
  • Added GPT-2 Large 774M model (@thomwolf )
  • Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
  • Multi-GPU training has been patched (@FeiWang96 )
  • Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
  • Updated the in-depth BERT fine-tuning scripts to pytorch-transformers (@Morizeyao )
  • Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
  • Add proxies and force_download options to from_pretrained() method to be able to use proxies and update cached models/tokenizers (@thomwolf)
  • Add shortcut to each special tokens with _id properties (e.g. tokenizer.cls_token_id for the id in the vocabulary of tokenizer.cls_token) (@thomwolf)
  • Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
  • Fix and clean up byte-level BPE tests (@thomwolf)
  • Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
  • Fix a warning raised when the decode method is called for a model with no sep_token like GPT-2 (@LysandreJik )
  • Updated the tokenizers saving method (@boy2000-007man)
  • SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
  • Stable EnvironmentErrors have been added to utility files (@abhishekraok )
  • Fixed distributed barrier hang (@VictorSanh )
  • Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
  • Change layer norm code to PyTorch's native layer norm (@dhpollack)
  • Improve tokenization of XLM for multilingual inputs (@shijie-wu)
  • Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
  • Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
  • Added new AutoModels: AutoModelWithLMHead, AutoModelForSequenceClassification, AutoModelForQuestionAnswering (@LysandreJik)
  • Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
  • Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
  • Adding DistilBert model architecture and checpoints (@VictorSanh @LysandreJik @thomwolf)
  • Fixes to DistilBert configuration and training script (@stefan-it)
  • Fix XLNet attention mask for fp16 (@ziliwang)
  • Documentation auto-deploy (@LysandreJik)
  • Fix to add a tuple of tokens (@epwalsh)
  • Update fp16 apex implmentation in scripts (@anhnt170489)
  • Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
  • Fix tokenizer reloading in example scripts (@rabeehk)
  • Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
  • Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)

Files

huggingface/pytorch-transformers-1.2.0.zip

Files (794.5 kB)

Name Size Download all
md5:4f217f6ee688e4642893b5669a665420
794.5 kB Preview Download

Additional details