huggingface/pytorch-transformers: DistilBERT, GPT-2 Large, XLM multilingual models, bug fixes
Creators
- Thomas Wolf1
- Lysandre Debut2
- Victor SANH1
- Denis
- Matt
- Grégory Châtel3
- Julien Chaumond2
- Tim Rault1
- Catalin Voss4
- Fei Wang5
- Malte Pietsch6
- Davide Fiocco
- dhanajitb
- Stefan Schweter
- Ananya Harsh Jha
- yzy5630
- Yongbo Wang7
- Shijie Wu
- Guillem García Subies
- Weixin Wang
- Zeyao Du
- Chi-Liang, Liu8
- Nikolay Korolev9
- Joel Grus10
- Jade Abbott11
- David Pollack12
- matej-svejda
- Clement1
- Ailing
- Abhishek Rao13
- 1. @huggingface
- 2. Hugging Face
- 3. DisAItek & Intel AI Innovators
- 4. Stanford University
- 5. @ShannonAI
- 6. deepset
- 7. Red Hat
- 8. @ntu-spml-lab @Yoctol
- 9. @JetBrains
- 10. @allenai
- 11. @RetroRabbit
- 12. i2x
- 13. @microsoft
Description
New model architecture: DistilBERT
Adding Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.
This new model architecture comes with two pretrained checkpoints:
distilbert-base-uncased
: the base DistilBert modeldistilbert-base-uncased-distilled-squad
: DistilBert model fine-tuned with distillation on SQuAD.
The third OpenAI GPT-2 checkpoint (GPT-2 large) is available in the library under the shortcut name gpt2-large
: 774M parameters, 36 layers, and 20 heads.
We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.
New dependency:sacremoses
Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses
, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.
In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *
* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.
Bug fixes and improvements to the library modules- Bertology script has seen major improvements (@tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
- Added GPT-2 Large 774M model (@thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
- Multi-GPU training has been patched (@FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
- Updated the in-depth BERT fine-tuning scripts to
pytorch-transformers
(@Morizeyao ) - Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
- Add
proxies
andforce_download
options tofrom_pretrained()
method to be able to use proxies and update cached models/tokenizers (@thomwolf) - Add shortcut to each special tokens with
_id
properties (e.g.tokenizer.cls_token_id
for the id in the vocabulary oftokenizer.cls_token
) (@thomwolf) - Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
- Fix and clean up byte-level BPE tests (@thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
- Fix a warning raised when the decode method is called for a model with no
sep_token
like GPT-2 (@LysandreJik ) - Updated the tokenizers saving method (@boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
- Stable
EnvironmentErrors
have been added to utility files (@abhishekraok ) - Fixed distributed barrier hang (@VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
- Change layer norm code to PyTorch's native layer norm (@dhpollack)
- Improve tokenization of XLM for multilingual inputs (@shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
- Added new AutoModels:
AutoModelWithLMHead
,AutoModelForSequenceClassification
,AutoModelForQuestionAnswering
(@LysandreJik) - Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
- Adding DistilBert model architecture and checpoints (@VictorSanh @LysandreJik @thomwolf)
- Fixes to DistilBert configuration and training script (@stefan-it)
- Fix XLNet attention mask for fp16 (@ziliwang)
- Documentation auto-deploy (@LysandreJik)
- Fix to add a tuple of tokens (@epwalsh)
- Update fp16 apex implmentation in scripts (@anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
- Fix tokenizer reloading in example scripts (@rabeehk)
- Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)
Files
huggingface/pytorch-transformers-1.2.0.zip
Files
(794.5 kB)
Name | Size | Download all |
---|---|---|
md5:4f217f6ee688e4642893b5669a665420
|
794.5 kB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/pytorch-transformers/tree/1.2.0 (URL)