huggingface/transformers: Trainer, TFTrainer, Multilingual BART, Encoder-decoder improvements, Generation Pipeline
Creators
- Thomas Wolf1
- Lysandre Debut2
- Julien Chaumond2
- Victor SANH1
- Patrick von Platen
- Aymeric Augustin3
- Rémi Louf
- Funtowicz Morgan4
- Sam Shleifer5
- Stefan Schweter
- Manuel Romero
- Denis
- erenup
- Matt
- Piero Molino
- Grégory Châtel6
- Bram Vanroy7
- Tim Rault1
- Gunnlaugur Thor Briem8
- Anthony MOI2
- Malte Pietsch9
- Catalin Voss10
- Bilal Khan
- Fei Wang11
- Louis Martin
- Davide Fiocco
- Martin Malmsten
- Lorenzo Ampil12
- HUSEIN ZOLKEPLI
- Clement1
- 1. @huggingface
- 2. Hugging Face
- 3. @canalplus
- 4. HuggingFace
- 5. Huggingface
- 6. DisAItek & Intel AI Innovators
- 7. @UGent
- 8. Qlik
- 9. deepset
- 10. Stanford University
- 11. University of Southern California
- 12. @thinkingmachines
Description
Trainer & TFTrainer
Version 2.9 introduces a new Trainer
class for PyTorch, and its equivalent TFTrainer
for TF 2.
This let us reorganize the example scripts completely for a cleaner codebase.
The main features of the Trainer are:
- Same user-facing API for PyTorch and TF 2
- Support for CPU, GPU, Multi-GPU, and TPU
- Easier than ever to share your fine-tuned models
The TFTrainer was largely contributed by awesome community member @jplu! 🔥 🔥
A few additional features of the example scripts are:
- Generate argparsers from type hints on dataclasses
- Can load arguments from json files
- Logging through TensorBoard and wandb
Documentation for the Trainer is still work-in-progress, please consider contributing improvements.
TPU Support- Both the TensorFlow and PyTorch trainers have TPU support (@jplu, @LysandreJik, @julien-c). An additional utility is added so that the TPU scripts may be launched in a similar manner to
torch.distributed
. - This was built with the support of @jysohn23, member of the Google TPU team
New BART checkpoint converted: this adds mbart-en-ro model
, a BART variant finetuned on english-romanian translation.
huggingface/tokenizers
- Additional tests and support has been added to
huggingface/tokenizers
tokenizers. (@mfuntowicz, @thomwolf) - TensorFlow models work out-of-the-box with the new tokenizers (@LysandreJik)
Auto-regressive decoding for T5 has been greatly sped up by storing past key/value states. Work done on both PyTorch and TensorFlow.
Breaking changeThis introduces a breaking change, in that it increases the default output length of T5Model and T5ForConditionalGeneration from 4 to 5 (including the past_key_value_states).
Encoder-Decoder enhancements- Apply Encoder Decoder 1.5GB memory savings to TF as well (@patrickvonplaten, translation of same work on PyTorch models by @sshleifer)
- BART Summarization fine-tuning script now works for T5 as well (@sshleifer)
- Clean Encoder-Decoder models with Bart/T5-like API and add generate possibility (@patrickvonplaten)
Question Answering support for Albert and Roberta in TF with (@Pierrci):
- Question Answering support for Albert and Roberta in TF
- TFAlbertForQuestionAnswering
- The question answering pipeline now handles impossible answers (@bryant1410)
- Remove tqdm logging (@mfuntowicz)
- Sentiment analysis pipeline can now handle more than two sequences (@xxbidiao)
- Rewritten batch support in pipelines (@mfuntowicz)
Implements a text generation pipeline, GenerationPipeline
, which works on any ModelWithLMHead
head.
- Clean the generate testing functions (@patrickvonplaten)
- Notebooks updated in the documentation (@LysandreJik)
- Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (@ethanjperez)
- Fixed RoBERTa conversion script (@myleott)
- Speedup torch summarization tests (@sshleifer)
- Optimize causal mask using torch.where (@Akababa)
- Improved benchmarking utils (@patrickvonplaten)
- Fixed edge case for bert tokenization (@patrickvonplaten)
- SummarizationDataset cleanup (@sshleifer)
- BART: Replace config.output_past with use_cache kwarg (@sshleifer)
- Better documentation for Summarization and Translation pipeline (@julien-c)
- Additional documentation for model cards (@julien-c)
- Fix force_download of files on Windows (@calpt)
- Fix shuffling issue for distributed training (@elk-cloner)
- Shift labels internally within TransfoXLLMHeadModel when called with labels (@TevenLeScao)
- Remove
output_past
everywhere and replace byuse_cache
argument (@patrickvonplaten) - Added unit test for run_bart_sum (@sshleifer)
- Cleaner code by factorizating a few methods back in the
PreTrainedModel
(@sshleifer) - [Bert] remove hard-coded pad token id (@patrickvonplaten)
- Clean pipelines test and remove unnecessary code (@patrickvonplaten)
- JITting is not compatible with PyTorch/XLA or any other frameworks that requires serialization. The JITted methods were removed (@LysandreJik)
- Change newstest2013 to newstest2014 and clean up (@patrickvonplaten)
- Factor out tensor conversion method in
PretrainedTokenizer
(@sshleifer) - Remove tanh torch warnings (@aryanshomray)
- Fix token_type_id in BERT question-answering example (@siboehm)
- Add CircleCI workflow to build docs for preview (@harupy)
- Higher tolerance for past testing in T5 and TF T5 (@patrickvonplaten)
- XLM tokenizer should encode with bos token (@LysandreJik)
- XLM tokenizer should encode with bos token (@patrickvonplaten)
- fix summarization do_predict (@sshleifer)
- Encode to max length of input not max length of tokenizer for batch input (@patrickvonplaten)
- Add
qas_id
to SquadResult and SquadExample (@jarednielsen) - Fix bug in run_*.py scripts: double wrap into DataParallel during eval (@and-kul)
- Fix torchhub integration (@julien-c)
- Fix TFAlbertForSequenceClassification classifier dropout probability (@jarednielsen)
- Change uses of pow(x, 3) to pow(x, 3.0) (@mneilly-et)
- Shuffle train subset for summarization example (@Colanim)
- Removed the boto3 dependency (@julien-c)
- Add dialogpt training tips (@patrickvonplaten)
- Generation can now start with an empty prompt (@patrickvonplaten)
- GPT-2 is now traceable (@jazzcook15)
- Add known 3rd party to setup.cfg; removes local/circle ci isort discrepancy. (@sshleifer)
- Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair (@thomwolf)
- Now using CDN urls for weights (@julien-c)
- [Fix common tests on GPU] send model, ids to torch_device (@sshleifer)
- Fix TF input docstrings to refer to tf.Tensor rather than torch.Float (@jarednielsen)
- Additional metadata to traing arguments (@parmarsuraj99)
- [ci] Load pretrained models into the default (long-lived) cache (@julien-c)
- add timeout_decorator to tests (@sshleifer)
- Added XLM-R to the multilingual section in the documentation (@stefan-it)
- Better
num_labels
in configuration objects - Updated pytorch lightning scripts (@williamFalcon)
- Tests now pass with torch 1.5.0 (@LysandreJik)
- Ensure fast tokenizer can construct single-element tensor without pad token (@mfuntowicz)
Files
huggingface/transformers-v2.9.0.zip
Files
(5.1 MB)
Name | Size | Download all |
---|---|---|
md5:a701429b07dda35436126ae46a802e7d
|
5.1 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/transformers/tree/v2.9.0 (URL)