There is a newer version of this record available.

Software Open Access

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Perric; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander M.

Disclaimer: this release is the first release with no Python 3.6 support.


The OPT model was proposed in Open Pre-trained Transformer Language Models by Meta AI. OPT is a series of open-sourced large causal language models which perform similar in performance to GPT3.

  • Add OPT by @younesbelkada in #17088

The FLAVA model was proposed in FLAVA: A Foundational Language And Vision Alignment Model by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.

The paper aims at creating a single unified foundation model which can work across vision, language as well as vision-and-language multimodal tasks.

  • [feat] Add FLAVA model by @apsdehal in #16654

The YOLOS model was proposed in You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. YOLOS proposes to just leverage the plain Vision Transformer (ViT) for object detection, inspired by DETR. It turns out that a base-sized encoder-only Transformer can also achieve 42 AP on COCO, similar to DETR and much more complex frameworks such as Faster R-CNN.

  • Add YOLOS by @NielsRogge in #16848

The RegNet model was proposed in Designing Network Design Spaces by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár.

The authors design search spaces to perform Neural Architecture Search (NAS). They first start from a high dimensional search space and iteratively reduce the search space by empirically applying constraints based on the best-performing models sampled by the current search space.

  • RegNet by @FrancescoSaverioZuppichini in #16188

The TAPEX model was proposed in TAPEX: Table Pre-training via Learning a Neural SQL Executor by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. TAPEX pre-trains a BART model to solve synthetic SQL queries, after which it can be fine-tuned to answer natural language questions related to tabular data, as well as performing table fact checking.

  • Add TAPEX by @NielsRogge in #16473
Data2Vec: vision

The Data2Vec model was proposed in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli. Data2Vec proposes a unified framework for self-supervised learning across different data modalities - text, audio and images. Importantly, predicted targets for pre-training are contextualized latent representations of the inputs, rather than modality-specific, context-independent targets.

The vision model is added in v4.19.0.

  • [Data2Vec] Add data2vec vision by @patrickvonplaten in #16760
  • Add Data2Vec for Vision in TF by @sayakpaul in #17008
FSDP integration in Trainer

PyTorch recently upstreamed the Fairscale FSDP into PyTorch Distributed with additional optimizations. This PR is aimed at integrating it into Trainer API.

It enables Distributed Training at Scale. It's a wrapper for sharding Module parameters across data parallel workers. This is inspired by Xu et al. as well as the ZeRO Stage 3 from DeepSpeed. PyTorch FSDP will focus more on production readiness and long-term support. This includes better integration with ecosystems and improvements on performance, usability, reliability, debuggability and composability.

  • PyTorch FSDP integration in Trainer by @pacman100 in #17136
Training scripts

New example scripts were added for image classification and semantic segmentation. Both now have versions that leverage the Trainer API and Accelerate.

  • Add image classification script, no trainer by @NielsRogge in #16727
  • Add semantic script no trainer, v2 by @NielsRogge in #16788
  • Add semantic script, trainer by @NielsRogge in #16834
Documentation in Spanish

To continue democratizing good machine learning, we're making the Transformers documentation more accessible to non-English speakers; starting with Spanish (572M speakers worldwide).

  • Added es version of language_modeling.mdx doc by @jQuinRivero in #17021
  • Spanish translation of the file philosophy.mdx by @jkmg in #16922
  • Documentation: Spanish translation of fast_tokenizers.mdx by @jloayza10 in #16882
  • Translate index.mdx (to ES) and add Spanish models to quicktour.mdx examples by @omarespejel in #16685
  • Spanish translation of the file multilingual.mdx by @SimplyJuanjo in #16329
Improvements and bugfixes
  • [modeling_utils] rearrange text by @stas00 in #16632
  • Added Annotations for PyTorch models by @anmolsjoshi in #16619
  • Allow the same config in the auto mapping by @sgugger in #16631
  • Update no_trainer scripts with new Accelerate functionalities by @muellerzr in #16617
  • Fix doc example by @NielsRogge in #16448
  • Add inputs vector to calculate metric method by @lmvasque in #16461
  • [megatron-bert-uncased-345m] fix conversion by @stas00 in #16639
  • Remove parent/child tests in auto model tests by @sgugger in #16653
  • Updated _load_pretrained_model_low_mem to check if keys are in the state_dict by @FrancescoSaverioZuppichini in #16643
  • Update Support image on by @BritneyMuller in #16615
  • bert: properly mention deprecation of TF2 conversion script by @stefan-it in #16171
  • add vit tf doctest with @add_code_sample_docstrings by @johko in #16636
  • Fix error in doc of DataCollatorWithPadding by @secsilm in #16662
  • Fix QA sample by @ydshieh in #16648
  • TF generate refactor - Beam Search by @gante in #16374
  • Add tests for no_trainer and fix existing examples by @muellerzr in #16656
  • only load state dict when the checkpoint is not None by @laurahanu in #16673
  • [Trainer] tf32 arg doc by @stas00 in #16674
  • Update audio examples with MInDS-14 by @stevhliu in #16633
  • add a warning in SpmConverter for sentencepiece's model using the byte fallback feature by @SaulLu in #16629
  • Fix some doc examples in task summary by @ydshieh in #16666
  • Jia multi gpu eval by @liyongsea in #16428
  • Generate: min length can't be larger than max length by @gante in #16668
  • fixed crash when deleting older checkpoint and a file f"{checkpoint_prefix}-*" exist by @sadransh in #16686
  • [Doctests] Correct task summary by @patrickvonplaten in #16644
  • Add Doc Test for BERT by @vumichien in #16523
  • Fix t5 shard on TPU Pods by @agemagician in #16527
  • update decoder_vocab_size when resizing embeds by @patil-suraj in #16700
  • Fix TF_MASKED_LM_SAMPLE by @ydshieh in #16698
  • Rename the method test_torchscript by @ydshieh in #16693
  • Reduce memory leak in _create_and_check_torchscript by @ydshieh in #16691
  • Enable more test_torchscript by @ydshieh in #16679
  • Don't push checkpoints to hub in no_trainer scripts by @muellerzr in #16703
  • Private repo TrainingArgument by @nbroad1881 in #16707
  • Handle image_embeds in ViltModel by @ydshieh in #16696
  • Improve PT/TF equivalence test by @ydshieh in #16557
  • Fix example logs repeating themselves by @muellerzr in #16669
  • [Bart] correct doc test by @patrickvonplaten in #16722
  • Add Doc Test GPT-2 by @ArEnSc in #16439
  • Only call get_output_embeddings when tie_word_embeddings is set by @smelm in #16667
  • Update by @raki-1203 in #16652
  • Qdqbert example add benchmark script with ORT-TRT by @shangz-ai in #16592
  • Replace assertion with exception by @anmolsjoshi in #16720
  • Change the chunk_iter function to handle by @Narsil in #16730
  • Remove duplicate header by @sgugger in #16732
  • Moved functions to by @anmolsjoshi in #16625
  • TF: remove set_tensor_by_indices_to_value by @gante in #16729
  • Add Doc Tests for Reformer PyTorch by @hiromu166 in #16565
  • [FlaxSpeechEncoderDecoder] Fix input shape bug in weights init by @sanchit-gandhi in #16728
  • [FlaxWav2Vec2Model] Fix bug in attention mask by @sanchit-gandhi in #16725
  • add Bigbird ONNX config by @vumichien in #16427
  • TF generate: handle case without cache in beam search by @gante in #16704
  • Fix decoding score comparison when using logits processors or warpers by @bryant1410 in #10638
  • [Doctests] Fix all T5 doc tests by @patrickvonplaten in #16646
  • Fix #16660 (tokenizers setters of ids of special tokens) by @davidleonfdez in #16661
  • [from_pretrained] refactor find_mismatched_keys by @stas00 in #16706
  • Add Doc Test for GPT-J by @ArEnSc in #16507
  • Fix and improve CTRL doctests by @jeremyadamsfisher in #16573
  • [modeling_utils] better explanation of ignore keys by @stas00 in #16741
  • CI: setup-dependent pip cache by @gante in #16751
  • Reduce Funnel PT/TF diff by @ydshieh in #16744
  • Add defensive check for config num_labels and id2label by @sgugger in #16709
  • Add self training code for text classification by @tuvuumass in #16738
  • [self-scheduled ci] explain where dependencies are by @stas00 in #16757
  • Fixup no_trainer examples scripts and add more tests by @muellerzr in #16765
  • [Doctest] added doctest changes for electra by @bhadreshpsavani in #16675
  • Enabling Tapex in table question answering pipeline. by @Narsil in #16663
  • [Flax .from_pretrained] Raise a warning if model weights are not in float32 by @sanchit-gandhi in #16762
  • Fix batch size in evaluation loop by @sgugger in #16763
  • Make nightly install dev accelerate by @muellerzr in #16783
  • [deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop by @stas00 in #16717
  • Kill async pushes when calling push_to_hub with blocking=True by @sgugger in #16755
  • Improve image classification example by @NielsRogge in #16585
  • [SpeechEncoderDecoderModel] Fix bug in reshaping labels by @sanchit-gandhi in #16748
  • Fix issue avoid-missing-comma found at by @code-review-doctor in #16768
  • [trainer / deepspeed] fix hyperparameter_search by @stas00 in #16740
  • [modeling utils] revamp from_pretrained(..., low_cpu_mem_usage=True) + tests by @stas00 in #16657
  • Fix PT TF ViTMAE by @ydshieh in #16766
  • Update by @NielsRogge in #16797
  • Pin Jax to last working release by @sgugger in #16808
  • CI: non-remote GH Actions now use a python venv by @gante in #16789
  • TF generate refactor - XLA sample by @gante in #16713
  • Raise error and suggestion when using custom optimizer with Fairscale or Deepspeed by @allanj in #16786
  • Create empty venv on cache miss by @gante in #16816
  • [ViT, BEiT, DeiT, DPT] Improve code by @NielsRogge in #16799
  • [Quicktour Audio] Improve && remove ffmpeg dependency by @patrickvonplaten in #16723
  • fix megatron bert convert state dict naming by @Codle in #15820
  • use base_version to check torch version in torch_less_than_1_11 by @nbroad1881 in #16806
  • Allow passing encoder_ouputs as tuple to EncoderDecoder Models by @jsnfly in #16814
  • Refactor issues with yaml by @LysandreJik in #16772
  • fix _setup_devices in case where there is no torch.distributed package in build by @dlwh in #16821
  • Clean up semantic segmentation tests by @NielsRogge in #16801
  • Fix LayoutLMv2 tokenization docstrings by @qqaatw in #16187
  • Wav2 vec2 phoneme ctc tokenizer optimisation by @ArthurZucker in #16817
  • [Flax] improve large model init and loading by @patil-suraj in #16148
  • Some tests misusing assertTrue for comparisons fix by @code-review-doctor in #16771
  • Type hints added for TFMobileBert by @Dahlbomii in #16505
  • fix seeking text column name twice by @dandelin in #16624
  • Add onnx export of models with a multiple choice classification head by @echarlaix in #16758
  • [ASR Pipeline] Correct init docs by @patrickvonplaten in #16833
  • Add doc about attention_mask on gpt2 by @wiio12 in #16829
  • TF: Add sigmoid activation function by @gante in #16819
  • Correct Logging of Eval metric to Tensorboard by @Jeevesh8 in #16825
  • replace Speech2TextTokenizer by Speech2TextFeatureExtractor in some docstrings by @SaulLu in #16835
  • Type hints added to Speech to Text by @Dahlbomii in #16506
  • Improve test_pt_tf_model_equivalence on PT side by @ydshieh in #16731
  • Add support for bitsandbytes by @manuelciosici in #15622
  • [Typo] Fix typo in modeling utils by @patrickvonplaten in #16840
  • add DebertaV2 fast tokenizer by @mingboiz in #15529
  • Fixing return type tensor with num_return_sequences>1. by @Narsil in #16828
  • [modeling_utils] use less cpu memory with sharded checkpoint loading by @stas00 in #16844
  • [docs] fix url by @stas00 in #16860
  • Fix custom init sorting script by @sgugger in #16864
  • Fix multiproc metrics in no_trainer examples by @muellerzr in #16865
  • Long QuestionAnsweringPipeline fix. by @Narsil in #16778
  • t5: add conversion script for T5X to FLAX by @stefan-it in #16853
  • tiny tweak to allow BatchEncoding.token_to_char when token doesn't correspond to chars by @ghlai9665 in #15901
  • Adding support for array key in raw dictionnaries in ASR pipeline. by @Narsil in #16827
  • Return input_ids in ImageGPT feature extractor by @sgugger in #16872
  • Use ACT2FN to fetch ReLU activation by @eldarkurtic in #16874
  • Fix GPT-J onnx conversion by @ChainYo in #16780
  • Fix doctest list by @ydshieh in #16878
  • New features for CodeParrot training script by @loubnabnl in #16851
  • Add missing entries in mappings by @ydshieh in #16857
  • TF: rework XLA generate tests by @gante in #16866
  • Minor fixes/improvements in convert_file_size_to_int by @mariosasko in #16891
  • Add doc tests for Albert and Bigbird by @vumichien in #16774
  • Add OnnxConfig for ConvBERT by @ChainYo in #16859
  • TF: XLA repetition penalty by @gante in #16879
  • Changes in create_optimizer to support tensor parallelism with SMP by @cavdard in #16880
  • [DocTests] Fix some doc tests by @patrickvonplaten in #16889
  • add bigbird typo fixes by @ChainYo in #16897
  • Fix doc test quicktour dataset by @patrickvonplaten in #16929
  • Add missing ckpt in config docs by @ydshieh in #16900
  • Fix PyTorch RAG tests GPU OOM by @ydshieh in #16881
  • Fix RemBertTokenizerFast by @ydshieh in #16933
  • TF: XLA logits processors - minimum length, forced eos, and forced bos by @gante in #16912
  • TF: XLA Logits Warpers by @gante in #16899
  • added deit onnx config by @rushic24 in #16887
  • TF: XLA stable softmax by @gante in #16892
  • Replace deprecated logger.warn with warning by @sanchit-gandhi in #16876
  • Fix issue probably-meant-fstring found at by @code-review-doctor in #16913
  • Limit the use of PreTrainedModel.device by @sgugger in #16935
  • apply torch int div to layoutlmv2 by @ManuelFay in #15457
  • FIx Iterations for decoder by @agemagician in #16934
  • Add onnx config for RoFormer by @skrsna in #16861
  • documentation: some minor clean up by @mingboiz in #16850
  • Fix RuntimeError message format by @ftnext in #16906
  • use original loaded keys to find mismatched keys by @tricktreat in #16920
  • [Research] Speed up evaluation for XTREME-S by @anton-l in #16785
  • Fix HubertRobustTest PT/TF equivalence test on GPU by @ydshieh in #16943
  • Misc. fixes for Pytorch QA examples: by @searchivarius in #16958
  • [HF Argparser] Fix parsing of optional boolean arguments by @NielsRogge in #16946
  • Fix distributed_concat with scalar tensor by @Yard1 in #16963
  • Update custom_models.mdx by @mishig25 in #16964
  • Fix add-new-model-like when model doesn't support all frameworks by @sgugger in #16966
  • Fix multiple deletions of the same files in save_pretrained by @sgugger in #16947
  • Fixup no_trainer save logic by @muellerzr in #16968
  • Fix doc notebooks links by @sgugger in #16969
  • Fix check_all_models_are_tested by @ydshieh in #16970
  • Add -e flag to some GH workflow yml files by @ydshieh in #16959
  • Update by @datquocnguyen in #16941
  • Update check_models_are_tested to deal with Windows path by @ydshieh in #16973
  • Add parameter --config_overrides for by @conan1024hao in #16961
  • Rename a class to reflect framework pattern AutoModelXxx -> TFAutoModelXxx by @amyeroberts in #16993
  • set eos_token_id to None to generate until max length by @ydshieh in #16989
  • Fix savedir for by epoch by @muellerzr in #16996
  • Update README to latest release by @sgugger in #16997
  • use scale=1.0 in floats_tensor called in speech model testers by @ydshieh in #17007
  • Update all require decorators to use skipUnless when possible by @muellerzr in #16999
  • TF: XLA bad words logits processor and list of processors by @gante in #16974
  • Make create_extended_attention_mask_for_decoder static method by @pbelevich in #16893
  • Update by @tarzanwill in #16977
  • Updating variable names. by @Narsil in #16445
  • Revert "Updating variable names. by @Narsil in #16445)"
  • Replace dict/BatchEncoding instance checks by Mapping by @sgugger in #17014
  • Result of new doc style with fixes by @sgugger in #17015
  • Add a check on config classes docstring checkpoints by @ydshieh in #17012
  • Add translating guide by @omarespejel in #17004
  • update docs of length_penalty by @manandey in #17022
  • [FlaxGenerate] Fix bug in decoder_start_token_id by @sanchit-gandhi in #17035
  • Fx with meta by @michaelbenayoun in #16836
  • [Flax(Speech)EncoderDecoder] Fix bug in decoder_module by @sanchit-gandhi in #17036
  • Fix typo in RetriBERT docstring by @mpoemsl in #17018
  • add torch.no_grad when in eval mode by @JunnYu in #17020
  • Disable Flax GPU tests on push by @sgugger in #17042
  • Clean up vision tests by @NielsRogge in #17024
  • [Trainer] Move logic for checkpoint loading into separate methods for easy overriding by @calpt in #17043
  • Update no_trainer examples to use new logger by @muellerzr in #17044
  • Fix no_trainer examples to properly calculate the number of samples by @muellerzr in #17046
  • Allow all imports from transformers by @LysandreJik in #17050
  • Make the sacremoses dependency optional by @LysandreJik in #17049
  • Clean up by @sgugger in #17045
  • [T5 Tokenizer] Model has no fixed position ids - there is no hardcode… by @patrickvonplaten in #16990
  • [FlaxBert] Add ForCausalLM by @sanchit-gandhi in #16995
  • Move test model folders by @ydshieh in #17034
  • Make Trainer compatible with sharded checkpoints by @sgugger in #17053
  • Remove Python and use v2 action by @sgugger in #17059
  • Fix RNG reload in resume training from epoch checkpoint by @sgugger in #17055
  • Remove device parameter from create_extended_attention_mask_for_decoder by @pbelevich in #16894
  • Fix hashing for deduplication by @thomasw21 in #17048
  • Skip RoFormer ONNX test if rjieba not installed by @lewtun in #16981
  • Remove masked image modeling from BEIT ONNX export by @lewtun in #16980
  • Make sure telemetry arguments are not returned as unused kwargs by @sgugger in #17063
  • Type hint complete Albert model file. by @karthikrangasai in #16682
  • Deprecate model templates by @sgugger in #17062
  • Update to build via git for accelerate by @muellerzr in #17084
  • Allow saved_model export of TFCLIPModel in save_pretrained by @seanmor5 in #16886
  • Fix DeBERTa token_type_ids by @deutschmn in #17082
  • 📝 open fresh PR for pipeline doctests by @stevhliu in #17073
  • minor change on TF Data2Vec test by @ydshieh in #17085
  • Added spanish translation of autoclass_tutorial. by @Duedme in #17069
  • type hints for pytorch models by @robotjellyzone in #17064
  • Add type hints for BERTGeneration by @robsmith155 in #17047
  • Fix MLflowCallback and add support for MLFLOW_EXPERIMENT_NAME by @orieg in #17091
  • Remove torchhub test by @sgugger in #17097
  • fix missing "models" in pipeline test module by @ydshieh in #17090
  • Fix link to example scripts by @stevhliu in #17103
  • Fix self-push CI report path in cat by @ydshieh in #17111
  • Added BigBirdPegasus onnx config by @nandwalritik in #17104
  • split single_gpu and multi_gpu by @ydshieh in #17083
  • LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens by @ghlai9665 in #17092
  • Add type hints for BigBirdPegasus and Data2VecText PyTorch models by @robsmith155 in #17123
  • add mobilebert onnx configs by @manandey in #17029
  • [WIP] Fix Pyright static type checking by replacing if-else imports with try-except by @d-miketa in #16578
  • Add the auto_find_batch_size capability from Accelerate into Trainer by @muellerzr in #17068
  • Fix MLflowCallback end_run() and add support for tags and nested runs by @orieg in #17130
  • Fix all docs for accelerate install directions by @muellerzr in #17145
  • LogSumExp trick question_answering pipeline. by @Narsil in #17143
  • train args defaulting None marked as Optional by @d-miketa in #17156
  • [trainer] sharded _load_best_model by @stas00 in #17150
  • [Deepspeed] add many more models to the model zoo test by @stas00 in #12695
  • Fixing the output of code examples in the preprocessing chapter by @HallerPatrick in #17162
  • missing file by @stas00 in #17164
  • Add MLFLOW_FLATTEN_PARAMS support in MLflowCallback by @orieg in #17148
  • Fix template init by @sgugger in #17163
  • MobileBERT tokenizer tests by @leondz in #16896
  • [M2M100 doc] remove duplicate example by @patil-suraj in #17175
  • Extend Transformers Trainer Class to Enable PyTorch SGD/Adagrad Optimizers for Training by @jianan-gu in #17154
  • propagate "attention_mask" dtype for "use_past" in OnnxConfig.generate_dummy_inputs by @arampacha in #17105
  • Convert image to rgb for clip model by @hengkuanwee in #17101
  • Add missing RetriBERT tokenizer tests by @mpoemsl in #17017
  • [WIP] Enable reproducibility for distributed trainings by @hasansalimkanmaz in #16907
  • Remove unnecessary columns for all dataset types in Trainer by @Yard1 in #17166
  • Fix LED documentation by @manuelciosici in #17181
  • Ensure tensors are at least 1d for pad and concat by @Yard1 in #17179
  • add shift_tokens_right in FlaxMT5 by @patil-suraj in #17188
  • Remove columns before passing to data collator by @Yard1 in #17187
  • Remove duplicated os.path.join by @shijie-wu in #17192
  • Spanish translation of philosophy.mdx #15947 by @jkmg in #16922
  • Fix style error in Spanish docs by @osanseviero in #17197
  • Fix contents in index.mdx to match docs' sidebar by @omarespejel in #17198
  • ViT and Swin symbolic tracing with torch.fx by @michaelbenayoun in #17182
  • migrate azure blob for beit checkpoints by @donglixp in #16902
  • Update data2vec.mdx to include a Colab Notebook link (that shows fine-tuning) by @sayakpaul in #17194
Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @anmolsjoshi
    • Added Annotations for PyTorch models (#16619)
    • Replace assertion with exception (#16720)
    • Moved functions to (#16625)
  • @vumichien
    • Add Doc Test for BERT (#16523)
    • add Bigbird ONNX config (#16427)
    • Add doc tests for Albert and Bigbird (#16774)
  • @tuvuumass
    • Add self training code for text classification (#16738)
  • @sayakpaul
    • Add Data2Vec for Vision in TF (#17008)
  • @robotjellyzone
    • type hints for pytorch models (#17064)
  • @d-miketa
    • [WIP] Fix Pyright static type checking by replacing if-else imports with try-except (#16578)
    • train args defaulting None marked as Optional (#17156)
If you use this software, please cite it using these metadata.
Files (11.4 MB)
Name Size
11.4 MB Download
All versions This version
Views 45,769200
Downloads 1,4678
Data volume 11.8 GB91.3 MB
Unique views 38,141174
Unique downloads 8068


Cite as