There is a newer version of this record available.

Software Open Access

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Perric; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander M.

New models XGLM

The XGLM model was proposed in Few-shot Learning with Multilingual Language Models by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.

XGLM is a GPT3-like multilingual model trained on a balanced corpus covering a diverse set of languages.


The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.

ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.


The PoolFormer model was proposed in MetaFormer is Actually What You Need for Vision by Sea AI Labs.


The PLBART model was proposed in Unified Pre-training for Program Understanding and Generation by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.

This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model plbart-base has been trained using multilingual denoising task on Java, Python and English.


The Data2Vec model was proposed in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli.

Data2Vec proposes a unified framework for self-supervised learning across different data modalities - text, audio and images. Importantly, predicted targets for pre-training are contextualized latent representations of the inputs, rather than modality-specific, context-independent targets.


The MaskFormer model was proposed in Per-Pixel Classification is Not All You Need for Semantic Segmentation by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.

MaskFormer addresses semantic segmentation with a mask classification paradigm instead of performing classic pixel-level classification.

Code in the Hub

This is a new experimental feature added to the library. It allows you to share a custom model (with configuration, tokenizer, feature extractor, processor) with anyone through the Model Hub while still using the Auto-classes API of the Transformers library.

See the documentation for more information!


We are working on updating the existing guides in the documentation, and writing more!

Time Stamps for Speech models

Speech models that have been trained with the CTC loss (Wav2Vec2, XLS-R, HuBERT, WavLM, ...) can now output the time stamp in addition to the transcription of the input audio. E.g. one can retrieve the start and end time for every transcribed word via the Wav2Vec2CTCTokenizer.decode method or the Wav2Vec2ProcessorWithLM.decoder method. See the documentation here and here respectively.

This feature can also be directly used via the ASR pipeline - see here and this example.

Breaking change

Unfortunately, some bugs had crept into CLIPTokenizerFast : the tokenization produced by CLIPTokenizer and CLIPTokenizerFast were not equal. CLIPTokenizerFast has been corrected to encode the text with the same strategy as CLIPTokenizer.

What does this mean for you ? You need to use the tokenizer that was used to train the CLIP template you are using. For example:

  • Case 1 : you use openai/clip-vit-base-patch32, openai/clip-vit-base-patch16 or openai/clip-vit-large-patch14 , before v4.17.0 the good version of the tokenizer was CLIPTokenizer. From v4.17.0, you can use both CLIPTokenizer and CLIPTokenizerFast.
  • Case 2 : you have trained your own CLIP model using CLIPTokenizerFast. Your tokenizer is no longer a CLIPTokenizerFast and we recommend you to load your tokenizer.json in a PreTrainedTokenizerFast directly or to continue to use a version prior to v4.17.0.
  • Case 3: you have trained your own CLIP model using CLIPTokenizer. Now, you can produce a fast equivalent of your tokenizer by doing CLIPTokenizerFast.from_pretrained("Path to local folder or Hub repo with slow tokenizer files", from_slow=True).

To make CLIPTokenizerFast identical to CLIPTokenizer, the template of the tokenization of a sentence pair (A,B) has been modified. The previous template was <|startoftext|> A B <|endoftext|> and the new one is <|startoftext|> A <|endoftext|> <|endoftext|> B <|endoftext|>.

What's Changed Impressive community contributors

The community contributors below have significantly contributed to the v4.16.0 release. Thank you!

@sayakpaul, for contributing the TensorFlow version of ConvNext @gchhablani, for contributing PLBart @edugp, for contributing Data2Vec

New Contributors

Full Changelog:

If you use this software, please cite it using these metadata.
Files (10.6 MB)
Name Size
10.6 MB Download
All versions This version
Views 45,7691,363
Downloads 1,46723
Data volume 11.8 GB243.2 MB
Unique views 38,1411,199
Unique downloads 80621


Cite as