Published November 5, 2021
| Version v3.2.0
Software
Open
explosion/spaCy: v3.2.0: Registered scoring functions, Doc input, floret vectors and more
Creators
- Ines Montani1
- Matthew Honnibal1
- Matthew Honnibal1
- Sofie Van Landeghem2
- Adriane Boyd
- Henning Peters
- Paul O'Leary McCann3
- Maxim Samsonov
- Jim Geovedi
- Jim O'Regan
- György Orosz4
- Duygu Altinok5
- Søren Lind Kristiansen
- Roman6
- Explosion Bot5
- Leander Fiedler7
- Grégory Howard
- Wannaphong Phatthiyaphaibun8
- Yohei Tamura
- Sam Bozek
- murat
- Mark Amery
- Björn Böing9
- Pradeep Kumar Tippa
- Leif Uwe Vogelsang
- Bram Vanroy10
- Ramanan Balakrishnan11
- Vadim Mazaev
- GregDubbin
- 1. Founder @explosion
- 2. Explosion & OxyKodit
- 3. Cotonoha
- 4. LogMeIn, Meltwater
- 5. @explosion
- 6. @kouchtv
- 7. Nord/LB
- 8. @PyThaiNLP
- 9. @codecentric
- 10. @UGent
- 11. @Semantics3
Description
✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()
andnlp.pipe()
acceptDoc
input, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite
config settings forentity_linker
,morphologizer
,tagger
,sentencizer
andsenter
.extend
config setting formorphologizer
for whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()
including IETF language tags, for examplefra
forFrench
andzh-Hans
forChinese
. - New package
spacy-loggers
for additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipy
are annotated asToken.morph
features. - Additional
morph_micro_p/r/f
scores for morphological features fromScorer.score_morph_per_feat()
. LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option forspacy pretrain
.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.pos
andToken.morph
.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer
.
- In the
Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].
is now° c .
instead of° c.
for most languages. - The tokenizer classes
ChineseTokenizer
,JapaneseTokenizer
,KoreanTokenizer
,ThaiTokenizer
andVietnameseTokenizer
requireVocab
rather thanLanguage
in__init__
. - In
DocBin
, user data is now always serialized according to thestore_user_data
option, see #9190.
- Demo projects for floret vectors:
pipeline/floret_vectors_demo
: basic floret vector training and importing.pipeline/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipeline/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
@adrianeboyd, @Avi197, @baxtree, @cayorodriguez, @DuyguA, @fgaim, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
Files
explosion/spaCy-v3.2.0.zip
Files
(10.8 MB)
Name | Size | Download all |
---|---|---|
md5:df3d3ff0a4b71ffe7c30e9fc27502465
|
10.8 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v3.2.0 (URL)