explosion/spaCy: v3.2.0: Registered scoring functions, Doc input, floret vectors and more

doi:10.5281/zenodo.5648257

Published November 5, 2021 | Version v3.2.0

Software Open

explosion/spaCy: v3.2.0: Registered scoring functions, Doc input, floret vectors and more

1. Founder @explosion
2. Explosion & OxyKodit
3. Cotonoha
4. LogMeIn, Meltwater
5. @explosion
6. @kouchtv
7. Nord/LB
8. @PyThaiNLP
9. @codecentric
10. @UGent
11. @Semantics3

✨ New features and improvements

NEW: Registered scoring functions for each component in the config.
NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
extend config setting for morphologizer for whether existing feature types are preserved.
Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
New package spacy-loggers for additional loggers.
New Irish lemmatizer.
New Portuguese noun chunks and updated Spanish noun chunks.
Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Japanese reading and inflection from sudachipy are annotated as Token.morph features.
Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
LIKE_URL attribute includes the tokenizer URL pattern.
--n-save-epoch option for spacy pretrain.
Trained pipelines:
- New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
- Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
- English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:
- pipeline/floret_vectors_demo: basic floret vector training and importing.
- pipeline/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.
- pipeline/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @cayorodriguez, @DuyguA, @fgaim, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

Files

explosion/spaCy-v3.2.0.zip

Files (10.8 MB)

Name	Size	Download all
explosion/spaCy-v3.2.0.zip md5:df3d3ff0a4b71ffe7c30e9fc27502465	10.8 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v3.2.0 (URL)

	All versions	This version
Views	23,835	289
Downloads	820	7
Data volume	16.9 GB	97.0 MB

explosion/spaCy: v3.2.0: Registered scoring functions, Doc input, floret vectors and more

Creators

Description

Files

explosion/spaCy-v3.2.0.zip

Files (10.8 MB)

Additional details

Related works