explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

Ines Montani; Matthew Honnibal; Matthew Honnibal; Sofie Van Landeghem; Adriane Boyd; Henning Peters; Paul O'Leary McCann; Maxim Samsonov; Jim Geovedi; Jim O'Regan; Duygu Altinok; György Orosz; Søren Lind Kristiansen; Lj Miranda; Daniël de Kok; Roman; Explosion Bot; Leander Fiedler; Grégory Howard; Edward; Wannaphong Phatthiyaphaibun; Yohei Tamura; Sam Bozek; murat; Ryn Daniels; Mark Amery; Björn Böing; Bram Vanroy; Pradeep Kumar Tippa

doi:10.5281/zenodo.6504092

Published April 29, 2022 | Version v3.3.0

Software Open

explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

1. Founder @explosion
2. Explosion & OxyKodit
3. Cotonoha
4. @deepgram
5. LogMeIn, Meltwater
6. @explosion
7. @kouchtv
8. Nord/LB
9. Explosion AI
10. @PyThaiNLP
11. @codecentric
12. @UGent

✨ New features and improvements

Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).
- Speed up parser projectivization functions (#10241).
- Replace Ragged with faster AlignmentArray in Example for training (#10319).
- Improve Matcher speed (#10659).
- Improve serialization speed for empty Doc.spans (#10250).
NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with spacy init config -p trainable_lemmatizer or using the quickstart.
Language updates:
- Initial support for Lower Sorbian and Upper Sorbian.
- New noun chunks for Finnish.
- Updated noun chunks for French, Italian and Spanish.
- Additional updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
Big endian support with thinc v8.0.14+ and thinc-bigendian-ops.
Config comparisons with spacy debug diff-config.
displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates for debugging span suggesters.
The quickstart now supports adding spancat and trainable_lemmatizer components.

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package	Language	UPOS	Parser LAS	NER F
`fi_core_news_sm`	Finnish	92.5	71.9	75.9
`fi_core_news_md`	Finnish	95.9	78.6	80.6
`fi_core_news_lg`	Finnish	96.2	79.4	82.4
`ko_core_news_sm`	Korean	86.1	65.6	71.3
`ko_core_news_md`	Korean	94.7	80.9	83.1
`ko_core_news_lg`	Korean	94.7	81.3	85.3
`sv_core_news_sm`	Swedish	95.0	75.9	74.7
`sv_core_news_md`	Swedish	96.3	78.5	79.3
`sv_core_news_lg`	Swedish	96.3	79.1	81.1

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

Model	v3.2 Lemma Acc	v3.3 Lemma Acc
`da_core_news_md`	84.9	94.8
`de_core_news_md`	73.4	97.7
`el_core_news_md`	56.5	88.9
`fi_core_news_md`	-	86.2
`it_core_news_md`	86.6	97.2
`ko_core_news_md`	-	90.0
`lt_core_news_md`	71.1	84.8
`nb_core_news_md`	76.7	97.1
`nl_core_news_md`	81.5	94.0
`pl_core_news_md`	87.1	93.7
`pt_core_news_md`	76.7	96.9
`ro_core_news_md`	81.8	95.5
`sv_core_news_md`	-	95.5

🔴 Bug fixes

Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
Fix issue #9443: Fix Scorer.score_cats for missing labels.
Fix issue #9669: Fix entity linker batching.
Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
Fix issue #9904: Fix textcat loss scaling.
Fix issue #9956: Compare all Span attributes consistently.
Fix issue #10073: Add "spans" to the output of doc.to_json.
Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
Fix issue #10189: Allow Example to align whitespace annotation.
Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
Fix issue #10324: Fix Tok2Vec for empty batches.
Fix issue #10347: Update basic functionality for rehearse.
Fix issue #10394: Fix Vectors.n_keys for floret vectors.
Fix issue #10400: Use meta in util.load_model_from_config.
Fix issue #10451: Fix Example.get_matching_ents.
Fix issue #10460: Fix initial special cases for Tokenizer.explain.
Fix issue #10521: Stream large assets on download in spaCy projects.
Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.

👥 Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996

Files

explosion/spaCy-v3.3.0.zip

Files (11.0 MB)

Name	Size	Download all
explosion/spaCy-v3.3.0.zip md5:b3b0b0d7226831a9a1957babcf4d2d4e	11.0 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v3.3.0 (URL)

	All versions	This version
Views	31,529	872
Downloads	1,675	21
Data volume	30.2 GB	241.3 MB

explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

Creators

Description

Files

explosion/spaCy-v3.3.0.zip

Files (11.0 MB)

Additional details

Related works