explosion/spaCy: v3.5.0: New CLI commands, language updates, bug fixes and much more

Ines Montani; Matthew Honnibal; Matthew Honnibal; Sofie Van Landeghem; Adriane Boyd; Henning Peters; Paul O'Leary McCann; jim geovedi; Jim O'Regan; Maxim Samsonov; György Orosz; Daniël de Kok; Duygu Altinok; Søren Lind Kristiansen; Madeesh Kannan; Raphaël Bournhonesque; Lj Miranda; Peter Baumgartner; Edward; Explosion Bot; Richard Hudson; Raphael Mitsch; Roman; Leander Fiedler; Ryn Daniels; Wannaphong Phatthiyaphaibun; Grégory Howard; Yohei Tamura; Sam Bozek

doi:10.5281/zenodo.7553910

Published January 20, 2023 | Version v3.5.0

Software Open

explosion/spaCy: v3.5.0: New CLI commands, language updates, bug fixes and much more

1. Founder @explosion
2. Explosion & OxyKodit
3. Cotonoha
4. LogMeIn, Meltwater
5. @explosion
6. @deepgram
7. @kouchtv
8. Nord/LB
9. @PyThaiNLP
10. @indeedeng

✨ New features and improvements

NEW: New apply CLI command to annotate new documents with a trained pipeline (#11376).
NEW: New benchmark CLI command to benchmark pipelines. The new benchmark speed subcommand measures the speed of a pipeline, the benchmark accuracy subcommand is a new alias for evaluate (#11902).
NEW: New find-threshold CLI command to identify an optimal threshold for classification models (#11280).
NEW: New FUZZY Matcher operator for fuzzy matches based on Levenshtein edit distance. In addition, the FUZZY and REGEX operators are now supported in combination with IN/NOT_IN. (#11359).
Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
Allow up to typer v0.7.x (#11720), mypy 0.990 (#11801) and typing_extensions v4.4.x (#12036).
New spacy.ConsoleLogger.v3 with expanded progress tracking (#11972).
Improved scoring behavior for textcat with spacy.textcat_scorer.v2 (#11696 and #11971) and spacy.textcat_multilabel_scorer.v2 (#11820).
Improved customizability of the knowledge base used for entity linking, with the default implementation being the new InMemoryLookupKB (#11268).
Optional before_update callback that is invoked at the start of each training step (#11739).
Improve performance of SpanGroup (#11380).
Improve UX around displacy.serve when the default port is in use (#11948).
Patch a security vulnerability in extracting tar files (#11746).
Add equality definition for vectors (#11806).
Allow interpolation of variables in directory names in projects (#11235).
Update default component configs to use the latest tok2vec version (#11618).

🔴 Bug fixes

#11382: Fix lookup behavior for the French and Catalan lemmatizers.
#11385: Ensure that downstream components can train properly on a frozen tok2vec or transformer layer.
#11762: Support local file system remotes for projects.
#11763: Raise an error when unsupported values are used for textcat.
#11834: Ensure Vocab.to_disk respects the exclude setting for lookups and vectors.
#12009: Fix a few typing issues for SpanGroup and Span objects.
#12098: Correctly handle missing annotations in the edit tree lemmatizer.

⚠️ Backwards incompatibilities and model updates

The following changes may require you to update code that is using the relevant functionality:

An error is now raised when unsupported values are given as input to train a textcat or textcat_multilabel model - ensure that values are 0.0 or 1.0 as explained in the docs.

The following changes may influence the output of your language pipeline or trained models:

Updates to language defaults:
- Extended support for Slovenian (#11162).
- Switch Russian and Ukrainian lemmatizers to pymorphy3 (#11345, #11811).
- Support for editorial punctuation in Ancient Greek (#11426).
- Update to Russian tokenizer exceptions (#11753).
- Small fix in the list of Dutch stop words (#11997).
Updates to model defaults:
- Use the latest tok2vec defaults in all components (#11618).
- Improve the default attributes used for the textcat and textcat_multilabel components (#11698).
- Update the default scorer for textcat and textcat_multilabel to fix a bug related to threshold for textcat and to make it possible to score multiple textcat/textcat_multilabel components in a single pipeline with custom scorers. If no custom scorers are used, the cat_p/r/f scores will now only reflect the final component's labels and performance (#11696, #11820).
- Correct the token_acc score to report the intended measure (# correct tokens / # predicted tokens, the same as in spaCy v2). The token_acc scores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. The token_p/r/f scores should remain unchanged (#12073).

The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:

From v4 onwards, we'll rename the master branch to main.

📦 Trained pipelines updates

The CNN pipelines add IS_SPACE as a tok2vec feature for tagger and morphologizer components to improve tagging of non-whitespace vs. whitespace tokens.
The transformer pipelines require spacy-transformers v1.2, which uses the exact alignment from tokenizers for fast tokenizers instead of the heuristic alignment from spacy-alignments. For all trained pipelines except ja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about the spacy-transformers changes in the v1.2.0 release notes.

📖 Documentation and examples

We've ported our website from Gatsby to Next 🥳
Updated the documentation on supported languages.
Added a note about experimental M1 GPU support to the installation quickstart.
Included documentation for the biluo_to_iob and iob_to_biluo functions.
Fixed model links in the v3.4 usage documentation.
Removed "new" tags of functionality from spaCy v2.x.
Various small additions, spelling and typo fixes.
spaCy Universe additions:
- greCy: Providing Ancient Greek models
- spacy-pythainlp: Add Thai support for spaCy
New projects:
- Accelerate NER with Speedster (experimental)

👥 Contributors

@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx

Files

explosion/spaCy-v3.5.0.zip

Files (11.1 MB)

Name	Size	Download all
explosion/spaCy-v3.5.0.zip md5:7ed9013b721d262d30526ab58ae5777f	11.1 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v3.5.0 (URL)

	All versions	This version
Views	38,268	415
Downloads	2,324	23
Data volume	40.2 GB	255.8 MB

explosion/spaCy: v3.5.0: New CLI commands, language updates, bug fixes and much more

Authors/Creators

Description

Files

explosion/spaCy-v3.5.0.zip

Files (11.1 MB)

Additional details

Related works