explosion/spaCy: v3.5.0: New CLI commands, language updates, bug fixes and much more
Authors/Creators
- Ines Montani1
- Matthew Honnibal1
- Matthew Honnibal1
- Sofie Van Landeghem2
- Adriane Boyd
- Henning Peters
- Paul O'Leary McCann3
- jim geovedi
- Jim O'Regan
- Maxim Samsonov
- György Orosz4
- Daniël de Kok5
- Duygu Altinok6
- Søren Lind Kristiansen
- Madeesh Kannan
- Raphaël Bournhonesque
- Lj Miranda5
- Peter Baumgartner5
- Edward5
- Explosion Bot5
- Richard Hudson
- Raphael Mitsch5
- Roman7
- Leander Fiedler8
- Ryn Daniels
- Wannaphong Phatthiyaphaibun9
- Grégory Howard
- Yohei Tamura10
- Sam Bozek
- 1. Founder @explosion
- 2. Explosion & OxyKodit
- 3. Cotonoha
- 4. LogMeIn, Meltwater
- 5. @explosion
- 6. @deepgram
- 7. @kouchtv
- 8. Nord/LB
- 9. @PyThaiNLP
- 10. @indeedeng
Description
- NEW: New
applyCLI command to annotate new documents with a trained pipeline (#11376). - NEW: New
benchmarkCLI command to benchmark pipelines. The newbenchmark speedsubcommand measures the speed of a pipeline, thebenchmark accuracysubcommand is a new alias forevaluate(#11902). - NEW: New
find-thresholdCLI command to identify an optimal threshold for classification models (#11280). - NEW: New
FUZZYMatcheroperator for fuzzy matches based on Levenshtein edit distance. In addition, theFUZZYandREGEXoperators are now supported in combination withIN/NOT_IN. (#11359). - Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
- Allow up to
typerv0.7.x (#11720),mypy0.990 (#11801) andtyping_extensionsv4.4.x (#12036). - New
spacy.ConsoleLogger.v3with expanded progress tracking (#11972). - Improved scoring behavior for
textcatwithspacy.textcat_scorer.v2(#11696 and #11971) andspacy.textcat_multilabel_scorer.v2(#11820). - Improved customizability of the knowledge base used for entity linking, with the default implementation being the new
InMemoryLookupKB(#11268). - Optional
before_updatecallback that is invoked at the start of each training step (#11739). - Improve performance of
SpanGroup(#11380). - Improve UX around
displacy.servewhen the default port is in use (#11948). - Patch a security vulnerability in extracting tar files (#11746).
- Add equality definition for vectors (#11806).
- Allow interpolation of variables in directory names in projects (#11235).
- Update default component configs to use the latest
tok2vecversion (#11618).
- #11382: Fix lookup behavior for the French and Catalan lemmatizers.
- #11385: Ensure that downstream components can train properly on a frozen
tok2vecortransformerlayer. - #11762: Support local file system remotes for projects.
- #11763: Raise an error when unsupported values are used for
textcat. - #11834: Ensure
Vocab.to_diskrespects the exclude setting forlookupsandvectors. - #12009: Fix a few typing issues for
SpanGroupandSpanobjects. - #12098: Correctly handle missing annotations in the edit tree lemmatizer.
The following changes may require you to update code that is using the relevant functionality:
- An error is now raised when unsupported values are given as input to train a
textcatortextcat_multilabelmodel - ensure that values are 0.0 or 1.0 as explained in the docs.
The following changes may influence the output of your language pipeline or trained models:
- Updates to language defaults:
- Extended support for Slovenian (#11162).
- Switch Russian and Ukrainian lemmatizers to
pymorphy3(#11345, #11811). - Support for editorial punctuation in Ancient Greek (#11426).
- Update to Russian tokenizer exceptions (#11753).
- Small fix in the list of Dutch stop words (#11997).
- Updates to model defaults:
- Use the latest
tok2vecdefaults in all components (#11618). - Improve the default attributes used for the
textcatandtextcat_multilabelcomponents (#11698). - Update the default scorer for
textcatandtextcat_multilabelto fix a bug related tothresholdfortextcatand to make it possible to score multipletextcat/textcat_multilabelcomponents in a single pipeline with custom scorers. If no custom scorers are used, thecat_p/r/fscores will now only reflect the final component's labels and performance (#11696, #11820). - Correct the
token_accscore to report the intended measure (# correct tokens / # predicted tokens, the same as in spaCy v2). Thetoken_accscores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. Thetoken_p/r/fscores should remain unchanged (#12073).
- Use the latest
The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:
- From v4 onwards, we'll rename the
masterbranch tomain.
- The CNN pipelines add
IS_SPACEas atok2vecfeature fortaggerandmorphologizercomponents to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformersv1.2, which uses the exact alignment fromtokenizersfor fast tokenizers instead of the heuristic alignment fromspacy-alignments. For all trained pipelines exceptja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformerschanges in the v1.2.0 release notes.
- We've ported our website from Gatsby to Next 🥳
- Updated the documentation on supported languages.
- Added a note about experimental M1 GPU support to the installation quickstart.
- Included documentation for the
biluo_to_iobandiob_to_biluofunctions. - Fixed model links in the v3.4 usage documentation.
- Removed "new" tags of functionality from spaCy v2.x.
- Various small additions, spelling and typo fixes.
- spaCy Universe additions:
- greCy: Providing Ancient Greek models
- spacy-pythainlp: Add Thai support for spaCy
- New projects:
- Accelerate NER with Speedster (experimental)
@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx
Files
explosion/spaCy-v3.5.0.zip
Files
(11.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7ed9013b721d262d30526ab58ae5777f
|
11.1 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v3.5.0 (URL)