explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
Authors/Creators
- Ines Montani1
- Matthew Honnibal1
- Matthew Honnibal1
- Sofie Van Landeghem2
- Adriane Boyd
- Henning Peters
- Paul O'Leary McCann3
- Maxim Samsonov
- Jim Geovedi
- Jim O'Regan
- Duygu Altinok4
- György Orosz5
- Søren Lind Kristiansen
- Lj Miranda6
- Daniël de Kok6
- Roman7
- Explosion Bot6
- Leander Fiedler8
- Grégory Howard
- Edward9
- Wannaphong Phatthiyaphaibun10
- Yohei Tamura
- Sam Bozek
- murat
- Ryn Daniels
- Mark Amery
- Björn Böing11
- Bram Vanroy12
- Pradeep Kumar Tippa
- 1. Founder @explosion
- 2. Explosion & OxyKodit
- 3. Cotonoha
- 4. @deepgram
- 5. LogMeIn, Meltwater
- 6. @explosion
- 7. @kouchtv
- 8. Nord/LB
- 9. Explosion AI
- 10. @PyThaiNLP
- 11. @codecentric
- 12. @UGent
Description
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Raggedwith fasterAlignmentArrayinExamplefor training (#10319). - Improve
Matcherspeed (#10659). - Improve serialization speed for empty
Doc.spans(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizeror using the quickstart. - Language updates:
- Big endian support with
thincv8.0.14+ andthinc-bigendian-ops. - Config comparisons with
spacy debug diff-config. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidatesfor debugging span suggesters.- The quickstart now supports adding
spancatandtrainable_lemmatizercomponents.
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
| Package | Language | UPOS | Parser LAS | NER F |
|---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
|---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_catsfor missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Spanattributes consistently. - Fix issue #10073: Add
"spans"to the output ofdoc.to_json. - Fix issue #10086: Add tokenizer option to allow
Matcherhandling for all special cases. - Fix issue #10189: Allow
Exampleto align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vecfor empty batches. - Fix issue #10347: Update basic functionality for
rehearse. - Fix issue #10394: Fix
Vectors.n_keysfor floret vectors. - Fix issue #10400: Use
metainutil.load_model_from_config. - Fix issue #10451: Fix
Example.get_matching_ents. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizertag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors.
- To see the speed improvements for the
Taggerarchitecture, edit your configs to switch fromspacy.Tagger.v1tospacy.Tagger.v2and then runinit fill-config. - Span comparisons involving ordering (
<,<=,>,>=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docsnow includesDoc.tensorby default and supports excludes with anexcludeargument in the same format asDoc.to_bytes. The supported exclude fields arespans,tensoranduser_data.
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
Files
explosion/spaCy-v3.3.0.zip
Files
(11.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b3b0b0d7226831a9a1957babcf4d2d4e
|
11.0 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v3.3.0 (URL)