explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
Creators
- Ines Montani1
- Matthew Honnibal1
- Matthew Honnibal1
- Sofie Van Landeghem2
- Adriane Boyd
- Henning Peters
- Paul O'Leary McCann3
- Maxim Samsonov
- Jim Geovedi
- Jim O'Regan
- Duygu Altinok4
- Gyรถrgy Orosz5
- Sรธren Lind Kristiansen
- Lj Miranda6
- Daniรซl de Kok6
- Roman7
- Explosion Bot6
- Leander Fiedler8
- Grรฉgory Howard
- Edward9
- Wannaphong Phatthiyaphaibun10
- Yohei Tamura
- Sam Bozek
- murat
- Ryn Daniels
- Mark Amery
- Bjรถrn Bรถing11
- Bram Vanroy12
- Pradeep Kumar Tippa
- 1. Founder @explosion
- 2. Explosion & OxyKodit
- 3. Cotonoha
- 4. @deepgram
- 5. LogMeIn, Meltwater
- 6. @explosion
- 7. @kouchtv
- 8. Nord/LB
- 9. Explosion AI
- 10. @PyThaiNLP
- 11. @codecentric
- 12. @UGent
Description
โจ New features and improvements
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Ragged
with fasterAlignmentArray
inExample
for training (#10319). - Improve
Matcher
speed (#10659). - Improve serialization speed for empty
Doc.spans
(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizer
or using the quickstart. - Language updates:
- Big endian support with
thinc
v8.0.14+ andthinc-bigendian-ops
. - Config comparisons with
spacy debug diff-config
. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates
for debugging span suggesters.- The quickstart now supports adding
spancat
andtrainable_lemmatizer
components.
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
๐ Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_cats
for missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_
value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Span
attributes consistently. - Fix issue #10073: Add
"spans"
to the output ofdoc.to_json
. - Fix issue #10086: Add tokenizer option to allow
Matcher
handling for all special cases. - Fix issue #10189: Allow
Example
to align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vec
for empty batches. - Fix issue #10347: Update basic functionality for
rehearse
. - Fix issue #10394: Fix
Vectors.n_keys
for floret vectors. - Fix issue #10400: Use
meta
inutil.load_model_from_config
. - Fix issue #10451: Fix
Example.get_matching_ents
. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain
. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizer
tag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors
.
- To see the speed improvements for the
Tagger
architecture, edit your configs to switch fromspacy.Tagger.v1
tospacy.Tagger.v2
and then runinit fill-config
. - Span comparisons involving ordering (
<
,<=
,>
,>=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs
now includesDoc.tensor
by default and supports excludes with anexclude
argument in the same format asDoc.to_bytes
. The supported exclude fields arespans
,tensor
anduser_data
.
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
Files
explosion/spaCy-v3.3.0.zip
Files
(11.0 MB)
Name | Size | Download all |
---|---|---|
md5:b3b0b0d7226831a9a1957babcf4d2d4e
|
11.0 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v3.3.0 (URL)