Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published April 29, 2022 | Version v3.3.0
Software Open

explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

Description

โœจ New features and improvements

๐Ÿ“ฆ Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

Package Language UPOS Parser LAS NER F
fi_core_news_sm Finnish 92.5 71.9 75.9
fi_core_news_md Finnish 95.9 78.6 80.6
fi_core_news_lg Finnish 96.2 79.4 82.4
ko_core_news_sm Korean 86.1 65.6 71.3
ko_core_news_md Korean 94.7 80.9 83.1
ko_core_news_lg Korean 94.7 81.3 85.3
sv_core_news_sm Swedish 95.0 75.9 74.7
sv_core_news_md Swedish 96.3 78.5 79.3
sv_core_news_lg Swedish 96.3 79.1 81.1

๐Ÿ™ Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

Model v3.2 Lemma Acc v3.3 Lemma Acc
da_core_news_md 84.9 94.8
de_core_news_md 73.4 97.7
el_core_news_md 56.5 88.9
fi_core_news_md - 86.2
it_core_news_md 86.6 97.2
ko_core_news_md - 90.0
lt_core_news_md 71.1 84.8
nb_core_news_md 76.7 97.1
nl_core_news_md 81.5 94.0
pl_core_news_md 87.1 93.7
pt_core_news_md 76.7 96.9
ro_core_news_md 81.8 95.5
sv_core_news_md - 95.5
๐Ÿ”ด Bug fixes
  • Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
  • Fix issue #9443: Fix Scorer.score_cats for missing labels.
  • Fix issue #9669: Fix entity linker batching.
  • Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
  • Fix issue #9904: Fix textcat loss scaling.
  • Fix issue #9956: Compare all Span attributes consistently.
  • Fix issue #10073: Add "spans" to the output of doc.to_json.
  • Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
  • Fix issue #10189: Allow Example to align whitespace annotation.
  • Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
  • Fix issue #10324: Fix Tok2Vec for empty batches.
  • Fix issue #10347: Update basic functionality for rehearse.
  • Fix issue #10394: Fix Vectors.n_keys for floret vectors.
  • Fix issue #10400: Use meta in util.load_model_from_config.
  • Fix issue #10451: Fix Example.get_matching_ents.
  • Fix issue #10460: Fix initial special cases for Tokenizer.explain.
  • Fix issue #10521: Stream large assets on download in spaCy projects.
  • Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
  • Fix issue #10551: Add automatic vector deduplication for init vectors.
๐Ÿš€ Notes about upgrading from v3.2
  • To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
  • Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
  • Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
  • Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.
๐Ÿ“– Documentation and examples ๐Ÿ‘ฅ Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996

Files

explosion/spaCy-v3.3.0.zip

Files (11.0 MB)

Name Size Download all
md5:b3b0b0d7226831a9a1957babcf4d2d4e
11.0 MB Preview Download

Additional details

Related works