explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

Ines Montani; Matthew Honnibal; Matthew Honnibal; Sofie Van Landeghem; Henning Peters; Adriane Boyd; Maxim Samsonov; Jim Geovedi; Jim Regan; György Orosz; Paul O'Leary McCann; Søren Lind Kristiansen; Duygu Altinok; Roman; Leander Fiedler; Grégory Howard; Explosion Bot; Sam Bozek; Wannaphong Phatthiyaphaibun; Mark Amery; Björn Böing; Pradeep Kumar Tippa; Yohei Tamura; Leif Uwe Vogelsang; Ramanan Balakrishnan; Vadim Mazaev; GregDubbin; jeannefukumaru; Jens Dahl Møllerhøj; Avadh Patel

doi:10.5281/zenodo.3897194

Published June 16, 2020 | Version v2.3.0

Software Open

explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

1. Founder @explosion
2. Explosion & OxyKodit
3. LogMeIn, Meltwater
4. Cotonoha
5. German Autolabs
6. @kouchtv
7. @explosion
8. @PyThaiNLP
9. @codecentric
10. PKSHA Technology
11. @Semantics3
12. BotXO
13. SUNY Binghamton - Computer Science

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
NEW: Alpha support for Armenian, Gujarati and Malayalam.
NEW: Lookup lemmatization for Polish.
NEW: Allow Matcher to match on both Doc and Span objects.
NEW: Add Token.is_sent_end property.
Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
Add support for pkuseg alongside jieba for Chinese.
Switch from fugashi to sudachipy for Japanese.
Improve punctuation used in sentencizer.
Switch to new and more consistent alignment method in gold.align.
Reduce stored lexemes data and move non-derivable features to spacy-lookups-data.

🔴 Bug fixes

Fix issue #5056: Introduce support for matching Span objects.
Fix issue #5086: Remove Vectors.from_glove.
Fix issue #5131: Improve data processing in named entity linking scripts.
Fix issue #5137: Fix passing of component configuration to component.
Fix issue #5144: Fix sentence comparison in test util.
Fix issue #5166: Fix handling of exclusive_classes in textcat ensemble.
Fix issue #5170: Set rank for new vector in Vocab.set_vector.
Fix issue #5181: Prevent None values in gold fields.
Fix issue #5191: Fix GoldParse initialization when the number of tokens has changed.
Fix issue #5193: Correctly pin cupy-cuda extra dependencies.
Fix issue #5200: Fix minor bugs in train CLI.
Fix issue #5216: Modify Vectors.resize to work with cupy.
Fix issue #5228: Raise error for inplace resize with new vector dimension.
Fix issue #5230: Fix unittest warnings when saving a model.
Fix issue #5257: Use inline flags in token_match patterns.
Fix issue #5278, #5359: Add missing __init__.py files to language data tests.
Fix issue #5281: Fix comparison predicate handling for !=.
Fix issue #5287: Normalize TokenC.sent_start values for Matcher.
Fix issue #5292: Fix typo in option name --n-save_every.
Fix issue #5303: Use max(uint64) for OOV lexeme rank.
Fix issue #5311: Fix alignment of cards on landing page.
Fix issue #5320: Fix most_similar for vectors with unused rows.
Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
Fix issue #5356: Fix bug in Span.similarity that could trigger TypeError.
Fix issue #5361: Fix problems with lower and whitespace in variants.
Fix issue #5373: Improve exceptions for 'd (would/had) in English.
Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
Fix issue #5429: Modify array type to accommodate OOV_RANK.
Fix issue #5430: Check that row is within bounds when adding vector.
Fix issue #5435: Use Token.sent_start for Span.sent.
Fix issue #5436: Fix ErrorsWithCodes().__class__ return value.
Fix issue #5450: Disallow merging 0-length spans.

📖 Documentation and examples

Fix various typos and inconsistencies.
Add new projects to the spaCy Universe.
Move bin/wiki_entity_linking scripts for Wikipedia to projects repo.

🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!

📦 Model packages (43) Model Language Version Vectors zh_core_web_sm Chinese 2.3.0 𐄂 zh_core_web_md Chinese 2.3.0 ✓ zh_core_web_lg Chinese 2.3.0 ✓ da_core_news_sm Danish 2.3.0 𐄂 da_core_news_md Danish 2.3.0 ✓ da_core_news_lg Danish 2.3.0 ✓ nl_core_news_sm Dutch 2.3.0 𐄂 nl_core_news_md Dutch 2.3.0 ✓ nl_core_news_lg Dutch 2.3.0 ✓ en_core_web_sm English 2.3.0 𐄂 en_core_web_md English 2.3.0 ✓ en_core_web_lg English 2.3.0 ✓ fr_core_news_sm French 2.3.0 𐄂 fr_core_news_md French 2.3.0 ✓ fr_core_news_lg French 2.3.0 ✓ de_core_news_sm German 2.3.0 𐄂 de_core_news_md German 2.3.0 ✓ de_core_news_lg German 2.3.0 ✓ el_core_news_sm Greek 2.3.0 𐄂 el_core_news_md Greek 2.3.0 ✓ el_core_news_lg Greek 2.3.0 ✓ it_core_news_sm Italian 2.3.0 𐄂 it_core_news_md Italian 2.3.0 ✓ it_core_news_lg Italian 2.3.0 ✓ ja_core_news_sm Italian 2.3.0 𐄂 ja_core_news_md Italian 2.3.0 ✓ ja_core_news_lg Italian 2.3.0 ✓ lt_core_news_sm Lithuanian 2.3.0 𐄂 lt_core_news_md Lithuanian 2.3.0 ✓ lt_core_news_lg Lithuanian 2.3.0 ✓ nb_core_news_sm Norwegian Bokmål 2.3.0 𐄂 nb_core_news_md Norwegian Bokmål 2.3.0 ✓ nb_core_news_lg Norwegian Bokmål 2.3.0 ✓ pl_core_news_sm Polish 2.3.0 𐄂 pl_core_news_md Polish 2.3.0 ✓ pl_core_news_lg Polish 2.3.0 ✓ pt_core_news_sm Portuguese 2.3.0 𐄂 pt_core_news_md Portuguese 2.3.0 ✓ pt_core_news_lg Portuguese 2.3.0 ✓ ro_core_news_sm Romanian 2.3.0 𐄂 ro_core_news_md Romanian 2.3.0 ✓ ro_core_news_lg Romanian 2.3.0 ✓ es_core_news_sm Spanish 2.3.0 𐄂 es_core_news_md Spanish 2.3.0 ✓ es_core_news_lg Spanish 2.3.0 ✓ xx_ent_wiki_sm Multi-language 2.3.0 𐄂 👥 Contributors

Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions.

🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).

Files

explosion/spaCy-v2.3.0.zip

Files (5.9 MB)

Name	Size	Download all
explosion/spaCy-v2.3.0.zip md5:c49c792a655dc448f9fe2ca889c384f8	5.9 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v2.3.0 (URL)

	All versions	This version
Views	26,074	926
Downloads	998	16
Data volume	19.7 GB	94.2 MB

explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

Creators

Description

Files

explosion/spaCy-v2.3.0.zip

Files (5.9 MB)

Additional details

Related works