explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes
Creators
- Ines Montani1
- Matthew Honnibal1
- Matthew Honnibal1
- Sofie Van Landeghem2
- Henning Peters
- Adriane Boyd
- Maxim Samsonov
- Jim Geovedi
- Jim Regan
- György Orosz3
- Paul O'Leary McCann4
- Søren Lind Kristiansen
- Duygu Altinok5
- Roman6
- Leander Fiedler
- Grégory Howard
- Explosion Bot7
- Sam Bozek
- Wannaphong Phatthiyaphaibun8
- Mark Amery
- Björn Böing9
- Pradeep Kumar Tippa
- Yohei Tamura10
- Leif Uwe Vogelsang
- Ramanan Balakrishnan11
- Vadim Mazaev
- GregDubbin
- jeannefukumaru
- Jens Dahl Møllerhøj12
- Avadh Patel13
- 1. Founder @explosion
- 2. Explosion & OxyKodit
- 3. LogMeIn, Meltwater
- 4. Cotonoha
- 5. German Autolabs
- 6. @kouchtv
- 7. @explosion
- 8. @PyThaiNLP
- 9. @codecentric
- 10. PKSHA Technology
- 11. @Semantics3
- 12. BotXO
- 13. SUNY Binghamton - Computer Science
Description
✨ New features and improvements⚠️ This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
- NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
- NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
- NEW: Alpha support for Armenian, Gujarati and Malayalam.
- NEW: Lookup lemmatization for Polish.
- NEW: Allow
Matcher
to match on bothDoc
andSpan
objects. - NEW: Add
Token.is_sent_end
property. - Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
- Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
- Add support for
pkuseg
alongsidejieba
for Chinese. - Switch from
fugashi
tosudachipy
for Japanese. - Improve punctuation used in sentencizer.
- Switch to new and more consistent alignment method in
gold.align
. - Reduce stored lexemes data and move non-derivable features to
spacy-lookups-data
.
- Fix issue #5056: Introduce support for matching
Span
objects. - Fix issue #5086: Remove
Vectors.from_glove
. - Fix issue #5131: Improve data processing in named entity linking scripts.
- Fix issue #5137: Fix passing of component configuration to component.
- Fix issue #5144: Fix sentence comparison in test util.
- Fix issue #5166: Fix handling of
exclusive_classes
in textcat ensemble. - Fix issue #5170: Set rank for new vector in
Vocab.set_vector
. - Fix issue #5181: Prevent
None
values in gold fields. - Fix issue #5191: Fix
GoldParse
initialization when the number of tokens has changed. - Fix issue #5193: Correctly pin
cupy-cuda
extra dependencies. - Fix issue #5200: Fix minor bugs in train CLI.
- Fix issue #5216: Modify
Vectors.resize
to work withcupy
. - Fix issue #5228: Raise error for inplace resize with new vector dimension.
- Fix issue #5230: Fix
unittest
warnings when saving a model. - Fix issue #5257: Use inline flags in
token_match
patterns. - Fix issue #5278, #5359: Add missing
__init__.py
files to language data tests. - Fix issue #5281: Fix comparison predicate handling for
!=
. - Fix issue #5287: Normalize
TokenC.sent_start
values forMatcher
. - Fix issue #5292: Fix typo in option name
--n-save_every
. - Fix issue #5303: Use
max(uint64)
for OOV lexeme rank. - Fix issue #5311: Fix alignment of cards on landing page.
- Fix issue #5320: Fix
most_similar
for vectors with unused rows. - Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
- Fix issue #5356: Fix bug in
Span.similarity
that could triggerTypeError
. - Fix issue #5361: Fix problems with lower and whitespace in variants.
- Fix issue #5373: Improve exceptions for
'd
(would/had) in English. - Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
- Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
- Fix issue #5429: Modify array type to accommodate
OOV_RANK
. - Fix issue #5430: Check that row is within bounds when adding vector.
- Fix issue #5435: Use
Token.sent_start
forSpan.sent
. - Fix issue #5436: Fix
ErrorsWithCodes().__class__
return value. - Fix issue #5450: Disallow merging 0-length spans.
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
- Move
bin/wiki_entity_linking
scripts for Wikipedia toprojects
repo.
📦 Model packages (43) Model Language Version Vectors🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!
zh_core_web_sm
Chinese
2.3.0
𐄂
zh_core_web_md
Chinese
2.3.0
✓
zh_core_web_lg
Chinese
2.3.0
✓
da_core_news_sm
Danish
2.3.0
𐄂
da_core_news_md
Danish
2.3.0
✓
da_core_news_lg
Danish
2.3.0
✓
nl_core_news_sm
Dutch
2.3.0
𐄂
nl_core_news_md
Dutch
2.3.0
✓
nl_core_news_lg
Dutch
2.3.0
✓
en_core_web_sm
English
2.3.0
𐄂
en_core_web_md
English
2.3.0
✓
en_core_web_lg
English
2.3.0
✓
fr_core_news_sm
French
2.3.0
𐄂
fr_core_news_md
French
2.3.0
✓
fr_core_news_lg
French
2.3.0
✓
de_core_news_sm
German
2.3.0
𐄂
de_core_news_md
German
2.3.0
✓
de_core_news_lg
German
2.3.0
✓
el_core_news_sm
Greek
2.3.0
𐄂
el_core_news_md
Greek
2.3.0
✓
el_core_news_lg
Greek
2.3.0
✓
it_core_news_sm
Italian
2.3.0
𐄂
it_core_news_md
Italian
2.3.0
✓
it_core_news_lg
Italian
2.3.0
✓
ja_core_news_sm
Italian
2.3.0
𐄂
ja_core_news_md
Italian
2.3.0
✓
ja_core_news_lg
Italian
2.3.0
✓
lt_core_news_sm
Lithuanian
2.3.0
𐄂
lt_core_news_md
Lithuanian
2.3.0
✓
lt_core_news_lg
Lithuanian
2.3.0
✓
nb_core_news_sm
Norwegian Bokmål
2.3.0
𐄂
nb_core_news_md
Norwegian Bokmål
2.3.0
✓
nb_core_news_lg
Norwegian Bokmål
2.3.0
✓
pl_core_news_sm
Polish
2.3.0
𐄂
pl_core_news_md
Polish
2.3.0
✓
pl_core_news_lg
Polish
2.3.0
✓
pt_core_news_sm
Portuguese
2.3.0
𐄂
pt_core_news_md
Portuguese
2.3.0
✓
pt_core_news_lg
Portuguese
2.3.0
✓
ro_core_news_sm
Romanian
2.3.0
𐄂
ro_core_news_md
Romanian
2.3.0
✓
ro_core_news_lg
Romanian
2.3.0
✓
es_core_news_sm
Spanish
2.3.0
𐄂
es_core_news_md
Spanish
2.3.0
✓
es_core_news_lg
Spanish
2.3.0
✓
xx_ent_wiki_sm
Multi-language
2.3.0
𐄂
👥 Contributors
Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions.
🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).
Files
explosion/spaCy-v2.3.0.zip
Files
(5.9 MB)
Name | Size | Download all |
---|---|---|
md5:c49c792a655dc448f9fe2ca889c384f8
|
5.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v2.3.0 (URL)