explosion/spaCy: v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
Creators
- Matthew Honnibal1
- Ines Montani1
- Matthew Honnibal1
- Henning Peters2
- Maxim Samsonov
- Jim Geovedi
- Jim Regan
- GyΓΆrgy Orosz3
- SΓΈren Lind Kristiansen
- Paul O'Leary McCann
- Duygu Altinok4
- Roman5
- GrΓ©gory Howard
- Alex6
- Kit7
- Sam Bozek
- Explosion Bot8
- Mark Amery
- Leif Uwe Vogelsang
- Pradeep Kumar Tippa
- GregDubbin
- Vadim Mazaev
- Jens Dahl MΓΈllerhΓΈj9
- wbwseeker
- Wannaphong Phatthiyaphaibun10
- Magnus Burton
- Yubing Dong (Tom)11
- thomasO
- Ramanan Balakrishnan12
- Avadh Patel13
- 1. Founder @explosion
- 2. RiseML
- 3. LogMeIn, Meltwater
- 4. 4Com
- 5. LinguaLeo
- 6. NSU
- 7. Founder @talecamp
- 8. @explosion
- 9. mollerhoj
- 10. @PyThaiNLP
- 11. Quora
- 12. @Semantics3
- 13. SUNY Binghamton - Computer Science
Description
π This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly
. It's not intended for production use.
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models β see below for details and benchmarks.
β¨ New features and improvements Tagger, Parser & NER- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: New
ud-train
command, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download
. - Pass additional arguments of
download
command topip
to customise installation. - Improve
train
command by lettingGoldCorpus
stream data, instead of loading into memory. - Improve
init-model
command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocab
command, which is now deprecated. - Add support for multi-task objectives to
train
command. - Add support for data-augmentation to
train
command.
- NEW:
Doc.retokenize
context manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Add warnings if
.similarity
method is called with empty vectors or without word vectors. - Improve rule-based
Matcher
and addreturn_matches
keyword argument toMatcher.pipe
to yield(doc, matches)
tuples instead of onlyDoc
objects, andas_tuples
to add context to theDoc
objects. - Make stop words via
Token.is_stop
andLexeme.is_stop
case-insensitive.
π§ Under constructionπ΄ Bug fixesThis section includes new features and improvements that are planned for the stable
v2.1.x
release, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher
(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexeme
attributes on merge (see #2390).md
andlg
models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
- Fix issue #1487: Add
Doc.retokenize()
context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inan
and addAnimacy_nhum
. - Fix issue #1865: Correct licensing of
it_core_news_sm
model. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcl
dependency label to symbols. - Fix issue #2014: Make
Token.pos_
writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix serialization of custom tokenizer if not all functions are defined.
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
Matcher
API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcher
inv2.1.x
may produce different results compared to theMatcher
inv2.0.x
. - Also note that some of the model licenses have changed:
it_core_news_sm
is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
en_core_web_sm
English
2.1.0a0
91.8
90.0
96.8
85.6
π
28 MB
en_core_web_md
English
2.1.0a0
92.0
90.2
97.0
86.2
β
107 MB
en_core_web_lg
English
2.1.0a0
92.1
90.3
97.0
86.2
β
805 MB
de_core_news_sm
German
2.1.0a0
92.0
90.1
97.2
83.8
π
26 MB
de_core_news_md
German
2.1.0a0
92.4
90.7
97.4
84.2
β
228 MB
es_core_news_sm
Spanish
2.1.0a0
90.1
87.2
96.9
89.4
π
28 MB
es_core_news_md
Spanish
2.1.0a0
90.7
88.0
97.2
89.5
β
88 MB
pt_core_news_sm
Portuguese
2.1.0a0
89.4
86.3
80.1
82.7
π
29 MB
fr_core_news_sm
French
2.1.0a0
88.8
85.7
94.4
67.3 <sup>1</sup>
π
32 MB
fr_core_news_md
French
2.1.0a0
88.7
86.0
95.0
70.4 <sup>1</sup>
β
100 MB
it_core_news_sm
Italian
2.1.0a0
90.7
87.1
96.1
81.3
π
27 MB
nl_core_news_sm
Dutch
2.1.0a0
83.5
77.6
91.5
87.3
π
27 MB
el_core_news_sm
Greek
2.1.0a0
84.5
81.0
95.0
73.5
π
27 MB
el_core_news_md
Greek
2.1.0a0
87.7
84.7
96.3
80.2
β
143 MB
xx_ent_wiki_sm
Multi
2.1.0a0
-
-
-
83.8
π
9 MB
1) We're currently investigating this, as the results are anomalously low.
π Documentation and examplesπ¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_
). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
- Fix various typos and inconsistencies.
Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.
Files
explosion/spaCy-v2.1.0a1.zip
Files
(24.6 MB)
Name | Size | Download all |
---|---|---|
md5:000575f96a0f6db41b081a3284cedc1e
|
24.6 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v2.1.0a1 (URL)