explosion/spaCy: v2.1.0a4: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
Authors/Creators
- Matthew Honnibal1
- Ines Montani1
- Matthew Honnibal1
- Henning Peters2
- Maxim Samsonov
- Jim Geovedi
- Jim Regan
- GyΓΆrgy Orosz3
- SΓΈren Lind Kristiansen
- Paul O'Leary McCann
- Duygu Altinok4
- Roman5
- GrΓ©gory Howard
- Alex6
- Sam Bozek
- Explosion Bot7
- Mark Amery
- Leif Uwe Vogelsang
- Pradeep Kumar Tippa
- GregDubbin
- Wannaphong Phatthiyaphaibun8
- Vadim Mazaev
- Jens Dahl MΓΈllerhΓΈj9
- wbwseeker
- Magnus Burton
- mpuels10
- Tom Dong11
- thomasO
- Ramanan Balakrishnan12
- Avadh Patel13
- 1. Founder @explosion
- 2. RiseML
- 3. LogMeIn, Meltwater
- 4. 4Com
- 5. @kouchtv
- 6. chatme.ai
- 7. @explosion
- 8. @PyThaiNLP
- 9. mollerhoj
- 10. @yoyolabsio
- 11. Quora
- 12. @Semantics3
- 13. SUNY Binghamton - Computer Science
Description
π This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models β see below for details and benchmarks.
β¨ New features and improvements Tagger, Parser, NER and Text Categorizerβ οΈ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- Improve loading time of
Frenchby ~30%.
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
π§ Under constructionπ΄ Bug fixesThis section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'β the name'sbd'is deprecated. ```diff - sentence_splitter = nlp.create_pipe('sbd')
- sentence_splitter = nlp.create_pipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output train_data.json dev_data.json --no-parser
- $ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
en_core_web_sm
English
2.1.0a5
91.2
89.3
96.9
85.6
π
10 MB
en_core_web_md
English
2.1.0a5
91.4
89.5
96.9
85.9
β
90 MB
en_core_web_lg
English
2.1.0a5
91.5
89.7
97.0
86.3
β
788 MB
de_core_news_sm
German
2.1.0a5
91.3
89.0
97.1
82.2
π
10 MB
de_core_news_md
German
2.1.0a5
92.0
90.0
97.4
82.7
β
210 MB
es_core_news_sm
Spanish
2.1.0a5
89.9
86.7
96.6
87.3
π
10 MB
es_core_news_md
Spanish
2.1.0a5
90.6
87.7
97.0
88.0
β
69 MB
pt_core_news_sm
Portuguese
2.1.0a5
89.3
86.0
78.5
87.8
π
12 MB
fr_core_news_sm
French
2.1.0a5
87.3
84.4
94.4
81.0
π
14 MB
fr_core_news_md
French
2.1.0a5
88.8
86.1
94.9
82.2
β
82 MB
it_core_news_sm
Italian
2.1.0a5
90.8
87.0
95.7
84.8
π
10 MB
nl_core_news_sm
Dutch
2.1.0a5
83.7
77.4
90.9
85.4
π
10 MB
el_core_news_sm
Greek
2.1.0a5
85.5
81.8
94.7
75.9
π
10 MB
el_core_news_md
Greek
2.1.0a5
88.5
85.2
96.8
80.01
β
126 MB
xx_ent_wiki_sm
Multi
2.1.0a5
-
-
-
82.8
π
3 MB
π Documentation and examplesπ¬ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
- Fix various typos and inconsistencies.
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.
Files
explosion/spaCy-v2.1.0a4.zip
Files
(29.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2029dc6b6a40873f98a22c817ca0ca30
|
29.0 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/explosion/spaCy/tree/v2.1.0a4 (URL)