explosion/spaCy: v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

Matthew Honnibal; Ines Montani; Matthew Honnibal; Henning Peters; Maxim Samsonov; Jim Geovedi; Jim Regan; György Orosz; Søren Lind Kristiansen; Paul O'Leary McCann; Duygu Altinok; Roman; Grégory Howard; Alex; Sam Bozek; Explosion Bot; Mark Amery; Leif Uwe Vogelsang; Pradeep Kumar Tippa; GregDubbin; Wannaphong Phatthiyaphaibun; Vadim Mazaev; Jens Dahl Møllerhøj; wbwseeker; Magnus Burton; mpuels; Yubing Dong (Tom); thomasO; Ramanan Balakrishnan; Avadh Patel

doi:10.5281/zenodo.1653208

Published November 28, 2018 | Version v2.1.0a3

Software Open

explosion/spaCy: v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

1. Founder @explosion
2. RiseML
3. LogMeIn, Meltwater
4. 4Com
5. @kouchtv
6. chatme.ai
7. @explosion
8. @PyThaiNLP
9. mollerhoj
10. @yoyolabsio
11. Quora
12. @Semantics3
13. SUNY Binghamton - Computer Science

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements Tagger, Parser & NER

NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
Make parser, tagger and NER faster, through better hyperparameters.
Add EntityRecognizer.labels property.
Remove document length limit during training, by implementing faster Levenshtein alignment.
Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
NEW: The English and German models are now available under the MIT license.
NEW: Statistical models for Greek.

CLI

NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
Check if model is already installed before downloading it via spacy download.
Pass additional arguments of download command to pip to customise installation.
Improve train command by letting GoldCorpus stream data, instead of loading into memory.
Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
Add support for multi-task objectives to train command.
Add support for data-augmentation to train command.

Other

NEW: Doc.retokenize context manager for merging tokens more efficiently.
NEW: Add support for custom pipeline component factories via entry points (#2348).
NEW: Implement fastText vectors with subword features.
NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
Add warnings if .similarity method is called with empty vectors or without word vectors.
Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction
This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

Enhanced pattern API for rule-based Matcher (see #1971).

Improve tokenizer performance (see #1642).

Allow retokenizer to update Lexeme attributes on merge (see #2390).

md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

Improved JSON(L) format for training (see #2928, #2932).

Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).

Refactor CLI and add debug-data command to validate training data (see #2932).

🔴 Bug fixes

Fix issue #1487: Add Doc.retokenize() context manager.
Fix issue #1574: Make sure stop words are available in medium and large English models.
Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
Fix issue #1865: Correct licensing of it_core_news_sm model.
Fix issue #1889: Make stop words case-insensitive.
Fix issue #1903: Add relcl dependency label to symbols.
Fix issue #2014: Make Token.pos_ writeable.
Fix issue #2369: Respect pre-defined warning filters.
Fix issue #2482: Fix serialization when parser model is empty.
Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
Fix issue #2772: Fix bug in sentence starts for non-projective parses.
Fix issue #2782: Make like_num work with prefixed numbers.
Fix serialization of custom tokenizer if not all functions are defined.
Fix bugs in beam-search training objective.
Fix problems with model pickling.

⚠️ Backwards incompatibilities

This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
If you've been training your own models, you'll need to retrain them with the new version.
While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks Model Language Version UAS LAS POS NER F Vec Size en_core_web_sm English 2.1.0a4 91.7 89.8 96.8 85.7 𐄂 12 MB en_core_web_md English 2.1.0a4 92.0 90.1 97.0 86.2 ✓ 93 MB en_core_web_lg English 2.1.0a4 92.1 90.3 97.0 86.5 ✓ 780 MB de_core_news_sm German 2.1.0a4 91.9 89.8 97.2 83.4 𐄂 12 MB de_core_news_md German 2.1.0a4 91.3 90.5 97.4 83.6 ✓ 212 MB es_core_news_sm Spanish 2.1.0a4 90.1 87.1 96.8 89.3 𐄂 12 MB es_core_news_md Spanish 2.1.0a4 90.7 87.8 97.1 89.4 ✓ 72 MB pt_core_news_sm Portuguese 2.1.0a4 89.2 85.8 79.8 82.4 𐄂 14 MB fr_core_news_sm French 2.1.0a4 87.2 84.0 94.4 67.0 <sup>1</sup> 𐄂 16 MB fr_core_news_md French 2.1.0a4 88.8 86.0 94.9 70.0 <sup>1</sup> ✓ 84 MB it_core_news_sm Italian 2.1.0a4 90.6 87.0 96.0 81.7 𐄂 12 MB nl_core_news_sm Dutch 2.1.0a4 83.1 77.2 91.3 87.3 𐄂 12 MB el_core_news_sm Greek 2.1.0a4 84.2 80.4 94.6 71.5 𐄂 12 MB el_core_news_md Greek 2.1.0a4 87.5 84.1 96.4 78.3 ✓ 128 MB xx_ent_wiki_sm Multi 2.1.0a4 - - - 83.2 𐄂 4 MB

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas and @skrcode for the pull requests and contributions.

Files

explosion/spaCy-v2.1.0a3.zip

Files (24.6 MB)

Name	Size	Download all
explosion/spaCy-v2.1.0a3.zip md5:4f092f8435bea155558c43b5516b881e	24.6 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v2.1.0a3 (URL)

	All versions	This version
Views	42,772	540
Downloads	2,711	21
Data volume	45.8 GB	566.0 MB

explosion/spaCy: v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

Authors/Creators

Description

Files

explosion/spaCy-v2.1.0a3.zip

Files (24.6 MB)

Additional details

Related works