explosion/spaCy: v2.1.0a5: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

Matthew Honnibal; Ines Montani; Matthew Honnibal; Henning Peters; Maxim Samsonov; Jim Geovedi; Jim Regan; György Orosz; Søren Lind Kristiansen; Paul O'Leary McCann; Duygu Altinok; Roman; Grégory Howard; Alex; Sam Bozek; Explosion Bot; Mark Amery; Leif Uwe Vogelsang; Pradeep Kumar Tippa; GregDubbin; Wannaphong Phatthiyaphaibun; Vadim Mazaev; Jens Dahl Møllerhøj; wbwseeker; Magnus Burton; mpuels; Tom Dong; thomasO; Ramanan Balakrishnan; Avadh Patel

doi:10.5281/zenodo.2532212

Published January 5, 2019 | Version v2.1.0a5

Software Open

explosion/spaCy: v2.1.0a5: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

1. Founder @explosion
2. RiseML
3. LogMeIn, Meltwater
4. 4Com
5. @kouchtv
6. ChatMe.ai
7. @explosion
8. @PyThaiNLP
9. mollerhoj
10. @yoyolabsio
11. Quora
12. @Semantics3
13. SUNY Binghamton - Computer Science

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements Tagger, Parser, NER and Text Categorizer

NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
Make parser, tagger and NER faster, through better hyperparameters.
Make TextCategorizer default to a simpler, GPU-friendly model.
Add EntityRecognizer.labels property.
Remove document length limit during training, by implementing faster Levenshtein alignment.
Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
NEW: The English and German models are now available under the MIT license.
NEW: Statistical models for Greek.
Improve loading time of French by ~30%.

CLI

NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
Check if model is already installed before downloading it via spacy download.
Pass additional arguments of download command to pip to customise installation.
Improve train command by letting GoldCorpus stream data, instead of loading into memory.
Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
Add support for multi-task objectives to train command.
Add support for data-augmentation to train command.

Other

NEW: Doc.retokenize context manager for merging tokens more efficiently.
NEW: Add support for custom pipeline component factories via entry points (#2348).
NEW: Implement fastText vectors with subword features.
NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
Add warnings if .similarity method is called with empty vectors or without word vectors.
Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
Refactor CLI and add debug-data command to validate training data (see #2932).
Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction
This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

Enhanced pattern API for rule-based Matcher (see #1971).

Improve tokenizer performance (see #1642).

Allow retokenizer to update Lexeme attributes on merge (see #2390).

md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

Improved JSON(L) format for training (see #2928, #2932).

🔴 Bug fixes

Fix issue #1487: Add Doc.retokenize() context manager.
Fix issue #1574: Make sure stop words are available in medium and large English models.
Fix issue #1585: Prevent parser from predicting unseen classes.
Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
Fix issue #1782, #2343: Fix training on GPU.
Fix issue #1816: Allow custom Language subclasses via entry points.
Fix issue #1865: Correct licensing of it_core_news_sm model.
Fix issue #1889: Make stop words case-insensitive.
Fix issue #1903: Add relcl dependency label to symbols.
Fix issue #2014: Make Token.pos_ writeable.
Fix issue #2369: Respect pre-defined warning filters.
Fix issue #2482: Fix serialization when parser model is empty.
Fix issue #2648: Fix KeyError in Vectors.most_similar.
Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
Fix issue #2772: Fix bug in sentence starts for non-projective parses.
Fix issue #2779: Fix handling of pre-set entities.
Fix issue #2782: Make like_num work with prefixed numbers.
Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
Fix issue #2871: Fix vectors for reserved words.
Fix issue #3027: Allow Span to take unicode value for label argument.
Fix issue #3048: Raise better errors for uninitialized pipeline components.
Fix serialization of custom tokenizer if not all functions are defined.
Fix bugs in beam-search training objective.
Fix problems with model pickling.

⚠️ Backwards incompatibilities

This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
If you've been training your own models, you'll need to retrain them with the new version.
While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
sentence_splitter = nlp.create_pipe('sbd')
sentence_splitter = nlp.create_pipe('sentencizer') ```
The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
$ spacy train en /output train_data.json dev_data.json --no-parser
$ spacy train en /output train_data.json dev_data.json --pipeline tagger,ner ```
Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks Model Language Version UAS LAS POS NER F Vec Size en_core_web_sm English 2.1.0a5 91.2 89.3 96.9 85.6 𐄂 10 MB en_core_web_md English 2.1.0a5 91.4 89.5 96.9 85.9 ✓ 90 MB en_core_web_lg English 2.1.0a5 91.5 89.7 97.0 86.3 ✓ 788 MB de_core_news_sm German 2.1.0a5 91.3 89.0 97.1 82.2 𐄂 10 MB de_core_news_md German 2.1.0a5 92.0 90.0 97.4 82.7 ✓ 210 MB es_core_news_sm Spanish 2.1.0a5 89.9 86.7 96.6 87.3 𐄂 10 MB es_core_news_md Spanish 2.1.0a5 90.6 87.7 97.0 88.0 ✓ 69 MB pt_core_news_sm Portuguese 2.1.0a5 89.3 86.0 78.5 87.8 𐄂 12 MB fr_core_news_sm French 2.1.0a5 87.3 84.4 94.4 81.0 𐄂 14 MB fr_core_news_md French 2.1.0a5 88.8 86.1 94.9 82.2 ✓ 82 MB it_core_news_sm Italian 2.1.0a5 90.8 87.0 95.7 84.8 𐄂 10 MB nl_core_news_sm Dutch 2.1.0a5 83.7 77.4 90.9 85.4 𐄂 10 MB el_core_news_sm Greek 2.1.0a5 85.5 81.8 94.7 75.9 𐄂 10 MB el_core_news_md Greek 2.1.0a5 88.5 85.2 96.8 80.01 ✓ 126 MB xx_ent_wiki_sm Multi 2.1.0a5 - - - 82.8 𐄂 3 MB

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.

Files

explosion/spaCy-v2.1.0a5.zip

Files (29.0 MB)

Name	Size	Download all
explosion/spaCy-v2.1.0a5.zip md5:8ad8d7a04104437ac6ca300ae6637469	29.0 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v2.1.0a5 (URL)

	All versions	This version
Views	42,483	271
Downloads	2,668	15
Data volume	45.2 GB	521.4 MB

explosion/spaCy: v2.1.0a5: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

Authors/Creators

Description

Files

explosion/spaCy-v2.1.0a5.zip

Files (29.0 MB)

Additional details

Related works