explosion/spaCy: v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more

doi:10.5281/zenodo.1346301

Published August 16, 2018 | Version v2.1.0a1

Software Open

explosion/spaCy: v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more

1. Founder @explosion
2. RiseML
3. LogMeIn, Meltwater
4. 4Com
5. LinguaLeo
6. NSU
7. Founder @talecamp
8. @explosion
9. mollerhoj
10. @PyThaiNLP
11. Quora
12. @Semantics3
13. SUNY Binghamton - Computer Science

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements Tagger, Parser & NER

NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
Make parser, tagger and NER faster, through better hyperparameters.
Fix bugs in beam-search training objective.
Remove document length limit during training, by implementing faster Levenshtein alignment.
Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
NEW: The English and German models are now available under the MIT license.
NEW: Statistical models for Greek.

CLI

NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
Check if model is already installed before downloading it via spacy download.
Pass additional arguments of download command to pip to customise installation.
Improve train command by letting GoldCorpus stream data, instead of loading into memory.
Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
Add support for multi-task objectives to train command.
Add support for data-augmentation to train command.

Other

NEW: Doc.retokenize context manager for merging tokens more efficiently.
NEW: Add support for custom pipeline component factories via entry points (#2348).
NEW: Implement fastText vectors with subword features.
NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
Add warnings if .similarity method is called with empty vectors or without word vectors.
Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction
This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

Enhanced pattern API for rule-based Matcher (see #1971).

Improve tokenizer performance (see #1642).

Allow retokenizer to update Lexeme attributes on merge (see #2390).

md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

🔴 Bug fixes

Fix issue #1487: Add Doc.retokenize() context manager.
Fix issue #1574: Make sure stop words are available in medium and large English models.
Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
Fix issue #1865: Correct licensing of it_core_news_sm model.
Fix issue #1889: Make stop words case-insensitive.
Fix issue #1903: Add relcl dependency label to symbols.
Fix issue #2014: Make Token.pos_ writeable.
Fix issue #2369: Respect pre-defined warning filters.
Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
If you've been training your own models, you'll need to retrain them with the new version.
While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks Model Language Version UAS LAS POS NER F Vec Size en_core_web_sm English 2.1.0a0 91.8 90.0 96.8 85.6 𐄂 28 MB en_core_web_md English 2.1.0a0 92.0 90.2 97.0 86.2 ✓ 107 MB en_core_web_lg English 2.1.0a0 92.1 90.3 97.0 86.2 ✓ 805 MB de_core_news_sm German 2.1.0a0 92.0 90.1 97.2 83.8 𐄂 26 MB de_core_news_md German 2.1.0a0 92.4 90.7 97.4 84.2 ✓ 228 MB es_core_news_sm Spanish 2.1.0a0 90.1 87.2 96.9 89.4 𐄂 28 MB es_core_news_md Spanish 2.1.0a0 90.7 88.0 97.2 89.5 ✓ 88 MB pt_core_news_sm Portuguese 2.1.0a0 89.4 86.3 80.1 82.7 𐄂 29 MB fr_core_news_sm French 2.1.0a0 88.8 85.7 94.4 67.3 <sup>1</sup> 𐄂 32 MB fr_core_news_md French 2.1.0a0 88.7 86.0 95.0 70.4 <sup>1</sup> ✓ 100 MB it_core_news_sm Italian 2.1.0a0 90.7 87.1 96.1 81.3 𐄂 27 MB nl_core_news_sm Dutch 2.1.0a0 83.5 77.6 91.5 87.3 𐄂 27 MB el_core_news_sm Greek 2.1.0a0 84.5 81.0 95.0 73.5 𐄂 27 MB el_core_news_md Greek 2.1.0a0 87.7 84.7 96.3 80.2 ✓ 143 MB xx_ent_wiki_sm Multi 2.1.0a0 - - - 83.8 𐄂 9 MB

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.

Files

explosion/spaCy-v2.1.0a1.zip

Files (24.6 MB)

Name	Size	Download all
explosion/spaCy-v2.1.0a1.zip md5:000575f96a0f6db41b081a3284cedc1e	24.6 MB	Preview Download

Additional details

Is supplement to: https://github.com/explosion/spaCy/tree/v2.1.0a1 (URL)

	All versions	This version
Views	20,910	35
Downloads	651	11
Data volume	14.3 GB	393.3 MB

explosion/spaCy: v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more

Creators

Description

Files

explosion/spaCy-v2.1.0a1.zip

Files (24.6 MB)

Additional details

Related works