Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published July 12, 2022 | Version v3.4.0
Software Open

explosion/spaCy: v3.4.0: Updated types, speed improvements and pipelines for Croatian

Description

✨ New features and improvements

  • Support for mypy 0.950+ and pydantic v1.9 (#10786).
  • Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
  • Min/max {n,m} operator for Matcher patterns (#10981).
  • Language updates:
    • Improve tokenization for Cyrillic combining diacritics (#10837).
    • Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
  • Improved speed of vector lookups (#10992).
  • For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
  • Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
  • Improved speed of StringStore lookups (#10938).
  • Updated spacy project clone to try both main and master branches by default (#10843).
  • Added confidence threshold for named entity linker (#11016).
  • Improved handling of Typer optional default values for init_config_cli (#10788).
  • Added cycle detection in parser projectivization methods (#10877).
  • Added counts for NER labels in debug data (#10960).
  • Support for adding NVTX ranges to TrainablePipe components (#10965).
  • Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).
πŸ“¦ Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

Package UPOS Parser LAS NER F
hr_core_news_sm 96.6 77.5 76.1
hr_core_news_md 97.3 80.1 81.8
hr_core_news_lg 97.5 80.4 83.0

πŸ™ Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

Package Model Version TAG Parser LAS NER F
en_core_news_md v3.3.0 97.3 90.1 84.6
en_core_news_md v3.4.0 97.2 90.3 85.5
en_core_news_lg v3.3.0 97.4 90.1 85.3
en_core_news_lg v3.4.0 97.3 90.2 85.6

All CNN pipelines have been extended to add whitespace augmentation.

πŸ”΄ Bug fixes
  • Fix issue #10960: Support hyphens in NER labels.
  • Fix issue #10994: Fix horizontal spacing for spans in displaCy.
  • Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
  • Fix issue #11056: Don't use get_array_module in textcat.
  • Fix issue #11092: Fix vertical alignment for spans in displaCy.
πŸš€ Notes about upgrading from v3.3
  • Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.
πŸ“– Documentation and examples
  • spaCy universe additions:
    • Aim-spacy: An Aim-based spaCy experiment tracker.
    • Asent: Fast, flexible and transparent sentiment analysis.
    • spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
    • spacy-report: Generates interactive reports for spaCy models.
πŸ‘₯ Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere

Files

explosion/spaCy-v3.4.0.zip

Files (11.0 MB)

Name Size Download all
md5:bbb77b802250927d792f87248dd48020
11.0 MB Preview Download

Additional details

Related works