There is a newer version of this record available.

Software Open Access

explosion/spaCy: v2.0.11: Alpha Vietnamese support, fixes to vectors, improved errors and more

Matthew Honnibal; Ines Montani; Matthew Honnibal; Henning Peters; Maxim Samsonov; Jim Geovedi; Jim Regan; György Orosz; Søren Lind Kristiansen; Roman; Duygu Altinok; Paul O'Leary McCann; Grégory Howard; Alex; Kit; Sam Bozek; Explosion Bot; Mark Amery; Leif Uwe Vogelsang; GregDubbin; Vadim Mazaev; Pradeep Kumar Tippa; wbwseeker; Wannaphong Phatthiyaphaibun; Magnus Burton; mpuels; Yubing Dong (Tom); thomasO; Ramanan Balakrishnan; Avadh Patel

📊 Help us improve spaCy and take the User Survey 2018! ✨ New features and improvements
  • NEW: Alpha Vietnamese support with tokenization via Pyvi.
  • NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified asserts have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164!
  • Improve language data for Polish.
  • Tidy up dependencies and drop six, html5lib, ftfy and requests.
  • Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the beam_update_prob entry in nlp.parser.cfg. The default value is 0.5, so 50% of beam updates will be done as greedy updates.
🔴 Bug fixes
  • Fix issue #1554, #1752, #2159: Fix Token.ent_iob after Doc.merge(), and ensure consistency in Doc.ents.
  • Fix issue #1660: Fix loading of multiple vector models.
  • Fix issue #1967: Allow entity types with dashes.
  • Fix issue #2032: Fix accidentally quadratic runtime in Vocab.set_vector.
  • Fix issue #2050: Correct mistakes in Italian lemmatizer data.
  • Fix issue #2073: Make Token.set_extension work as expected.
  • Fix issue #2100, #2151, #2181: Drop six and html5lib and prevent dependency conflict with TensorFlow / Keras.
  • Fix issue #2101: Improve error message if token text is empty string.
  • Fix issue #2121: Fix Language.to_bytes and pickling in Thinc.
  • Fix issue #2156: Fix hashtag example in Matcher docs.
  • Fix issue #2177: Don't raise error in set_extension if getter and setter are specified or if default=None, and add error if setter is specified with no getter.
📖 Documentation and examples 👥 Contributors

Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.

Files (19.0 MB)
Name Size
19.0 MB Download
All versions This version
Views 14,361619
Downloads 52821
Data volume 11.0 GB398.1 MB
Unique views 12,078578
Unique downloads 24518


Cite as