cltk: v0.1.29
Creators
- 1. Universidad de Buenos Aires
- 2. Archimedes Digital
- 3. Gitter
- 4. Columbia University
Description
This release adds basic Word2Vec support, including the introduction of Greek and Latin Word2Vec models (https://github.com/cltk/latin_word2vec_cltk & https://github.com/cltk/greek_word2vec_cltk). The key functionality is a keyword expander for use when querying the TLG and PHI5 corpora.
From the docs:
Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).
The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.
One of the most useful simple features of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here's an example of its use:
threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.
The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:
To add and subtract vectors, you need to load the models yourself with Gensim.
Files
cltk-v0.1.29.zip
Files
(421.7 kB)
Name | Size | Download all |
---|---|---|
md5:97b73a00f7c29f0a67dc48369316c8b9
|
421.7 kB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/kylepjohnson/cltk/tree/v0.1.29 (URL)