TELF.pre_processing.Vulture package#

Subpackages#

Submodules#

TELF.pre_processing.Vulture.pre_process module#

TELF.pre_processing.Vulture.pre_process.advance_document_clean(document: str, document_id: str, nlp, allowed_postags={'ADJ': 1, 'ADV': 1, 'NOUN': 1, 'PROPN': 1, 'VERB': 1}, lemmatize=True) tuple[source]#

Form Bigrams, Trigrams, filter to the allowed postags and apply spacy lemmatization

Parameters:
  • document (str) – single text document.

  • document_id (str) – unique identifier of the given document.

  • nlp (callable) – Spacy NLP model.

  • allowed_postags (list, optional) – The list of allowed Postags. The default is [‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’, “PROPN”].

  • lemmatize (bool, optional) – Parameter to select if we are doing lemmatization

Returns:

document ID, cleaned document pair.

Return type:

tuple

TELF.pre_processing.Vulture.pre_process.build_stem_map(vocabulary, method='frequency')[source]#

Stems vocabulary map, ununified

Parameters:

vocabulary (list) – Node containing a row of the data.

Returns:

vocab_stems – variants map to shortest variant, ununified

Return type:

dict (str:str)

TELF.pre_processing.Vulture.pre_process.build_vocab_stem_subs(vocabulary: list, min_char_len: int = 4)[source]#

Stems vocabulary, constructs map of all variants to the shorstest variant.

Parameters:

vocabulary (list) – Node containing a row of the data.

Returns:

  • subs_stemed (dict (str:str)) – variants map to shortest variant

  • shortened_vocabulary (list) – new vocab with only mapped variants

TELF.pre_processing.Vulture.pre_process.check_character_lengths(document: str, document_id: str, min_characters=2, min_unique_characters=2) tuple[source]#

check tokens for character lengths

Parameters:
  • document (str) – single text document.

  • document_id (str) – unique identifier of the given document.

  • min_characters (int, optional) – Minimum number of characters a token should have. The default is 2.

  • min_unique_characters (int, optional) – Minimum number of unique characters a token should have. The default is 2.

Returns:

document ID, cleaned document pair.

Return type:

tuple

TELF.pre_processing.Vulture.pre_process.correct_text(documents: dict, corrections: dict)[source]#

Checks word boundaries for any and all occurances of corrections, replaces key with value

Parameters:
  • documents (dict) – id, text of corpus

  • corrections (dict) – mapping of replacements, where key is source and value is destination

Returns:

results – list of tuples containing the original id, text parings.

Return type:

list

TELF.pre_processing.Vulture.pre_process.lemmatize_document(document: str, document_id: str) tuple[source]#

Performs lemmatization of a given document.

Parameters:
  • document (str) – single text document.

  • document_id (str) – unique identifier of the given document.

Returns:

document ID, lemmatized document pair.

Return type:

tuple

TELF.pre_processing.Vulture.pre_process.nested_defaultdict()[source]#

can be called for another default dict, where the third dimension becomes a list

Parameters:

None

Return type:

default dict with list values

TELF.pre_processing.Vulture.pre_process.simple_document_clean(document: str, document_id: str, stop_words, stop_phrases: list, clean_settings={}) tuple[source]#

Cleans the given document.

Have the capability to do: - make lower case - remove copyright statements with copyright symbol - remove stop-phrases - make hypen words single word - remove next line characters - remove emails - replace dashes with spaces - replace [ and ] with space - replace with space - replace ^ with space - Remove numbers - remove non ASCII characters - remove tags - remove any other symbols - remove anything in between [], including [] - filter stop-words

Parameters:
  • document (str) – single text document.

  • document_id (str) – unique identifier of the given document.

  • stop_words (dict (recommended), or list) – Hashmap (dict) of stopwords. O(1) lookup. List of stopwords. O(n) lookup.

  • stop_phrases (list, optional) – List of phrases to be removed.

  • clean_settings (dict, optional) – Settings used in simple cleaning. See _organize_simple_clean_defaults for defaults.

Returns:

document ID, cleaned document pair.

Return type:

tuple

TELF.pre_processing.Vulture.pre_process.stem_document(document: str, document_id: str) tuple[source]#

Performs stemming of a given document.

Parameters:
  • document (str) – single text document.

  • document_id (str) – unique identifier of the given document.

Returns:

document ID, stemmed document pair.

Return type:

tuple

TELF.pre_processing.Vulture.pre_process.strip_suffixes(word, suffixes)[source]#

Removes found suffixes

Parameters:
  • word (str) – element to remove suffex from

  • suffixes (list) – common suffixes in english

Returns:

word – element without suffix

Return type:

str

TELF.pre_processing.Vulture.vulture module#

© 2022. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

class TELF.pre_processing.Vulture.vulture.Vulture(*, n_jobs=-1, n_nodes=1, parallel_backend='multiprocessing', cache='/tmp', verbose=False)[source]#

Bases: object

Vulture is a parallel, multi-node parallel, and distributed parallel document pre-processing tool. It is designed to be simple and fast.

Vultures are natures’ cleaners!

DEFAULT_PIPELINE = [SimpleCleaner(effective_stop_words=['characteristics', 'acknowledgment', 'characteristic', 'significantly', 'automatically', 'unfortunately', 'corresponding', 'substantially', 'predominantly', 'approximately', 'investigation', 'applications', 'particularly', 'sufficiently', 'specifically', 'demonstrates', 'representing', 'consequently', 'respectively', 'introduction', 'successfully', 'nevertheless', 'demonstrated', 'demonstrate', 'conclusions', ... (+1359 more)], patterns={'standardize_hyphens': (re.compile('[\\u002D\\u2010\\u2011\\u2012\\u2013\\u2014\\u2015\\u2212\\u2E3A\\u2E3B]'), '-'), 'remove_copyright_statement': None, 'remove_stop_phrases': None, 'make_lower_case': None, 'normalize': None, 'remove_trailing_dash': ('(?<!\\w)-|-(?!\\w)', ''), 'make_hyphens_words': ('([a-z])\\-([a-z])', ''), 'remove_next_line': ('\\n+', ' '), 'remove_email': ('\\S*@\\S*\\s?', ''), 'remove_formulas': ('\\b\\w*[\\=\\≈\\/\\\\\\±]\\w*\\b', ''), 'remove_dash': ('-', ''), 'remove_between_[]': ('\\[.*?\\]', ' '), 'remove_between_()': ('\\(.*?\\)', ' '), 'remove_[]': ('[\\[\\]]', ' '), 'remove_()': ('[()]', ' '), 'remove_\\': ('\\\\', ' '), 'remove_numbers': ('\\d+', ''), 'remove_standalone_numbers': ('\\b\\d+\\b', ''), 'remove_nonASCII_boundary': ('\\b[^\\x00-\\x7F]+\\b', ''), 'remove_nonASCII': ('[^\\x00-\\x7F]+', ''), 'remove_tags': ('&lt;/?.*?&gt;', ''), 'remove_special_characters': ('[!|"|#|$|%|&|\\|\\\'|(|)|*|+|,|.|/|:|;|<|=|>|?|@|[|\\|]|^|_|`|{|\\||}|~]', ''), 'isolate_frozen': None, 'remove_extra_whitespace': ('\\s+', ' '), 'remove_stop_words': None, 'min_characters': None}, sw_pattern=re.compile('\\b[\\w-]+\\b'))]#
PARALLEL_BACKEND_OPTIONS = {'loky', 'multiprocessing', 'threading'}#
property cache#
clean(documents, steps=None, substitutions=None, save_path=None)[source]#
clean_dataframe(df, columns, steps=None, substitutions=None, append_to_original_df=False, concat_cleaned_cols=False)[source]#
property n_jobs#
property n_nodes#
property parallel_backend#
property save_path#
use_mpi()[source]#
property verbose#
TELF.pre_processing.Vulture.vulture.chunk_tuple_list(l, n_chunks)[source]#

Splits the given list of (key, value) tuples into sub-lists.

Parameters:
  • l (list of tuple) – List of (key, value) tuples to split.

  • n_chunks (int) – How many sets of sub-lists to create.

Yields:

list – Sub-list containing (key, value) tuples.

Module contents#