Preprocessing Utils¶
-
preprocessing.preprocessing_ess_utils.
check_if_answer_is_special_category
(text, answer_value, ess_special_answer_categories)[source]¶ Verifies if a given answer segment is one of the special answer categories, by testing the answer text against the attributes of SpecialAnswerCategories object. This method serves the purpose of standartizing the special answer category values.
- Parameters
text (param1) – answer segment currently being analyzed.
answer_value (param2) – answer category value, defined in clean_answer() method.
ess_special_answer_categories (param3) – instance of SpecialAnswerCategories object,
accordance to the country_language. (in) –
- Returns
answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.
-
preprocessing.preprocessing_ess_utils.
check_if_segment_is_instruction
(sentence, country_language)[source]¶ Calls the appropriate instruction recognition method, according to the language. :param param1 sentence: sentence being analyzed in outer loop of data extraction. :type param1 sentence: string :param param2 country_language: country_language metadata, embedded in file name. :type param2 country_language: string
- Returns
bypass the return of instruction_recognition methods (boolean).
-
preprocessing.preprocessing_ess_utils.
clean_answer
(text, ess_special_answer_categories)[source]¶ Cleans the answer segment, by standartizing the text (when it is a special answer category), and attributing an category value to it.
- Parameters
text (param1) – answer segment currently being analyzed.
ess_special_answer_categories (param2) – instance of SpecialAnswerCategories object,
accordance to the country_language. (in) –
- Returns
answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.
-
preprocessing.preprocessing_ess_utils.
clean_text
(text)[source]¶ Cleans Request, Introduction and Instruction text segments by removing undesired characters and standartizing some character representations. A string input is expected, if the input is not a string instance, the method returns ‘’, so the entry is ignored in the data extraction loop.
- Parameters
text (param1) – text to be cleaned.
- Returns
cleaned text (string).
-
preprocessing.preprocessing_ess_utils.
expand_interviewer_abbreviations
(text, country_language)[source]¶ Switches abbreviations of the word interviewer for the full form.
- Parameters
text (param1) – sentence being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
text (string) without abbreviations for the word interviewer, when applicable.
-
preprocessing.preprocessing_ess_utils.
get_country_language_and_study_info
(filename)[source]¶ Retrieves the country/language and study metadata based on the input filename, or survey_item_ID prefix. The filenames respect a nomenclature rule, as follows: SSS_RRR_YYYY_CC_LLL S = study name R = round or wave Y = study year C = Country (ISO code with two digits, except for SOURCE) L = Language
- Parameters
filename (param1) – name of the input file.
- Returns
country/language (string) and study metadata (string).
-
preprocessing.preprocessing_ess_utils.
instantiate_special_answer_category_object
(country_language)[source]¶ Instantiates the SpecialAnswerCategories object that stores both the text and category values of the special answers (don’t know, refusal, not applicable and write down) in accordance to the country_language metadata parameter.
- Parameters
country_language (param1) – country_language metadata parameter, embedded in file name.
- Returns
instance of SpecialAnswerCategories object (Python object), in accordance to the country_language.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_catalan_spanish
(text, country_language)[source]¶ Recognizes an instruction segment for texts written either in Spanish or Catalan, based on regex named groups patterns.
- Parameters
text (param1) – text (in Spanish or Catalan) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_czech
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Czech, based on regex named groups patterns.
- Parameters
text (param1) – text (in Czech) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_english
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in English, based on regex named groups patterns.
- Parameters
text (param1) – text (in English) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_french
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in French, based on regex named groups patterns.
- Parameters
text (param1) – text (in French) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_german
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in German, based on regex named groups patterns.
- Parameters
text (param1) – text (in German) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_norwegian
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Norwegian, based on regex named groups patterns.
- Parameters
text (param1) – text (in Norwegian) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_portuguese
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Portuguese, based on regex named groups patterns.
- Parameters
text (param1) – text (in Portuguese) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_russian
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in German, based on regex named groups patterns.
- Parameters
text (param1) – text (in German) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
remove_spaces_from_item_name
(item_name)[source]¶ Removes spaces in item names such as A 1, because the MCSQ standard are item names without spaces (A1). :param param1 item_name: item_name retrieved from the input file. :type param1 item_name: string
- Returns
item_name (string) withour spaces.
-
preprocessing.preprocessing_ess_utils.
retrieve_item_module
(item_name, study)[source]¶ Retrieves the module of the survey_item, based on information from the ESSModulesRRR objects. This information comes from the source questionnaires.
- Parameters
item_name (param1) – name of survey item, retrieved in previous steps.
study (param2) – study metadata, embedded in the file name.
- Returns
module of survey_item (string).
-
preprocessing.preprocessing_ess_utils.
retrieve_supplementary_module
(essmodules, item_name)[source]¶ Matches the item_name against the dictionary stored in the ESSModulesRRR objects. Rotating/supplementary modules are defined by round because they may change from round to round. :param param1 essmodules: ESSModulesRRR object, instantiated according to the round. :type param1 essmodules: Python object :param param2 item_name: name of survey item, retrieved in previous steps. :type param2 item_name: string
- Returns
matching value for item name (string).
-
preprocessing.preprocessing_ess_utils.
standardize_study_metadata
(study)[source]¶ Transforms study metadata present in the input file to the standard used in the MCSQ format.
- Parameters
study (param1) – study metadata extracted from input file (Study column).
- Returns
Standardized study parameter (string).
-
preprocessing.preprocessing_ess_utils.
standardize_supplementary_item_name
(item_name)[source]¶ Standardizes the item name metadata of supplementary modules G, H and I
- Parameters
item_name (param1) – item_name metadata, extracted from input file.
- Returns
Standardized item_name, when applicable.
Python3 script with utility functions for preprocessing Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.utils.
determine_country
(filename)[source]¶ Determines the full name of the country, based on ISO code for country that is embedded in the file name.
- Parameters
filename (param1) – input file name.
- Returns
full name of the country (string).
-
preprocessing.utils.
determine_sentence_tokenizer
(filename)[source]¶ Provide the sentence splitter suffix to instantiate it in accordance to the target language (information emebedded on filename).
- Parameters
filename (param1) – input file name.
- Returns
a sentence splitter suffix (string) according to the target language.
-
preprocessing.utils.
get_sentence_splitter
(filename)[source]¶ Decide what Instantiate Punkt Sentence Tokenizer from NLTK should be instantiated, according to the information embedded in the filename.
- Parameters
filename (param1) – input file name.
- Returns
a sentence splitter (NLTK object) instantiated according to the target language.
-
preprocessing.utils.
recognize_standard_response_scales
(filename, text)[source]¶ Recognizes special answer categories from EVS by testing the answer segment against the language dependent pattern definitions for the special categories.
- Parameters
filename (param1) – input file name.
text (param2) – answer text segment.
- Returns
If a pattern was found, returns a string informing the special category, otherwise returns None.
-
preprocessing.preprocessing_evs_utils.
clean_answer_text_evs
(text, filename)[source]¶ Removes undesired characters from request/response text. :param param text: request/response text extracted from the input file. :type param text: string :param param filename: name of the input file. :type param filename: string
- Returns
clean request/response text (string).
-
preprocessing.preprocessing_evs_utils.
clean_instruction
(text)[source]¶ Removes undesired characters from instruction text. :param param1 text: instruction text extracted from the input file. :type param1 text: string
- Returns
clean instruction text (string) or ‘’ when text is not an instance of a string.
-
preprocessing.preprocessing_evs_utils.
clean_text
(text, filename)[source]¶ Removes undesired characters from request/response text. :param param text: request/response text extracted from the input file. :type param text: string :param param filename: name of the input file. :type param filename: string
- Returns
clean request/response text (string).
-
preprocessing.preprocessing_evs_utils.
get_country_language_and_study_info
(filename)[source]¶ Retrieves the country/language and study metadata based on the input filename. :param param filename: name of the input file. :type param filename: string
- Returns
country/language (string) and study (string) metadata.
-
preprocessing.preprocessing_evs_utils.
standardize_item_name
(item_name)[source]¶ Standartizes a given item_name, if it is not in the standard :param param1 item_name: item name extracted from the input file. :type param1 item_name: string
- Returns
standardized item_name (string).
-
preprocessing.preprocessing_evs_utils.
standardize_special_response_category
(filename, text)[source]¶ Standartizes text of special response categories (don’t know, no answer, not applicable), according to the language (informed in the the filename).
- Parameters
filename (param1) – name of the input file.
text (param2) – response text.
- Returns
standardized response category text (string).
-
preprocessing.preprocessing_evs_utils.
standardize_special_response_category_value
(filename, catValu, text)[source]¶ Standartizes a response category value, if it is a special response category. Standard: Refusal=777 Don’t know=888 Does not apply=999
- Parameters
filename (param1) – name of the input file.
catValu (param2) – response category value, extracted from input file.
text (param3) – text of response category, to test against special response category patterns.
- Returns
standardized response category value (string).