Preprocessing Utils

preprocessing.preprocessing_ess_utils.check_if_answer_is_special_category(text, answer_value, ess_special_answer_categories)[source]

Verifies if a given answer segment is one of the special answer categories, by testing the answer text against the attributes of SpecialAnswerCategories object. This method serves the purpose of standartizing the special answer category values.

Parameters
  • text (param1) – answer segment currently being analyzed.

  • answer_value (param2) – answer category value, defined in clean_answer() method.

  • ess_special_answer_categories (param3) – instance of SpecialAnswerCategories object,

  • accordance to the country_language. (in) –

Returns

answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.

preprocessing.preprocessing_ess_utils.check_if_segment_is_instruction(sentence, country_language)[source]

Calls the appropriate instruction recognition method, according to the language. :param param1 sentence: sentence being analyzed in outer loop of data extraction. :type param1 sentence: string :param param2 country_language: country_language metadata, embedded in file name. :type param2 country_language: string

Returns

bypass the return of instruction_recognition methods (boolean).

preprocessing.preprocessing_ess_utils.clean_answer(text, ess_special_answer_categories)[source]

Cleans the answer segment, by standartizing the text (when it is a special answer category), and attributing an category value to it.

Parameters
  • text (param1) – answer segment currently being analyzed.

  • ess_special_answer_categories (param2) – instance of SpecialAnswerCategories object,

  • accordance to the country_language. (in) –

Returns

answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.

preprocessing.preprocessing_ess_utils.clean_text(text)[source]

Cleans Request, Introduction and Instruction text segments by removing undesired characters and standartizing some character representations. A string input is expected, if the input is not a string instance, the method returns ‘’, so the entry is ignored in the data extraction loop.

Parameters

text (param1) – text to be cleaned.

Returns

cleaned text (string).

preprocessing.preprocessing_ess_utils.expand_interviewer_abbreviations(text, country_language)[source]

Switches abbreviations of the word interviewer for the full form.

Parameters
  • text (param1) – sentence being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

text (string) without abbreviations for the word interviewer, when applicable.

preprocessing.preprocessing_ess_utils.get_country_language_and_study_info(filename)[source]

Retrieves the country/language and study metadata based on the input filename, or survey_item_ID prefix. The filenames respect a nomenclature rule, as follows: SSS_RRR_YYYY_CC_LLL S = study name R = round or wave Y = study year C = Country (ISO code with two digits, except for SOURCE) L = Language

Parameters

filename (param1) – name of the input file.

Returns

country/language (string) and study metadata (string).

preprocessing.preprocessing_ess_utils.instantiate_special_answer_category_object(country_language)[source]

Instantiates the SpecialAnswerCategories object that stores both the text and category values of the special answers (don’t know, refusal, not applicable and write down) in accordance to the country_language metadata parameter.

Parameters

country_language (param1) – country_language metadata parameter, embedded in file name.

Returns

instance of SpecialAnswerCategories object (Python object), in accordance to the country_language.

preprocessing.preprocessing_ess_utils.instruction_recognition_catalan_spanish(text, country_language)[source]

Recognizes an instruction segment for texts written either in Spanish or Catalan, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Spanish or Catalan) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_czech(text, country_language)[source]

Recognizes an instruction segment for texts written in Czech, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Czech) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_english(text, country_language)[source]

Recognizes an instruction segment for texts written in English, based on regex named groups patterns.

Parameters
  • text (param1) – text (in English) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_french(text, country_language)[source]

Recognizes an instruction segment for texts written in French, based on regex named groups patterns.

Parameters
  • text (param1) – text (in French) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_german(text, country_language)[source]

Recognizes an instruction segment for texts written in German, based on regex named groups patterns.

Parameters
  • text (param1) – text (in German) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_norwegian(text, country_language)[source]

Recognizes an instruction segment for texts written in Norwegian, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Norwegian) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_portuguese(text, country_language)[source]

Recognizes an instruction segment for texts written in Portuguese, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Portuguese) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_russian(text, country_language)[source]

Recognizes an instruction segment for texts written in German, based on regex named groups patterns.

Parameters
  • text (param1) – text (in German) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.remove_spaces_from_item_name(item_name)[source]

Removes spaces in item names such as A 1, because the MCSQ standard are item names without spaces (A1). :param param1 item_name: item_name retrieved from the input file. :type param1 item_name: string

Returns

item_name (string) withour spaces.

preprocessing.preprocessing_ess_utils.retrieve_item_module(item_name, study)[source]

Retrieves the module of the survey_item, based on information from the ESSModulesRRR objects. This information comes from the source questionnaires.

Parameters
  • item_name (param1) – name of survey item, retrieved in previous steps.

  • study (param2) – study metadata, embedded in the file name.

Returns

module of survey_item (string).

preprocessing.preprocessing_ess_utils.retrieve_supplementary_module(essmodules, item_name)[source]

Matches the item_name against the dictionary stored in the ESSModulesRRR objects. Rotating/supplementary modules are defined by round because they may change from round to round. :param param1 essmodules: ESSModulesRRR object, instantiated according to the round. :type param1 essmodules: Python object :param param2 item_name: name of survey item, retrieved in previous steps. :type param2 item_name: string

Returns

matching value for item name (string).

preprocessing.preprocessing_ess_utils.standardize_study_metadata(study)[source]

Transforms study metadata present in the input file to the standard used in the MCSQ format.

Parameters

study (param1) – study metadata extracted from input file (Study column).

Returns

Standardized study parameter (string).

preprocessing.preprocessing_ess_utils.standardize_supplementary_item_name(item_name)[source]

Standardizes the item name metadata of supplementary modules G, H and I

Parameters

item_name (param1) – item_name metadata, extracted from input file.

Returns

Standardized item_name, when applicable.

Python3 script with utility functions for preprocessing Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.utils.determine_country(filename)[source]

Determines the full name of the country, based on ISO code for country that is embedded in the file name.

Parameters

filename (param1) – input file name.

Returns

full name of the country (string).

preprocessing.utils.determine_sentence_tokenizer(filename)[source]

Provide the sentence splitter suffix to instantiate it in accordance to the target language (information emebedded on filename).

Parameters

filename (param1) – input file name.

Returns

a sentence splitter suffix (string) according to the target language.

preprocessing.utils.get_sentence_splitter(filename)[source]

Decide what Instantiate Punkt Sentence Tokenizer from NLTK should be instantiated, according to the information embedded in the filename.

Parameters

filename (param1) – input file name.

Returns

a sentence splitter (NLTK object) instantiated according to the target language.

preprocessing.utils.recognize_standard_response_scales(filename, text)[source]

Recognizes special answer categories from EVS by testing the answer segment against the language dependent pattern definitions for the special categories.

Parameters
  • filename (param1) – input file name.

  • text (param2) – answer text segment.

Returns

If a pattern was found, returns a string informing the special category, otherwise returns None.

preprocessing.preprocessing_evs_utils.clean_answer_text_evs(text, filename)[source]

Removes undesired characters from request/response text. :param param text: request/response text extracted from the input file. :type param text: string :param param filename: name of the input file. :type param filename: string

Returns

clean request/response text (string).

preprocessing.preprocessing_evs_utils.clean_instruction(text)[source]

Removes undesired characters from instruction text. :param param1 text: instruction text extracted from the input file. :type param1 text: string

Returns

clean instruction text (string) or ‘’ when text is not an instance of a string.

preprocessing.preprocessing_evs_utils.clean_text(text, filename)[source]

Removes undesired characters from request/response text. :param param text: request/response text extracted from the input file. :type param text: string :param param filename: name of the input file. :type param filename: string

Returns

clean request/response text (string).

preprocessing.preprocessing_evs_utils.get_country_language_and_study_info(filename)[source]

Retrieves the country/language and study metadata based on the input filename. :param param filename: name of the input file. :type param filename: string

Returns

country/language (string) and study (string) metadata.

preprocessing.preprocessing_evs_utils.standardize_item_name(item_name)[source]

Standartizes a given item_name, if it is not in the standard :param param1 item_name: item name extracted from the input file. :type param1 item_name: string

Returns

standardized item_name (string).

preprocessing.preprocessing_evs_utils.standardize_special_response_category(filename, text)[source]

Standartizes text of special response categories (don’t know, no answer, not applicable), according to the language (informed in the the filename).

Parameters
  • filename (param1) – name of the input file.

  • text (param2) – response text.

Returns

standardized response category text (string).

preprocessing.preprocessing_evs_utils.standardize_special_response_category_value(filename, catValu, text)[source]

Standartizes a response category value, if it is a special response category. Standard: Refusal=777 Don’t know=888 Does not apply=999

Parameters
  • filename (param1) – name of the input file.

  • catValu (param2) – response category value, extracted from input file.

  • text (param3) – text of response category, to test against special response category patterns.

Returns

standardized response category value (string).