ESS data extraction, preprocessing and treatment methods

preprocessing.txt2spreadsheet.main(folder_path, has_supplementary)[source]

Main method of the ESS plain text to spreadsheet data transformation algorithm. The data is extracted from the plain text file (that obeys an internal specification for the MCSQ project), preprocessed and receives appropriate metadata attribution.

The algorithm outputs the csv representation of the df_questionnaire, used to store questionnaire data (pandas dataframe)

Parameters
  • folder_path (param1) – path to the folder where the plain text files are.

  • has_supplementary (param2) – boolean variable that indicates if there is a supplementary spreadsheet to be appended.

preprocessing.txt2spreadsheet.process_answer_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, country_language)[source]

Extracts and processes the answer segments from a raw item. The answer segments are always after the {ANSWERS} tag. If there are no answer segments, then the answer segment is the corresponding to ‘write down’ for the target language.

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • country_language (param6) – country_language metadata, embedded in file name.

Returns

updated df_questionnaire when new valid answer segments are included, or df_questionnaire in the same state it was when no new valid answer segments were included.

preprocessing.txt2spreadsheet.process_intro_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter)[source]

Extracts and processes the introduction segments from a raw item. The introduction segments are always between the item name and {QUESTION} tag, for instance:

{INTRO} Ara m’agradaria fer-li algunes preguntes sobre política i el govern.

B1 {QUESTION} En quina mesura diria vostè que l’interessa la política? Vostè diria que l’interessa…

{ANSWERS} Molt Bastant Poc Gens

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • splitter (param6) – sentence segmentation from NLTK library.

Returns

updated df_questionnaire when new valid introduction segments are included, or df_questionnaire in the same state it was when no new valid introduction segments were included.

preprocessing.txt2spreadsheet.process_question_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter, country_language)[source]

Extracts and processes the question segments from a raw item. The question segments are always between the {QUESTION} and {ANSWERS} tags, for instance:

G2 {QUESTION} Per a ell és important ser ric. Vol tenir molts diners i coses cares.

{ANSWERS} Se sembla molt a mi Se sembla a mi Se sembla una mica a mi Se sembla poc a mi No se sembla a mi No se sembla gens a mi

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • splitter (param6) – sentence segmentation from NLTK library.

  • country_language (param7) – country_language metadata, embedded in file name.

Returns

updated df_questionnaire when new valid question segments are included, or df_questionnaire in the same state it was when no new valid question segments were included.

preprocessing.txt2spreadsheet.retrieve_raw_items_from_file(file)[source]

Extracts the raw items from ESS plain text file, based on an item name regex pattern. Also excludes blank lines and non relevant scale items.

Parameters

file (param1) – input ESS plain text file.

Returns

retrieved raw items (list of strings).

preprocessing.txt2spreadsheet.set_initial_structures(filename)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

filename (param1) – name of the input file.

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).

Python3 script to transform XML ESS data into spreadsheet format used as input for MCSQ Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.ess_xml_data_extraction.adjust_item_name(item_name)[source]

Adjust item_name inconsistencies (and item_type in some cases) present in source XML file.

Parameters

item_name (param1) – item_name metadata, extracted from input file.

Returns

adjusted item_name and item_type metadata.

preprocessing.ess_xml_data_extraction.clean(text)[source]

Cleans the question or instruction segment, by standardizing the text and removing undesired elements.

Parameters

text (param1) – question or instruction segment currently being analyzed.

Returns

standardized question or instruction text (string).

preprocessing.ess_xml_data_extraction.clean_answer_category(text)[source]

Cleans the answer segment, by standardizing the text and removing undesired elements.

Parameters

text (param1) – answer segment currently being analyzed.

Returns

standardized answer text (string).

preprocessing.ess_xml_data_extraction.get_answer_id(node, parent_map)[source]

Gets the answer id from node attributes, if it exists

Parameters
  • node (param1) – current xml tree node being analyzed in outer loop.

  • parent_map (param2) – a dictionary containing information about parent-child relationships in XML tree.

Returns

answer_id (string) if it exists, otherwise None.

preprocessing.ess_xml_data_extraction.identify_showcard_instruction(text, country_language)[source]

Language specific definitions of the word ‘card’ used in the ESS files. If the text matches the word, then it is a showcard instruction.

Parameters
  • text (param1) – text segment being analyzed.

  • country_language (param2) – country and language metadata, embedded in the name of the input file.

Returns

item_type (string). Either request or instruction in the case that it is a showcard instruction.

preprocessing.ess_xml_data_extraction.process_answer_node(ess_answers, df_answers, parent_map, ess_special_answer_categories, extract_source)[source]

Iterates through answer nodes to extract answer segments.

Parameters
  • ess_answers (param1) – answer nodes.

  • df_answers (param2) – a dataframe to store processed answer segments

  • parent_map (param3) – a dictionary containing information about parent-child relationships in XML tree.

  • ess_special_answer_categories (param4) – instance of SpecialAnswerCategories object, in accordance to the country_language.

  • extract_source (param5) – flag that indicates if the script should extract the ENG_SOURCE data or the target language.

Returns

Updated df_answers dataframe, with new answer segments.

preprocessing.ess_xml_data_extraction.process_question_instruction_node(ess_questions_instructions, df_question_instruction, parent_map, splitter, country_language, extract_source)[source]

Iterates through question nodes to extract questions and instructions (introduction is not present in metadata)

Parameters
  • ess_questions_instructions (param1) – question and instruction nodes.

  • df_question_instruction (param2) – a dataframe to store processed question and instruction segments

  • parent_map (param3) – a dictionary containing information about parent-child relationships in XML tree.

  • splitter (param4) – sentence segmentation from NLTK library.

  • country_language (param5) – country and language metadata, extracted from the input file name.

  • extract_source (param6) – flag that indicates if the script should extract the ENG_SOURCE data or the target language.

Returns

Updated df_question_instruction dataframe, with new question and instruction segments.

preprocessing.ess_xml_data_extraction.segment_question_instruction(df_question_instruction, parent_map, node, item_name, item_type, splitter, country_language)[source]
Extracts the question/instruction text segments from a node, if the node text exists.

nodes to extract questions and instructions (introduction is not present in metadata)

Parameters
  • df_question_instruction (param1) – a dataframe to store processed question and instruction segments

  • parent_map (param2) – a dictionary containing information about parent-child relationships in XML tree.

  • node (param3) – XML node being analyzed.

  • item_name (param4) – item name metadata extracted from node.attrib[‘name’]

  • item_type (param5) – item type metadata inferred from parent_map[node].attrib[‘type_name’]

  • splitter (param6) – Sentence segmenter object from NLTK

  • country_language (param7) – country and language metadata, extracted from the input file name.

Returns

Updated df_question_instruction dataframe, with new question and instruction segments.

preprocessing.ess_xml_data_extraction.set_initial_structures(filename, extract_source)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

filename (param1) – name of the input file.

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/instruction segments when necessary (NLTK object).

preprocessing.preprocessing_ess_utils.check_if_answer_is_special_category(text, answer_value, ess_special_answer_categories)[source]

Verifies if a given answer segment is one of the special answer categories, by testing the answer text against the attributes of SpecialAnswerCategories object. This method serves the purpose of standardizing the special answer category values.

Parameters
  • text (param1) – answer segment currently being analyzed.

  • answer_value (param2) – answer category value, defined in clean_answer() method.

  • ess_special_answer_categories (param3) – instance of SpecialAnswerCategories object, in accordance to the country_language.

Returns

answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.

preprocessing.preprocessing_ess_utils.check_if_segment_is_instruction(sentence, country_language)[source]

Calls the appropriate instruction recognition method, according to the language.

Parameters
  • sentence (param1) – sentence being analyzed in outer loop of data extraction.

  • country_language (param2) – country_language metadata, embedded in file name.

Returns

bypass the return of instruction_recognition methods (boolean).

preprocessing.preprocessing_ess_utils.clean_answer(text, ess_special_answer_categories)[source]

Cleans the answer segment, by standardizing the text (when it is a special answer category), and attributing an category value to it.

Parameters
  • text (param1) – answer segment currently being analyzed.

  • ess_special_answer_categories (param2) – instance of SpecialAnswerCategories object, in accordance to the country_language.

Returns

answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.

preprocessing.preprocessing_ess_utils.clean_text(text)[source]

Cleans Request, Introduction and Instruction text segments by removing undesired characters and standardizing some character representations. A string input is expected, if the input is not a string instance, the method returns ‘’, so the entry is ignored in the data extraction loop.

Parameters

text (param1) – text to be cleaned.

Returns

cleaned text (string).

preprocessing.preprocessing_ess_utils.expand_interviewer_abbreviations(text, country_language)[source]

Switches abbreviations of the word interviewer for the full form.

Parameters
  • text (param1) – sentence being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

text (string) without abbreviations for the word interviewer, when applicable.

preprocessing.preprocessing_ess_utils.get_country_language_and_study_info(filename)[source]

Retrieves the country/language and study metadata based on the input filename, or survey_item_ID prefix. The filenames respect a nomenclature rule, as follows: SSS_RRR_YYYY_CC_LLL S = study name R = round or wave Y = study year C = Country (ISO code with two digits, except for SOURCE) L = Language

Parameters

filename (param1) – name of the input file.

Returns

country/language (string) and study metadata (string).

preprocessing.preprocessing_ess_utils.instantiate_special_answer_category_object(country_language)[source]

Instantiates the SpecialAnswerCategories object that stores both the text and category values of the special answers (don’t know, refusal, not applicable and write down) in accordance to the country_language metadata parameter.

Parameters

country_language (param1) – country_language metadata parameter, embedded in file name.

Returns

instance of SpecialAnswerCategories object (Python object), in accordance to the country_language.

preprocessing.preprocessing_ess_utils.instruction_recognition_catalan_spanish(text, country_language)[source]

Recognizes an instruction segment for texts written either in Spanish or Catalan, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Spanish or Catalan) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_czech(text, country_language)[source]

Recognizes an instruction segment for texts written in Czech, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Czech) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_english(text, country_language)[source]

Recognizes an instruction segment for texts written in English, based on regex named groups patterns.

Parameters
  • text (param1) – text (in English) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_french(text, country_language)[source]

Recognizes an instruction segment for texts written in French, based on regex named groups patterns.

Parameters
  • text (param1) – text (in French) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_german(text, country_language)[source]

Recognizes an instruction segment for texts written in German, based on regex named groups patterns.

Parameters
  • text (param1) – text (in German) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_norwegian(text, country_language)[source]

Recognizes an instruction segment for texts written in Norwegian, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Norwegian) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_portuguese(text, country_language)[source]

Recognizes an instruction segment for texts written in Portuguese, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Portuguese) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.instruction_recognition_russian(text, country_language)[source]

Recognizes an instruction segment for texts written in Russian, based on regex named groups patterns.

Parameters
  • text (param1) – text (in Russian) currently being analyzed.

  • country_language (param2) – country_language metadata embedded in file name.

Returns

True if the segment is an instruction or False if it is not.

preprocessing.preprocessing_ess_utils.remove_spaces_from_item_name(item_name)[source]

Removes spaces in item names such as A 1, because the MCSQ standard are item names without spaces (A1).

Parameters

item_name (param1) – item_name retrieved from the input file.

Returns

item_name (string) withour spaces.

preprocessing.preprocessing_ess_utils.retrieve_item_module(item_name, study)[source]

Retrieves the module of the survey_item, based on information from the ESSModulesRRR objects. This information comes from the source questionnaires.

Parameters
  • item_name (param1) – name of survey item, retrieved in previous steps.

  • study (param2) – study metadata, embedded in the file name.

Returns

module of survey_item (string).

preprocessing.preprocessing_ess_utils.retrieve_supplementary_module(essmodules, item_name)[source]

Matches the item_name against the dictionary stored in the ESSModulesRRR objects. Rotating/supplementary modules are defined by round because they may change from round to round.

Parameters
  • essmodules (param1) – ESSModulesRRR object, instantiated according to the round.

  • item_name (param2) – name of survey item, retrieved in previous steps.

Returns

matching value for item name (string).

preprocessing.preprocessing_ess_utils.standardize_study_metadata(study)[source]

Transforms study metadata present in the input file to the standard used in the MCSQ format.

Parameters

study (param1) – study metadata extracted from input file (Study column).

Returns

Standardized study parameter (string).

preprocessing.preprocessing_ess_utils.standardize_supplementary_item_name(item_name)[source]

Standardizes the item name metadata of supplementary modules G, H and I

Parameters

item_name (param1) – item_name metadata, extracted from input file.

Returns

Standardized item_name, when applicable.

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesCAT[source]

Class encapsulating special answer categories for Catalan

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesCZE[source]

Class encapsulating special answer categories for Czech

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesENG[source]

Class encapsulating special answer categories for English

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesFRE[source]

Class encapsulating special answer categories for French

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesGER[source]

Class encapsulating special answer categories for German

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesNOR[source]

Class encapsulating special answer categories for Norwegian

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesPOR[source]

Class encapsulating special answer categories for Portuguese

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesRUS_EE[source]

Class encapsulating special answer categories for Russian from Estonia

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesRUS_IL[source]

Class encapsulating special answer categories for Russian from Israel

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesRUS_LT[source]

Class encapsulating special answer categories for Russian from Lithuania

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesRUS_LV[source]

Class encapsulating special answer categories for Russian from Latvia

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesRUS_RU_UA[source]

Class encapsulating special answer categories for Russian from Russian Federation and Ukraine

class preprocessing.ess_special_answer_categories.SpecialAnswerCategoriesSPA[source]

Class encapsulating special answer categories for Spanish

class preprocessing.essmodules.ESSSModulesR01[source]

Rotating modules in ESS round 1.

class preprocessing.essmodules.ESSSModulesR02[source]

Rotating modules in ESS round 2.

class preprocessing.essmodules.ESSSModulesR03[source]

Rotating modules in ESS round 3.

class preprocessing.essmodules.ESSSModulesR04[source]

Rotating modules in ESS round 4.

class preprocessing.essmodules.ESSSModulesR05[source]

Rotating modules in ESS round 5.

class preprocessing.essmodules.ESSSModulesR06[source]

Rotating modules in ESS round 6.

class preprocessing.essmodules.ESSSModulesR07[source]

Rotating modules in ESS round 7.

class preprocessing.essmodules.ESSSModulesR08[source]

Rotating modules in ESS round 8.

class preprocessing.essmodules.ESSSModulesR09[source]

Rotating modules in ESS round 9.