ESS data extraction, preprocessing and treatment methods¶
-
preprocessing.txt2spreadsheet.
main
(folder_path, has_supplementary)[source]¶ Main method of the ESS plain text to spreadsheet data transformation algorithm. The data is extracted from the plain text file (that obeys an internal specification for the MCSQ project), preprocessed and receives appropriate metadata attribution.
The algorithm outputs the csv representation of the df_questionnaire, used to store questionnaire data (pandas dataframe)
- Parameters
folder_path (param1) – path to the folder where the plain text files are.
has_supplementary (param2) – boolean variable that indicates if there is a supplementary spreadsheet to be appended.
-
preprocessing.txt2spreadsheet.
process_answer_segment
(raw_item, survey_item_prefix, study, item_name, df_questionnaire, country_language)[source]¶ Extracts and processes the answer segments from a raw item. The answer segments are always after the {ANSWERS} tag. If there are no answer segments, then the answer segment is the corresponding to ‘write down’ for the target language.
- Parameters
raw_item (param1) – raw survey item, retrieved in previous steps.
survey_item_prefix (param2) – prefix of survey_item_ID.
study (param3) – metadata parameter about study embedded in the file name.
item_name (param4) – item_name metadata parameter, retrieved in previous steps.
df_questionnaire (param5) – pandas dataframe to store questionnaire data.
country_language (param6) – country_language metadata, embedded in file name.
- Returns
updated df_questionnaire when new valid answer segments are included, or df_questionnaire in the same state it was when no new valid answer segments were included.
-
preprocessing.txt2spreadsheet.
process_intro_segment
(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter)[source]¶ Extracts and processes the introduction segments from a raw item. The introduction segments are always between the item name and {QUESTION} tag, for instance:
{INTRO} Ara m’agradaria fer-li algunes preguntes sobre política i el govern.
B1 {QUESTION} En quina mesura diria vostè que l’interessa la política? Vostè diria que l’interessa…
{ANSWERS} Molt Bastant Poc Gens
- Parameters
raw_item (param1) – raw survey item, retrieved in previous steps.
survey_item_prefix (param2) – prefix of survey_item_ID.
study (param3) – metadata parameter about study embedded in the file name.
item_name (param4) – item_name metadata parameter, retrieved in previous steps.
df_questionnaire (param5) – pandas dataframe to store questionnaire data.
splitter (param6) – sentence segmentation from NLTK library.
- Returns
updated df_questionnaire when new valid introduction segments are included, or df_questionnaire in the same state it was when no new valid introduction segments were included.
-
preprocessing.txt2spreadsheet.
process_question_segment
(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter, country_language)[source]¶ Extracts and processes the question segments from a raw item. The question segments are always between the {QUESTION} and {ANSWERS} tags, for instance:
G2 {QUESTION} Per a ell és important ser ric. Vol tenir molts diners i coses cares.
{ANSWERS} Se sembla molt a mi Se sembla a mi Se sembla una mica a mi Se sembla poc a mi No se sembla a mi No se sembla gens a mi
- Parameters
raw_item (param1) – raw survey item, retrieved in previous steps.
survey_item_prefix (param2) – prefix of survey_item_ID.
study (param3) – metadata parameter about study embedded in the file name.
item_name (param4) – item_name metadata parameter, retrieved in previous steps.
df_questionnaire (param5) – pandas dataframe to store questionnaire data.
splitter (param6) – sentence segmentation from NLTK library.
country_language (param7) – country_language metadata, embedded in file name.
- Returns
updated df_questionnaire when new valid question segments are included, or df_questionnaire in the same state it was when no new valid question segments were included.
-
preprocessing.txt2spreadsheet.
retrieve_raw_items_from_file
(file)[source]¶ Extracts the raw items from ESS plain text file, based on an item name regex pattern. Also excludes blank lines and non relevant scale items.
- Parameters
file (param1) – input ESS plain text file.
- Returns
retrieved raw items (list of strings).
-
preprocessing.txt2spreadsheet.
set_initial_structures
(filename)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
- Parameters
filename (param1) – name of the input file.
- Returns
df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).
Python3 script to transform XML ESS data into spreadsheet format used as input for MCSQ Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.ess_xml_data_extraction.
adjust_item_name
(item_name)[source]¶ Adjust item_name inconsistencies (and item_type in some cases) present in source XML file.
- Parameters
item_name (param1) – item_name metadata, extracted from input file.
- Returns
adjusted item_name and item_type metadata.
-
preprocessing.ess_xml_data_extraction.
clean
(text)[source]¶ Cleans the question or instruction segment, by standardizing the text and removing undesired elements.
- Parameters
text (param1) – question or instruction segment currently being analyzed.
- Returns
standardized question or instruction text (string).
-
preprocessing.ess_xml_data_extraction.
clean_answer_category
(text)[source]¶ Cleans the answer segment, by standardizing the text and removing undesired elements.
- Parameters
text (param1) – answer segment currently being analyzed.
- Returns
standardized answer text (string).
-
preprocessing.ess_xml_data_extraction.
get_answer_id
(node, parent_map)[source]¶ Gets the answer id from node attributes, if it exists
- Parameters
node (param1) – current xml tree node being analyzed in outer loop.
parent_map (param2) – a dictionary containing information about parent-child relationships in XML tree.
- Returns
answer_id (string) if it exists, otherwise None.
-
preprocessing.ess_xml_data_extraction.
identify_showcard_instruction
(text, country_language)[source]¶ Language specific definitions of the word ‘card’ used in the ESS files. If the text matches the word, then it is a showcard instruction.
- Parameters
text (param1) – text segment being analyzed.
country_language (param2) – country and language metadata, embedded in the name of the input file.
- Returns
item_type (string). Either request or instruction in the case that it is a showcard instruction.
-
preprocessing.ess_xml_data_extraction.
process_answer_node
(ess_answers, df_answers, parent_map, ess_special_answer_categories, extract_source)[source]¶ Iterates through answer nodes to extract answer segments.
- Parameters
ess_answers (param1) – answer nodes.
df_answers (param2) – a dataframe to store processed answer segments
parent_map (param3) – a dictionary containing information about parent-child relationships in XML tree.
ess_special_answer_categories (param4) – instance of SpecialAnswerCategories object, in accordance to the country_language.
extract_source (param5) – flag that indicates if the script should extract the ENG_SOURCE data or the target language.
- Returns
Updated df_answers dataframe, with new answer segments.
-
preprocessing.ess_xml_data_extraction.
process_question_instruction_node
(ess_questions_instructions, df_question_instruction, parent_map, splitter, country_language, extract_source)[source]¶ Iterates through question nodes to extract questions and instructions (introduction is not present in metadata)
- Parameters
ess_questions_instructions (param1) – question and instruction nodes.
df_question_instruction (param2) – a dataframe to store processed question and instruction segments
parent_map (param3) – a dictionary containing information about parent-child relationships in XML tree.
splitter (param4) – sentence segmentation from NLTK library.
country_language (param5) – country and language metadata, extracted from the input file name.
extract_source (param6) – flag that indicates if the script should extract the ENG_SOURCE data or the target language.
- Returns
Updated df_question_instruction dataframe, with new question and instruction segments.
-
preprocessing.ess_xml_data_extraction.
segment_question_instruction
(df_question_instruction, parent_map, node, item_name, item_type, splitter, country_language)[source]¶ - Extracts the question/instruction text segments from a node, if the node text exists.
nodes to extract questions and instructions (introduction is not present in metadata)
- Parameters
df_question_instruction (param1) – a dataframe to store processed question and instruction segments
parent_map (param2) – a dictionary containing information about parent-child relationships in XML tree.
node (param3) – XML node being analyzed.
item_name (param4) – item name metadata extracted from node.attrib[‘name’]
item_type (param5) – item type metadata inferred from parent_map[node].attrib[‘type_name’]
splitter (param6) – Sentence segmenter object from NLTK
country_language (param7) – country and language metadata, extracted from the input file name.
- Returns
Updated df_question_instruction dataframe, with new question and instruction segments.
-
preprocessing.ess_xml_data_extraction.
set_initial_structures
(filename, extract_source)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
- Parameters
filename (param1) – name of the input file.
- Returns
df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/instruction segments when necessary (NLTK object).
-
preprocessing.preprocessing_ess_utils.
check_if_answer_is_special_category
(text, answer_value, ess_special_answer_categories)[source]¶ Verifies if a given answer segment is one of the special answer categories, by testing the answer text against the attributes of SpecialAnswerCategories object. This method serves the purpose of standardizing the special answer category values.
- Parameters
text (param1) – answer segment currently being analyzed.
answer_value (param2) – answer category value, defined in clean_answer() method.
ess_special_answer_categories (param3) – instance of SpecialAnswerCategories object, in accordance to the country_language.
- Returns
answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.
-
preprocessing.preprocessing_ess_utils.
check_if_segment_is_instruction
(sentence, country_language)[source]¶ Calls the appropriate instruction recognition method, according to the language.
- Parameters
sentence (param1) – sentence being analyzed in outer loop of data extraction.
country_language (param2) – country_language metadata, embedded in file name.
- Returns
bypass the return of instruction_recognition methods (boolean).
-
preprocessing.preprocessing_ess_utils.
clean_answer
(text, ess_special_answer_categories)[source]¶ Cleans the answer segment, by standardizing the text (when it is a special answer category), and attributing an category value to it.
- Parameters
text (param1) – answer segment currently being analyzed.
ess_special_answer_categories (param2) – instance of SpecialAnswerCategories object, in accordance to the country_language.
- Returns
answer text (string) and its category value (string). When the answer is a special answer category, the text and category values are the ones stored in the SpecialAnswerCategories object.
-
preprocessing.preprocessing_ess_utils.
clean_text
(text)[source]¶ Cleans Request, Introduction and Instruction text segments by removing undesired characters and standardizing some character representations. A string input is expected, if the input is not a string instance, the method returns ‘’, so the entry is ignored in the data extraction loop.
- Parameters
text (param1) – text to be cleaned.
- Returns
cleaned text (string).
-
preprocessing.preprocessing_ess_utils.
expand_interviewer_abbreviations
(text, country_language)[source]¶ Switches abbreviations of the word interviewer for the full form.
- Parameters
text (param1) – sentence being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
text (string) without abbreviations for the word interviewer, when applicable.
-
preprocessing.preprocessing_ess_utils.
get_country_language_and_study_info
(filename)[source]¶ Retrieves the country/language and study metadata based on the input filename, or survey_item_ID prefix. The filenames respect a nomenclature rule, as follows: SSS_RRR_YYYY_CC_LLL S = study name R = round or wave Y = study year C = Country (ISO code with two digits, except for SOURCE) L = Language
- Parameters
filename (param1) – name of the input file.
- Returns
country/language (string) and study metadata (string).
-
preprocessing.preprocessing_ess_utils.
instantiate_special_answer_category_object
(country_language)[source]¶ Instantiates the SpecialAnswerCategories object that stores both the text and category values of the special answers (don’t know, refusal, not applicable and write down) in accordance to the country_language metadata parameter.
- Parameters
country_language (param1) – country_language metadata parameter, embedded in file name.
- Returns
instance of SpecialAnswerCategories object (Python object), in accordance to the country_language.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_catalan_spanish
(text, country_language)[source]¶ Recognizes an instruction segment for texts written either in Spanish or Catalan, based on regex named groups patterns.
- Parameters
text (param1) – text (in Spanish or Catalan) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_czech
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Czech, based on regex named groups patterns.
- Parameters
text (param1) – text (in Czech) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_english
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in English, based on regex named groups patterns.
- Parameters
text (param1) – text (in English) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_french
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in French, based on regex named groups patterns.
- Parameters
text (param1) – text (in French) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_german
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in German, based on regex named groups patterns.
- Parameters
text (param1) – text (in German) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_norwegian
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Norwegian, based on regex named groups patterns.
- Parameters
text (param1) – text (in Norwegian) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_portuguese
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Portuguese, based on regex named groups patterns.
- Parameters
text (param1) – text (in Portuguese) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
instruction_recognition_russian
(text, country_language)[source]¶ Recognizes an instruction segment for texts written in Russian, based on regex named groups patterns.
- Parameters
text (param1) – text (in Russian) currently being analyzed.
country_language (param2) – country_language metadata embedded in file name.
- Returns
True if the segment is an instruction or False if it is not.
-
preprocessing.preprocessing_ess_utils.
remove_spaces_from_item_name
(item_name)[source]¶ Removes spaces in item names such as A 1, because the MCSQ standard are item names without spaces (A1).
- Parameters
item_name (param1) – item_name retrieved from the input file.
- Returns
item_name (string) withour spaces.
-
preprocessing.preprocessing_ess_utils.
retrieve_item_module
(item_name, study)[source]¶ Retrieves the module of the survey_item, based on information from the ESSModulesRRR objects. This information comes from the source questionnaires.
- Parameters
item_name (param1) – name of survey item, retrieved in previous steps.
study (param2) – study metadata, embedded in the file name.
- Returns
module of survey_item (string).
-
preprocessing.preprocessing_ess_utils.
retrieve_supplementary_module
(essmodules, item_name)[source]¶ Matches the item_name against the dictionary stored in the ESSModulesRRR objects. Rotating/supplementary modules are defined by round because they may change from round to round.
- Parameters
essmodules (param1) – ESSModulesRRR object, instantiated according to the round.
item_name (param2) – name of survey item, retrieved in previous steps.
- Returns
matching value for item name (string).
-
preprocessing.preprocessing_ess_utils.
standardize_study_metadata
(study)[source]¶ Transforms study metadata present in the input file to the standard used in the MCSQ format.
- Parameters
study (param1) – study metadata extracted from input file (Study column).
- Returns
Standardized study parameter (string).
-
preprocessing.preprocessing_ess_utils.
standardize_supplementary_item_name
(item_name)[source]¶ Standardizes the item name metadata of supplementary modules G, H and I
- Parameters
item_name (param1) – item_name metadata, extracted from input file.
- Returns
Standardized item_name, when applicable.
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesCAT
[source]¶ Class encapsulating special answer categories for Catalan
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesCZE
[source]¶ Class encapsulating special answer categories for Czech
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesENG
[source]¶ Class encapsulating special answer categories for English
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesFRE
[source]¶ Class encapsulating special answer categories for French
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesGER
[source]¶ Class encapsulating special answer categories for German
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesNOR
[source]¶ Class encapsulating special answer categories for Norwegian
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesPOR
[source]¶ Class encapsulating special answer categories for Portuguese
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesRUS_EE
[source]¶ Class encapsulating special answer categories for Russian from Estonia
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesRUS_IL
[source]¶ Class encapsulating special answer categories for Russian from Israel
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesRUS_LT
[source]¶ Class encapsulating special answer categories for Russian from Lithuania
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesRUS_LV
[source]¶ Class encapsulating special answer categories for Russian from Latvia
-
class
preprocessing.ess_special_answer_categories.
SpecialAnswerCategoriesRUS_RU_UA
[source]¶ Class encapsulating special answer categories for Russian from Russian Federation and Ukraine