SHARE data extraction, preprocessing and treatment methods

preprocessing.share_covid_data_extraction.get_language_country_iso_codes(language_country)[source]

Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ.

Parameters

language_country (param1) – language and country information retrieved from input file.

Returns

language_country (string). Variable representing the language and country metadata in ISO codes.

preprocessing.share_covid_data_extraction.preprocess_answer_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the answer segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid answer segments.

preprocessing.share_covid_data_extraction.preprocess_instruction_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the instruction segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid instruction segments.

preprocessing.share_covid_data_extraction.preprocess_question_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the question segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid question segments.

preprocessing.share_covid_data_extraction.replace_abbreviations_and_fills(sentence)[source]

Replaces abbreviations and fills text from the text of input file.

Parameters

sentence (param1) – text segment from input file.

Returns

sentence (string). Text segment without abbreviations and fills text.

preprocessing.share_covid_data_extraction.retrieve_module_from_item_name(item_name)[source]

Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html

Parameters

item_name (param1) – item_name information retrieved from input file.

Returns

module (string). Module of the question.

preprocessing.share_covid_data_extraction.set_initial_structures(language_country)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

language_country (param1) – language and country of the subdataframe being analyzed

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).

Python3 script to extract data from XML SHARE input files Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.share_xml_data_extraction.build_questionnaire_structure(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]

Build the final questionnaire from df_questions, df_answers and df_procedures. Calls the fill_extraction() and fill_unrolling() methods to replace the dynamic fills in the texts for the appropriate string definitions found in df_procedures.

Parameters
  • df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.

  • df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.

  • df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • df_questionnaire (param4) – a dataframe to hold the final questionnaire.

  • survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.

  • share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.

  • special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)

  • study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns

The final SHARE questionnaire, stored in df_questionnaire (pandas dataframe).

preprocessing.share_xml_data_extraction.clean_answer_text(text, country_language)[source]

Substitutes HTML markups in the answer text segments with fixed values

Parameters
  • text (param1) – the answer text segment.

  • country_language (param2) – country_language metadata, embedded in file name.

Returns

the answer text (string) where the markups were replaced (if present in original string).

preprocessing.share_xml_data_extraction.clean_text_share(text, country_language, w7flag)[source]

Substitutes HTML markups and certain fills in the text segments with fixed values.

Parameters
  • text (param1) – the answer text segment.

  • country_language (param2) – country_language metadata, embedded in file name.

  • w7_flag (param3) – a boolean flag that indicates if the segment comes from a input xml file in SHARE w7.

Returns

the text (string) where the markups and fills were replaced (if present in original string).

preprocessing.share_xml_data_extraction.eliminate_showcardID_and_adjust_item_type(text, item_name)[source]

Substitutes the SHOWCARD_ID strings with a card number (the card IDs are not available in the input XML files).

Parameters
  • text (param1) – the text segment being analyzed (either request or instruction).

  • item_name (param2) – item_name metadata, extracted direcly from the input xml file. If ‘intro’ is in the item_name, the segment receives the introduction item_type.

Returns

text (string) and item_type (string). The SHOWCARD_ID strings are removed from the text segment.

preprocessing.share_xml_data_extraction.extract_answers(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]

Extract answers text from XML nodes of SHARE w8 files.

Parameters
  • subnode (param1) – child node being analyzed in outer loop.

  • df_answers (param2) – pandas dataframe containing answers extracted from XML file

  • name (param3) – name of the answer structure inside XML file

  • country_language (param4) – country_language metadata, embedded in file name.

  • output_source_questionnaire_flag (param5) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns

df_answers (pandas dataframe) filled with retrieved answer segments extracted from answer_element nodes.

preprocessing.share_xml_data_extraction.extract_categories(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]

Extracts the categories (i.e., answers) from SHARE W07 XML files.

Parameters
  • subnode (param1) – subnode of categories node.

  • df_answers (param2) – a dataframe to store answer text and its attributes

  • country_language (param3) – country and language metadata, contained in the filename

  • output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns

df_answers (pandas dataframe) filled with retrieved answer segments extracted from category_element nodes.

preprocessing.share_xml_data_extraction.extract_qenums(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]

Extracts the qenums (i.e., answers) from SHARE W07 XML files.

Parameters
  • subnode (param1) – subnode of categories node.

  • df_answers (param2) – a dataframe to store answer text and its attributes

  • country_language (param3) – country and language metadata, contained in the filename

  • output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns

df_answers (pandas dataframe) filled with retrieved answer segments extracted from qenum_element nodes.

preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w7(subnode, df_questions, df_procedures, parent_map, name, tmt_id, splitter, country_language, output_source_questionnaire_flag)[source]

Extracts the questions and procedures text segments from SHARE wave 7 XML files.

Parameters
  • df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.

  • df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.

  • name (param4) – name node attribute inside XML file

  • tmt_id (param5) – tmt_id node attribute inside XML file

  • splitter (param6) – Sentence segmenter object from NLTK

  • country_language (param7) – country and language metadata, contained in the filename

  • output_source_questionnaire_flag (param8) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns

df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.

preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w8(subnode, df_questions, df_procedures, parent_map, name, splitter, country_language, output_source_questionnaire_flag)[source]

Extracts the questions and procedures text segments from SHARE wave 8 XML files.

Parameters
  • df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.

  • df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.

  • name (param4) – name node attribute inside XML file

  • splitter (param5) – Sentence segmenter object from NLTK

  • country_language (param6) – country and language metadata, contained in the filename

  • output_source_questionnaire_flag (param7) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns

df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.

preprocessing.share_xml_data_extraction.fill_extraction(text)[source]

Retrieves all dynamic fills (if there is any) from a given SHARE text segment, so later on these fills can be replaces by their natural language text definition.

Parameters

text (param1) – the text segment.

Returns

either a list of fills (list of strings), or null if there are no matching fills in the text segment.

preprocessing.share_xml_data_extraction.fill_substitution_in_answer(text, fills, df_procedures)[source]

Substitutes the fills in the answer text segments. The fill is substituted only if it was found in the procedure nodes (this can be checked by filtering the df_procedures dataframe by the fill present in the answer segment).

Parameters
  • text (param1) – the answer text segment.

  • fills (param2) – the list of fills that are present in the text segment. Effectivelly, for answers the fill list has just one element.

  • df_procedures (param3) – a dataframe that stores the contents of the procedures nodes, where the fill definitions are.

Returns

module (string) the module name.

preprocessing.share_xml_data_extraction.fill_unrolling(text, fills, df_procedures, df_questionnaire, survey_item_id, item_name, share_modules, study, item_type)[source]

Replaces all dynamic fills found in a given text segment by their string definitions in the df_procedures dataframe.

Parameters
  • text (param1) – the text segment that contains at least one dynamic fill.

  • fills (param2) – the list of dynamic fills found in the text segment passed as parameter.

  • df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • df_questionnaire (param4) – a dataframe to hold the final questionnaire.

  • item_name (param5) – the item name metadata, extracted in previous steps.

  • share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.

  • study (param7) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

  • item_type (param8) – the item type metadata, extracted in previous steps.

Returns

The updated df_questionnaire (pandas dataframe). The dynamic fill(s) in the text segment was properly replaced.

preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w7(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]

Filters the question and answer dataframes by the tmt_ids. Only segments with the same tmt_id are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.

Parameters
  • df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.

  • df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.

  • df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • df_questionnaire (param4) – a dataframe to hold the final questionnaire.

  • survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.

  • share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.

  • special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)

  • study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns

The final SHARE wave 7 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.

preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w8(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]

Filters the question and answer dataframes by the item name. Only segments with the same item name are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.

Parameters
  • df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.

  • df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.

  • df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.

  • df_questionnaire (param4) – a dataframe to hold the final questionnaire.

  • survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.

  • share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.

  • special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)

  • study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns

The final SHARE wave 8 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.

preprocessing.share_xml_data_extraction.get_module_metadata(item_name, share_modules)[source]

Gets the module to which a given survey item pertains. based on the survey item name.

Parameters
  • item_name (param1) – item_name metadata, extracted direcly from the input xml file.

  • share_modules (param2) – a dictionary of module names (taken from SHARE website), encapsulated in the SHAREModules object.

Returns

module (string) the module name.

preprocessing.share_xml_data_extraction.main(filename)[source]

Flag that indicates if the data to be extracted is from the source or the target questionnaire.

preprocessing.share_xml_data_extraction.replace_fill_in_answer(text)[source]

Substitutes certain fills in the answer text segments with fixed values.

Parameters

text (param1) – the answer text segment.

Returns

the answer text (string) where the fills were replaced (if present in original string).

preprocessing.share_xml_data_extraction.replace_untranslated_instructions(country_language, text)[source]

Replaces certain dynamic fills that are not defined in the input file by language-dependent fixed values.

Parameters
  • country_language (param1) – country and language metadata, contained in the filename.

  • text (param2) – the text segment.

Returns

The text segment (string) without certain dynamic fills (if there were any).

preprocessing.share_xml_data_extraction.set_initial_structures(filename, output_source_questionnaire_flag)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

filename (param1) – name of the input file.

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).

preprocessing.share_xml_data_extraction.split_answer_text_item_value_from_categories(text)[source]

Splits the answer text and its item value in the category node

Parameters

text (param1) – text from category node, containing item value and answer text segment

Returns

item_value (string) and answer text segment (string)

class preprocessing.sharemodules.SHAREModules[source]

SHARE modules, information taken from SHARE website