SHARE data extraction, preprocessing and treatment methods¶
-
preprocessing.share_covid_data_extraction.
get_language_country_iso_codes
(language_country)[source]¶ Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ.
- Parameters
language_country (param1) – language and country information retrieved from input file.
- Returns
language_country (string). Variable representing the language and country metadata in ISO codes.
-
preprocessing.share_covid_data_extraction.
preprocess_answer_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the answer segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid answer segments.
-
preprocessing.share_covid_data_extraction.
preprocess_instruction_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the instruction segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid instruction segments.
-
preprocessing.share_covid_data_extraction.
preprocess_question_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the question segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid question segments.
-
preprocessing.share_covid_data_extraction.
replace_abbreviations_and_fills
(sentence)[source]¶ Replaces abbreviations and fills text from the text of input file.
- Parameters
sentence (param1) – text segment from input file.
- Returns
sentence (string). Text segment without abbreviations and fills text.
-
preprocessing.share_covid_data_extraction.
retrieve_module_from_item_name
(item_name)[source]¶ Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html
- Parameters
item_name (param1) – item_name information retrieved from input file.
- Returns
module (string). Module of the question.
-
preprocessing.share_covid_data_extraction.
set_initial_structures
(language_country)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
- Parameters
language_country (param1) – language and country of the subdataframe being analyzed
- Returns
df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).
Python3 script to extract data from XML SHARE input files Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.share_xml_data_extraction.
build_questionnaire_structure
(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Build the final questionnaire from df_questions, df_answers and df_procedures. Calls the fill_extraction() and fill_unrolling() methods to replace the dynamic fills in the texts for the appropriate string definitions found in df_procedures.
- Parameters
df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
- Returns
The final SHARE questionnaire, stored in df_questionnaire (pandas dataframe).
-
preprocessing.share_xml_data_extraction.
clean_answer_text
(text, country_language)[source]¶ Substitutes HTML markups in the answer text segments with fixed values
- Parameters
text (param1) – the answer text segment.
country_language (param2) – country_language metadata, embedded in file name.
- Returns
the answer text (string) where the markups were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.
clean_text_share
(text, country_language, w7flag)[source]¶ Substitutes HTML markups and certain fills in the text segments with fixed values.
- Parameters
text (param1) – the answer text segment.
country_language (param2) – country_language metadata, embedded in file name.
w7_flag (param3) – a boolean flag that indicates if the segment comes from a input xml file in SHARE w7.
- Returns
the text (string) where the markups and fills were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.
eliminate_showcardID_and_adjust_item_type
(text, item_name)[source]¶ Substitutes the SHOWCARD_ID strings with a card number (the card IDs are not available in the input XML files).
- Parameters
text (param1) – the text segment being analyzed (either request or instruction).
item_name (param2) – item_name metadata, extracted direcly from the input xml file. If ‘intro’ is in the item_name, the segment receives the introduction item_type.
- Returns
text (string) and item_type (string). The SHOWCARD_ID strings are removed from the text segment.
-
preprocessing.share_xml_data_extraction.
extract_answers
(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extract answers text from XML nodes of SHARE w8 files.
- Parameters
subnode (param1) – child node being analyzed in outer loop.
df_answers (param2) – pandas dataframe containing answers extracted from XML file
name (param3) – name of the answer structure inside XML file
country_language (param4) – country_language metadata, embedded in file name.
output_source_questionnaire_flag (param5) – indicates if the data to be extracted in the source (1) or the target language (any other value)
- Returns
df_answers (pandas dataframe) filled with retrieved answer segments extracted from answer_element nodes.
-
preprocessing.share_xml_data_extraction.
extract_categories
(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extracts the categories (i.e., answers) from SHARE W07 XML files.
- Parameters
subnode (param1) – subnode of categories node.
df_answers (param2) – a dataframe to store answer text and its attributes
country_language (param3) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
- Returns
df_answers (pandas dataframe) filled with retrieved answer segments extracted from category_element nodes.
-
preprocessing.share_xml_data_extraction.
extract_qenums
(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extracts the qenums (i.e., answers) from SHARE W07 XML files.
- Parameters
subnode (param1) – subnode of categories node.
df_answers (param2) – a dataframe to store answer text and its attributes
country_language (param3) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
- Returns
df_answers (pandas dataframe) filled with retrieved answer segments extracted from qenum_element nodes.
-
preprocessing.share_xml_data_extraction.
extract_questions_and_procedures_w7
(subnode, df_questions, df_procedures, parent_map, name, tmt_id, splitter, country_language, output_source_questionnaire_flag)[source]¶ Extracts the questions and procedures text segments from SHARE wave 7 XML files.
- Parameters
df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
name (param4) – name node attribute inside XML file
tmt_id (param5) – tmt_id node attribute inside XML file
splitter (param6) – Sentence segmenter object from NLTK
country_language (param7) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param8) – indicates if the data to be extracted in the source (1) or the target language (any other value)
- Returns
df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.
-
preprocessing.share_xml_data_extraction.
extract_questions_and_procedures_w8
(subnode, df_questions, df_procedures, parent_map, name, splitter, country_language, output_source_questionnaire_flag)[source]¶ Extracts the questions and procedures text segments from SHARE wave 8 XML files.
- Parameters
df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
name (param4) – name node attribute inside XML file
splitter (param5) – Sentence segmenter object from NLTK
country_language (param6) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param7) – indicates if the data to be extracted in the source (1) or the target language (any other value)
- Returns
df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.
-
preprocessing.share_xml_data_extraction.
fill_extraction
(text)[source]¶ Retrieves all dynamic fills (if there is any) from a given SHARE text segment, so later on these fills can be replaces by their natural language text definition.
- Parameters
text (param1) – the text segment.
- Returns
either a list of fills (list of strings), or null if there are no matching fills in the text segment.
-
preprocessing.share_xml_data_extraction.
fill_substitution_in_answer
(text, fills, df_procedures)[source]¶ Substitutes the fills in the answer text segments. The fill is substituted only if it was found in the procedure nodes (this can be checked by filtering the df_procedures dataframe by the fill present in the answer segment).
- Parameters
text (param1) – the answer text segment.
fills (param2) – the list of fills that are present in the text segment. Effectivelly, for answers the fill list has just one element.
df_procedures (param3) – a dataframe that stores the contents of the procedures nodes, where the fill definitions are.
- Returns
module (string) the module name.
-
preprocessing.share_xml_data_extraction.
fill_unrolling
(text, fills, df_procedures, df_questionnaire, survey_item_id, item_name, share_modules, study, item_type)[source]¶ Replaces all dynamic fills found in a given text segment by their string definitions in the df_procedures dataframe.
- Parameters
text (param1) – the text segment that contains at least one dynamic fill.
fills (param2) – the list of dynamic fills found in the text segment passed as parameter.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
item_name (param5) – the item name metadata, extracted in previous steps.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
study (param7) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
item_type (param8) – the item type metadata, extracted in previous steps.
- Returns
The updated df_questionnaire (pandas dataframe). The dynamic fill(s) in the text segment was properly replaced.
-
preprocessing.share_xml_data_extraction.
filter_items_to_build_questionnaire_structure_w7
(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Filters the question and answer dataframes by the tmt_ids. Only segments with the same tmt_id are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.
- Parameters
df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
- Returns
The final SHARE wave 7 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.
-
preprocessing.share_xml_data_extraction.
filter_items_to_build_questionnaire_structure_w8
(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Filters the question and answer dataframes by the item name. Only segments with the same item name are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.
- Parameters
df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
- Returns
The final SHARE wave 8 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.
-
preprocessing.share_xml_data_extraction.
get_module_metadata
(item_name, share_modules)[source]¶ Gets the module to which a given survey item pertains. based on the survey item name.
- Parameters
item_name (param1) – item_name metadata, extracted direcly from the input xml file.
share_modules (param2) – a dictionary of module names (taken from SHARE website), encapsulated in the SHAREModules object.
- Returns
module (string) the module name.
-
preprocessing.share_xml_data_extraction.
main
(filename)[source]¶ Flag that indicates if the data to be extracted is from the source or the target questionnaire.
-
preprocessing.share_xml_data_extraction.
replace_fill_in_answer
(text)[source]¶ Substitutes certain fills in the answer text segments with fixed values.
- Parameters
text (param1) – the answer text segment.
- Returns
the answer text (string) where the fills were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.
replace_untranslated_instructions
(country_language, text)[source]¶ Replaces certain dynamic fills that are not defined in the input file by language-dependent fixed values.
- Parameters
country_language (param1) – country and language metadata, contained in the filename.
text (param2) – the text segment.
- Returns
The text segment (string) without certain dynamic fills (if there were any).
-
preprocessing.share_xml_data_extraction.
set_initial_structures
(filename, output_source_questionnaire_flag)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
- Parameters
filename (param1) – name of the input file.
- Returns
df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).
-
preprocessing.share_xml_data_extraction.
split_answer_text_item_value_from_categories
(text)[source]¶ Splits the answer text and its item value in the category node
- Parameters
text (param1) – text from category node, containing item value and answer text segment
- Returns
item_value (string) and answer text segment (string)