Plain xml to spreadsheet

Main method that calls for EVS/ESS scripts to generate MCSQ spreadsheet inputs Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.main_xml_files.main(folder_path)[source]

This main file calls the transformation algorithms inside evs_xml_data_extraction, ess_xml_data_extraction and ess_xml_data_extraction scripts.

evs_xml_data_extraction is called for EVS files ess_xml_data_extraction is called for ESS files share_xml_data_extraction is called for SHARE files

The algorithm transforms a XML file to a structured spreadsheet format with valuable metadata.

Call main script using folder_path, for instance: reset && python3 main.py /path/to/your/data

Parameters

folder_path (param1) – the path of the directory containing the files to tranform

preprocessing.evs_xml_data_extraction.process_ivuinstr_node(filename, ivuInstr, survey_item_prefix, study, item_name, module, df_questionnaire)[source]

Extracts information from ivuInstr node (instructions). The text is split into sentences and appropriate metadata is attributed to it.

Parameters
  • filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.

  • ivuInstr (param2) – valid node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.

  • module (param6) – module of survey_item, extracted in previous steps of the loop.

  • df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.

Returns

updated df_questionnaire when new valid information extracted from the ivuInstr node is included, or df_questionnaire in the same state it was when no new valid segments are included.

preprocessing.evs_xml_data_extraction.process_preqtxt_node(filename, preQTxt, survey_item_prefix, study, item_name, module, df_questionnaire)[source]

Extracts information from preQTxt node (requests and introductions). The text is split into sentences and appropriate metadata is attributed to it.

Parameters
  • filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.

  • preQTxt (param2) – valid node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.

  • module (param6) – module of survey_item, extracted in previous steps of the loop.

  • df_questionnaire (param) – pandas dataframe where the questionnaire is being stored.

Returns

updated df_questionnaire when new valid information extracted from the preQTxt node is included, or df_questionnaire in the same state it was when no new valid segments are included.

preprocessing.evs_xml_data_extraction.process_qstnLit_node(filename, qstnLit, survey_item_prefix, study, item_name, module, df_questionnaire)[source]

Extracts information from qstnLit node (requests). The text is split into sentences and appropriate metadata is attributed to it.

Parameters
  • filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.

  • qstnLit (param2) – valid node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.

  • module (param6) – module of survey_item, extracted in previous steps of the loop.

  • df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.

Returns

updated df_questionnaire when new valid information extracted from the qstnLit node is included, or df_questionnaire in the same state it was when no new valid segments are included.

preprocessing.evs_xml_data_extraction.process_response_with_id_node(filename, node, survey_item_prefix, study, df_questionnaire, response_dict)[source]

Extracts information of a response node that contains the attribute ID. If the node has the ID attribute, the translation text is in this node. The text and category value will be stored in the response_dict dictionary, to be used in response categories with references to the ID.

Parameters
  • filename (param1) – name of the input file.

  • node (param2) – response category node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • df_questionnaire (param5) – pandas dataframe where the questionnaire is being stored.

  • response_dict (param6) – dictionary that stores response category text and value.

Returns

df_questionnaire (pandas dataframe), with new information extracted from the response node and updated response_dict.

preprocessing.evs_xml_data_extraction.process_response_with_id_reference_node(node, survey_item_prefix, study, df_questionnaire, response_dict)[source]

Extracts information of a response node that contains a reference to an ID (attribute sdatrefs). The response text and category value are retrieved from the response_dict dictionary, updated in the process_response_with_id_node method.

Parameters
  • node (param) – response category node that is being analyzed.

  • survey_item_prefix (param) – prefix of the survey_item_ID metadata

  • study (param) – study metadata, retrieved from the filename.

  • df_questionnaire (param) – pandas dataframe where the questionnaire is being stored.

  • response_dict (param) –

Returns

df_questionnaire (pandas dataframe), with new information extracted from the response node.

preprocessing.evs_xml_data_extraction.process_txt_node(filename, txt, survey_item_prefix, study, item_name, module, df_questionnaire)[source]

Extracts information from txt node (requests). The text is split into sentences and appropriate metadata is attributed to it.

Parameters
  • filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.

  • txt (param2) – valid node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.

  • module (param6) – module of survey_item, extracted in previous steps of the loop.

  • df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.

Returns

updated df_questionnaire when new valid information extracted from the txt node is included, or df_questionnaire in the same state it was when no new valid segments are included.

preprocessing.evs_xml_data_extraction.process_valid_node(filename, node, survey_item_prefix, study, module, df_questionnaire)[source]

Calls the appropriate method to extract information from node and its children, when the node is valid (variable listed in EVSModulesYYYY classes), depending on node tag.

Parameters
  • filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.

  • node (param2) – valid node that is being analyzed.

  • survey_item_prefix (param3) – prefix of the survey_item_ID metadata

  • study (param4) – study metadata, retrieved from the filename.

  • module (param5) – module of survey_item, extracted in previous steps of the loop.

  • df_questionnaire (param6) – pandas dataframe where the questionnaire is being stored.

Returns

updated df_questionnaire when new valid information extracted from node is included, or df_questionnaire in the same state it was when no new valid segments are included.

preprocessing.evs_xml_data_extraction.retrieve_item_module(study, country_language, name)[source]

Retrieves the module of the survey_item, based on information from the EVSModulesYYYY objects. This information comes from the EVS_modules_reference.xlsx file, sent by Evelyn.

Parameters
  • study (param1) – study metadata, embedded in the file name.

  • country_language (param2) – country_language metadata, embedded in the file name.

  • name (param3) – attribute ‘name’ of the analyzed node, which is an EVS variable.

  • interest variables are listed in the EVSModulesYYYY objects. (The) –

Returns

appropriate module of survey_item (string).