EVS data extraction, preprocessing and treatment methods¶
-
preprocessing.evs_xml_data_extraction.
process_ivuinstr_node
(filename, ivuInstr, survey_item_prefix, study, item_name, module, df_questionnaire)[source]¶ Extracts information from ivuInstr node (instructions). The text is split into sentences and appropriate metadata is attributed to it.
- Parameters
filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.
ivuInstr (param2) – valid node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.
module (param6) – module of survey_item, extracted in previous steps of the loop.
df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.
- Returns
updated df_questionnaire when new valid information extracted from the ivuInstr node is included, or df_questionnaire in the same state it was when no new valid segments are included.
-
preprocessing.evs_xml_data_extraction.
process_preqtxt_node
(filename, preQTxt, survey_item_prefix, study, item_name, module, df_questionnaire)[source]¶ Extracts information from preQTxt node (requests and introductions). The text is split into sentences and appropriate metadata is attributed to it.
- Parameters
filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.
preQTxt (param2) – valid node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.
module (param6) – module of survey_item, extracted in previous steps of the loop.
df_questionnaire (param) – pandas dataframe where the questionnaire is being stored.
- Returns
updated df_questionnaire when new valid information extracted from the preQTxt node is included, or df_questionnaire in the same state it was when no new valid segments are included.
-
preprocessing.evs_xml_data_extraction.
process_qstnLit_node
(filename, qstnLit, survey_item_prefix, study, item_name, module, df_questionnaire)[source]¶ Extracts information from qstnLit node (requests). The text is split into sentences and appropriate metadata is attributed to it.
- Parameters
filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.
qstnLit (param2) – valid node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.
module (param6) – module of survey_item, extracted in previous steps of the loop.
df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.
- Returns
updated df_questionnaire when new valid information extracted from the qstnLit node is included, or df_questionnaire in the same state it was when no new valid segments are included.
-
preprocessing.evs_xml_data_extraction.
process_response_with_id_node
(filename, node, survey_item_prefix, study, df_questionnaire, response_dict)[source]¶ Extracts information of a response node that contains the attribute ID. If the node has the ID attribute, the translation text is in this node. The text and category value will be stored in the response_dict dictionary, to be used in response categories with references to the ID.
- Parameters
filename (param1) – name of the input file.
node (param2) – response category node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
df_questionnaire (param5) – pandas dataframe where the questionnaire is being stored.
response_dict (param6) – dictionary that stores response category text and value.
- Returns
df_questionnaire (pandas dataframe), with new information extracted from the response node and updated response_dict.
-
preprocessing.evs_xml_data_extraction.
process_response_with_id_reference_node
(node, survey_item_prefix, study, df_questionnaire, response_dict)[source]¶ Extracts information of a response node that contains a reference to an ID (attribute sdatrefs). The response text and category value are retrieved from the response_dict dictionary, updated in the process_response_with_id_node method.
- Parameters
node (param) – response category node that is being analyzed.
survey_item_prefix (param) – prefix of the survey_item_ID metadata
study (param) – study metadata, retrieved from the filename.
df_questionnaire (param) – pandas dataframe where the questionnaire is being stored.
response_dict (param) –
- Returns
df_questionnaire (pandas dataframe), with new information extracted from the response node.
-
preprocessing.evs_xml_data_extraction.
process_txt_node
(filename, txt, survey_item_prefix, study, item_name, module, df_questionnaire)[source]¶ Extracts information from txt node (requests). The text is split into sentences and appropriate metadata is attributed to it.
- Parameters
filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.
txt (param2) – valid node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
item_name (param5) – item_name metadata, retrieved from the process_valid_node() method.
module (param6) – module of survey_item, extracted in previous steps of the loop.
df_questionnaire (param7) – pandas dataframe where the questionnaire is being stored.
- Returns
updated df_questionnaire when new valid information extracted from the txt node is included, or df_questionnaire in the same state it was when no new valid segments are included.
-
preprocessing.evs_xml_data_extraction.
process_valid_node
(filename, node, survey_item_prefix, study, module, df_questionnaire)[source]¶ Calls the appropriate method to extract information from node and its children, when the node is valid (variable listed in EVSModulesYYYY classes), depending on node tag.
- Parameters
filename (param1) – name of the input file. It will be used to instantiate the sentence splitter.
node (param2) – valid node that is being analyzed.
survey_item_prefix (param3) – prefix of the survey_item_ID metadata
study (param4) – study metadata, retrieved from the filename.
module (param5) – module of survey_item, extracted in previous steps of the loop.
df_questionnaire (param6) – pandas dataframe where the questionnaire is being stored.
- Returns
updated df_questionnaire when new valid information extracted from node is included, or df_questionnaire in the same state it was when no new valid segments are included.
-
preprocessing.evs_xml_data_extraction.
retrieve_item_module
(study, country_language, name)[source]¶ Retrieves the module of the survey_item, based on information from the EVSModulesYYYY objects. This information comes from the EVS_modules_reference.xlsx file, sent by Evelyn.
- Parameters
study (param1) – study metadata, embedded in the file name.
country_language (param2) – country_language metadata, embedded in the file name.
name (param3) – attribute ‘name’ of the analyzed node, which is an EVS variable.
interest variables are listed in the EVSModulesYYYY objects. (The) –
- Returns
appropriate module of survey_item (string).
-
preprocessing.preprocessing_evs_utils.
clean_answer_text_evs
(text, filename)[source]¶ Removes undesired characters from request/response text.
- Parameters
text (param) – request/response text extracted from the input file.
filename (param) – name of the input file.
- Returns
clean request/response text (string).
-
preprocessing.preprocessing_evs_utils.
clean_instruction
(text)[source]¶ Removes undesired characters from instruction text.
- Parameters
text (param1) – instruction text extracted from the input file.
- Returns
clean instruction text (string) or ‘’ when text is not an instance of a string.
-
preprocessing.preprocessing_evs_utils.
clean_text
(text, filename)[source]¶ Removes undesired characters from request/response text.
- Parameters
text (param) – request/response text extracted from the input file.
filename (param) – name of the input file.
- Returns
clean request/response text (string).
-
preprocessing.preprocessing_evs_utils.
get_country_language_and_study_info
(filename)[source]¶ Retrieves the country/language and study metadata based on the input filename.
- Parameters
filename (param) – name of the input file.
- Returns
country/language (string) and study (string) metadata.
-
preprocessing.preprocessing_evs_utils.
standardize_item_name
(item_name)[source]¶ Standartizes a given item_name, if it is not in the standard
- Parameters
item_name (param1) – item name extracted from the input file.
- Returns
standardized item_name (string).
-
preprocessing.preprocessing_evs_utils.
standardize_special_response_category
(filename, text)[source]¶ Standartizes text of special response categories (don’t know, no answer, not applicable), according to the language (informed in the the filename).
- Parameters
filename (param1) – name of the input file.
text (param2) – response text.
- Returns
standardized response category text (string).
-
preprocessing.preprocessing_evs_utils.
standardize_special_response_category_value
(filename, catValu, text)[source]¶ Standartizes a response category value, if it is a special response category. Standard: Refusal=777 Don’t know=888 Does not apply=999
- Parameters
filename (param1) – name of the input file.
catValu (param2) – response category value, extracted from input file.
text (param3) – text of response category, to test against special response category patterns.
- Returns
standardized response category value (string).
-
class
preprocessing.evsmodules.
EVSModules1990
[source]¶ Class encapsulating variables that compose the following modules in EVS 1990: There is no indication of the module names in the files.
-
class
preprocessing.evsmodules.
EVSModules1999
[source]¶ Class encapsulating variables that compose the following modules in EVS 1999: Perceptions of life, Politics and society, Environment, Family, Work, Religion and morale, National Identity, Life Experiences, Socio demographics, Administrative.
-
class
preprocessing.evsmodules.
EVSModules2008
[source]¶ Class encapsulating variables that compose the following modules in EVS 2008: Perceptions of life, Politics and society, Environment, Family, Work, Religion and morale, National Identity, Life Experiences, Socio demographics, Respondent Parents, Respondent Partner, Administrative.