Preprocessing SHARE COVID Questionnaires¶
-
preprocessing.share_covid_data_extraction.
get_language_country_iso_codes
(language_country)[source]¶ Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ. :param param1 language_country: language and country information retrieved from input file. :type param1 language_country: string
- Returns
language_country (string). Variable representing the language and country metadata in ISO codes.
-
preprocessing.share_covid_data_extraction.
preprocess_answer_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the answer segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid answer segments.
-
preprocessing.share_covid_data_extraction.
preprocess_instruction_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the instruction segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid instruction segments.
-
preprocessing.share_covid_data_extraction.
preprocess_question_segment
(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the question segments from the input file.
- Parameters
row (param1) – dataframe row being currently analyzed.
df_questionnaire (param2) – pandas dataframe to store questionnaire data.
survey_item_prefix (param3) – prefix of survey_item_ID.
splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
- Returns
updated df_questionnaire with new valid question segments.
-
preprocessing.share_covid_data_extraction.
replace_abbreviations_and_fills
(sentence)[source]¶ Replaces abbreviations and fills text from the text of input file. :param param1 sentence: text segment from input file. :type param1 sentence: string
- Returns
sentence (string). Text segment without abbreviations and fills text.
-
preprocessing.share_covid_data_extraction.
retrieve_module_from_item_name
(item_name)[source]¶ Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html
- Parameters
item_name (param1) – item_name information retrieved from input file.
- Returns
module (string). Module of the question.
-
preprocessing.share_covid_data_extraction.
set_initial_structures
(language_country)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
- Parameters
language_country (param1) – language and country of the subdataframe being analyzed
- Returns
df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).