Preprocessing SHARE COVID Questionnaires

preprocessing.share_covid_data_extraction.get_language_country_iso_codes(language_country)[source]

Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ. :param param1 language_country: language and country information retrieved from input file. :type param1 language_country: string

Returns

language_country (string). Variable representing the language and country metadata in ISO codes.

preprocessing.share_covid_data_extraction.preprocess_answer_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the answer segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid answer segments.

preprocessing.share_covid_data_extraction.preprocess_instruction_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the instruction segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid instruction segments.

preprocessing.share_covid_data_extraction.preprocess_question_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]

Extracts and processes the question segments from the input file.

Parameters
  • row (param1) – dataframe row being currently analyzed.

  • df_questionnaire (param2) – pandas dataframe to store questionnaire data.

  • survey_item_prefix (param3) – prefix of survey_item_ID.

  • splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.

Returns

updated df_questionnaire with new valid question segments.

preprocessing.share_covid_data_extraction.replace_abbreviations_and_fills(sentence)[source]

Replaces abbreviations and fills text from the text of input file. :param param1 sentence: text segment from input file. :type param1 sentence: string

Returns

sentence (string). Text segment without abbreviations and fills text.

preprocessing.share_covid_data_extraction.retrieve_module_from_item_name(item_name)[source]

Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html

Parameters

item_name (param1) – item_name information retrieved from input file.

Returns

module (string). Module of the question.

preprocessing.share_covid_data_extraction.set_initial_structures(language_country)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

language_country (param1) – language and country of the subdataframe being analyzed

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).