Plain text to spreadsheet (ESS)

preprocessing.txt2spreadsheet.main(folder_path, has_supplementary)[source]

Main method of the ESS plain text to spreadsheet data transformation algorithm. The data is extracted from the plain text file (that obeys an internal specification for the MCSQ project), preprocessed and receives appropriate metadata attribution.

The algorithm outputs the csv representation of the df_questionnaire, used to store questionnaire data (pandas dataframe)

Parameters
  • folder_path (param1) – path to the folder where the plain text files are.

  • has_supplementary (param2) – boolean variable that indicates if there is a supplementary spreadsheet to be appended.

preprocessing.txt2spreadsheet.process_answer_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, country_language)[source]

Extracts and processes the answer segments from a raw item. The answer segments are always after the {ANSWERS} tag. If there are no answer segments, then the answer segment is the corresponding to ‘write down’ for the target language.

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • country_language (param6) – country_language metadata, embedded in file name.

Returns

updated df_questionnaire when new valid answer segments are included, or df_questionnaire in the same state it was when no new valid answer segments were included.

preprocessing.txt2spreadsheet.process_intro_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter)[source]

Extracts and processes the introduction segments from a raw item. The introduction segments are always between the item name and {QUESTION} tag, for instance:

{INTRO} Ara m’agradaria fer-li algunes preguntes sobre política i el govern.

B1 {QUESTION} En quina mesura diria vostè que l’interessa la política? Vostè diria que l’interessa…

{ANSWERS} Molt Bastant Poc Gens

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • splitter (param6) – sentence segmentation from NLTK library.

Returns

updated df_questionnaire when new valid introduction segments are included, or df_questionnaire in the same state it was when no new valid introduction segments were included.

preprocessing.txt2spreadsheet.process_question_segment(raw_item, survey_item_prefix, study, item_name, df_questionnaire, splitter, country_language)[source]

Extracts and processes the question segments from a raw item. The question segments are always between the {QUESTION} and {ANSWERS} tags, for instance:

G2 {QUESTION} Per a ell és important ser ric. Vol tenir molts diners i coses cares.

{ANSWERS} Se sembla molt a mi Se sembla a mi Se sembla una mica a mi Se sembla poc a mi No se sembla a mi No se sembla gens a mi

Parameters
  • raw_item (param1) – raw survey item, retrieved in previous steps.

  • survey_item_prefix (param2) – prefix of survey_item_ID.

  • study (param3) – metadata parameter about study embedded in the file name.

  • item_name (param4) – item_name metadata parameter, retrieved in previous steps.

  • df_questionnaire (param5) – pandas dataframe to store questionnaire data.

  • splitter (param6) – sentence segmentation from NLTK library.

  • country_language (param7) – country_language metadata, embedded in file name.

Returns

updated df_questionnaire when new valid question segments are included, or df_questionnaire in the same state it was when no new valid question segments were included.

preprocessing.txt2spreadsheet.retrieve_raw_items_from_file(file)[source]

Extracts the raw items from ESS plain text file, based on an item name regex pattern. Also excludes blank lines and non relevant scale items. :param param1 file: input ESS plain text file. :type param1 file: Python module

Returns

retrieved raw items (list of strings).

preprocessing.txt2spreadsheet.set_initial_structures(filename)[source]

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters

filename (param1) – name of the input file.

Returns

df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).