Data extraction, preprocessing and treatment methods commons

Python3 script with utility functions for preprocessing Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.utils.determine_country(filename)[source]

Determines the full name of the country, based on ISO code for country that is embedded in the file name.

Parameters

filename (param1) – input file name.

Returns

full name of the country (string).

preprocessing.utils.determine_sentence_tokenizer(filename)[source]

Provide the sentence splitter suffix to instantiate it in accordance to the target language (information emebedded on filename).

Parameters

filename (param1) – input file name.

Returns

a sentence splitter suffix (string) according to the target language.

preprocessing.utils.get_sentence_splitter(filename)[source]

Decide what Instantiate Punkt Sentence Tokenizer from NLTK should be instantiated, according to the information embedded in the filename.

Parameters

filename (param1) – input file name.

Returns

a sentence splitter (NLTK object) instantiated according to the target language.

preprocessing.utils.recognize_standard_response_scales(filename, text)[source]

Recognizes special answer categories from EVS by testing the answer segment against the language dependent pattern definitions for the special categories.

Parameters
  • filename (param1) – input file name.

  • text (param2) – answer text segment.

Returns

If a pattern was found, returns a string informing the special category, otherwise returns None.

Main method that calls for EVS/ESS scripts to generate MCSQ spreadsheet inputs Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.main_xml_files.main(folder_path)[source]

This main file calls the transformation algorithms inside evs_xml_data_extraction, ess_xml_data_extraction and ess_xml_data_extraction scripts.

evs_xml_data_extraction is called for EVS files ess_xml_data_extraction is called for ESS files share_xml_data_extraction is called for SHARE files

The algorithm transforms a XML file to a structured spreadsheet format with valuable metadata.

Call main script using folder_path, for instance: reset && python3 main.py /path/to/your/data

Parameters

folder_path (param1) – the path of the directory containing the files to tranform