Data extraction, preprocessing and treatment methods commons¶
Python3 script with utility functions for preprocessing Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.utils.
determine_country
(filename)[source]¶ Determines the full name of the country, based on ISO code for country that is embedded in the file name.
- Parameters
filename (param1) – input file name.
- Returns
full name of the country (string).
-
preprocessing.utils.
determine_sentence_tokenizer
(filename)[source]¶ Provide the sentence splitter suffix to instantiate it in accordance to the target language (information emebedded on filename).
- Parameters
filename (param1) – input file name.
- Returns
a sentence splitter suffix (string) according to the target language.
-
preprocessing.utils.
get_sentence_splitter
(filename)[source]¶ Decide what Instantiate Punkt Sentence Tokenizer from NLTK should be instantiated, according to the information embedded in the filename.
- Parameters
filename (param1) – input file name.
- Returns
a sentence splitter (NLTK object) instantiated according to the target language.
-
preprocessing.utils.
recognize_standard_response_scales
(filename, text)[source]¶ Recognizes special answer categories from EVS by testing the answer segment against the language dependent pattern definitions for the special categories.
- Parameters
filename (param1) – input file name.
text (param2) – answer text segment.
- Returns
If a pattern was found, returns a string informing the special category, otherwise returns None.
Main method that calls for EVS/ESS scripts to generate MCSQ spreadsheet inputs Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.main_xml_files.
main
(folder_path)[source]¶ This main file calls the transformation algorithms inside evs_xml_data_extraction, ess_xml_data_extraction and ess_xml_data_extraction scripts.
evs_xml_data_extraction is called for EVS files ess_xml_data_extraction is called for ESS files share_xml_data_extraction is called for SHARE files
The algorithm transforms a XML file to a structured spreadsheet format with valuable metadata.
Call main script using folder_path, for instance: reset && python3 main.py /path/to/your/data
- Parameters
folder_path (param1) – the path of the directory containing the files to tranform