Part-of-Speech and Named Entity Recognition annotations in MCSQ¶
-
annotation.mcsq_annotation.
ner_annotation
(df, ner, filename)[source]¶ Iterates through the preprocessed and POS tag annotated ENG, CAT, GER, FRE, POR, NOR and SPA spreadsheets, adding the NER annotation.
- Parameters
df (param1) – the dataframe that holds the preprocessed and POS tag annotated questionnaire.
ner (param2) – pretrained NER model provided by Spacy or FlairNLP.
- Returns
df_tagged (pandas dataframe), the questionnaire with added NER annotations.
-
annotation.mcsq_annotation.
pos_tag_annotation
(df, pos)[source]¶ Iterates through the preprocessed spreadsheets, adding the POS tag annotation.
- Parameters
df (param1) – the dataframe that holds the preprocessed questionnaire.
pos (param2) – pretrained or in-house trained (CAT, POR, rUS) model provided by FlairNLP.
- Returns
df_tagged (pandas dataframe), the questionnaire with added POS tag annotations.
-
annotation.mcsq_annotation.
select_ner_model
(language)[source]¶ Selects the appropriate named entity recognition (NER) model based on the language. ENG, GER, FRE, SPA language use pretrained models provided by Flair. CZE and RUS languages use multilingual pretrained model provided by Deeppavlov. CAT, NOR and POR languages use pretrained models provided by SpaCy
- Parameters
language (param1) – 3-digit language ISO code.
- Returns
NER tagging model (Spacy or FlairNLP model).
-
annotation.mcsq_annotation.
select_pos_model
(language)[source]¶ Selects the appropriate pos tagging model based on the language. ENG language uses a pretrained model provided by Flair. NOR, SPA, GER, CZE, and FRE languages use multilingual pretrained model provided by Flair. CAT, RUS and POR languages use models trained by me.
- Parameters
language (param1) – 3-digit language ISO code.
- Returns
part-of-speech tagging model (Pytorch object).
-
annotation.ner_annotation_cze_rus.
ner_annotation
(df, ner)[source]¶ Iterates through the preprocessed and POS tag annotated RUS and CZE spreadsheets, adding the NER annotation. POS tag is done in the mcsq_annotation script. CZE and RUS languages use multilingual pretrained model provided by Deeppavlov.
The Slavic-BERT-NER from Deeppavlov uses lib versions that are imcompatible with the ones from the mcsq_annotation script, therefore this script should be run using a separate virtual environment.
- Parameters
df (param1) – the dataframe that holds the preprocessed and POS tag annotated questionnaire.
ner (param2) – pretrained NER model provided by Deeppavlov.
- Returns
df_tagged (pandas dataframe), the questionnaire with added NER annotations.
-
annotation.wis_annotated_text_to_alignment.
add_annotation
(df_source, df_target, df_alignment)[source]¶ Adds NER/POS annotations in the alignment files by copying the annotations from the spreadsheets. Differently from the EVS, ESS and SHARE files, all the WIS files have 1-1 correspondences and come prealigned, therefore these files do not have to go through the Alignment algorithm.
- Parameters
df_source (param1) – the dataframe that holds the preprocessed annotated source questionnaire.
df_target (param2) – the dataframe that holds the preprocessed annotated target questionnaire.
df_alignment (param3) – the dataframe that holds the alignment questionnaire, without annotations.
- Returns
df_alignment (pandas dataframe) with added NER and POS annotations that were copied from the df_source and df_target.