Published October 11, 2020 | Version 1.0
Dataset Open

Study of terminological subsystems of modern school textbooks in Russian with the help of word embedding models Word2Vec and neural networks

Description

The aim of the project is to analyse the inventory and functioning of scientific terms and special lexemes in textbooks for secondary schools of the Russian Federation with the help of modern methods of natural language processing and deep learning. The number of terms from different fields of knowledge that a pupil should learn during secondary school studies has never been evaluated. According to the preliminary evaluations made on the basis of the Model Basic Curriculum for General and Secondary Education in 2015 only the subject "Russian language" presupposes that a pupil finishing the 11th grade of secondary school should be able to understand, recognise and use about 1000 terms and terminological combinations. Thus, taking into account the number of school subjects, the total number of special vocabulary units studied in general education schools is measured in thousands. At the same time, the comparative characteristics of the inventory and functioning of terms in textbooks for different school subjects are not studied and remain unknown. The correlation between the terminological density of the text in school textbooks for different subjects and the place occupied by these subjects in the curriculum is not clear. The traditional way of compiling lists of scientific terms is simply by gleaning them from special texts and writing down manually. If this method is reliable in terms of intellectualisation of selection principles, it cannot be applied to large data sets and does not reflect either the frequency of use of terms, or the specificity of their syntagmatic connections, or the systemic relationship between terms. The current project is aimed at filling this gap by means of 1) creating a full-text corpus of school textbooks for 5–11 classes included in the Federal List compiled by the Ministry of Education, 2) automatic extraction, stratification, and mapping of terms with the help of distribution semantics algorithms, 3) creation and training of a deep neural network capable of predicting the subject, level of education and educational topic given a group of vector representations of terms as input. The results of the research can be of fundamental interest in the perspective of terminology science development and also have practical applications in the creation of different types of educational literature.

Funding: The reported study was funded by RFBR, project number 19-29-14032

Files

Textbooks_RFFI_1_zenodo.zip

Files (2.1 GB)

Name Size Download all
md5:6ea8415e3eb088fcbaff22d6fe75251b
2.1 GB Preview Download