Imputation techniques for the reconstruction of missing interconnected data from higher Educational Institutions

Educational Institutions data constitute the basis for several important analyses on the educational systems; however they often contain not negligible shares of missing values, for several reasons. We consider in this work the relevant case of the European Tertiary Education Register (ETER), describing the Educational Institutions of Europe. The presence of missing values prevents the full exploitation of this database, since several types of analyses that could be performed are currently impracticable. The imputation of artificial data, reconstructed with the aim of being statistically equivalent to the (unknown) missing data, would allow to overcome these problems. A main complication in the imputation of this type of data is given by the correlations that exist among all the variables. We propose several imputation techniques designed to deal with the different types of missing values appearing in these interconnected data. We use these techniques to impute the database. Moreover, we evaluate the accuracy of the proposed approach by artificially introducing missing data, by imputing them, and by comparing imputed and original values. Results show that the information reconstruction does not introduce statistically significant changes in the data and that the imputed values are close enough to the original values.


Introduction
Organizations providing higher level education, such as traditional universities, universities of applied sciences, polytechnics, community colleges, liberal arts colleges, etc. are collectively called Higher Education Institutions (HEIs). The data describing each individual HEI (for instance number of students, number of graduates, etc.) are called HEI microdata. Their availability is essential to support an empirical-oriented and robust policy making in the dynamic landscape of educational systems [1]. A pioneer research project called Aquameth [2] was the first attempt to gather microdata about European HEIs. Since this first experience, the presence of missing or noisy data and the lack of comparability among data appeared to be the most critical obstacles to an appropriate usage of the collected data [3]. After the Aquameth project, a large scale study called EUMIDA was commissioned by the European Union from 2009 to 2011, and showed the feasibility of a European-level data collection on individual HEIs [1]. Since then, it has been underlined the need to build a register of Higher Education Institutions in Europe.
The code (and data) in this article has been certified as Reproducible by Code Ocean:https://help.codeocean.com/en/articles/1120151-code-ocean-sverification-process-for-computational-reproducibility. More information on the Reproducibility Badge Initiative isavailable at www.elsevier.com/locate/knosys. in the data collection, the current ETER database includes many scattered missing values in the variables and this creates problems for the usage of those variables. In particular, it clearly does not allow the micro analysis of the institutions containing the missing values, and also prevents the macro analysis (at the aggregate level) of many categories of institutions, whenever they include the incomplete ones. Thus, the main goal of this work is to propose a methodology to reconstruct those missing values for the key variables of the institutions.
Imputation techniques have been studied in many fields to tackle the widespread problem of missing data, see e.g. [4][5][6]. Consequently, many approaches to the problem have been proposed, based on several different techniques, including: knearest neighbors [7] or other forms of proximity search [8], fuzzy clustering [9,10], bagging algorithms [11], ensemble of neural networks [12], autoencoders neural networks [13], denoising autoencoders [14], other deep neural networks [15], Bayesian networks [16,17], integer linear programming [18], pattern sequence forecasting [19], use of a knowledge base [20], similarity rules [21], each of which possibly combined or hybridized with additional techniques. Existing methods to deal with missing data can be organized according to different perspectives. For example, one may discriminate on the basis of the source of the imputed information, which can be: (a) the other variables of the same unit under imputation, taking advantage of some existing relations among the variables; (b) other units similar to the treated one, which are often called donors; (c) other sources, external to the dataset under imputation, but storing the same information, e.g. statistical ledgers. Donors are generally complete units [22,23], though even incomplete ones could be used if necessary [24], and their use often requires the solution of optimization problems [25]. Another discrimination sometimes adopted is between statistical methods and machine learning approaches [8].
Currently, there is no evidence of a single ''best'' imputation method. Rather, it appears that the performance of each method depends in large measure on the dataset characteristics, as also suggested in [26]. The imputation of missing data in time series is a particularly difficult task [27], and many general techniques are not able to satisfactory deal with this case. And the subcase of multivariate time series stays at the core of the most challenging tasks, as observed in [16,28].
Educational data have important peculiarities. To begin with, they contain multivariate time series (number of students, number of graduates, etc.). Furthermore, it can be observed that almost all data of an institution are interconnected. The number of graduates is not independent from the number of students, the expenditure is not independent from the staff, just to make some easy examples. Thus, the imputation becomes particularly difficult: imputed values must belong to time series and each of them may impact on the situation of the whole institution. Therefore, imputation techniques for this type of data should be specifically designed. An approach to improve the data quality for the same ETER dataset, not dealing with missing data and so complementing the present work, has been developed in [29].
The proposed imputation methodology combines an interdisciplinary set of tools coming from information management, machine learning and statistics with an investigation of the relations existing among the variables and of the types of missing patterns actually contained in the data. The proposed methodology works at the formal level, with a data driven approach, i.e., learning from the data many aspects of the imputation techniques. Therefore, it could also be used to impute datasets with educational data from other origins, or even datasets with different meaning but sharing the features of time series and interconnection among the variables. We validate our approach by comparing the statistical features of the data before and after imputation. Moreover, we evaluate its reconstruction accuracy by means of the following experiments: (1) we take a set of complete records and artificially introduce in them missing values; (2) we use the proposed imputation techniques to impute those missing values; (3) we compare  the imputed values with the original known values, and we study  the occurrence of significant differences. The work is organized as follows. Section 2 explains the variables and the different types of missing values that appear in the current version of ETER. Sections 3 and 4 introduce the proposed imputation techniques, distinguishing between those based on functions of the available values of the same institution and those based on other institutions (donors). Section 5 shows the results of the application of the proposed methodology to reconstruct the selected ETER variables. Section 6 describes the tests of accuracy carried out to further evaluate the performance of the proposed methodology. Section 7 draws conclusions.

Distribution of the missing values
ETER database is composed of 3208 units, each representing a single European HEI over a number of years. Each institution I j contains a number of values s j vk , where v ∈ V is the index of the variable (e.g., students), k ∈ T that of the year (e.g., 2011) and j ∈ J that of the institution (e.g., Sapienza University). The target of our imputation is constituted of the following 8 variables: • Total number of students enrolled (called in ETER ''Total Students Enrolled at ISCED 5-7'') • Total number of graduates (in ETER ''Tot. Stud. Grad. at • Total non-academic staff (technical and administrative staff, measured in ETER either in ''Full Time Equivalent -FTE'' or in ''Head Count -HC'').
• Total current expenditure (measured in ETER in Euro). • Total current revenues (measured in ETER in Euro).
These variables are selected because they are very relevant in many types of analyses. Moreover they are the main descriptors of the size of an institution. The importance of size in the characterization of the HEIs, and its connection to their overall performance, is well known in the specialized literature, see e.g. [30]. Each of the above variables takes a value for each year of the time horizon, which at present goes from 2011 to 2016 for the majority of the institutions (even if some may have less years, or some may have also 2017). To lighten the notation, when there is no ambiguity, we will denote the values of the , v AAA graduates2012 , . . . , v AAA graduates2016 }, may be simply denoted by {grad 1 , grad 2 , grad 3 , grad 4 , grad 5 , grad 6 }.
A specificity of these data, with respect to other imputation cases, is given by the relations connecting all the above values. Indeed, the different values of each single time series are obviously related. For example, the number of students enrolled in 2011 is necessarily related to the number of students enrolled in 2012, in 2013, and so on, and in most of the cases the time series exhibits a trend, in the sense that, if some values are for instance increasing, it can be expected that the next value will be likely still increasing. However, trend may also change direction during the time series. Moreover, all the above variables are positively correlated. For example, the number of graduates is generally a certain proportion of the number of students, because they were indeed students in the previous year; the staff tends to increase or decrease in some measure with the number students, and so do the expenditures and the revenues. We will denote the missing value with ''⊔''. We are interested in distinguishing the following types of missing values: • Isolated Internal Missing: this type of missing occurs when v i = ⊔ for 2 ≤ i ≤ t − 1 and both v i−1 ̸ = ⊔ and v i+1 ̸ = ⊔.
• Isolated Extreme Missing: this type of missing occurs when v i = ⊔ for i = 1 or i = t and respectively v 2 ̸ = ⊔ or v t−1 ̸ = ⊔.
• Missing Sequence of length L: this type of missing occurs and v i+L+1 ̸ = ⊔ (the missing sequence does not cover the whole time series).
• Full Sequence Missing: this type of missing occurs when v 1 = · · · = v t = ⊔ (the missing sequence covers the whole time series).
The current situation of missing values is reported in Table 1.
More details on the types of missing data are in Tables 2-6 in Section 5. Note that the most common type of missing values are those contained in the full sequence missing.

Imputation based on the available values of the sequence
This Section describes a category of imputation techniques based on functions of the available values in the sequence under imputation (i.e., the values which are not missing), starting with the very simple approaches based on average and linear regression that will be used as building blocks for the proposed Trend Smoothing technique. We also discuss, for each imputation technique, the type of missing for which it should be used.

Imputation based on weighted average
Very intuitive imputation techniques are based on averages of the values already available in the sequence. This approach relies on the assumption that the values in the time series are related, and not independent [31]. In general, there exist several mathematical types of averages. In our case, it appears reasonable to assume that, generally speaking, the closer the data are on the time scale, the most related their values are. Therefore, we propose to impute a value v i for an isolated internal missing by using a weighted arithmetic mean of the surrounding values which are available, as follows.
The weights w h should progressively decrease with the time distance between h and the instant i under imputation. We define yearly decrement d a positive value < 1 such that For instance, if d = 0.5, i = 4 and we denote the value of w 3 by α, we have w 1 = 1/2 w 2 = 1/4 α, w 2 = 1/2 α, w 3 = α, w 5 = α, w 6 = 1/2 α, and so on. However, this condition alone is not enough to find the w h .
Thus, we propose to determine their values by using the following data-driven approach: for each value v i of each institution j, with i ∈ {T \ 1}, we compute the variation δ ij with respect to the preceding value v i−1 as follows: The set of these variations can now be studied to find its average valueδ by using the arithmetic mean, and two extreme values e 1 , e 2 such that they contain 90% of the values of the variations.
Then, we search for the yearly decrement d that better fits the average variationδ. By defining the set D of the indices of the institutions containing isolated missing values, and by assuming, with a slight abuse of notation, that i always represents the index of that missing value, this is obtained by solving the following optimization problem, , for which the dependence on d is made , however the minimization of the above squared sums aims at providing the value of d that better shares the differences.
The described Weighted Average Imputation appears feasible to impute isolated missing values, in particular when they are internal. It can be adapted to the external case, by considering only one side of the former case, however it may fail to capture the data trend when there is a distinct increase or decrease of the values over the years. In any case, the Weighted Average approach could be used as a building block to develop more sophisticated imputation techniques for missing sequences.

Imputation based on linear regression
Another basic technique that can be used to reconstruct values in a time series is linear regression. The missing value v i is approximated with the value given by the straight line interpolating the available values v h , with h ∈ T , h ̸ = i. This approach relies on the assumption that the values in the time series are not only related but also subject to a temporal evolution, which often exhibits a trend. See [31] for further discussion of the field of application of this approach. This approach does not need to compute weights or other parameters. It appears feasible to impute isolated missing values, in particular extreme ones, because it is able to capture the data trend.
However, when there is a sharp increase or decrease in the available values, it may predict negative values, which are clearly infeasible. For example, if we have the following sequence ⊔, 100, 250, 410, 550, 690, the Linear Regression Imputation would provide -44 for the value of the first period. Clearly, negative values are not acceptable, and even replacing them with 0 would not be a good solution. In similar cases, we propose to smoothen the trend by computing a value v 1 ∈ (0, v 2 ). In particular, v 1 can be computed with the exponentiation operation as (v 2 ) c , with c ∈ (0, 1). For instance, using c = 0.5 gives the square root, which in the example above would produce v 1 = √ 100 = 10, which is a more reasonable value.
To select the value of the exponent c, we propose to use again a data-driven approach. When focusing on the case of initial isolated missing, we define the set E of the institutions which would have a negative initial value v 1 if approximated with linear regression, and we compute a new average variationδ E limited to the institutions in E. Now, we search for the value of c that better fits this new average variationδ E by solving the following optimization problem, Specularly, when focusing on the case of final isolated missing, we define the set F of the institutions which would have a negative final value v t if approximated with linear regression, and we compute another average variationδ F limited to the institutions in F . In this case, we search for the value of c that better fits this new average variationδ F by solving the following optimization problem, represents the difference between the imputed value v j t (c) and the valueδ F v j t−1 , which is the value obtainable for v j t when assuming v j t (c) = (v j t−1 ) c and the average variationδ F .
The described Linear Regression Imputation, possibly integrating the described exponentiation technique, appears feasible to impute isolated missing values, in particular when they are extreme. Moreover, similarly to the approach described in the previous Section, it can be used as a building block to develop more sophisticated imputation techniques for missing sequences.

Trend smoothing imputation
To be able to capture a trend in the series, but at the same time to be not excessively (mis)lead by it, we propose to combine the two simple approaches described above by means of the following technique. By denoting with WA i the value given for instant i by the weighted average approach, and by LR i the value given for instant i by the linear regression approach, the actual value imputed for v i would be obtained as a combination of the two, as follows.
Note that the value LR i is intended to be already smoothened by means of the exponentiation operation explained above. Coefficient a should give more importance to the contribution of the linear regression when the slope of the interpolating straight line is nearly flat, and more importance to the contribution of the weighted average when the same slope is too vertical. This transition from one extreme to the other should be without discontinuities. Therefore, by denoting with m the angular coefficient of the interpolating straight line, with T ′ ⊂ T the set of the available values, and with s a constant value usually set at 2, we use In other words, we follow the trend given by the available values when it is reasonably increasing or decreasing; we progressively consider the trend less reliable when the available values make it too steep. The absolute value of the angular coefficient is divided by the smaller available value in order to ''normalize'' it, because an increase for example from 1000 to 1500 is more reasonable than an increase from 1 to 501, which will produce the same value of m. The technique proposed in this Section is called Trend Smoothing Imputation, since it does follow the trend, however it progressively smoothen it when needed, without discontinuities. This technique appears feasible for isolated missing, but also for missing sequence of length L ≤ t − 2. In particular, we will use it for this last case in our experiments. The case of missing sequences of length t −1 actually contain only one non-missing value, hence no trend exists. Hence, these cases are better assimilated to the cases of full sequence missing.
Finally, in the case of interconnected variables, for example number of students and number of graduates, we may compute an average ratio between the values available year by year for that couple of variables, and impose that the imputed values remains ''not too far'' from that ratio. This request can be practically implemented according to the peculiarities of the specific case.

Imputation techniques based on donors
The techniques described in the previous Sections basically rely on some type of prolongation of the values available in the sequence. When the full sequence of values is missing, or even when there is only one available value and the rest of the sequence is missing, those techniques are clearly inapplicable. However, unfortunately, the case of full sequence missing is the most common type of missing. Note, indeed, that a full sequence missing generally corresponds to 6 missing values with the current time horizon.
Therefore, it is important to deal with such type of missing, even if this happens to be the most difficult case. Thus, we need to recur to a different imputation technique, which is also well established in the field [6]. This technique is called donor imputation, and is based on the search for a complete (and correct) record (i.e., institution) being as similar as possible to the incomplete records under imputation. Similarity is judged by defining a distance function among records, and thus minimum distance corresponds to maximum similarity. When this complete record at minimum distance is found, it is used as a donor: the missing values of the incomplete record are replaced by the corresponding values of the complete records. The incomplete record under imputation is also called the recipient. A possible variant is to search for a small set of records at minimum distance (e.g., 3) and not just one, and then select the donor randomly among them For this reason, this technique is sometimes called k-nearest neighbor imputation. Of course, this approach is easily applied when the set of possible donors is large enough to have donors in the vicinity of each incomplete record that must be treated. A recognized statistical advantage of the donor imputation with respect to other imputation techniques is that it does not create artificial and possibly unlikely values; rather, it takes values already appearing in the dataset, and they are selected with higher probability whenever they are more frequent, see also [18]. Hence, donor imputation tends to preserve the data properties and their frequency distributions without hypothesis on the distribution of the data, which may be questionable [12].
The similarity criterion is fundamental and must be defined on the specific case. In the case of educational institutions, we identify a number of variables under which institutions should be considered similar. They are essentially variables describing the size, the type and the geographical location of the institution. For each of these variables, a difference in the values may give a certain contribution to the total distance or directly lead to the exclusion of the record as possible donor. In the specific case of the ETER database, containing European institutions, we set up the following distance calculation scheme.

Variable: ''Institution Category standardized'', whose value
can be 1 = University, 2 = University of Applied Science, 0= Other. For this variable we accept only donors from the same category. 2. Variable: ''Distance education institution'', which tells whether the institution is essentially telematic or traditional. The difference in this variable gives a contribution to the distance denoted by p 1 , which should be set at a value corresponding to a strong penalization. 3. Variable: ''Institution Category'', which reports the category of the institution with more granularity than the former ''Institution Category standardized''. The difference in this variable gives again a contribution to the distance of p 1 . 4. Variables: ''Total Current expenditure'', ''Total Current revenues'', ''Total academic staff'' (which can be measured either in Full Time Equivalent or in Headcount). These variables basically describe the size of the institution. For these numerical variables, each difference between two values v ′ and v ′′ gives a contribution to the distance computed as In other words, the contribution is between p 1 and 0, depending on the absolute difference between the two values normalized by dividing it by the largest value. Hence, the maximum contribution to the total distance given by these three variables is 3p 1 . Size indicators are very important for the selection of a donor; unfortunately these variables are often missing in the ETER database, so their use is limited in practice. 5. Variable: ''Country'', reporting the country of the institution. We define similarity according to geographical areas reported below.
• The same country gives a contribution of 0 to the distance. Different countries in the same area gives a contribution to the distance denoted by p 2 , which should be set at a value providing a light penalization. Different areas give a contribution to the distance of p 1 (strong penalization). 6. Variable: ''Legal status'', reporting whether the institution is public or private. Difference in this variable gives a contribution of p 2 to distance.
Unfortunately, in the case of ETER, the number of possible donors is not large enough to guarantee the presence of suitable donors for each HEI. In particular, some types of institutions or some Countries have very few complete institutions. And even exploiting also the incomplete institutions, used as sets of partial donors for the same recipient, the problem remains unsolved. Hence, to avoid using donors too different from the recipient, which will produce unacceptable data, we need to impose filters on acceptable donors, i.e., criteria to recognize and exclude unacceptable donors. For this reason, for some of the incomplete HEIs it is not possible to obtain donors, and so they remain not imputed. A first filter for donors is implemented by comparing the average values of donor and recipient on a number of size-related variables, such as: Total Current Expenditures; Total Current Revenues; Total Students Enrolled; Total Graduates; Total Academic Staff. When the two corresponding average values of donor and recipient are both available, and their difference is larger than a threshold (e.g., 30%), the donor is considered not acceptable. This filter also uses values of the variable under imputation that may possibly be available in the recipient (e.g., recipient has only 1 value and the rest of the sequence is missing), again by comparing the average values of donor and recipient.
Moreover, some HEIs may be not suitable to donate because, even if they are complete, their values are too uncommon, and it is not advisable to replicate them. We select some of them by computing, for each institution, the ratios between all pairs of variables (e.g., graduates/student, Ph.D. students/ graduates, expenditure/students, etc.) and by excluding those having extreme values of ratios (e.g., top and bottom 2%). Some other are selected by computing the volatility of the sequences, and by excluding those having too volatile values (e.g., top 2%).
Another technique which we introduce to deal with the relative scarcity of donors is scaling. When the recipient has a missing sequence containing one non-missing value, for instance v 2 , and the sequence of the donor's values is (w 1 , w 2 , . . . , w t ), we learn from v 2 and w 2 a size ratio r = v 2 /w 2 between the two institutions, and scale the donor's values by making the recipient sequence become (rw 1 , v 2 , rw 3 , . . . , rw t ). On the other hand, when the recipient has a full sequence missing, scaling can be done by using another recipient's variable strongly related to the variable under imputation. For example, if the recipient has values (s 1 , s 2 , . . . , s t ) for students and all missing for graduates, and the donor has (t 1 , t 2 , . . . , t t ) for students and (g 1 , g 2 , . . . , g t ) for graduates, we learn the sequence of size ratios (r 1 = s 1 /t 1 , r 2 = s 2 /t 2 , . . . , r t = s t /t t ), and scale the donor's values by making the recipient graduates become (r 1 g 1 , r 2 g 2 , . . . , r t g t ). This allows the imputation of values suitable for the recipient's size even if that differs from the donor's size.
Furthermore, similarly to case of Trend Smoothing imputation, for the couples of variables which are practically linked, such as number of students and number of graduates, we use the trend of the variable available in the recipient to improve the selection of the donor. For example, if we are imputing the whole sequence of number of students and the number of graduates is available in the recipient and is increasing, we accept only donors with increasing number of graduates, or, if that is missing in the donor, with increasing number of students. On the other hand, when the recipient misses both the sequences of number of students and number of graduates, the imputations of both variables should either be done by the same donor or by two partial donors with sequences compatible for size, ratio and trend. Finally, we set a limit on the number of times an institution can be used as donor, to avoid any risk of replicating the same values too often.

Results on the real missing values of ETER database
We report in this section the result of the imputation of all the ETER missing values. Therefore, the original values (i.e., those lost due to the missing) are not available for a direct check. While the techniques described in Sections 3.1, 3.2, 3.3 always provide a value, the donor imputation described in Section 4 may leave unimputed institutions when no acceptable donor is available for them. More specifically, Tables 2-6 describe the types of missing appearing in the database before and after the imputation operations, respectively for: Total Students and Total Graduates; Total Ph.D. Students and Total Ph.D. Graduates; Total Academic staff FTE and HC; Total Non-academic staff FTE and HC; Total Expenditure and Total Revenues. All the implementation codes of our methodology are available in [32].
As observable, the percentage or institutions without missing goes from being often lower than 50% before imputation to being generally over 90% after imputation. The non imputed cases are all due to the unavailability of acceptable donors for some institutions; they may be short sequences because those institutions often contain less than 6 years in their time horizon. Clearly, they could be imputed by further relaxing the filters used on the donors, however the quality of the imputed values would worsen. Thus, we prefer to pursue this compromise between coverage and quality of the imputation. The results of our methodology on ETER database are available in [33].
Since the original values are not available, the quality of the imputation cannot be evaluated by matching original and imputed values. Hence, we evaluate it by comparing, for the dataset before imputation and after imputation: • the frequency distribution of each variable; • the ratios between selected couple of variables. This type of measure is particularly informative for interconnected variables.
We do this by using the so called violin plots juxtaposing the frequency distributions of the variables in Figs. 1-10, and the box plots juxtaposing the statistical description of the ratios in Figs. 11-13. We consider informative the following ratios: Graduates/Students; Ph.D. Graduates/Ph.D. Students, Expenditure/Revenues, Expenditure/All Students, Revenues/All Students; Academic Staff FTE/All Students; Academic Staff HC/All Students; Non-academic Staff FTE/All Students; Non-academic Staff HC/All Students. With ''All Students'' we denote the sum of Students and Ph.D. Students; this quantity have been introduced to evaluate fairly the institutions focused on producing Ph.D., for which a ratio over Students would be misleading.    Note that we chose to compare the ratios of the data before imputation to those of the imputed data only (and not to those of all data after imputation), because, with the second choice, possible differences would be too ''diluted'' to be evident. Note, moreover, that the missing values are not equally distributed over the institutions, but sometimes more concentrated on small institutions, especially for expenditure and revenues. Thus, when the small institutions are imputed, small values would appear more frequently in the distribution, and this behavior is correct. On the contrary, imputing the small institutions with values similar to those of the larger institutions would not be correct. Therefore, the ''ideal'' imputation does not necessarily correspond to the exact replication of the data distribution in the left side of the violin plots.      From these analyses, we can observe that: (1) the imputation increased considerably the coverage of the database; some missing are still present due to relative scarcity of donors, however the usability has greatly improved; (2) the distributions of the data have been generally preserved, however in some cases, mainly Expenditure and Revenues (Figs. 9 and 10), the number of small values correctly increased, because the missing were localized mainly on small institutions; (3) the ratios of the imputed data show that they maintain very well the relations between the interconnected variables, so the data quality appears very satisfactory.

Performance and accuracy evaluation
To evaluate the accuracy of the proposed methodology, we set up the following experiments using 4 particularly relevant variables: Total Students, Total Graduates, Total Academic Staff FTE, Total Expenditure.

We identify a dataset composed by the institutions having
all the values for the variable in analysis, and artificially introduce in this variable random missing values according to perturbation schemes described below. 2. For each type of such missing values, we impute the values using the technique developed for that type of missing among the proposed ones. 3. We compare the imputed values with the original values, which in these experiments are known, and we study the occurrence of significant differences.

Perturbation scheme
We use the following perturbation scheme. For each single variable of the 4 in analysis, we perturb the dataset in 3 alternative manners. We introduce in a first case isolated missing values, in a second case missing sequences of length L, and in a third case full (or almost full) sequences missing, as explained below.
• Perturbation 1. Each record is perturbed with one isolated missing value randomly located over the time horizon T .
• Perturbation 2. Each record is perturbed with one missing sequence randomly located over the horizon T . The length of the sequence is 2 with probability 0.5, 3 with probability 0.35, and 4 with probability 0.15.
• Perturbation 3. The dataset is randomly partitioned in two, taking care of keeping, in each partition, representatives of all countries and types of Institutions. This is obtained by splitting each country and type of institution independently. Then, one partition is perturbed, and the other partition is used as set of possible donors. Hence, the set of possible donors does not include any of the Institutions that undergo perturbation. The perturbation introduces full sequence missing with probability 0.5, and missing sequence of length 5 (i.e., only 1 value is available in the time horizon) randomly located over the time horizon with probability 0.5. Fig. 11. Box plot comparing the ratios reported near to each plot of the data before imputation to those of imputed data only. All Students means Students + Ph.D. Students. Therefore, in the first two tests, every record is perturbed and will undergo imputation, while in the third test this happens for half of the records. Note that such a situation is quite worse than the standard situation in real databases. Therefore, the results on the occurrences of significant differences can be seen as a worst-case bound over the results of a real application.
For the first perturbation case, we apply the Weighted Average imputation technique described in Section 3.1 to the internal isolated missing values, and Linear Regression imputation, integrated with the exponentiation technique described in Section 3.2, to the external isolated missing values. Results on this case are reported in Table 7 and Fig. 14. For the second perturbation case, we apply the Trend Smoothing imputation described in Section 3.3, and results are reported in Table 8 and Fig. 15. For the third perturbation case, we apply Donor Imputation with the distance function and filter admissible donors as described in Section 4. Distance parameters p 1 and p 2 are set respectively at 3 and 1. Note that, in this experiment, each institutions received a donor, so no missing values remains after the imputation. Results for this third case are reported in Table 9 and Fig. 16.

Evaluation of the reconstruction accuracy
This Section aims at evaluating the accuracy in the reconstruction of the imputed values, i.e., how similar to the original value is the imputed value. For each imputed value v i of each institution j, with i ∈ T , we compute the difference η ij with respect to the original value v ⋆ i : An ''ideal'' imputation would provide very ''small'' values for η ij , with 0 being the limit. However, 0 is not a realistic target, and we need a scale to determine which values are actually ''small''.
To do so, we use again a data-driven measure and we consider the two extreme values e 1 and e 2 , defining a so called Interval of Moderate Variations (IMV ), containing 90% of the values of the variations δ ij , as explained in Section 3.1 . Now, we study the frequency of the imputations whose corresponding η ij lay within or outside the IMV = [e 1 , e 2 ]. We define ''significant'' a difference for which η ij ̸ ∈ [e 1 , e 2 ]. Moreover, to consider also a dataindependent measure, we study the frequency of the imputations whose corresponding η ij lay within or outside a Fixed Interval (FI) defined as FI = [−10%, 10%], which is more restrictive than the previous interval in the analyzed cases. The outcome of these analyses is in Tables 7-9.
As a general observation, we note that the imputed values lay quite near to the original ones in the majority of the cases, both in data-driven and in absolute measurements. Since the range of possible value is very wide, this is a very positive result.
Moreover, even when the imputed values are not so near, we hypothesize that the positive and negative errors should  statistically compensate each other. To test this hypothesis, we consider ''global'' descriptors of the data, providing in Figs. 14, 15, 16 the box plots for: (1) the original data suppressed by the perturbation; and (2) the new data imputed by our methodology. Note that this is a very selective analysis, aimed at magnifying possible differences in the data values, and it is possible only in this case, because we actually know the original data suppressed in the perturbation. As observable, our hypothesis appears to be confirmed, since the imputed data appear statistically equivalent to the suppressed data to a considerable extent.

Conclusions
Data describing the situation of Educational Institutions are currently used for a wide variety of analyses, even to support important economic and political decisions. However, similar data often contain non negligible shares of missing values, also due to the structure of the gathering process, and this may invalidate the above operations. The missing information should therefore be optimally reconstructed, by imputing data as similar as possible to the unknown original ones. This is a very difficult task. We develop imputation techniques for the reconstruction of partial    numerical sequences based on the combination of weighted average and linear regression, and techniques for the reconstruction of full numerical sequences based on the use of donors. We search for data-driven optimal solutions in the sense that we aim at maximizing the conservation of the global data features. Experiments on real-world data containing real missing values confirm that the imputation process is practically feasible and very useful. Experiments on real-world data artificially perturbed by inserting several types of missing values show that the reconstructed data are satisfactory similar to the original data. One additional important advantage of the proposed procedure is that it works at the formal level, with a data driven approach. Hence, it could be adapted to impute different datasets containing educational data from other origins, or even datasets with different meaning but sharing the feature of interconnection among the variables. Future work includes the integration in the proposed methodology with the web scraping techniques described in [34,35] by using the Universities' websites, or with the Logic-based techniques described in [36,37] to extract datasupported logic descriptions of the Institutions.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.