Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published June 16, 2019 | Version 1.0
Report Open

Predictive Models of Student Performance for Data-Driven Learning Analytics

  • 1. Indiana University

Description

Analytic tools are useful for detecting patterns in education data and providing insights about student performance and learning. This study compared six supervised learning algorithms (linear regression, ridge regression, the lasso, regression trees, random forests regression, gradient boosted regression) and identified features important for predicting student performance. The dataset consisted of N=1044 observations from two secondary schools in Portugal (UCI-MLR, Cortez & Silva, 2008). Performance was assessed by final grades (range: 0-20) in two courses, mathematics and Portugese. The models were fit to training data with 27 independent variables and evaluated on a testing subset. Overall, performance was lower for students in mathematics than Portugese. The models selected a similar set of variables as important for predicting performance: mother's education level, student plans for higher education, and weekly study time were positively related to predicted performance, whereas course subject, school educational support, and romantic relationships were associated with decreased student performance. The models differed in the number, weighting, order and importance given to predictor variables. Linear regression provided a model with 13 predictors. Ridge regression shrank the coefficient estimates toward zero; the lasso performed variables selection for a model with 20 predictors. There was a tradeoff between model complexity and interpretability. The single pruned regression tree provided a simple, interpretable non-linear model with four features. Random forests regression and gradient boosting reduced overfitting, but were more difficult to interpret. Advantages and limitations of the different models are discussed. Applications for educational data mining (EDM) and learning analytics (LA) are considered. 

Notes

This report was completed as an independent research project. Data cleaning and preparation was performed in an interactive Python notebook. All models were constructed in R using Rstudio. Thanks to Michael Smith, Director of Analytics at ICF, for providing an introduction to Learning Analytics. Elizabeth Whynott and Douglas Wilbur provided helpful comments.

Files

Shiverick-LA-models.pdf

Files (1.1 MB)

Name Size Download all
md5:1d43e4a11d390c9422b9252bc779ae16
961.1 kB Preview Download
md5:6b183a02e866b398d6b96ccba896036d
112.4 kB Preview Download