Published July 26, 2025 | Version v1
Dataset Open

Student Performance and Learning Behavior Dataset for Educational Analytics

  • 1. ROR icon Mohammed V University

Description

The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

  • Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
  • Resource access and learning environment: Resources, Internet, EduTech

  • Motivation and psychological factors: Motivation, StressLevel

  • Demographic information: Gender, Age (ranging from 18 to 30 years)

  • Learning preference classification: LearningStyle

  • Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

 

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

  1. Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

  2. Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

  3. Data Preprocessing

    • Encoding categorical variables using LabelEncoder.

    • Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

    • Detecting and removing duplicates.

  4. Clustering Analysis

    • Applying K-Means clustering to segment learners into distinct profiles.

    • Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

    • Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

  5. Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

  6. Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

  7. Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

  8. Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.

Files

merged_dataset.csv

Files (2.4 MB)

Name Size Download all
md5:f4e2d66446de33c9591292a90e6a7423
539.6 kB Preview Download
md5:5a9e4e66505f111b1367d2917827f82a
1.9 MB Preview Download

Additional details

Dates

Available
2025-07-26