Data and Code to reproduce results in paper "Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning"

Liu, Yiting

doi:10.5281/zenodo.17990581

Published May 7, 2024 | Version v1

Computational notebook Open

Data and Code to reproduce results in paper "Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning"

Liu, Yiting

Contributors

Data collector (2):

Supervisor (2):

1. University of Twente
2. BFH Bern University of Applied Sciences
3. Bern University of Applied Sciences

Data and Code to reproduce results in paper "Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning"

This repository contains the necessary codes to reproduce results in the paper:

Yiting Liu, Lennart John Baals, Jörg Osterrieder, Branka Hadji-Misheva,
Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning,
Expert Systems with Applications,
Volume 252, Part B,
2024,
124100,
ISSN 0957-4174,
https://doi.org/10.1016/j.eswa.2024.124100.
(https://www.sciencedirect.com/science/article/pii/S0957417424009667)
Abstract: Peer-to-Peer (P2P) lending markets have witnessed remarkable growth, revolutionizing the way borrowers and lenders interact. Despite the increasing popularity of P2P lending, it poses significant challenges related to credit risk assessment and default prediction with meaningful implications for financial stability. Traditional credit risk models have been widely employed in the field of P2P lending; however, they may not be capable to capture latent factor information inherent to a loan network based on similarity distances. Thus, in this study we propose an enhanced two-step modeling approach for Machine Learning (ML) that utilizes insights from network analysis and subsequently combines derived network centrality metrics with traditional credit risk factors to improve the prediction accuracy in the credit default prediction process. Through a comparative analysis of three classical ML models with varying degrees of complexity, namely Elastic Net (EN), Random Forest (RF), and Multi-Layer Perceptron (MLP), we showcase novel evidence that the systematic inclusion of network topology features in the credit scoring process can significantly improve the prediction accuracy of the scoring models. Additional robustness tests via the inclusion of randomly shuffled centrality metrics in the analysis, and a further comparison of the graph-based models against a pertinent state-of-the-art credit scoring model in form of XGBoost, further confirm our results. The insights from this study bear valuable conclusions for P2P lending platforms to further improve their scoring systems with graph-enhanced metrics, thereby reducing default risk and facilitating greater access to credit.
Keywords: Peer-to-Peer-lending; Credit default prediction; Machine Learning; Network centrality

Raw data:

The raw dataset was downloaded on April 22nd, 2022, as a part of Bandora’s daily updated public report.3 Loan starting dates span from June 16th, 2009, to April 21st, 2022. The original dataset covers 231,039 individual borrowers characterized through 112 categorical and continuous variables. Among these loans, 79,424 have been recorded with delayed interest payments according to the platform, while 151,615 loans have no recorded delay on interest payments before the download date of the data. Specifically, the dataset details borrower demographics, financial attributes, and past credit market interactions.

The raw dataset cannot be made public due to the restrictions of the Bondora platform (https://bondora.com/en/terms/):
13.4 The Portal, Portal's website and the copyright of the contents thereof belong to the Company. The User does not have the right to save, copy, change, transfer, forward or disclose the pages of the Portal for a purpose other than personal use.

Data cleaning:

Bondora.R

1_data_washing.ipynb

Thes two files cleans the data, as described in the paper Section 4.

Data metadata:

Description cleaned.xlsx

This file describes the meaning of features in the cleaned dataset.

The cleaned dataset, as second-hand data, can also not be made public due to the restriction of the platform.

Modeling:

The remaining notebooks.

These notebooks generate the results presentated in the paper.

Files

1_data_washing.ipynb

Files (24.4 MB)

Name	Size	Download all
1_data_washing.ipynb md5:c232a80cc95dd5863af5d9b57ff90902	143.5 kB	Preview Download
2_descriptive_statistics.ipynb md5:577a74ae7dc0ca93d39968b117b8c995	53.8 kB	Preview Download
3_data_pre-processing_and_models_training.ipynb md5:2a694cbb3efda1165ef25d33b3c4ed45	380.5 kB	Preview Download
4_model_performance_analysis.ipynb md5:0768c8a86183c22709c788fbe9181355	6.2 MB	Preview Download
5_model_re-training_and_testing.ipynb md5:83652b4cd1379d7e34dc585039651ba4	2.3 MB	Preview Download
6_models_explanation_LIME.ipynb md5:36f5eebe16a4b74a35176407968d9fcb	14.9 MB	Preview Download
6_models_explanation_SHAP.ipynb md5:007aaf4d8c2755bb8c6f51f34268f48c	331.8 kB	Preview Download
Bondora.R md5:89d34a85ed3e6ae78f61b2d252889620	29.5 kB	Download
Description cleaned.xlsx md5:7a79ddebd1f75985722deecd5fc88d76	24.6 kB	Download

Additional details

Swiss National Science Foundation
100019𝐸_205487
European Cooperation in Science and Technology
COST Action CA19130

	All versions	This version
Views	25	25
Downloads	22	22
Data volume	75.2 MB	75.2 MB

Data and Code to reproduce results in paper "Leveraging network topology for credit risk assessment in P2P lending: A comparative study under the lens of machine learning"

Authors/Creators

Contributors

Data collector (2):

Supervisor (2):

Description

Files

1_data_washing.ipynb

Files (24.4 MB)

Additional details

Funding