Propensity Score Matching (PSM) Python-based code

Pariente, Emilio

doi:10.5281/zenodo.15030430

Published March 15, 2025 | Version v2.0

Software Open

Propensity Score Matching (PSM) Python-based code

Pariente, Emilio (Project leader)^{1, 2, 3}

1. University of Cantabria
2. Servicio Cántabro de Salud
3. Instituto de Investigación Marqués de Valdecilla (IDIVAL)

OVERVIEW / FINAL SUMMARY

This repository provides 4 variants of a free, Python-based code for performing propensity score (PS) matching. An initiative of the Camargo Cohort Study (Cantabria, Spain), developed with the aim of sharing the tool and spreading the use of PS matching.

The code overcomes compatibility issues with R versions and R packages, and implements (i) logistic regression to compute PS, (ii) 1:N matching using the K-nearest neighbour (KNN) algorithm with a customisable caliper, (iii) sampling with or without replacement, (iv) visualisations to assess matching quality and (v) statistics to evaluate the balance.

Outputs:

Matched pairs stored as '.csv' file, allowing a Coxreg to be performed ('SET' in SPSS).
Diagnostic plots stored in the specified output folder, providing a view of SMD and PS distribution.
Statistics for matching validation: SMD, variance ratio (VR), McFadden's pseudo-R², and L1 multivariate imbalance.

The code has been developed using information from the Matplotlib, Numpy and Seaborn libraries and with OpenAI's ChatGPT support and refinements.

No funding was received for conducting this work and there are no financial or non-financial interests to disclose.

***Methodological Note (updated perspective)

Subsequent applications of this PSM framework in longitudinal cardiometabolic–osteogenic research have provided additional insight into one of its most informative features. Beyond achieving balance in the covariates explicitly included in the propensity score model, the matching procedure frequently led to a meaningful reduction in the standardized mean differences of variables not directly specified in the PS equation.

This pattern, initially observed during balance diagnostics, is unlikely to be incidental. Rather, it suggests that the propensity score may capture underlying latent correlation structures embedded within interconnected cardiometabolic domains. In such contexts, balancing upstream determinants can indirectly equilibrate associated biological or structural variables that share common pathways or covariance patterns.

From a methodological standpoint, this phenomenon supports the view that well-specified PS models do not merely equalise isolated predictors but may approximate broader susceptibility architectures when the included covariates represent stable upstream determinants. This observation, while dependent on context and model specification, reinforces the interpretative value of comprehensive balance assessment beyond the primary covariates.

Example of balance dynamics before and after matching. Notably, the reduction in standardized mean differences extends beyond covariates explicitly included in the PS model, illustrating indirect equilibrium of correlated domains.

Other

CODE	REPLACEMENT	CUSTOMISABLE RATIO AND CALIPER	MATCHED PAIRS	PSM ASSESSMENT
PS matching code 1	Without	Ratio: line 73 Caliper: line 84	.csv file	SMD (barplot and lineplot) (.png)
PS matching code 2	Without	Ratio: line 88 Caliper: line 89	.csv file	SMD, VR and pseudo-R² (.csv, .txt)
PS matching code 3	Without	Ratio: line 163 Caliper: line 168	.csv file	Lineplot with improvements (.png) Balance report (SMD, VR, pseudo-R² and L1 imbalance) (.docx)

PS matching code 4	With	Ratio: line 89 Caliper: line 100	.csv file	SMD (barplot and lineplot) (.png)

Notes

Comparison between Python-based code and PSM performed by SPSS (based on R packages)

The code has been tested by comparing the results with those of a PSM in SPSS based on R packages (Propensity Score Matching for SPSS v1.0, by Thoemmes F). We selected 5 covariates to estimate the PS, caliper=0.20, ratio 1/1, sampling without replacement, and applied them on the same dataset with both methods. We observed significant discrepancies in the PS values and in the composition of the matched sample. The differences were probably due to several factors -PS estimation, optimisation algorithms, caliper application...- reflecting the different performances offered by Python libraries (matplotlib, sklearn) and R-based packages (MatchIt, RItools, cem).

However, as shown in the file, the SMD were virtually identical by using both methods. Given that SMD is the most recognized statistic in terms of balance assessment, this result validates our approach and shows that the Python implementation is reliable.

Notes

Final comments

Given the growing use of PSM and the known compatibility issues between versions of SPSS, R and the R packages on which PSM relies, the primary objective of this initiative was to develop a Python-based script that could be implemented regardless of the version of SPSS and R. The tool should be complete, well-validated and easy to implement, with the intention of making it available to clinicians and researchers.

Secondary objectives were to produce a matched sample of identified pairs and a well structured balance report. PSM for SPSS v1.0 - the only version we were able to get running - provides a matched sample, but the pairs are not identified, and this information is crucial for running a COXREG. Regarding the Balance report, after discarding the Overall balance due to the lack of a broad consensus, it encompasses the recommended statistics, and we consider it as an achievement.

Finally, an unexpected finding. The colour assignment in the lineplot, based on whether a covariate is included in the PS, has shown that PSM can also indirectly reduce the SMD of covariates not explicitly included in the PS model, likely due to underlying correlations or associations.

Files

Balance report.png

Files (637.6 kB)

Name	Size	Download all
Balance report.png md5:639de8079996405e2be0ad0386d68b2a	25.5 kB	Preview Download
Comparison_Rpackage_and_Python.docx md5:8441fbe8e7fca32b1a0b65d7519170f1	254.0 kB	Download
PS matching_code 1.py md5:25c29a4ed1f97164ab6865eaa80df1fe	10.4 kB	Download
PS matching_code 2.py md5:5436ae051adcf047b4309bd09e37084f	7.7 kB	Download
PS matching_code 3.py md5:478d4a7a27e12647081563fe596ffa53	10.2 kB	Download
PS matching_code 4.py md5:8c1af0c66f3d8b954af24e5a1fe51827	10.8 kB	Download
PS_density.png md5:982012e6073b69e9e64d02a06fbd209b	60.2 kB	Preview Download
PS_frequency.png md5:4132dcd992a4e6a273ce3662a9a4297d	47.2 kB	Preview Download
SMD_barplot.png md5:52b1125472fe0760979c7efe0ccea770	75.8 kB	Preview Download
SMD_lineplot.png md5:88295d3cb5ce0b60e853d61ccfe6442f	135.8 kB	Preview Download

Additional details

Alternative title: SUMMARY

Updated: 2025-03-03

Python-based code for implementing PSM

Repository URL: https://github.com/epsar-co/Propensity-Score-Matching-Python-based-code.git
Programming language: Python

Staffa SJ, Zurakowski D. Five Steps to Successfully Implement and Evaluate Propensity Score Matching in Clinical Research Studies. Anesth Analg. 2018;127:1066-1073. doi: 10.1213/ANE.0000000000002787.
Thoemmes, F. Propensity score matching in SPSS. 2012. Available at: https://arxiv.org/pdf/1201.6385.
Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25:1-21. doi: 10.1214/09-STS313.
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163:1149-56. doi: 10.1093/aje/kwj149.
Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011 May;46(3):399-424. doi: 10.1080/00273171.2011.568786.
Zhang Z, Kim HJ, Lonjon G, Zhu Y; written on behalf of AME Big-Data Clinical Trial Collaborative Group. Balance diagnostics after propensity score matching. Ann Transl Med. 2019 Jan;7(1):16. doi: 10.21037/atm.2018.12.10.

	All versions	This version
Views	4,765	824
Downloads	2,184	511
Data volume	104.2 MB	29.1 MB

OVERVIEW / FINAL SUMMARY

***Methodological Note (updated perspective)

Balance report.png

Files (637.6 kB)

Additional titles

Dates

Software

References

Propensity Score Matching (PSM) Python-based code

Authors/Creators

Description

OVERVIEW / FINAL SUMMARY

***Methodological Note (updated perspective)

Other

Notes

Notes

Files

Balance report.png

Files (637.6 kB)

Additional details

Additional titles

Dates

Software

References