Propensity Score Matching Python-based code

Pariente, Emilio

doi:10.5281/zenodo.14750770

Published January 27, 2025 | Version v8

Software Open

Propensity Score Matching Python-based code

Pariente, Emilio (Project leader)^{1, 2, 3}

1. University of Cantabria
2. Servicio Cántabro de Salud
3. Instituto de Investigación Marqués de Valdecilla (IDIVAL)

This repository provides 4 versions of a free, Python-based code for performing propensity score (PS) matching. An initiative of the Camargo Cohort Study, developed with the aim of sharing the tool and spreading the use of PS matching.

The code overcomes compatibility issues with R versions and R packages, and implements (i) logistic regression to compute PS, (ii) 1:N matching using the K-nearest neighbour (KNN) algorithm with a customisable caliper, (iii) sampling with or without replacement, and (iv) visualisations to assess matching quality.

Outputs:

Matched pairs stored as '.csv' file, allowing a Coxreg to be performed ('SET' in SPSS).
Diagnostic plots stored in the specified output folder, providing a view of SMD and PS distribution.
Statistics for matching validation: SMD, variance ratio (VR), and McFadden's pseudo-R^2.

The code has been developed using information from the Matplotlib, Numpy and Seaborn libraries and with OpenAI's ChatGPT support and refinements.

No funding was received for conducting this work and there are no financial or non-financial interests to disclose.

Notes

It has been tested and works with datasets in SPSS v25, 28 and 29 ('open script').

Python, v3.10 and 3.11.

Regarding R, versions 4.3.0 and 4.4.0, and 'Reticulate' package, 1.39 and 1.40.

It tolerates missing values acceptably. However, it is desirable to reduce them as much as possible.

Usage
Refine the code with your current research:
- Rename C:\PATH_TO_YOUR_DATASET.sav
- Rename COVS with your data (name, not label)
- Choose the ratio (1:1, 1:2...) and the caliper

- Choose bar colors and adjust the limits of the x-axis and y-axis to the desired range
- Rename C:\PATH_TO_YOUR_FOLDER
Run the script [RStudio, SPSS (File / Open script)...]

All of them perform PS matching and store matched pairs. Features:

* Code 1: Sampling without replacement. Five plots showing SMD and PS distributions.

* Code 2: Sampling with replacement. Five plots.

* Code 3: Sampling without replacement. In addition, a lineplot showing the SMD before and after matching. A colour assignment has been applied based on whether a covariate is included in the PS. It can be shown that PSM can also indirectly reduce the SMD of covariates not explicitly included in the PS model, due to underlying correlations or associations.

* Code 4: Sampling without replacement. Focused on matching validation, it stores 3 statistics:

- SMD (covariates included in PS): the objective is an absolute SMD postmatching <0.1

- VR (covariates included in PS): VR postmatching close to 1

- McFadden's pseudo-R^2 (postmatching close to zero, indicating that the covariates included in the PS model are no longer determinant of the variability of the DV)

Notes

The code was also tested by comparing the results with those of a PSM in SPSS based on R packages (Propensity Score Matching for SPSS v1.0, by Thoemmes F). We selected certain characteristics (5 covariates to be included in the PS, caliper=0.20, ratio 1/1, sampling without replacement), and applied them to both methods. We observed significant discrepancies in the PS values and in the composition of the matched sample. However, the post-matching balance met the standard thresholds using both methods. The differences are probably due to several factors -PS estimation, optimisation algorithms, caliper application...- reflecting the different performances offered by Python libraries (matplotlib, sklearn) and R-based packages (MatchIt, RItools, cem). In our opinion, in a practical approach, a method could be considered acceptable if the balance after matching meets the key criterion of absolute SMD <10% in covariates. This indicates a good PSM model, regardless of the PS values or the composition of the matched pairs.

Files

Figure_1.png

Files (226.6 kB)

Name	Size	Download all
Figure_1.png md5:ef269c09c403a30538e2f9cc821c6570	46.2 kB	Preview Download
PSM1_'Without_Replacement'.py md5:48263ac797d6ea828b92b51e4dd80729	10.8 kB	Download
PSM2_'With_Replacement'.py md5:d871cfd950226f0c758d2157a119dec7	11.3 kB	Download
PSM3_'LINEPLOT'.py md5:a7a4280f8e69a9b0b149c4cdf60a2fb2	8.3 kB	Download
PSM4_'SMD, VR and SeudoR^2'.py md5:4a7ec2ec49d25b0960c29736badee9ce	7.5 kB	Download
SMD_barplot.png md5:0ad0391808cf4fb1070857ee12787671	63.9 kB	Preview Download
SMD_lineplot.png md5:1dfed3ff74f225bf1e1d22541f499f42	78.5 kB	Preview Download

Additional details

Updated: 2025-01-27

Python-based code for implementing PSM

Repository URL: https://github.com/epsar-co/Propensity-Score-Matching-Python-based-code.git
Programming language: Python

Staffa SJ, Zurakowski D. Five Steps to Successfully Implement and Evaluate Propensity Score Matching in Clinical Research Studies. Anesth Analg. 2018;127:1066-1073. doi: 10.1213/ANE.0000000000002787.
Thoemmes, F. Propensity score matching in SPSS. 2012. Available at: https://arxiv.org/pdf/1201.6385.
Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25:1-21. doi: 10.1214/09-STS313.
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163:1149-56. doi: 10.1093/aje/kwj149.

	All versions	This version
Views	4,095	508
Downloads	1,294	113
Data volume	63.4 MB	3.1 MB

Figure_1.png

Files (226.6 kB)

Dates

Software

References

Propensity Score Matching Python-based code

Authors/Creators

Description

Notes

Notes

Files

Figure_1.png

Files (226.6 kB)

Additional details

Dates

Software

References