There is a newer version of the record available.

Published January 27, 2025 | Version v8
Software Open

Propensity Score Matching Python-based code

  • 1. EDMO icon University of Cantabria
  • 2. ROR icon Servicio Cántabro de Salud
  • 3. Instituto de Investigación Marqués de Valdecilla (IDIVAL)

Description

This repository provides 4 versions of a free, Python-based code for performing propensity score (PS) matching. An initiative of the Camargo Cohort Study, developed with the aim of sharing the tool and spreading the use of PS matching.

The code overcomes compatibility issues with R versions and R packages, and implements (i) logistic regression to compute PS, (ii) 1:N matching using the K-nearest neighbour (KNN) algorithm with a customisable caliper, (iii) sampling with or without replacement, and (iv) visualisations to assess matching quality.

Outputs:

  • Matched pairs stored as '.csv' file, allowing a Coxreg to be performed ('SET' in SPSS).
  • Diagnostic plots stored in the specified output folder, providing a view of SMD and PS distribution.
  • Statistics for matching validation: SMD, variance ratio (VR), and McFadden's pseudo-R^2.

The code has been developed using information from the Matplotlib, Numpy and Seaborn libraries and with OpenAI's ChatGPT support and refinements.

No funding was received for conducting this work and there are no financial or non-financial interests to disclose. 

Notes

It has been tested and works with datasets in SPSS v25, 28 and 29 ('open script'). 
Python, v3.10 and 3.11.
Regarding R, versions 4.3.0 and 4.4.0, and 'Reticulate' package, 1.39 and 1.40. 
It tolerates missing values acceptably. However, it is desirable to reduce them as much as possible.
 
Usage
Refine the code with your current research:
- Rename C:\PATH_TO_YOUR_DATASET.sav
- Rename COVS with your data (name, not label)
- Choose the ratio (1:1, 1:2...) and the caliper 
- Choose bar colors and adjust the limits of the x-axis and y-axis to the desired range
- Rename C:\PATH_TO_YOUR_FOLDER
Run the script [RStudio, SPSS (File / Open script)...]
 
All of them perform PS matching and store matched pairs. Features: 
* Code 1: Sampling without replacement. Five plots showing SMD and PS distributions. 
* Code 2: Sampling with replacement. Five plots. 
* Code 3: Sampling without replacement. In addition, a lineplot showing the SMD before and after matching. A colour assignment has been applied based on whether a covariate is included in the PS. It can be shown that PSM can also indirectly reduce the SMD of covariates not explicitly included in the PS model, due to underlying correlations or associations. 
* Code 4: Sampling without replacement. Focused on matching validation, it stores 3 statistics: 
- SMD (covariates included in PS):  the objective is an absolute SMD postmatching <0.1
- VR (covariates included in PS): VR postmatching close to 1
- McFadden's pseudo-R^2 (postmatching close to zero, indicating that the covariates included in the PS model are no longer determinant of the variability of the DV)

Notes

The code was also tested by comparing the results with those of a PSM in SPSS based on R packages (Propensity Score Matching for SPSS v1.0, by Thoemmes F). We selected certain characteristics (5 covariates to be included in the PS, caliper=0.20, ratio 1/1, sampling without replacement), and applied them to both methods. We observed significant discrepancies in the PS values and in the composition of the matched sample. However, the post-matching balance met the standard thresholds using both methods. The differences are probably due to several factors -PS estimation, optimisation algorithms, caliper application...- reflecting the different performances offered by Python libraries (matplotlib, sklearn) and R-based packages (MatchIt, RItools, cem). In our opinion, in a practical approach, a method could be considered acceptable if the balance after matching meets the key criterion of absolute SMD <10% in covariates. This indicates a good PSM model, regardless of the PS values or the composition of the matched pairs.

Files

Figure_1.png

Files (226.6 kB)

Name Size Download all
md5:ef269c09c403a30538e2f9cc821c6570
46.2 kB Preview Download
md5:48263ac797d6ea828b92b51e4dd80729
10.8 kB Download
md5:d871cfd950226f0c758d2157a119dec7
11.3 kB Download
md5:a7a4280f8e69a9b0b149c4cdf60a2fb2
8.3 kB Download
md5:4a7ec2ec49d25b0960c29736badee9ce
7.5 kB Download
md5:0ad0391808cf4fb1070857ee12787671
63.9 kB Preview Download
md5:1dfed3ff74f225bf1e1d22541f499f42
78.5 kB Preview Download

Additional details

Dates

Updated
2025-01-27
Python-based code for implementing PSM

References

  • Staffa SJ, Zurakowski D. Five Steps to Successfully Implement and Evaluate Propensity Score Matching in Clinical Research Studies. Anesth Analg. 2018;127:1066-1073. doi: 10.1213/ANE.0000000000002787.
  • Thoemmes, F. Propensity score matching in SPSS. 2012. Available at: https://arxiv.org/pdf/1201.6385.
  • Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25:1-21. doi: 10.1214/09-STS313.
  • Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol. 2006;163:1149-56. doi: 10.1093/aje/kwj149.