Published May 30, 2023 | Version v1
Dataset Open

A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data

  • 1. University of Toronto
  • 2. University of Sao Paulo
  • 3. Hospital for Sick Children

Description

Background

The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.

Methods

This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson's correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data. 

Results

The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson's correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).

Notes

We provide data on an Excel file, with absolute differences in beta values between replicate samples for each probe provided in different tabs for raw data and different normalization methods.

Funding provided by: McLaughlin Centre Accelerator Grants in Genomic Medicine 2020*
Crossref Funder Registry ID:
Award Number: MC-2020-11

Funding provided by: Conselho Nacional de Desenvolvimento Científico e Tecnológico
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100003593
Award Number: CNPq INCT 465355/2014-5

Funding provided by: Fundação de Amparo à Pesquisa do Estado de São Paulo
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100001807
Award Number: SABE 2014/50649-6

Funding provided by: Fundação de Amparo à Pesquisa do Estado de São Paulo
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100001807
Award Number: INCT 2014/50931-3

Funding provided by: Ontario Graduate Scholarship

Funding provided by: UTM Postdoctoral Fellowship Award

Files

README.md

Files (1.7 GB)

Name Size Download all
md5:c55ed01a5c9a7369e00a4efeebf98583
1.7 GB Download
md5:85cb81a98da7e9cd2297949a4f168c39
1.0 kB Preview Download

Additional details

Related works