Published March 29, 2025 | Version v1

Open format raw data for MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study

  • 1. St. George's University School of Medicine, True Blue, Grenada
  • 2. Frontier Diagnostics, LLC, Nashville, Tennessee, USA
  • 3. Pathology Associates of Saint Thomas, Nashville, Tennessee, USA.
  • 4. Mass Spectrometry Research Center, Department of Biochemistry, Vanderbilt University, Nashville, Tennessee, USA.
  • 5. Frontier Diagnostics, LLC, Nashville, Tennessee, USA.
  • 6. Duke University Department of Pathology and Dermatology, Durham, North Carolina, USA

Description

Raw data in HDF5 format accompanying "MALDI Imaging Mass Spectrometry Differentiates Basal Cell Carcinoma from Trichoblastoma and Trichoepithelioma: A Proof of Principle Study".

Abstract

Background

Basal cell carcinoma (BCC) comprises a large portion of dermatopathology specimens; however, benign mimics such as trichoblastoma/trichoepithelioma (TB/TE) place accurate diagnosis at risk and consequently lead to inappropriate clinical management and overuse of healthcare resources.  This study aims to address the challenges of traditional histopathological evaluation by utilizing matrix-assisted laser desorption ionization imaging mass spectrometry (MALDI IMS). 

Methods and Findings 

Formalin-fixed paraffin-embedded BCC and TB/TE tissue blocks were taken from archival tissue. A cohort of 69 BCC and TB/TE specimens were identified, each having three concordant diagnoses given by Dermatopathologists after a blinded analysis. H&E stained sections of each specimen were imaged for pathological analysis and uploaded to a digital annotation software with the following classifications: BCC, TB, TE, BCC stroma, TB stroma, and TE stroma. Mass spectra were collected from unstained serial sections guided by the areas annotated by the Dermatopathologists on the H&E stained serial sections. Before informatics, the data from the cohort were divided randomly into a training set (n=55) and a validation set (n=14). Prediction models were developed using a support vector machine (SVM) classification model from the training set data. 

The platform predicted BCC and TB/TE in model 2 (tumor nests alone) with a sensitivity of 98.9% (95% CI 98.3-99.4%) and specificity of 88.4% (95% CI 78.4-94.5%) at the spectral level in the validation set. Model 1 (stroma alone) had a sensitivity of  46.1% (95% CI 43.0-49.1%) and specificity of 99.2% (95% CI 97.1-99.9%). A combined model 3 (tumor nests and stroma) had a sensitivity of 90.26% (95% CI 89.1%-91.3%) and a specificity of 97.1% (95% CI 94.6% to 98.7%). The limitations of this study included a small sample set, which included easily identifiable cases obtained from a single tissue source.

Conclusions

Our study proves that BCC and TB/TE exhibit different proteomic profiles that one can use to enable accurate differential diagnosis. While our findings are not yet validated for clinical use, this merits further research to support IMS as an ancillary diagnostic tool for adequately and efficiently identifying the most common cutaneous malignancy in the United States. We recommend that future studies obtain a more extensive set of histologically challenging cases from multiple institutions and adequate clinical follow-up to confirm diagnostic accuracy.

 

Documentation: Reading MSI Spot Data H5 Files

Overview

This document describes the structure of the H5 files created by the Mass Spectrometry Imaging (MSI) data processing pipeline and provides examples of how to read these files using both Python and R.

H5 File Structure

Each H5 file contains the following datasets:

  1. intensity: A 2D array (matrix) containing intensity values for each spot and m/z value

    • Dimensions: [number_of_spots × number_of_m/z_values]
    • Data type: 32-bit floating point (float32)
    • Compression: GZIP (level 1)
  2. mz_values: A 1D array containing the m/z values

    • Dimensions: [number_of_m/z_values]
    • Data type: 32-bit floating point (float32)
  3. spot_names: A 1D array containing the spot names/identifiers

    • Dimensions: [number_of_spots]
    • Data type: ASCII string (bytes)

Reading the H5 Files in Python

Using h5py

import h5py
import numpy as np
import matplotlib.pyplot as plt

# Open the H5 file
file_path = "sample-MSI-spot-data.h5"
with h5py.File(file_path, 'r') as f:
    # Read the datasets
    intensity_array = f['intensity'][:]  # Read the full intensity array
    mz_values = f['mz_values'][:]        # Read the m/z values
    spot_names = f['spot_names'][:]      # Read the spot names
    
    # Convert byte strings to regular strings if needed
    spot_names = [name.decode('utf-8') for name in spot_names]
    
    # Print basic information
    print(f"Number of spots: {intensity_array.shape[0]}")
    print(f"Number of m/z values: {intensity_array.shape[1]}")
    print(f"First few m/z values: {mz_values[:5]}")
    print(f"First few spot names: {spot_names[:5]}")
    
    # Example: Extract spectrum for first spot
    plt.figure(figsize=(10, 6))
    plt.plot(mz_values, intensity_array[0, :])
    plt.xlabel('m/z')
    plt.ylabel('Intensity')
    plt.title(f'Mass Spectrum for Spot: {spot_names[0]}')
    plt.show()
    
    # Example: Extract intensity for a specific m/z value across all spots
    target_mz = 500.0  # Replace with your m/z of interest
    closest_mz_idx = np.abs(mz_values - target_mz).argmin()
    actual_mz = mz_values[closest_mz_idx]
    print(f"Closest m/z to {target_mz} is {actual_mz}")
    
    intensity_at_mz = intensity_array[:, closest_mz_idx]
    # Now intensity_at_mz contains the intensity for that m/z across all spots

Reading the H5 Files in R

Using rhdf5

# Install required packages if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!requireNamespace("rhdf5", quietly = TRUE))
    BiocManager::install("rhdf5")

library(rhdf5)
library(ggplot2)

# Set the file path
file_path <- "sample-MSI-spot-data.h5"

# Read the datasets
intensity_array <- h5read(file_path, "intensity")
mz_values <- h5read(file_path, "mz_values")
spot_names <- h5read(file_path, "spot_names")

# Convert spot names from raw bytes to character strings
spot_names <- sapply(spot_names, rawToChar)

# Print basic information
cat("Number of spots:", dim(intensity_array)[1], "\n")
cat("Number of m/z values:", dim(intensity_array)[2], "\n")
cat("First few m/z values:", head(mz_values, 5), "\n")
cat("First few spot names:", head(spot_names, 5), "\n")

# Example: Extract spectrum for first spot
first_spot_spectrum <- intensity_array[1,]
spectrum_data <- data.frame(mz = mz_values, intensity = first_spot_spectrum)

# Plot the spectrum
ggplot(spectrum_data, aes(x = mz, y = intensity)) +
  geom_line() +
  labs(title = paste("Mass Spectrum for Spot:", spot_names[1]),
       x = "m/z",
       y = "Intensity") +
  theme_minimal()

# Example: Find intensity for a specific m/z value across all spots
target_mz <- 500.0
closest_mz_idx <- which.min(abs(mz_values - target_mz))
actual_mz <- mz_values[closest_mz_idx]
cat("Closest m/z to", target_mz, "is", actual_mz, "\n")

intensity_at_mz <- intensity_array[, closest_mz_idx]
# Now intensity_at_mz contains the intensity for that m/z across all spots

Additional Information

  • The intensity values represent the intensity of the mass spectrometry signal for each m/z value at each spot.
  • Spot names typically include information about the location of the spot on the sample and be cross referenced with the spot information in the accompanying training and test .csv files. 
  • The m/z values represent the mass-to-charge ratio of the detected ions.
  • This H5 file format allows for efficient storage and retrieval of large MSI datasets.

Troubleshooting

If you encounter issues reading the file:

  1. Ensure the H5 file exists at the specified path.
  2. Verify that you have the correct version of h5py (Python) or rhdf5 (R) installed.
  3. For large files, ensure you have sufficient memory available.

Files

nml-bcc-tricho-nostroma-sample-train.csv

Files (21.5 GB)

Name Size
md5:f6d8b8b34558f4fba096a451ec78f388
285.8 MB Download
md5:5a88ca9453e3e705969a6c7b77bed8da
132.0 MB Download
md5:29dc0fab833d43ab2da16d58b9d5dc0f
86.0 MB Download
md5:29dc0fab833d43ab2da16d58b9d5dc0f
86.0 MB Download
md5:cb41c4c07e7c1063ae6c5548b5d9594e
300.8 MB Download
md5:cb41c4c07e7c1063ae6c5548b5d9594e
300.8 MB Download
md5:998f4a2673623fdf7216ac6b94253875
341.5 MB Download
md5:998f4a2673623fdf7216ac6b94253875
341.5 MB Download
md5:8850f10029d91b4ce5a6a30976332f64
61.6 MB Download
md5:f58c0ce691b294180650fd2d00bc772f
75.6 MB Download
md5:c4519ee509238328b852ac9c81103f83
93.7 MB Download
md5:e83716ed16c0467ed29e65aa10e6ec55
171.3 MB Download
md5:f33b49a25edce3a56ea8e73e1cfaa139
210.9 MB Download
md5:063296e0179b9f88a2967163c24991ce
94.8 MB Download
md5:3cafd9889e5f100b55ee8dc731745dc8
30.3 MB Download
md5:2b7f9d7c5a9af49e69fb124c7456b858
232.2 MB Download
md5:0f69aae0c0b72b8027ef3d826cd17e6a
214.6 MB Download
md5:0c1d482e18a40a501be1db646d3cd2e9
105.0 MB Download
md5:1ebd6b23e7478d87a2479c730c679c7d
323.8 MB Download
md5:b82edc1081247a51da5665e752519289
65.2 MB Download
md5:ea01993e7029e84e96ce14dfabcec4c8
724.6 MB Download
md5:ca58c9b330960d2884ff31a3e15deeb3
303.7 MB Download
md5:f9af8e9fbe3a48e7bc48583e6e7a6c70
324.1 MB Download
md5:5edaabfafe42af3326e074c39b2e38b6
138.6 MB Download
md5:65fbd4ed1e3786b1d79da4cc7a937472
298.6 MB Download
md5:ea2d2e90afeac9972fd0ae085fe19786
352.1 MB Download
md5:40a36606a2510ece9e39cb81bad024d8
158.3 MB Download
md5:a862539c3af8fdd565e63d073093d4f2
263.8 MB Download
md5:0f6668e8ab1f7582ffa94db3c2b94490
691.2 MB Download
md5:ba3215af0a6674052d378b7a2fb8c426
163.6 MB Download
md5:9ce0e3cfa0add1574ff7edd31357e8d3
148.6 MB Download
md5:1f2129e525deb50ed5c788578a18c3d9
237.8 MB Download
md5:710a5f2931a85672dc1965a264ebefef
163.3 MB Download
md5:92afc92174cc2dd3e248e277c50cf747
261.7 MB Download
md5:09efb12d30af32c3e4bbe9ffc01fff8a
346.4 MB Download
md5:0ec247f5851398bde049037e5c441dd6
352.3 MB Download
md5:a2f3ea2ffbc1fa36a55d5691eb18812f
392.6 MB Download
md5:e40bc404338d1184ac6fd4ac264ac5c9
205.9 MB Download
md5:440dc29656e5815f30ccd99c85d752d4
261.3 MB Download
md5:ff593f566406db45dff5a57c4bd92164
246.0 MB Download
md5:a39c690a264c3cbb5dd55de0ec692198
1.2 GB Download
md5:558f3d1a3bb8beca35aa1c892a95165f
411.0 MB Download
md5:a12449919fea13f5ebf2d8bd0e6f053a
486.3 MB Download
md5:dd2dfa86130825600201b3454ee6aa91
461.4 MB Download
md5:9a89b77bc92e68019ce97911330560a0
427.2 MB Download
md5:bb269a6d90149558ca725273eb9811c7
269.8 MB Download
md5:22ee8004e538548c9b92e7ad005fef72
539.8 MB Download
md5:ccd28597dfda7ba3e2ddd463886c9757
113.9 MB Download
md5:3813990f156655ff5709c2945381f523
467.5 MB Download
md5:69457eb50ae30a06e5f60e1c9b5da8d4
580.1 MB Download
md5:aec8a18e990892b84e1bddaafb2c6168
237.2 MB Download
md5:000734ab793dac10c19f0cb4510d051a
1.1 GB Download
md5:5073b32d8045cb25fe69af78cfec7fc5
104.9 MB Download
md5:3d368432cb70c0d5adff5ba6feecdf73
713.2 MB Download
md5:3b6a54fe116abcaf7df74ba0ed7405a9
223.9 MB Download
md5:5ad4833a88e8da5483b445f431a3d707
1.3 GB Download
md5:deababfe9780c3018c898ae331164d15
729.0 MB Download
md5:81d7b6f026961b201ce940a1e9e06412
936.3 MB Download
md5:8c1b6b912329b0779e0ddc33046ba928
675.2 MB Download
md5:a6daed492e634b3ec6ff39a2a3539d90
277.9 MB Download
md5:cefce251a91d2a1e6449c420e48cf97d
191.1 MB Download
md5:31a6c6ffc724a2086e8e3bc01c4e5689
60.7 MB Download
md5:ea3f238cdd50f7abcb46f31b3f9567e6
405.2 MB Download
md5:511bfe84da4ae1e04bdd9123fe04bd32
1.3 kB Preview Download
md5:8e99d51982cca2b32873f84e3e46c0d6
679.9 kB Preview Download

Additional details

Funding

National Institutes of Health
A Molecular Diagnostic Assay for Accurately Differentiating Melanoma from Benign Lesions 5R44CA228897