Simulating Cross-Contamination Datasets

Table of Contents

1 Introduction

The CELSeq2 protocol used in much of the Pre-Processing Tutorial has a handful of published datasets using the multiplex format described, usually in different stages of quality without too many clear and publicly available examples of contamination (only those seen privately).

To circumvent this issue, and to gain better control over the educational content, simulating the datasets seemed to be the better approach.

2 Methods

2.1 Helper Functions

writeTSV <- function(mat, file){
    write.table(as.matrix(mat), file=file,
            col.names=NA, sep='\t', quote=FALSE)

## Abandoned methods, ignore
## cbind.fill <- function(...){
##     nm <- list(...)
##     nm <- lapply(nm, as.matrix)
##     n <- max(sapply(nm, nrow))
##, lapply(nm, function(x) rbind(x, matrix(,n-nrow(x), ncol(x)))))
## }

## bufferBatch <- function(batch){
##     tmp <- cbind.fill(batch, union_of_all_names)
##     tmp[] <- 0
##     tmp2 <- apply(tmp[,1:192], 1:2, as.integer)
##     return(tmp2)
## }

2.2 General Parameters

Here we set some general parameters, where we split cell types into 3 general categories of "Good", "Bad", and "Empty", with some expectation on how many of each we expect to see in a good or bad batch.

We also load the celseq2 barcodes and segment them into odd (1-96) and even (97-192) subsets due to the alternate barcoding scheme used in the sorting protocol.

goodbatch.goodcells = 70
goodbatch.badcells = 10
goodbatch.emptycells= 16

badbatch.goodcells = 0
badbatch.badcells = 3
badbatch.emptycells = 93

barcodes = read.table('../test-data/celseq_barcodes.192.raw', stringsAsFactors=F)[,1]

barcodes.1_96 <- barcodes[1:96]
barcodes.97_192 <- barcodes[97:192]

2.3 Cell Counts

We define a 'Cell' that takes an alpha and beta shape parameter (not to be confused with the parameters of a Gamma distribution!) and the number of genes detected in that cell. Good, Bad, Empty cells are then populated with gene counts sampled from a normal distribution.

(A negative binomial was initially chosen, as is expected with discrete count data, but the R method of sampling from one does not easily allow one to set mean and dispersion as desired, and so it was switched to an easier method).

  • A "Good" cell has around 5-8k total counts with individual counts in the 0-20 range
  • A "Bad" cell has around 2-4k total counts with individual counts in the 0-10 range
  • An "Empty" cell has around 500 total counts with individual counts in the 0-5 range
testCell <- function(alpha, beta, num.genes, debug=FALSE){
    ##return(rnbinom(n=num.genes, size=size, mu=mu))
    vals <- sapply(as.integer(rnorm(num.genes, alpha, beta)),
                   function(x){if(x<0){return(0)} else {return(x)}})
    if (debug){
    else {

goodCell <- function(ngenes){
    ##return(rnbinom(n=num.genes, size=size, mu=mu))
    return(testCell(-10,9, ngenes) * runif(1, 0.5, 1.1))

badCell <- function(ngenes){
    ##return(rnbinom(n=num.genes, size=0.1, mu=0.5))
    return(testCell(-13,9, ngenes) * runif(1, 0.5, 1))

emptyCell <- function(ngenes){
    ##return(rnbinom(n=num.genes, size=0.1, mu=0.1))
    return(testCell(-20,9, ngenes))

2.4 Simulating a Batch

A batch consists of three types of cells; Good, Bad, and Empty. The numbers of each type are required and cells are simulated according to them.

A row offset provides a means of ensuring that each batch does not always detect the same set of genes (as in real data), with only some overlap permitted batches.

populateBatch <- function(goodcells, badcells, emptycells, ngenes){
    tmp.cell <- c()
    if (goodcells > 0){
        for (c in 1:goodcells){
            tmp.cell <- cbind(tmp.cell, as.integer(goodCell(ngenes)))
    if (badcells > 0){
        for (c in 1:badcells){
            tmp.cell <- cbind(tmp.cell, as.integer(badCell(ngenes)))
    if (emptycells > 0){
        for (c in 1:emptycells){
            tmp.cell <- cbind(tmp.cell, as.integer(emptyCell(ngenes)))

generateBatch <- function(batch.prefix="P1_B1",, ngenes, goodcells, badcells, emptycells){
    bar.good <- paste(batch.prefix,, sep="_")
    tmp.cell <- populateBatch(goodcells, badcells, emptycells, ngenes)
    colnames(tmp.cell) <- bar.good

rowNameOffset <- function(rlen, offset){
    ## Begin labelling from this offset so that all batches do not always contain the same set of genes
    ## But do permit overlaps
    #rfract <- -1
    #while (rfract < 0 || rfract > 0.4){
    #    rfract = rnorm(1, 0.5, 0.25) - 0.25
    return(as.integer(rlen * offset))

makeBatch <- function(batch.prefix, ngenes, offset=0, odd=TRUE, props=NULL){

    if (is.null(props)){
        g.g = goodbatch.goodcells; g.b = goodbatch.badcells; g.e = goodbatch.emptycells;
        b.g = badbatch.goodcells; b.b = badbatch.badcells; b.e = badbatch.emptycells;
    } else {
        g.g = props[1]; g.b = props[2]; g.e = props[3];
        b.g = props[4]; b.b = props[5]; b.e = props[6];
    batch.good <- generateBatch(batch.prefix, barcodes.1_96 else barcodes.97_192, ngenes,
                                g.g, g.b, g.e)
    batch.bad <- generateBatch(batch.prefix, barcodes.97_192 else barcodes.1_96, ngenes,
                               b.g, b.b, b.e)

    batch <- cbind(batch.good, batch.bad)
    ## Shuffle columns
    batch.shuff <- batch[,sample(ncol(batch))]
    off <- rowNameOffset(ngenes, offset)
    rownames(batch.shuff) <- paste("GENE", (1+off):(nrow(batch.shuff)+off), sep="")

2.5 Generating 8 batches

We choose to generate 8 batches with varying numbers of detected genes, with varying levels of overlap, following the Odd/Even scheme of the MPI Freiburg lab which uses the Odd barcodes with the odd numbered batches, and the Even barcodes with even numbered batches.

Most batches have a distribution of cell categories that consist of the numbers given in the parameter section above. Batch 4 has slightly more inflated bad cells to show that it is still a good batch but has some cross contamination, and Batch6 has more bad cells than good cells to show that it is a batch with both cross contamination and likely label mixup.

We write each matrix to file so that we can use them in the Pre-processing tutorial where we can merge them later.

batch.1 <- makeBatch("P1_B1", 16000, 0.3, odd=TRUE)
batch.2 <- makeBatch("P1_B2", 15000, 0.23, odd=FALSE)
batch.3 <- makeBatch("P1_B3", 16200, 0.15, odd=TRUE)
batch.4 <- makeBatch("P1_B4", 14400, 0.05, odd=FALSE, props=c(70,10,16, 1, 35, 60))
batch.5 <- makeBatch("P2_B5", 17000, 0.22, odd=TRUE)
batch.6 <- makeBatch("P2_B6", 13000, 0.3, odd=FALSE, props=c(10,10,76, 32, 32, 32))
batch.7 <- makeBatch("P2_B7", 13800, 0, odd=TRUE)
batch.8 <- makeBatch("P2_B8", 14000, 0.11, odd=FALSE)

writeTSV(batch.1, "../test-data/p1b1.tsv")
writeTSV(batch.2, "../test-data/p1b2.tsv")
writeTSV(batch.3, "../test-data/p1b3.tsv")
writeTSV(batch.4, "../test-data/p1b4.tsv")
writeTSV(batch.5, "../test-data/p2b5.tsv")
writeTSV(batch.6, "../test-data/p2b6.tsv")
writeTSV(batch.7, "../test-data/p2b7.tsv")
writeTSV(batch.8, "../test-data/p2b8.tsv")

At this point the generation of the training material ends. The next section deals with merging the matrices in R and generating a combined matrix that can be used to be fed directly into the cross-contamination tool.

3 Merging Matrices

Here we make use of `dplyr` library to perform a full table join on the column names. The in-built functions in R (namely `cbind` and `merge`) failed to provide this functionality.

makeNice <- function(batch){
    tmp <- %>% mutate(names=rownames(batch))
merge2batches <- function(b1,b2){
    tmp <- full_join(b1, b2, by="names")

batch.n1 <- makeNice(batch.1)
batch.n2 <- makeNice(batch.2)
batch.n3 <- makeNice(batch.3)
batch.n4 <- makeNice(batch.4)
batch.n5 <- makeNice(batch.5)
batch.n6 <- makeNice(batch.6)
batch.n7 <- makeNice(batch.7)
batch.n8 <- makeNice(batch.8)

batch.m12 <- merge2batches(batch.n1, batch.n2)
batch.m34 <- merge2batches(batch.n3, batch.n4)
batch.m56 <- merge2batches(batch.n5, batch.n6)
batch.m78 <- merge2batches(batch.n7, batch.n8)
batch.m1234 <- merge2batches(batch.m12, batch.m34)
batch.m5678 <- merge2batches(batch.m56, batch.m78)
total.matrix <- merge2batches(batch.m1234, batch.m5678)

total.matrix[] <- 0
total.matrix2 <- total.matrix %>% select(-c(names))
total.matrix <- total.matrix2

Below we merge all column names to get a union of all sets in order to confirm that the merge performed by dplyr is correct. If so, then the column names of the total matrix should match the union of all sets.

t1 <- union(rownames(batch.1), rownames(batch.2))
t2 <- union(rownames(batch.3), rownames(batch.4))
t3 <- union(rownames(batch.5), rownames(batch.6))
t4 <- union(rownames(batch.7), rownames(batch.8))
a1 <- union(t1,t2)
a2 <- union(t3,t4)
union_of_all_names <- union(a1,a2)

if (nrow(total.matrix) != length(union_of_all_names)){
} else {
    rownames(total.matrix) <- union_of_all_names

writeTSV(total.matrix, "../test-data/test_big.matrix")

Author: Mehmet Tekman

Created: 2019-06-23 So 19:28