Basic description of the genome formula distance metric adapted to deltaGF

Intro

The genome formula is the set of relative frequencies for all virus genome segments. The GF for a genome with k segments is:

\(GF = \{f_1, f_2, [...], f_k\}\),

where f is the relative mean frequency of a segment, such that for the \(i^{th}\) segment \(f_j = c_j/\sum^k_{i=1}c_i\) and \(\sum^k_{i=1}f_i=1\). Sicard et al. (2013) proposed the \(\Delta\)GF metric, the sum of the absolute distance between all genome segments for two genome formula observations a and b:

\(\Delta GF_{a,b} = \sum^k_{i=1}\lvert{f_{a,i}-f_{b,i}}\rvert/2\).

Here we briefly consider some properties of this metric and consider some theoretical predictions for what values to expect for it, modifying the code we used for the genome formula distance metric.

Predicted \(\Delta\)GF values for random genome formula values

Most studies comment that there is considerable genome formula variation between virus populations. It is therefore interesting to explore how much genome formula variation is predicted for different conditions. We first consider the situation that the accumulation of different genome segments is random (and independent) for each segment. We use a numerical approach for estimating the mean distance between two such observations. We draw the value of accumulation for each segment from a uniform distribution covering the range of possible values - from 0 to some maximum (i.e., 1) - for two populations, and then determine the \(\Delta\)GF. For simplicity, the code below still refers to genome formula distance, but it has been adapted to reflect the \(\Delta\)GF.

# First we write a function for predicting the mean GF distance based on the 
# number of genome segments, assuming the absolute frequency of each segments 
# is drawn from a uniform distribution. Because we are comparing pairs of 
# GF observations, there are two arrays generated.
predict.gf.dist <- function(gen.seg, num.iter) {
  sim.result.1 <- array(NA, dim = c(num.iter, gen.seg)) # Array for storing the 
                                                        # absolute frequencies
                                                        # for GF Obs 1.
  n.sim.result.1 <- sim.result.1  # For relative freqs of GF Obs 1.
  sim.result.2 <- sim.result.1    # For absolute freqs of GF Obs 2.
  n.sim.result.2 <- sim.result.1  # For relative freqs of GF Obs 2.
  
  # Draw values for the absolute frequencies from a uniform distribution, 
  # determine the sum of each row and then determine the relaive frequencies. 
  # Note that with the standard parameter settings runif() draws values 
  # *between* 0 and 1, rendering the desired range.
  for(i in 1:gen.seg) {
    sim.result.1[,i] <- runif(n = num.iter, min = 0, max = 1)
    sim.result.2[,i] <- runif(n = num.iter, min = 0, max = 1)
  }
  
  sum.rows.1 <- rowSums(sim.result.1)
  sum.rows.2 <- rowSums(sim.result.2)
  
  for(i in 1:gen.seg) {
    n.sim.result.1[,i] <- sim.result.1[,i]/sum.rows.1
    n.sim.result.2[,i] <- sim.result.2[,i]/sum.rows.2 
  }  
  
  # Determine the cumulative distance between all corresponding GF observations.
  # Here the code has been modified to reflect the dGF instead of genome formula
  # distance.
  gf.dist = rowSums(abs(n.sim.result.1 - n.sim.result.2)/2)

  # Determine and return the mean distance
  return(mean(gf.dist))
}
  
# Set parameters for the  example, and make arrays to store the result.
virus.seg.range <- 2:10 # The range of number of segments to be explored. For
                        # simplicity the model will assume a balanced GF.
num.iter.1 <- 10^6      # The number of times to iterate the process
set.seed(666)           # Set random seed for reproducibility

# Now run the prediction for each value of the number of genome segments, and
# generate an array to store all results (i.e., some columns will not be used 
# yet here.)
num.virus.seg <- length(virus.seg.range)
comb.res <- array(data = NA, dim = c(num.virus.seg, 4))
colnames(comb.res) = c("Num_gen_seg", "Dist_Random", "Dist_Pois", "Lambda")
comb.res[,1] <- virus.seg.range

for(i in 1:num.virus.seg){
  comb.res[i,2] = predict.gf.dist(gen.seg = virus.seg.range[i], num.iter = num.iter.1)
}

# Plot the results.
plot(x = comb.res[,1], y = comb.res[,2], xlab = "Number of genome segments", 
  ylab = expression("Mean genome formula distance"), 
  type = "b", col = "blue", pch = 17, frame = FALSE, ylim = c(0, 0.5), 
  xlim = c(2, max(virus.seg.range)))

The expected random \(\Delta\)GF therefore behaves differently than the genome formula distance, starting at lower values for a small number of segments but increasing slightly as the number of segments increases.

Predicted \(\Delta\)GF values for a Poisson distributed number of infection founders

We have considered a model in which segment frequencies are uniformly distributed, for which the biological underpinning is simply the requirement that each segment is present at a non-zero frequency. But we can also consider other models that are more strongly rooted in biology. For example, we might hypothesize that the main source of genome formula variation is genetic bottlenecks at the start of infection. Then we can allow the number of segment copies that effectively enters the host to follow a Poisson distribution with a mean \(\lambda\) for all segments and \(\lambda_i\) for each of the j segment types. The amount of genome formula variation will depend on \(\lambda\), with the highest variation occurring at intermediate values (Zwart and Elena, 2020). Because the value of \(\lambda\) will depend on the exact conditions (e.g., virus-particle dose, host susceptibility to infection) and it is difficult to measure in practice, we predict the mean \(\Delta\)GF for a range of \(\lambda\) values to determine the maximum value \(*\)\(GF^{drift}_{a,b}\).

# Note: variables set in the first chunk of R code are re-used here.

# Load library for random draws from the zero-truncated Poisson distribution, and 
# for making heatmaps
library(actuar)
library(lattice)
library(viridisLite)

# Model parameters
virus.moi.range <- 10^seq(-1, 3, by = 0.01)  # The range of lambda values to be 
                                            # explored. I call this parameter MOI
                                            # in the code, but this is not to be 
                                            # confused with cellular MOI. 
num.moi.values <- length(virus.moi.range)
num.iter.2 <- 10^5        # The number of iterations used in the estimation

# Array to store the final GF distance values.
results.p <- array(data = NA, dim = c((num.virus.seg*num.moi.values), 3))
colnames(results.p) = c("Num_gen_seg", "Lambda", "Distance")
counter = 1

# Nested loops to run the model for all conditions
for(i in 1:num.virus.seg) {
  
  # Set number of segments
  num.segs.now <- virus.seg.range[i]
  
  for(j in 1:num.moi.values) {
    
    # Set moi, then determine for each segment assuming equimolarity
    moi <- virus.moi.range[j]
    moi.seg = moi/num.segs.now
    
    # First draw numbers for the number of successful founders of each segment,
    # then determine their relative frequencies. We are considering the
    # distance between two observations, and use an array for each observation.
    f.data.1  <- array(data = NA, dim = c(num.iter.2, num.segs.now))
    f.data.2  <- f.data.1
    rf.data.1 <- f.data.1
    rf.data.2 <- f.data.1
    for(k in 1:num.segs.now) {
      f.data.1[,k] = rztpois(n = num.iter.2, lambda = moi.seg)
      f.data.2[,k] = rztpois(n = num.iter.2, lambda = moi.seg)
    } 
    sum.rows.f.1 <- rowSums(f.data.1)
    sum.rows.f.2 <- rowSums(f.data.2)
    for(k in 1:num.iter.2) {
      rf.data.1[k,] = f.data.1[k,]/sum.rows.f.1[k]
      rf.data.2[k,] = f.data.2[k,]/sum.rows.f.2[k]
    } 
    
    # Determine the euclidean distance between all corresponding GF observations.
    # Here the code has been modified to reflect the dGF instead of genome formula
    # distance.
    gf.dist = rowSums(abs(rf.data.1 - rf.data.2)/2)

    results.p[counter,] = c(num.segs.now, moi, mean(gf.dist)) 
    # print(counter)
    counter = counter + 1
    }
}

# Make a heatmap of the results
res.p <- as.data.frame(results.p)
levelplot(Distance ~ Num_gen_seg * log10(Lambda), data=res.p, col.regions = magma(100))

# Now lets determine the maximum Poisson variation for each number of
# genome segments, and send this value to the overview array
for(i in 1:num.virus.seg) {
  eval.now <- which(res.p$Num_gen_seg == virus.seg.range[i])
  rank.now <- order(res.p$Distance[eval.now], decreasing = TRUE)
  comb.res[i,3] <- res.p[rank.now[1],3]
  comb.res[i,4] <- res.p[rank.now[1],2]
}

# Plot the results, recreating the random GF distance plot and then add the 
# Max Poisson prediction.
plot(x = comb.res[,1], y = comb.res[,2], xlab = "Number of genome segments", 
  ylab = expression("Mean genome formula distance"), 
  type = "b", col = "blue", pch = 17, frame = FALSE, ylim = c(0, 0.5), 
  xlim = c(2, max(virus.seg.range)))
  lines(comb.res[,1], y = comb.res[,3], type = "b", col = "green", 
        pch = 18, lty = 2)

The heatmap illustrates how the highest \(\Delta\)GF variation is indeed obtained at intermediate values for \(\lambda\), as had been anticipated based on previous work.

Finally, make a table summarizing the results for both distance metrics, to give precise values for these reference values.

# Make a table of the final results.
library(knitr)
kable(comb.res, caption = "Predicted random genome formula variation", escape = FALSE)

Predicted random genome formula variation
Num_gen_seg	Dist_Random	Dist_Pois	Lambda
2	0.2725972	0.2034137	5.370318
3	0.3045872	0.1980846	7.079458
4	0.3156749	0.1850404	9.332543
5	0.3211271	0.1742081	10.964782
6	0.3241218	0.1584896	13.489629
7	0.3260013	0.1492817	15.135613
8	0.3274279	0.1411478	16.982436
9	0.3284513	0.1323558	19.054607
10	0.3290911	0.1235538	21.877616

R Markdown File 9: Delta GF Metric Properties

Mark P. Zwart and Marcelle Johnson

6-11-2023

Intro

Predicted \(\Delta\)GF values for random genome formula values

Predicted \(\Delta\)GF values for a Poisson distributed number of infection founders