Basic description of the genome formula distance metric

Intro

The genome formula is the set of relative frequencies for all virus genome segments. The GF for a genome with k segments is:

\(GF = \{f_1, f_2, [...], f_k\}\),

where f is the relative mean frequency of a segment, such that for the \(i^{th}\) segment \(f_j = c_j/\sum^k_{i=1}c_i\) and \(\sum^k_{i=1}f_i=1\). We previously proposed a simple metric for empirically observed genome formula values (Boezen et al., Front Virol 2023), the Euclidean distance between two genome formula observations a and b:

\(D_{a,b} = \sqrt{\sum^k_{i=1}(f_{a,i}-f_{b,i})^2}\).

One advantage of this metric is that it is easy to visualize and interpret. Moreover, simple and robust approaches such as PERMANOVA can be applied to genome formula data using existing software packages. Here we briefly consider some properties of this metric and consider some theoretical predictions for what values to expect for it.

Predicted genome formula distance values for random genome formula values

Most studies comment that there is considerable genome formula variation between virus populations. It is therefore interesting to explore how much genome formula variation is predicted for different conditions. We first consider the situation that the accumulation of different genome segments is random (and independent) for each segment. We use a numerical approach for estimating the mean distance between two such observations. We draw the value of accumulation for each segment from a uniform distribution covering the range of possible values - from 0 to some maximum (i.e., 1) - for two populations, and then determine the mean distance between them, \(D^{rand}_{a,b}\).

# First we write a function for predicting the mean GF distance based on the 
# number of genome segments, assuming the absolute frequency of each segments 
# is drawn from a uniform distribution. Because we are comparing pairs of 
# GF observations, there are two arrays generated.
predict.gf.dist <- function(gen.seg, num.iter) {
  sim.result.1 <- array(NA, dim = c(num.iter, gen.seg)) # Array for storing the 
                                                        # absolute frequencies
                                                        # for GF Obs 1.
  n.sim.result.1 <- sim.result.1  # For relative freqs of GF Obs 1.
  sim.result.2 <- sim.result.1    # For absolute freqs of GF Obs 2.
  n.sim.result.2 <- sim.result.1  # For relative freqs of GF Obs 2.
  
  # Draw values for the absolute frequencies from a uniform distribution, 
  # determine the sum of each row and then determine the relaive frequencies. 
  # Note that with the standard parameter settings runif() draws values 
  # *between* 0 and 1, rendering the desired range.
  for(i in 1:gen.seg) {
    sim.result.1[,i] <- runif(n = num.iter, min = 0, max = 1)
    sim.result.2[,i] <- runif(n = num.iter, min = 0, max = 1)
  }
  
  sum.rows.1 <- rowSums(sim.result.1)
  sum.rows.2 <- rowSums(sim.result.2)
  
  for(i in 1:gen.seg) {
    n.sim.result.1[,i] <- sim.result.1[,i]/sum.rows.1
    n.sim.result.2[,i] <- sim.result.2[,i]/sum.rows.2 
  }  
  
  # Determine the euclidean distance between all corresponding GF observations.
  gf.dist = sqrt(rowSums((n.sim.result.1 - n.sim.result.2)^2))

  # Determine and return the mean distance
  return(mean(gf.dist))
}
  
# Set parameters for the  example, and make arrays to store the result.
virus.seg.range <- 2:10 # The range of number of segments to be explored. For
                        # simplicity the model will assume a balanced GF.
num.iter.1 <- 10^6      # The number of times to iterate the process
set.seed(666)           # Set random seed for reproducibility

# Now run the prediction for each value of the number of genome segments, and
# generate an array to store all results (i.e., some columns will not be used 
# yet here.)
num.virus.seg <- length(virus.seg.range)
comb.res <- array(data = NA, dim = c(num.virus.seg, 4))
colnames(comb.res) = c("Num_gen_seg", "Dist_Random", "Dist_Pois", "Lambda")
comb.res[,1] <- virus.seg.range

for(i in 1:num.virus.seg){
  comb.res[i,2] = predict.gf.dist(gen.seg = virus.seg.range[i], num.iter = num.iter.1)
}

# Plot the results.
plot(x = comb.res[,1], y = comb.res[,2], xlab = "Number of genome segments", 
  ylab = expression("Mean genome formula distance"), 
  type = "b", col = "blue", pch = 17, frame = FALSE, ylim = c(0, 0.5), 
  xlim = c(2, max(virus.seg.range)))

The expected random genome formula distance is therefore expected to increase slightly as the number of segments increases from 2 to 3, but steadily decreases when the number of segments increases above 3. The decrease in predicted distance with increasing segment number occurs because of reversion to the mean when sampling multiple segments. The slight increase in predicted distance between 2 and 3 segments is perhaps associated with a measurement with a single or multiple degrees of freedom. (I.e., we only need one relative frequency to represent the genome formula of a bipartite virus, as all genome formula values can be represented by a line in 2-dimensional segment-frequency space as illustrated in Figure 1 in the main paper.)

Predicted genome formula distance values for a Poisson distributed number of infection founders

We have considered a model in which segment frequencies are uniformly distributed, for which the biological underpinning is simply the requirement that each segment is present at a non-zero frequency. But we can also consider other models that are more strongly rooted in biology. For example, we might hypothesize that the main source of genome formula variation is genetic bottlenecks at the start of infection. Then we can allow the number of segment copies that effectively enters the host to follow a Poisson distribution with a mean \(\lambda\) for all segments and \(\lambda_i\) for each of the j segment types. The amount of genome formula variation will depend on \(\lambda\), with the highest variation occurring at intermediate values (Zwart and Elena, 2020). Because the value of \(\lambda\) will depend on the exact conditions (e.g., virus-particle dose, host susceptibility to infection) and it is difficult to measure in practice, we predict the mean distance for a range of \(\lambda\) values to determine the maximum value \(d^{drift}_{a,b}\).

In the R code below, we predict the mean genome formula distance when there is a Poisson-distributed number of founders for each segment.

# Note: variables set in the first chunk of R code are re-used here.

# Load library for random draws from the zero-truncated Poisson distribution, and 
# for making heatmaps
library(actuar)
library(lattice)
library(viridisLite)

# Model parameters
virus.moi.range <- 10^seq(-1, 3, by = 0.01)  # The range of lambda values to be 
                                            # explored. I call this parameter MOI
                                            # in the code, but this is not to be 
                                            # confused with cellular MOI. 
num.moi.values <- length(virus.moi.range)
num.iter.2 <- 10^5        # The number of iterations used in the estimation

# Array to store the final GF distance values.
results.p <- array(data = NA, dim = c((num.virus.seg*num.moi.values), 3))
colnames(results.p) = c("Num_gen_seg", "Lambda", "Distance")
counter = 1

# Nested loops to run the model for all conditions
for(i in 1:num.virus.seg) {
  
  # Set number of segments
  num.segs.now <- virus.seg.range[i]
  
  for(j in 1:num.moi.values) {
    
    # Set moi, then determine for each segment assuming equimolarity
    moi <- virus.moi.range[j]
    moi.seg = moi/num.segs.now
    
    # First draw numbers for the number of successful founders of each segment,
    # then determine their relative frequencies. We are considering the
    # distance between two observations, and use an array for each observation.
    f.data.1  <- array(data = NA, dim = c(num.iter.2, num.segs.now))
    f.data.2  <- f.data.1
    rf.data.1 <- f.data.1
    rf.data.2 <- f.data.1
    for(k in 1:num.segs.now) {
      f.data.1[,k] = rztpois(n = num.iter.2, lambda = moi.seg)
      f.data.2[,k] = rztpois(n = num.iter.2, lambda = moi.seg)
    } 
    sum.rows.f.1 <- rowSums(f.data.1)
    sum.rows.f.2 <- rowSums(f.data.2)
    for(k in 1:num.iter.2) {
      rf.data.1[k,] = f.data.1[k,]/sum.rows.f.1[k]
      rf.data.2[k,] = f.data.2[k,]/sum.rows.f.2[k]
    } 
    
    # Determine the euclidean distance between all corresponding GF observations.
    gf.dist = sqrt(rowSums((rf.data.1 - rf.data.2)^2))

    results.p[counter,] = c(num.segs.now, moi, mean(gf.dist)) 
    # print(counter)
    counter = counter + 1
    }
}

# Make a heatmap of the results
res.p <- as.data.frame(results.p)
levelplot(Distance ~ Num_gen_seg * log10(Lambda), data=res.p, col.regions = magma(100))

# Now lets determine the maximum Poisson variation for each number of
# genome segments, and send this value to the overview array
for(i in 1:num.virus.seg) {
  eval.now <- which(res.p$Num_gen_seg == virus.seg.range[i])
  rank.now <- order(res.p$Distance[eval.now], decreasing = TRUE)
  comb.res[i,3] <- res.p[rank.now[1],3]
  comb.res[i,4] <- res.p[rank.now[1],2]
}

# Plot the results, recreating the random GF distance plot and then add the 
# Max Poisson prediction.
plot(x = comb.res[,1], y = comb.res[,2], xlab = "Number of genome segments", 
  ylab = expression("Mean genome formula distance"), 
  type = "b", col = "blue", pch = 17, frame = FALSE, ylim = c(0, 0.5), 
  xlim = c(2, max(virus.seg.range)))
  lines(comb.res[,1], y = comb.res[,3], type = "b", col = "green", 
        pch = 18, lty = 2)

The heatmap illustrates how the highest GF variation is indeed obtained at intermediate values for \(\lambda\), as had been anticipated based on previous work. The genome formula distance values \(d^{drift}_{a,b}\) follow a similar pattern to \(d^{rand}_{a,b}\), although being appreciably lower.

Finally, make a table summarizing the results for both distance metrics, to give precise values for these reference values.

# Make a table of the final results.
library(knitr)
kable(comb.res, caption = "Predicted random genome formula variation", escape = FALSE)

Predicted random genome formula variation
Num_gen_seg	Dist_Random	Dist_Pois	Lambda
2	0.3855106	0.2876704	5.370318
3	0.3905047	0.2801339	7.079458
4	0.3638186	0.2629384	9.120108
5	0.3366711	0.2494165	10.471286
6	0.3132396	0.2340507	12.302688
7	0.2933507	0.2188568	14.125375
8	0.2767207	0.2059531	15.848932
9	0.2625185	0.1928851	18.197009
10	0.2500673	0.1846531	19.498446

R Markdown File 1: Basic description of the genome formula distance metric

Mark P. Zwart and Marcelle Johnson

6-11-2023

Intro

Predicted genome formula distance values for random genome formula values

Predicted genome formula distance values for a Poisson distributed number of infection founders