Cluster: Infer alternative structures by clustering reads’ mutations
Cluster: Input files
Cluster input file: Mask report
You can give any number of Mask report files as inputs for the Cluster step. See List Input Files for ways to list multiple files.
Cluster all masked reads in out:
seismic cluster out
Cluster reads from sample-1 masked over reference reference ref-1,
section abc:
seismic cluster out/sample-1/mask/ref-1/abc
Cluster: Settings
Cluster setting: Maximum order (number of clusters)
To infer alternative RNA structures, SEISMIC-RNA uses an optimized version of our original DREEM algorithm [Tomezsko et al. (2020)], which is a type of expectation-maximization (EM). All EM algorithms need the order of clustering (i.e. number of clusters) to be prespecified; however, the optimal order is unknown before the algorithm runs, creating a chicken-and-egg problem.
SEISMIC-RNA solves this problem by first running the EM algorithm at order 1, then order 2, then 3, and so on until one of two limits is reached:
The Bayesian information criterion (BIC) worsens upon increasing the order.
The maximum order is reached. You can set this limit using
--max-clusters(-k). If you run the entire workflow usingseismic wf(see Workflow: Run all steps), then the maximum order defaults to 0 (which disables clustering). If you run the Cluster step individually usingseismic cluster, then the maxmimum order defaults to 2 (the minimum non-trivial number).
Note
If the BIC score worsens (increases) before reaching the maximum order, then clustering will stop. The report (see Cluster Report) records the maximum order you specified (field “Maximum Number of Clusters”) and the order that yielded the best BIC (field “Optimal Number of Clusters”), which is always less than or equal to the maximum order you specified.
Note
If you realize after clustering that it would have been better to have run
clustering with a higher/lower maximum order, then you can edit the results
using +addclust/+delclust (see Add/Delete Orders to/from an Already-Clustered Dataset).
Cluster setting: Expectation-maximization iterations
Expectation-maximization is an iterative algorithm, meaning that it begins by guessing an initial solution and then calculates progressively better solutions, halting once successive solutions cease changing, which is called convergence.
You can limit the minimum/maximum number of iterations per number of clusters
using --min-em-iter and --max-em-iter, respectively.
Generally, as the number of clusters increases, so does the number of iterations
required for convergence.
Thus, to treat different numbers of clusters more fairly, SEISMIC-RNA multiplies
the iteration limits by the number of clusters.
For example, if you use --max-em-iter 300, then SEISMIC-RNA will allow up to
600 iterations for 2 clusters, 900 iterations for 3 clusters, and so on.
The exception is for 1 cluster: since all reads go into the same cluster, there
is no need to iterate, so the iteration limit is always the minimum possible, 2.
You can set the threshold for convergence with --em-thresh followed by the
minimum difference between log-likelihoods of successive iterations for the
iterations to be considered different.
For example, if you set the threshold to 0.1 with --em-thresh 0.1, then if
iterations 38 and 39 had log-likelihoods of -7.28 and -7.17, respectively, then
the algorithm would keep going because their difference in log-likelihood (0.11)
would exceed the threshold; but if iteration 40 had a log-likelihood of -7.08,
then the algorithm would consider itself converged and stop running because the
difference in log-likelihood between iterations 40 and 39 would be 0.09, which
would be below the threshold.
Cluster setting: Expectation-maximization runs
Expectation-maximization is guaranteed to return a locally optimal solution,
but there is no guarantee that the solution will be globally optimal.
To improve the odds of finding the global optimum, SEISMIC-RNA runs EM multiple
times (by default, 6 times), each time starting at a different initial guess.
The idea is that if multiple EM runs, initialized randomly, converge on the same
solution, then that solution is probably the global optimum.
You can set the number of independent EM runs using --em-runs (-e).
Cluster: Output files
All output files go into the directory OUT/SAMPLE/cluster/REFERENCE/SECTION.
Cluster output file: Batch of cluster memberships
Each batch of clustered reads contains a ClustBatchIO object and is saved to
the file cluster-batch-{num}.brickle, where {num} is the batch number.
See ../../data/cluster/cluster for details on the data structure.
See Brickle: Compressed Python Objects for more information on brickle files.
Cluster output file: Cluster report
SEISMIC-RNA also writes a report file, cluster-report.json, that records the
settings you used for running the Cluster step and summarizes the results, such
as the number of clusters, number of iterations, and the BIC scores.
See Cluster Report for more information.
Note
You must look at the report file to determine whether your clusters come from true alternative structures or are just noise and artifacts. See Cluster: Verify clusters for how to verify that your clusters are real.
Cluster: Verify clusters
You must check whether your clusters are real or artifacts.
In your cluster report:
The number of clusters that SEISMIC-RNA found is Optimal Number of Clusters. Several important caveats exist about this number:
This number can never exceed the Maximum Number of Clusters. So if you want to know whether an RNA forms N alternative structures, the results of clustering can provide useful information only if you set the Maximum Number of Clusters to at least N.
A “cluster” is as subjective as a “conformational state”: two clusters can correspond to completely different structures at one extreme and to slightly different structures at the other. With more reads comes better ability to distinguish clusters that are more similar – the same way that, in a study examining differences between two groups, larger sample sizes would enable finding more subtle differences. Thus, the number of clusters you find will generally increase with more reads, but that doesn’t mean that your RNA actually forms more structures, just that you can resolve more subtle structural differences.
The Number of Unique Bit Vectors is the number of reads that were used for clustering; it should be about 20,000 at minimum, and ideally ≥ 30,000. If you have < 20,000 unique bit vectors, then clustering will probably not be able to find real clusters; so if the Optimal Number of Clusters is 1, then that does not mean your RNA necessarily forms only one structure.
Expectation-maximization is guaranteed to find a local optimum, but not a global optimum. SEISMIC-RNA thus runs multiple trajectories from different starting points; if the trajectories converge to the same solution, then that solution is likely (but still not necessarily) the global optimum. You must check if your trajectories converged to the same solution by checking the fields “NRMSD from Run 0” and “Correlation with Run 0” in the report. If all runs converged to identical solutions, then every NRMSD would be 0 and every Correlation would be 1. Generally, the runs are sufficiently reproducible if runs 1 and 2 have NRMSDs less than 0.05 and Correlations greater than 0.98 with respect to run 0. If not, then there you have no evidence that run 0 is the global optimum for that number of clusters, so it would be best to rerun clustering using more independent runs to increase the chances of finding the global optimum.
Cluster: Troubleshoot and optimize
Run Cluster with higher orders, without repeating the work already done
The tool +addclust exists for this purpose: see Command line for adding orders.
Delete unnecessary higher orders, without repeating the work already done
The tool +delclust exists for this purpose: see Command line for deleting orders.
Cluster takes too long to finish
Adjust the settings of
seismic cluster:Increase the threshold for convergence (
--em-thresh). Larger thresholds will make clustering converge in fewer iterations at the cost of making the runs end at more variable solutions. Check the Log Likelihood per Run field to verify that clustering is finding the global optimum; see Cluster: Verify clusters for more information.Decrease the number of independent runs (
--em-runs/-e) to 3 or 4; don’t go below 3 for anything you intend to publish, or else you won’t be able to tell if your clustering is finding the global optimum.