apzubarev/Degree-of-nontrivial-ultrametricity-for-RNA-macrostates: Calculation of the degree of nontrivial ultrametricity for RNA macrostates Version 7
Authors/Creators
Description
Calculation of the degree of nontrivial ultrametricity for RNA macrostates. PHYSICALLY RIGOROUS APPROACH: distance between basins via spectral decomposition of the transition rate matrix (Mahalanobis distance in the space of eigenvectors of the symmetrized matrix K).
METHOD:
- A transition rate matrix K is constructed between all structures (N x N, where N ~ 2000) based on the Kramers formula.
- K is symmetrized taking into account detailed balance.
- The m smallest eigenvalues in magnitude and corresponding eigenvectors are computed (Lanczos method for sparse matrices).
- Automatic filtering of noise modes is performed by searching for a spectral gap: if the ratio |lambda_k| / |lambda_{k-1}| exceeds a threshold (default 10^6), modes with indices < k are discarded as numerical noise.
- Each basin of attraction is represented by a characteristic vector chi_A in the space of structures.
- The distance between basins A and B is defined as the weighted Euclidean distance between projections of chi_A and chi_B onto eigenvectors (Mahalanobis distance).
- The resulting distance matrix is a metric and is tested for ultrametricity.
HANDLING OF DISCONNECTED GRAPHS: Before constructing the K_sym matrix, the connectivity of the structure graph is checked. If the graph contains multiple connected components, each component is processed separately: its own K_sym matrix is built, spectral decomposition is performed, and ultrametricity is checked. Components with fewer than 3 basins are skipped. Components containing fewer than ALPHA_COMPONENT_THRESHOLD * N structures are classified as noise and excluded from the calculation of f_inter and the final spectral analysis. IMPORTANT: This logic of processing ALL significant components with subsequent weighted averaging is applied UNIFORMLY both in the main stage and in all null hypothesis testing modes (nt_shuffle, energy_shuffle, topo_shuffle). This guarantees statistical reliability of the results.
STATISTICAL MODE (NUM_STAT > 1): When NUM_STAT > 1, NUM_STAT independent runs are performed for each sequence with different random samples of structures (seed varies: RANDOM_SEED, RANDOM_SEED+1, ..., RANDOM_SEED+NUM_STAT-1). Runs are executed IN PARALLEL via multiprocessing.Pool to maximize computational resource utilization. Results are averaged, and the final table displays mean values and standard deviations (mean +/- std). Integer quantities (number of structures, basins, connected components) are rounded to integers.
OUTPUT MODES: VERBOSE = True -- full log (steps, components, spectral analysis). VERBOSE = False -- brief log: sequence header and parameters are printed once, then only RUN/COMPLETED, followed by statistics block.
NULL HYPOTHESIS TESTING (NULL_MODEL_TYPE): Testing is performed by comparing the real system with null models differing in the degree of "randomness". All null models are executed IN PARALLEL and process ALL significant connected components. Statistical significance is assessed via a TWO-SIDED p-value, since biological function may require either pronounced hierarchy (high ultrametricity) or its absence/specific frustration (low ultrametricity). The two-sided criterion tests the significance of the deviation of the real value from the mean null ensemble in both directions.
'none' : Program runs in standard mode without tests.
'full_analysis' : (RECOMMENDED) FULL MECHANISM ANALYSIS. Automatically performs TWO independent tests: 1. Energy Shuffle: Graph preservation + energy shuffling. Shows contribution of pure graph topology. 2. Topo Shuffle: Configuration model (edge rewiring preserving vertex degrees) + energy shuffling. Shows baseline chaos level. Outputs TWO separate summary tables for each test.
'topo_shuffle' : CONFIGURATION MODEL. Graph edge rewiring via double_edge_swap while preserving vertex degree sequence + energy shuffling. Destroys topological correlations while preserving mobility distribution. Basins are re-determined for ALL significant components.
'energy_shuffle' : (WEAK RANDOMNESS / TOPOLOGICAL ORDER) Neighborhood graph is fully preserved (including all topological correlations), but vertex energies are randomly shuffled. Basins are re-determined for these random energies in ALL significant components. Allows isolating the contribution of PURE GRAPH TOPOLOGY to ultrametricity.
'nt_shuffle' : NUCLEOTIDE SHUFFLING (BIOLOGICAL CONTROL). Random permutations of the original RNA sequence are generated preserving nucleotide composition. For each permutation, all stages are fully re-executed: structure generation, graph construction, search for ALL significant components and basins, spectral analysis. This is the strictest test checking whether ultrametricity is due to specific nucleotide order. TIME-CONSUMING.
'random_basins' : (GEOMETRIC CONTROL) Real spectrum of K_sym matrix is preserved, but structures are randomly partitioned into basins of same sizes. Checks whether ultrametricity is an artifact of the geometry of the high-dimensional eigenvector space. Does not affect graph or energies.
ENSEMBLE EXPECTATION (EXPECTATION_BY_RNA = True): If EXPECTATION_BY_RNA = True, a row "AVERAGE OVER ALL RNAs" is appended to the END OF EACH summary table (main and all null model tables), containing mean values and STD of all corresponding metrics across the entire set of sequences. This allows estimating typical values and spread across the ensemble for both real data and each null model.
ADVANTAGES:
- Accounts for all possible transition paths (via spectral decomposition).
- Context-independent (distance between A and B is determined only by them, not by presence of other basins).
- Symmetric and guaranteed to be a metric.
- Automatically filters numerical noise via spectral gap search.
- Correctly handles disconnected structure graphs.
- Computational complexity O(mNE + K^2*m), allowing processing of N ~ 2000 structures and K ~ 100 basins in seconds.
- Full parallelization of main stage and all null models.
- Unified methodology for component processing in all modes.
- Two-sided statistical significance testing.
STRUCTURE GENERATION MODE: Stochastic sampling (pbacktrack) from Gibbs distribution.
Dependencies: pip install viennarna numpy scipy biopython
Files
apzubarev/Degree-of-nontrivial-ultrametricity-for-RNA-macrostates-Version_7.zip
Files
(27.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:789a26cf73da492be1c8d1f3f4ae963a
|
27.6 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/apzubarev/Degree-of-nontrivial-ultrametricity-for-RNA-macrostates/tree/Version_7 (URL)