Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

Accessible chromatin regions are critical components of gene regulation but modeling them directly from sequence remains challenging, especially within plants, whose mechanisms of chromatin remodeling are less understood than in animals. We trained an existing deep‐learning architecture, DanQ, on data from 12 angiosperm species to predict the chromatin accessibility in leaf of sequence windows within and across species. We also trained DanQ on DNA methylation data from 10 angiosperms because unmethylated regions have been shown to overlap significantly with ACRs in some plants. The across‐species models have comparable or even superior performance to a model trained within species, suggesting strong conservation of chromatin mechanisms across angiosperms. Testing a maize (Zea mays L.) held‐out model on a multi‐tissue chromatin accessibility panel revealed our models are best at predicting constitutively accessible chromatin regions, with diminishing performance as cell‐type specificity increases. Using a combination of interpretation methods, we ranked JASPAR motifs by their importance to each model and saw that the TCP and AP2/ERF transcription factor (TF) families consistently ranked highly. We embedded the top three JASPAR motifs for each model at all possible positions on both strands in our sequence window and observed position‐ and strand‐specific patterns in their importance to the model. With our publicly available across‐species ‘a2z’ model it is now feasible to predict the chromatin accessibility and methylation landscape of any angiosperm genome.


INTRODUCTION
Accessible chromatin regions (ACRs) are known to play a critical role in eukaryotic gene regulation but their comprehensive identification in plants remains a challenge (Marand et al., 2017;Weber et al., 2016). Current methods to assay chromatin accessibility are highly environment specific and relatively expensive compared with DNA sequencing, limiting the number of species or conditions that can be investigated. Assaying chromatin accessibility in plants comes with additional unique challenges: the cell wall makes plant nuclei hard to isolate and many active transposon families shuffle, create, and destroy regulatory regions over time (Hirsch & Springer, 2017). Regions that lack DNA methylation are known to be stable over developmental time and overlap significantly with ACRs in plants with larger genomes (Crisp et al., 2020), suggesting they may contain a superset of ACRs across cell types. Computational models capable of predicting chromatin accessibility and methylation state directly from DNA sequence would enable a wide range of previously intractable studies on gene regulation across evolutionary time as well as estimation of noncoding variant effects for use in contexts such as breeding. Plants also provide an excellent system to study the genetic basis of adaptation (Anderson et al., 2011). Now that it is feasible to assemble genomes of thousands of species, regulatory regions that control adaptation can be identified, providing valuable insight on how to breed crops resilient to climate change. Recent advances in machine learning, particularly deep learning, have catalyzed a vast number of applications to biological prediction including RNA abundance (Agarwal & Shendure, 2020;Avsec, Agarwal, et al., 2021;Washburn et al., 2019), chromatin state (Kelley, 2020;Quang & Xie, 2016;Zhou & Troyanskaya, 2015), and transcription factor (TF) binding (Tu et al., 2020) directly from DNA sequence. Many of these models have so far only been trained within a single species to predict within the same species, usually using held-out chromosomes as a test set to control for sequence relatedness. At a high level, plant chromatin has characteristics similar to animal chromatin: chromatin is organized into hierarchical compartments, distal regulatory regions are colocalized to genes through chromatin looping, and various histone modifications signal a wide variety of local chromatin states. However, the exact mechanisms driving chromatin accessibility are known to be quite different in terms of specific histone modifications (Lu et al., 2019), pioneer factors (Yamaguchi, 2021), and chromatin looping mechanisms (Doğan & Liu, 2018). Because of these differences, plant-specific chromatin accessibility models are likely to be necessary.
We know that TF binding sites are strongly conserved across evolutionary time (Chen et al., 2018;Tu et al., 2020) and highly enriched in ACRs (Shlyueva et al., 2014). Certain deep-learning model architectures, such as convolutional

Core Ideas
• Cross-species models of chromatin state from sequence are comparable or superior to withinspecies models. • Model performance is highest on accessible regions open in many tissues. • Transcription factor motifs can be ranked by importance to each species and chromatin state.
neural networks (CNNs), have already been shown effective for predicting chromatin accessibility within species by recognizing important motifs (Quang & Xie, 2016;Zhou & Troyanskaya, 2015) and their spatial relationships (Avsec, Weilert, et al., 2021). These CNN-based architectures can accurately predict chromatin accessibility in humans Quang & Xie, 2016;Zhou & Troyanskaya, 2015) as well as in plants (Shen et al., 2021;Zhao et al., 2021). However, the vast majority of previous work has focused on improving performance within species and across cell types with little focus on across-species prediction (Kelley, 2020) or prediction within unobserved species. Previous work (Chen et al., 2018;Krützfeldt et al., 2020) has observed that CNNs require much larger training data sets than earlier model architectures to achieve equivalent or better performance. By incorporating multiple species into the training data, we not only increase the number of observations but also the total evolutionary time between observations, which reduces confounding neutral variation within conserved sequences. For the purposes of predicting regulatory regions in unobserved plant species, training a model across species will be critical to learn important motifs and syntax that are conserved across longer evolutionary time periods. Therefore, we predicted that previously published deep-learning architectures could work well across species and make accurate chromatin accessibility and methylation predictions in related unobserved species.
Here, we train DanQ (Quang & Xie, 2016) to predict chromatin accessibility in leaf using assay for transposaseaccessible chromatin with sequencing (ATAC-seq) data from 12 angiosperm species (Lu et al., 2019), comparing the performance of within-species models to across-species models. We also train DanQ to predict unmethylated regions using methylation data from 10 angiosperm species including five previously published grasses (Crisp et al., 2020). Using a maize single-cell ATAC (scATAC) accessibility atlas , we see that the accessibility model has similar performance across cell types but is highly variable across regions with different levels of cell-type specificity. Using various interpretation methods designed for CNNs, we compare and contrast which motifs were important across angiosperms for predicting chromatin accessibility in leaves or methylation state. Our publicly available pan-angiosperm chromatin state models are an important stepping stone toward a better understanding of gene regulation and adaptation.

Software environment
The software environment for the experiments was managed by conda (v4.10.3). Packages were downloaded from the conda-forge (conda-forge Community, 2015) and bioconda (Grüning et al., 2018) channels. Software versions not explicitly mentioned in the methods are defined in the conda environment files in the companion code repository on Zenodo (Wrightsman et al., 2021). All of the code and data files required to reproduce this manuscript, including all experiments and figures, are available in the associated Zenodo repository. The trained parameters for all models are also available in the associated Zenodo repository.

Raw data
The angiosperm ATAC-seq peaks (Lu et al., 2019) were downloaded from NCBI GEO accession GSE128434.  (Mascher et al., 2017) were downloaded from Ensembl Plants. The genome and annotation for asparagus (Asparagus officinalis L.) (v1.1) (Harkess et al., 2017) was downloaded from the Asparagus Genome Project website. Unmethylated regions for the grasses were downloaded from the supplemental information of Crisp et al. (2020). For the unmethylated regions, the maize AGPv4 genome and annotation was downloaded from MaizeGDB. The grapevine (Vitis vinifera L.) genome and annotation (Genoscope.12X) (The French-Italian Public Consortium for Grapevine Genome Characterization, 2007) were downloaded from the Genoscope website.

Unmethylated region calling on non-grass species
Unmethylated region analysis on the nongrass species was performed as per Crisp et al. (2020) using the data summarized in Supplemental Table S14. Briefly, sequencing reads were trimmed and quality checked using Trim galore! (0.6.4_dev), powered by cutadapt (v1.18) (Martin, 2011) and fastqc (v0.11.4). For all libraries, 20 bp were trimmed from the 5′ ends of both R1 and R2 reads and aligned with bsmap (v2.74) (Xi & Li, 2009) to the respective genomes with the following parameters: -v 5 to allow five mismatches, -r 0 to report only unique mapping pairs, and -p 1 and -q 20 to allow quality trimming to Q20. Output SAM files were parsed with SAMtools  fixsam, sorted, and then indexed. Picard MarkDuplicates (Broad Institute, 2019) was used to remove duplicates, BamTools filter to remove improperly paired reads, and bamUtil clipOverlap (Jun et al., 2015) to trim overlapping reads so as to only count cytosines once per sequenced molecule in a pair for paired-end reads. The methylratio.py script from bsmap was used to extract per-site methylation data summaries for each context (CH/CHG/CHH) and reads were summarized into nonoverlapping 100-bp windows tiling the genome. Whole-genome bisulfite sequencing pipelines are available on GitHub. To identify unmethylated regions, each 100-bp tile of the genome was classified into one of six domains or types-'missing data' (including 'no data' and 'no sites'), 'high CHH/RdDM', 'Heterochromatin', 'CG only', 'Unmethylated' or 'intermediate'-in preferential order as per Crisp et al. (2020).

Training data preprocessing
Interval manipulation was done using a combination of the GNU coreutils, gawk, and bedtools (Quinlan & Hall, 2010). We created our positive observations by symmetrically extending each accessible or unmethylated region from the midpoint by half of the window size (300, 600, or 1000 bp).
Our negative observations are randomly sampled from the rest of the genome not covered by the union of the resized positive observations and the original peaks. Observations were labeled as 'genic' if more than half of the range overlapped with a gene annotation, as 'proximal' if not genic and more than half of the range was within the proximal cutoff (2 kb), and as 'distal' if neither genic nor proximal. Previous work (Krützfeldt et al., 2020;Wei & Dunbrack, 2013) has shown that classifiers train best on balanced sets with an equal number of positive and negative examples but should be tested on the true class distribution to get an accurate performance estimate. Therefore, for the across-species models, we randomly sampled 6% of the observations and divided them equally between a validation and test set. For the withinspecies models, we randomly chose a hold-out chromosome to follow best practice for reducing contamination of related sequences between the training and test sets. As a heuristic to select held-out chromosomes across genome assemblies of varying contiguity, we randomly select within chromosomes that are at least a million base pairs long and have more than five positive observations. We then down sampled the remaining observations to obtain a training set for the across-species models with a balanced representation of species and target class. The Ns were encoded as vectors with equal probability assigned to each base as opposed to all zeros, which is another common practice. Sequences were extracted using BioPython (Cock et al., 2009) and pyfaidx (Shirley et al., 2015)

Training and evaluating models
The DanQ, Basset, CharPlant, and DeeperDeepSEA architectures were implemented and trained using Keras (Chollet, 2015) and TensorFlow (TensorFlow Developers, 2022). The across-species models were tested on a given species and trained on the remainder. Within-species models were tested on a held-out chromosome and trained on the other chromosomes. Because our ratio of accessible to inaccessible chromatin observations is heavily unbalanced, we focus more on the area under the precision-recall curve (auPR) to measure model performance as opposed to the more commonly reported area under the receiver operating characteristic curve. Performance metrics were measured using scikit-learn (Pedregosa et al., 2011) and curves were plotted using matplotlib (Hunter, 2007). Each model was trained three times to obtain an estimate of variability in performance because of the stochastic nature of the model variable initialization. For comparison between models, we used the first of the three trained models.
The bag-of-k-mers model was trained and tested independently on the within-species maize accessibility and methylation training data using code adapted from Tu et al. (2020) and compared with the within-species maize accessibility and methylation models. The Basset, CharPlant, and DeeperDeepSEA architectures were trained in the same manner as the across-species DanQ configuration with the same window size of 600 bp. For the two-step masked model comparison, we masked the maize-held-out accessibility and methylation model predictions to zero if more than half of a region overlapped with an annotated repeat from Repeat-Masker. We used pybedtools (Dale et al., 2011) to compute overlaps between the test set and the repeats. We preprocessed the yeast and human cell line ATAC-seq peaks in the same manner as the angiosperm ATAC-seq peaks and used the maize-held-out model to make predictions on the yeast and human peaks. The grasses accessibility model was trained and evaluated in the same manner as the across-species angiosperm accessibility model but restricted to only grass species. The 'balDist' accessibility model extended the training data balancing to distance class in addition to chromatin state, meaning the training data had equal representation for each species, distance class (genic, proximal, distal), and target class (accessible or inaccessible or unmethylated or methylated). The 'exp' accessibility model changed the activation function on the convolutional layer from rectified linear unit to exponential. The 'all_v_AtZm' accessibility model was tested on Arabidopsis and maize and trained on the rest of the angiosperm species. All trained model weights are available on Zenodo (Wrightsman et al., 2021).
The dendrogram in Figure 1 was plotted using the Phylo package of Biopython (Talevich et al., 2012).

Analysis of maize scATAC-seq data
The scATAC-seq peaks were preprocessed in the same manner as the other peaks to generate uniform 600-bp regions. Peaks were classified as open in a cell type if their counts per million (a normalized depth measurement) value was greater than log 2 5 in that cell type, which would represent no reads observed in that peak in that cell type, based on the methods reported in Marand et al. (2021). Accessibility was predicted using the maize-held-out model.

TF-MoDISco and k-mer occlusion
We ran TF-MoDISco (Shrikumar et al., 2020) with a sliding window size of 15 bp, a flank size of 5 bp, and a target seqlet false discovery rate of 0.15. For converting seqlets to patterns, we set 'trim_to_window_size' to 15 bp, 'ini-tial_flank_to_add' to 5 bp and specified a final minimum cluster size of 60.
The k-mer-occlusion method involves masking (replacing with Ns) a sliding k-mer across each sequence in a given model's test set. The difference between the model's masked and unmasked prediction is the k-mer's 'effect size'. We ran the k-mer-occlusion method with a k-mer size of 10 bp on all F I G U R E 1 Performance of the across-species chromatin state classifiers. The top middle and top right show the mean and standard error (due to variability in the stochastic model training process) of the area under the precision-recall curve (auPR) for the accessibility and methylation models, respectively, per species for both the within-and across-species training configurations. The bottom left is the precision-recall curve across all hold-out species for the across-species models split by distance class and chromatin feature. The bottom middle and bottom right are the precision-recall curves for the across-species accessibility and methylation models, respectively, split by species. The auPR is shown in parentheses within the figure legends species and chromatin feature pairs. The top 5% accessibilityor methylation-reducing k-mers per species and chromatin feature were classified as 'high-effect' k-mers. We performed an all-by-all global alignment of the high-effect k-mers per species and chromatin feature using Biopython's pairwise aligner (Cock et al., 2009). Using the alignment distance matrix, we clustered these high-effect k-mers into 100 representative k-mers using k-medoids (Bauckhage, 2015). We took the 100 medoid k-mers for each species and chromatin feature pair and did another all-by-all global alignment to create another distance matrix. The embedded k-mers coordinates were created using the MDS function in scikit-learn's manifold package. High-effect k-mers were matched to JAS-PAR 2020 CORE plantae motifs using FIMO (Grant et al., 2011) and a q-value threshold of 0.05.

Positional global importance analysis
Global importance analysis (GIA) (Koo et al., 2021) measures the average difference in model predictions from a sampled background set of sequences to the same set with the sequence embedded within them. We ran a positional GIA (pGIA) analysis for each species and chromatin feature pair by embedding the consensus motifs of the 530 JASPAR 2020 CORE plantae TFs in both orientations at each possible position within 1,000 generated 600-bp sequences. The 600-bp sequences were generated using a profile model, where bases were sampled at each position according to their relative frequency in the model's test set at that position. The GNU parallel (Tange, 2018) was used to speed up the pGIA analysis. JASPAR motifs were ranked by their maximum global importance across all positions. The TF families and classes were obtained from the JASPAR API (v1).

Recurrent CNNs accurately model chromatin state across species
To train a successful chromatin state classifier, we needed to choose a window size that balanced genomic context with resolution. We tested a few different model configurations and decided upon 600-bp windows because higher window sizes showed diminishing returns on performance on our validation set while decreasing our effective resolution (Supplemental Figure S1). We preprocessed the ATAC-seq and unmethylated peaks by taking the midpoint and symmetrically extending to half the window size in both directions to obtain our positive observations. Negatives were sampled from the rest of the genome. After preprocessing, we had 26,280 training regions per species (315,360 total) for the across-species accessibility models and 35,652 training regions per species (356,520 total) for the methylation models split evenly between classes.
As a baseline for comparison to previous, within-species, chromatin state CNN models as well as our across-species models, we trained within-species DanQ model configurations for each of the angiosperm species in our data. We also trained across-species model configurations each using a different species as a test set. Generally, we observed that a given across-species model has a comparable, if not superior, auPR to the within-species model (Figure 1, top middle and top right). Although the across-species accessibility model auPR and areas under the receiver-operating characteristic curve vary substantially (Figure 1, bottom middle and bottom right; Supplemental Figure S2), they are also within the range of those observed in the original DanQ and DeepSEA human models and superior to the bag-of-kmers model within maize (Supplemental Figure S3). We also see that both within-species and across-species performance decreases as genome size increases (Supplemental Figure S4). When comparing the accessibility and methylation models, we see the same trends in performance for each species. To assess whether recurrent CNNs were a better architecture choice for across-species accessibility models over standard CNNs, we trained across-species accessibility configurations of the Basset, CharPlant, and DeeperDeepSEA architectures. We observed that DanQ was a superior architecture for acrossspecies accessibility modeling for almost all hold-out species (Supplemental Figure S5).
To see if the models were more accurate in predicting accessible or unmethylated regions near or within genes, where these regions are known to be enriched, we looked at the precision-recall curves across different distance classes (genic, proximal, or distal). Observations were labeled as genic if more than half of the range overlapped with a gene annotation, as proximal if not genic and more than half of the range was within the proximal cutoff (2 kb), and as distal if neither genic nor proximal. We see that the across-species models for both chromatin features perform the worst on distal regions but show contrasting results on the genic and proximal regions (Figure 1, bottom left). This could be driven by the imbalanced distribution of regions between the distance classes, with accessible regions biased toward the proximal class and unmethylated regions toward the genic class (Supplemental Figure S6). In particular, barley has proportionally many more distal accessible and unmethylated regions, which could explain the lower overall performance. The across-species accessibility models are very precise when calling inaccessible chromatin, with most of the errors being false positives, particularly in distal regions (Supplemental Figure S7). We see a much different result in the methylation model, which shows only a slight bias toward false positives.
To control for potential trans-driven transposon silencing, we tested a two-step model that takes the predictions of the a2z model and then masks them with zeros if they overlap annotated transposons in maize. We see that these two-step repeat-masked models do much better (ΔauPR 0.15 for accessibility and 0.07 for methylation) than the naive models (Supplemental Figure S8), suggesting a relatively straightforward way to reduce false positives in larger plant genomes with more transposon-derived sequence.
Finally, we wanted to assess how far out in evolutionary time the angiosperm model could work. We ran the model against ATAC-seq data from yeast and a human GM12878 cell line (Schep et al., 2015). We see the plant-trained model has some ability (Supplemental Figure S9) to predict chromatin accessibility in yeast (auPR 0.21), if not human cell lines (auPR 0.02).

Leaf-trained models struggle to predict cell-type-specific ACRs
Knowing the a2z models are capable of working across species, we then asked how well the leaf-trained accessibility models could work across cell types. We used scATAC-seq data from six maize organs  as a multiple cell type test set for our single-tissue model. Using a model trained on every species with ATAC-seq data except maize, we predicted the accessibility of each scATAC peak as well as negatives sampled from the rest of the genome. Looking at the area under the threshold-recall curve, we see that the model does better on peaks that are accessible across many cell types, with a sharp decrease in peaks only accessible in five or fewer cell types, which are likely to be a mix of false positives and highly cell-type-specific peaks (Figure 2, left). The model does best on peaks that are generally open across many cell types, which comprise the largest portion of the training data (Supplemental Figure S10). This is clearly shown when looking at the overall precision-recall curves in the best (guard cell) and worst (trichoblast) cell types as well as a union of all cell types. There is not a substantial difference between the three (Figure 2, right).

Interpretation methods reveal important conserved and species-specific motifs
Although chromatin state models that work across angiosperms are a useful tool, we may be able to gain new insights into chromatin biology by dissecting what motifs and higherorder motif patterns the model is learning to use to separate accessible from inaccessible chromatin or unmethylated from methylated regions. We started with the attribution tool TF-MoDISco to identify important motifs in the maize and Arabidopsis test sets using their respective held-out models. Although TF-MoDISco qualitatively identified many important motifs (Supplemental Figure S11), most of them ranked similarly by attribution score and therefore could not be quantitatively compared in terms of effect size or importance relative to each other.
To obtain better estimates of sequence-effect size, we developed a method that masks sliding windows across a set of sequences and evaluates the change in the model prediction, which we refer to as the k-mer occlusion method. Using a k-mer size of 10 bp, representing a common estimate of core binding site length, we ran a k-mer occlusion to get effect sizes for each k-mer in the test set, binned k-mers into 'higheffect' and 'null-effect', and then scanned them for matches to JASPAR 2020 CORE plantae (Fornes et al., 2019) binding motifs. For our accessibility models, we see that approximately 20-40% of high-effect k-mers match with JASPAR motifs, whereas our methylation models generally seem to have poor matching between JASPAR motifs and high-effect k-mers (Supplemental Figure S12). To look at how similar the high-effect k-mers were between chromatin features and species, we used k-medoids to get a subset of representative k-mers and then visualized the distances between them using multidimensional scaling. Surprisingly, the high-effect k-mers across species and chromatin features cluster together, with slight separation between methylation and accessibility (Figure 3, left). However, there is no separation between species (Supplemental Figure S13) nor monocots and dicots (Figure 3, middle, right) for either chromatin feature.
To understand which known biological motifs were being recognized as important to the model, we used a recently developed model interpretation method known as GIA (Koo et al., 2021). First, we ranked JASPAR motifs by their maximum global importance across all positions for each model (Table 1) and saw both species-specific and common TFs across the models. One of the most remarkable observations is that the top 10 motifs in the Arabidopsis model are all from the TCP family. The maize accessibility model also ranked TCP motifs in the top 10 but behind Dof-type motifs. The Arabidopsis and maize methylation models rank the same two motifs at the top and share mostly the same families between the rest. Next, we looked at the positional effects of the top three TFs across Arabidopsis accessibility (Figure 4, top left) and methylation (bottom left) as well as maize accessibility (top right) and methylation (bottom right). The most striking feature is the sawtooth pattern seen across both species and chromatin feature models; however, the cause of this pattern is unclear. The Arabidopsis accessibility model shows a clear bias toward the center of the accessible regions for the top three TFs, whereas the other models are not as consistent.

DISCUSSION
We have shown that recurrent CNNs, DanQ in particular, are an effective architecture on which to base acrossspecies sequence to chromatin state models. By incorporating sequence data from multiple species, we not only increase the size of our training data set, a critical factor for deep-learning models, but also reduce the amount of confounding neutral variation around functional motifs. Being able to predict chromatin state across species also opens the door for studies of regulatory regions in additional angiosperm species with only genomic sequence data. Beyond angiosperms, the a2z model's predictive ability in yeast suggests it is capable of working effectively across wide evolutionary timescales. Unsurprisingly, we noticed that the performance across different peak classes relates to their relative abundance in the training set. Future work looking at ways to balance or weight observations in rarer peak classes would likely improve the generalizability of the models. This is particularly important for working toward better across-tissue chromatin state models, where the tissue-specific peaks are usually the minority in any given data set, as well as with larger genomes, where distal peaks are more prevalent. Further, most sequence-based model architectures, including DanQ, only take in cis sequence, which is known (Xiao & Wagner, 2015) to account for only a portion of the variation in local chromatin state. Model architectures that can effectively incorporate trans factors, such as chromatin-remodeling TFs on neighboring regulatory elements (Taberl et al., 2011) or small RNA silencing (Ito, 2011), will likely surpass current methods but their across-species applicability remains an open question. By far, the most prevalent error of the accessibility models in particular is calling false positives, which may be due to lack of trans information. A portion of these false positives may also be under called ATAC-seq peaks that are open in very specific cell types, because the peaks from Lu et al. (2019) were called with relatively conservative thresholds.
Interpreting deep-learning models remains a challenge but is an especially critical one to overcome. Here we use occlusion-and perturbation-based methods instead of gradient-based approaches like TF-MoDISco and saliency F I G U R E 4 Positional global importance analysis plots for Arabidopsis (left) and maize (right) accessibility (top) and methylation (bottom). The solid and dotted lines represent the importance scores for the positive and negative strand, respectively. Only the top three JASPAR motifs ranked by the maximum global importance across the sequence were plotted maps to trade longer computational times for reduced noise (Kim et al., 2019) in effect estimates. Particularly because eukaryotic TF binding sites are known to be degenerate (Stewart et al., 2012), point mutation effect sizes in regulatory sequences are likely to be small and harder to estimate accurately with our limited data. The lack of separation between clades and species in the multidimensional scaling plots for each chromatin feature is not too surprising. The across-species models must learn to prioritize motifs that are generalizable across species and so potential species-or cladespecific motifs are ignored. The sawtooth pattern, which is stronger in some TFs than in others, could be a manifestation of the model learning a helical face bias for specific TF binding. Further controls will be necessary to investigate that hypothesis, as the pattern may also be an artifact of the maximum pooling or long short-term memory layers. Not all of the pGIA results agree with current theory. For example, some of the motifs have a noticeable strand bias, but enhancers are known to operate in an orientation-independent (Arnold et al., 2013) manner. Given some of them are relatively simple motifs, it is possible that these matches are surrogates for important nonbinding motifs. We chose to rank JASPAR motifs by maximum global importance across the sequence as a rough estimate for importance to regulating the given chromatin feature state, though other methods of ranking could be preferable depending on the use case. Because positive observations are created by extending from the midpoint, the effect of TFs that bind to the center of accessible or unmethylated regions will be easier to estimate because they are more aligned across the test set sequences. In contrast, TFs that bind to the edges of accessible or unmethylated regions are not aligned because the lengths of the true, unextended ATAC-seq peaks are not equal.
The top 10 JASPAR motifs are very different between the features but remarkably similar between the species within each feature. Of the two known (Jin et al., 2021;Lai et al., 2021;Tao et al., 2017) plant pioneer TFs (LEC1 and LEAFY), only LEAFY is present in JASPAR but does not show up in the top 10 motifs for any of the models. This is not unexpected, as it is a floral TF, and our models are trained on leaf accessible regions. The strong presence of the TCP family in the highly ranked accessibility TFs is promising, because they are known (Yang et al., 2020) to be involved in chromatin remodeling. What role the Dof-type TFs play in accessibility is still unclear because of the wide variety of roles they play (Noguero et al., 2013). The shared top two motifs between the methylation models have evidence that they are involved in plant pathogen response (Bethke et al., 2009;Ou et al., 2011). Knowing that plant immunity genes are among the most variable (Van de Weyer et al., 2019), it would be interesting to see if these unmethylated regions are harboring a large library of rapidly inducible resistance genes that remain mostly inaccessible until needed. With the high similarity in binding motifs by definition within families, it is quite possible that some highly ranked TFs are false positives because of association with the few causal TFs in the same family. Although it is useful to use JASPAR motifs as specific testable hypotheses, there are only 530 motifs in the database, and with the lowest estimates of angiosperm TF gene count starting at ∼1,500 (Lang et al., 2010), critical TFs may still be missing.
Moving forward, more focus is necessary on collecting high-quality accessible regions across a variety of cell types to train models that are capable of simultaneously generalizing across both tissues and species. Lessons learned from successful across-species and across-tissue chromatin state models could be applied to improve more task-specific sequence models such as enhancer prediction (Min et al., 2017) or promoter-enhancer contact prediction (Li et al., 2019). With the release of highly accurate protein-folding models, such as AlphaFold2 (Jumper et al., 2021), the missing species-specific TF binding motifs in any genome may finally be feasible to estimate using simulated DNA docking approaches. Now that many deep-learning-based approaches borrowed from other fields (Avsec, Agarwal, et al., 2021;Tu et al., 2020) have been shown to be successful in mapping genomic sequence to a variety of cellular phenotypes, better interpretation methods to assess what these black-box models are learning will be important to optimize toward more biologically relevant architectures.

A C K N O W L E D G M E N T S
This work was funded by an NSF Graduate Research Fellowship (DGE-1650441) and the USDA-ARS to T.W., an NSF Postdoctoral Fellowship in Biology (DBI-1905869) to A.P.M., an Australian Research Council (ARC) Discovery Early Career Award (DE200101748) to P.A.C., the NSF IOS-1934384 to N.M.S., and the USDA-ARS to E.S.B. The Texas Advanced Computing Center supported a portion of the compute time for the analyses with their Frontera system. Peter Koo contributed helpful comments during the analyses.

C O N F L I C T O F I N T E R E S T
The authors declare no conflict of interest.