Context of deletions and insertions in human coding sequences

We studied the dependence of the rate of short deletions and insertions on their contexts using the data on mutations within coding exons at 19 human loci that cause mendelian diseases. We confirm that periodic sequences consisting of three to five or more nucleotides are mutagenic. Mutability of sequences with strongly biased nucleotide composition is also elevated, even when mutations within homonucleotide runs longer than three nucleotides are ignored. In contrast, no elevated mutation rates have been detected for imperfect direct or inverted repeats. Among known candidate contexts, the indel context GTAAGT and regions with purine‐pyrimidine imbalance between the two DNA strands are mutagenic in our sample, and many others are not mutagenic. Data on mutation hot spots suggest two novel contexts that increase the deletion rate. Comprehensive analysis of mutability of all possible contexts of lengths four, six, and eight indicates a substantially elevated deletion rate within YYYTG and similar sequences, which is one of the two contexts revealed by the hot spots. Possible contexts that increase the insertion rate (AT(A/C)(A/C)GCC and TACCRC) and decrease deletion (TATCGC) or insertion (GCGG) rates have also been identified. Two‐thirds of deletions remove a repeat, and over 80% of insertions create a repeat, i.e., they are duplications. Hum Mutat 23:177–185, 2004. Published 2003 Wiley‐Liss, Inc.


INTRODUCTION
Mutability varies substantially along nucleotide sequences. At some extremely mutable sites, mutation rates exceed the average per site rate by an order of magnitude or more. However, such mutation hot spots [Benzer, 1961;Coulondre et al., 1978] are rare in human coding sequences [Kondrashov, 2003], with the only exception being substitution hot spots at methylated CpG sites [Cooper and Youssoufian, 1988]. Thus, at the majority of sites, local mutation rates deviate from the average by no more than a factor of two to five.
Properties of sites with unusually high (or low) mutation rates can shed light on the mechanisms of spontaneous mutation [Miller, 1983;Horsfall et al., 1990;Boulikas, 1992;Dogliotti et al., 1998;Rogozin et al., 2001a;Rogozin and Pavlov, 2003;Maki, 2002]. For example, comparison of mutational hot spots at the human APC locus with the error spectrum of DNA polymerase b suggests that at least some mutations at this locus are caused by errors of this polymerase [Muniappan and Thilly, 2002]. A similar comparison suggests that DNA polymerase Z is involved in somatic hypermutation of mammalian immunoglobulin genes [Rogozin et al., 2001b;Pavlov et al., 2002].
Here, we analyze local contexts of deletions and insertions in coding regions of 19 human loci that cause mendelian diseases. We consider only deletions and insertions, because such mutations, at least when causing a frameshift, always lead to loss-of-function phenotypes. In contrast, phenotypic ascertainment of nucleotide substitutions is incomplete and involves unavoidable biases, which obscures patterns in mutation.

Hot Spots
We regarded as a hot spot each site at which a particular deletion or insertion has been found at least three times. This threshold was obtained using the CLUSTERM program [Glazko et al., 1998;Rogozin et al., 2001a]. With our samples, the local mutation rate at a hot spot so defined is at least 10 À8 , which exceeds the average per nucleotide rate of deletions or insertions by factors of 20 or 50, respectively [Kondrashov, 2003].

Analysis of the Impact of a Context
A context may contain sites of two types: 1) those where mutations are taken into account (denoted by uppercase letters (a context must contain at least one such site); and 2) those which only determine whether the context is present at a particular location (denoted by lowercase letters). For example, a context aTGc is present (exactly) if and only if the sequence contains the segment ''. . . atgc . . .''; however only mutations that affect the two central nucleotides of such segments will be taken into account.
For a context, we calculated n+ and n-, the numbers of nucleotides in the coding exons of a locus that belong and do not belong to it (only to uppercase sites), and d+ and d-(i+ and i-), the numbers of deletions (insertions) that occurred within and outside of the uppercase sites of the context. A nucleotide was considered as belonging to the context when it was covered by the context on either DNA strand. When the exact position of a mutation was uncertain (for example, a mutation that transforms ''. . . atgta . . .'' into ''. . . ata . . .'' can be a deletion of either tg or gt), each possible position was included with the weight 1/q, where q is the total number of possible positions for the mutation. For a deletion, every deleted nucleotide was considered as a site where the deletion occurred. For an insertion, both nucleotides that flank the inserted sequence were considered as sites of the insertion.
The impact of the context on the per nucleotide deletion rate at the mth locus was described by the ratio of the densities of deletions within and outside the context, Rm = (d+/n+)/(d-/n-) (loci at which n+ = 0 were treated as missing data; for reasonable contexts, d-and n-are always nonzero). After this, the average impact, I, and its standard error, E, were calculated for the set of Rm values corresponding to the 19 loci. Insertions were treated analogously.
Nucleotides that belong to mutagenic periodic sequences (i.e., homonucleotide runs longer than three nucleotides, sequences in which a segment of length two is presented more than two times, or sequences in which a segment of length three, four, or five is presented at least twice; see below) were ignored, together with mutations at these sequences, when other contexts were investigated. In some cases, only subsets of mutations (e.g., only deletions of length one) were analyzed. An ad hoc C program performing the analyses is available at ftp://ftp.ncbi.nih.gov/pub/ kondrashov/context.

Choice of Potentially Important Contexts
The analysis described above tests the impact of a particular context on the mutation rate. We identified contexts to be tested in four ways. The first two ways rely on the existing data on spontaneous mutation, and the other two ways do not use any preexisting information.
First, we considered known ARMs [Gordenin and Resnick, 1998], all of which are relational contexts. Second, we tested textual contexts known or suspected to affect mutation in other datasets. This information was collected from the literature (cited below) and from the compilation of recombination signals and mutational hot spots (ftp.bionet.nsc.ru/pub/biology/dbms/RE-COMB.ZIP).
Third, we looked for common contexts in mutation hot spots using the MEME [Grundy et al., 1996] and REGRT [Berikov and Rogozin, 1999] programs. Fourth, we identified potential contexts automatically. This was done as follows. First, we tested the impact on mutability of all possible 4 L contexts of length L (L = 4, 6, or 8). For each such context, all sequences that deviate from it by no more than k nucleotides were treated as belonging to this context. After this, we selected a small fraction of the most (or the least) mutable contexts, and performed their classification using singlelink clustering [Kondrashov and Shabalina, 2002]. For this purpose, two contexts were considered similar if and only if they differed from each other by a single substitution. For several of the most populous classes, we derived their consensus sequences and studied their impacts on mutability.

Analysis of the Impact of Imperfect Direct or Inverted Repeats
It has been suggested that deletions and insertions may result from repair of short heteroduplexes formed by complementary regions within imperfect direct or inverted repeats [Ripley and Glickman, 1983;Golding and Glickman, 1985]. We attempted to detect such heteroduplex-repair mutagenesis using a modification of a Monte Carlo procedure implemented in the CONSEN program [Rogozin and Kondrashav, 1992;Rogozin and Pavlov, 2003]. A weight Wj of site j is N*M/L, where N is the number of deletions/insertions at this site that are compatible with the heteroduplex-repair mechanism, M is the number of complementary nucleotides in a potential heteroduplex (M44), and L is the distance between two regions of direct or inverted repeats (5oLo100). The average of Wj, W, was calculated for all sites in the mutation target sequence. The distribution of average statistical weights W random was calculated for 10,000 groups of random sites. Each group contained a number of mutations equal to the observed one with the same distribution of mutations throughout the sites. Based on the distribution W random , a probability that W is below W random , P(W r W random ) was calculated.

Hot Spots
A total of 50 deletion hot spots and 10 insertion hot spots were detected at the 19 loci. Only 21 deletion hot spots occurred within periodic contexts; eight deletion hot spots occurred within yyYTG contexts, two occurred at the acACTTacaa motif, and the rest involved diverse sequences without obvious common features (Table 1). Only seven hot spots involved deletions of one nucleotide, and deletions of length four were responsible for 16 hot spots. In contrast, deletions of one nucleotide were five times more common than deletions of four nucleotides among all deletions in human coding sequences (see Fig. 5 of Kondrashov [2003]).
Most of the insertion hot spots were located within periodic sequences, and most of the corresponding insertions were only 1 nucleotide long ( Table 1). The difference between the prevalences of periodic sequences in deletion vs. insertion hot spots is statistically significant (by the Fisher exact test, P = 0.03).

Mutation at Periodic Sequences
Figures 1, 2, and 3 present data on the mutation rates in periodic sequences. Sequences with period equal to one (homonucleotide runs) are mutagenic when they are four or more nucleotides long (Fig. 1). Sequences with period equal to two are mutagenic when the number of identical segments of two nucleotides is three or more ( Fig. 2; there was not enough data for insertions into such sequences). In both cases, the mutation rate grows rapidly with the number of identical segments. When the period is three nucleotides or longer, even two identical segments in tandem are mutagenic, at least for deletions, and the mutation rate increases with the length of the period ( Fig. 3; in our data there were too few sequences with three or more such segments to study the dependence of the mutation rate on the number of identical segments).
For all mutagenic periodic sequences (i.e., for those of length Z4 with a repeated segment of length one, or of length Z6 with a repeated segment of length two, or with at least two repeated segments of length Z3), their average impacts on the rates of deletion and insertion were 2.27 7 0.16 and 2.01 7 0.25, respectively. Over one-third of all deletions (628), and over 60% (236) of all insertions occur within such periodic sequences.

Mutation of Sequences With Biased Nucleotide Composition
Even when we ignore mutations within homonucleotide runs longer than three nucleotides, which are mutagenic per se, short sequences that mostly consist of just one nucleotide have elevated mutation rates. For example, the impacts of sequences of length six with five identical nucleotides on the rates of deletion and insertion are 2.48 7 0.41 and 2.84 7 1.44, respectively. For sequences of length eight with six or seven identical nucleotides, the corresponding impacts on the rates of deletion and insertion are 2.99 7 0.55 and 2.30 7 0.86, respectively.

Mutation Within Imperfect Direct or Inverted Repeats
We did not observe an increased mutation rate at imperfect direct or inverted repeats. For mutations that can be interpreted as products of heteroduplex repair events, P(WoWrandom) varied between 0.12 and 0.96. Thus, the observed cooccurrence of deletions/insertions and repeats was not statistically significant. Table 2 lists two known textual contexts that were found to increase the deletion rate, as well as some other previously studied contexts which were not significantly mutagenic in our dataset.

Mutation at Textual Contexts
Screening of all contexts of length eight (under k = 2) reveals 59 contexts with high deletion rates, each of which had I 42.5 and I-2 n E 41.0 (these conditions ensure that the context increases the deletion rate substantially, and that this increase is statistically significant, Po0.05). Classification of these contexts produces 31 classes, three of which each contain more than five members (Table 3). We can see that all these classes contain, in three different phases, essentially the same context, which also appears in eight deletion hot spots (Table 1). If, as suggested by the hot spots, we define this context as yyYTG (or CARrr in the opposite strand) and allow one deviation from the exact context (k = 1), its impacts on deletion and insertions rates are 3.19 7 0.72 and 1.18 7 0.33, respectively. If we define this context as cyCTGt (k = 1), its impacts on deletion and insertions rates are 2.24 7 0.42 and 1.36 7 0.37, respectively. Screening of all contexts of lengths four (with k = 0) and six (with k = 1) revealed the same mutable context (data not reported). Essentially the same context has also been found by the MEME and REGRT programs. However, all other predictions made by these programs on the basis of hot spots were not confirmed when the complete gene sequences were taken into account (data not reported).
Screening of all contexts of length six (with k = 1) revealed 28 contexts with low deletion rates, each of which had I o0.5 and I-2 n E o1.0. Their classification produced 24 classes, 23 with one context each, and one with five contexts. The impacts of the consensus sequence of this largest class, TATCGC (k = 1) on deletion and insertion rates are 0.24 7 0.087 and 2.62 7 0.97, respectively. Screening of all contexts of lengths eight and four did not reveal additional clear-cut contexts with low deletion rates.
Screening of all contexts of length eight (with k=2) revealed 82 contexts with high insertion rates, each of which had I>2.5 and I-2*E>1.0. Their classification produced only two classes with more than three members. The impacts of the consensus sequence of the first class, AT(A/C)(A/C)GCC (k=1) on deletion and insertion rates are 1.15 7 0.30 and 2.66 7 0.64, respectively. The corresponding impacts of the consensus sequence of the The position of a hotspot is the leftmost possible position of the ¢rst deleted or inserted nucleotide, i. e., the number of the ¢rst capitalized nucleotide. Such nucleotides are also underlined. Nucleotides are numbered as in the sequence whose accession is provided. second class, TACCRC (k=1), on deletion and insertion rates are 0.74 7 0.21 and 3.36 7 1.54, respectively. Screening of all contexts of lengths six and four did not reveal additional clear-cut contexts with high insertion rate. Screenings of all contexts of lengths four, six, and eight produced several classes of contexts with low insertion rates, whose consensus sequences shared one common motif, GCGG. The impacts of GCGG sequence (k=0) on rates of deletion and insertions are 0.55 7 0.25 and 0.07 7 0.07, respectively.

Repeat Removals and Duplications
Among all deletions, 66% (1212) lead to removal of a repeat (''deduplication''), in the sense that the deleted sequence is identical to a sequence bordering the site of deletion. Among all insertions, 81% (311) are duplications, i.e., the inserted sequence is identical to a sequence bordering the site of insertion.

DISCUSSION
Data on disease-causing deletions and insertions at autosomal dominant or X-linked loci are suitable for studying contexts of mutation. Indeed, drastic, frameshift alleles of such loci must persist in the population for only few generations, so that different patients must carry independent mutations. Even at loci-causing late-onset (e.g., APC; Bjork et al. [1999]) or relatively mild (e.g., JAG1; Crosnier et al. [1999]) diseases, 50% or more of patients carry de novo mutations (see Kondrashov [2003] for review), indicating short persistent times, at least for complete loss-of-function alleles.   Our analysis confirms that sequence periodicity is mutagenic [Streisinger et al., 1966;Miller, 1983;Ripley, 1990;Gordenin and Resnick, 1998;Bebenek and Kunkel, 2000]. The impact of periodicity rapidly increases with the number and length of repeated sequence segments (Figs. 1-3). Similar results were obtained by Greenblatt et al. [1996] and Halandoga et al. [2001] for somatic mutations. However, periodicity per se does not determine the mutation rate exactly. Some periodic sequences are mutation hot spots (Table 1), but many others with the same patterns of periodicity are not, and periodic hot spots of deletions and of insertions do not overlap. On average, deletions in human coding sequences are approximately three times more common than insertions [Kondrashov, 2003], however within some periodic sequences, insertions are much more common than deletions. Expanding disease-causing microsatellites (CTG)n, (CGG)n, and (GAA)n are well-known examples of such sequences [Mitas, 1997;Petruska et al., 1998]. In contrast, periodic sequences that are more prone to deletions than to insertions will disappear, unless maintained by purifying selection.
Contexts containing primarily one nucleotide (e.g., AAAGACAA) are also mutagenic, even when we disregard mutagenic homonucleotide runs. This suggests that a relaxed version of Streisinger's model [Streisinger et al., 1966], allowing some deviations from exact periodicity at periodic contexts, is still applicable to spontaneous mutation in human protein-coding genes.
We did not observe any significant increase in mutation at contexts that contain inverted or direct repeats separated by 5-100 nucleotides. Thus, our data offer no support for the short heteroduplex repair model of mutation [Ripley and Glickman, 1983;Golding and Glickman, 1985].
Among the contexts known or suspected to be mutagenic, our data support only two. Contexts with R/Y imbalance between strands [Boulikas, 1992], and a motif of complex mutations (indels) GTAAGT [Chuzhanova et al., 2003a] were found to increase the deletion (but not the insertion) rate. Also, our data showed that AT-rich sequences may be marginally mutagenic.
We found two new contexts that increase the deletion rate. The more common one is yyYTG (Table 3). This context is present in eight deletion hot spots (Table 1). A similar motif ytG (hot spot of deletions of one nucleotide G) was observed in the spectra of errors produced by E. coli DNA polymerases I in vitro [Papanicolaou and Ripley, 1989]. Also, (CTG)n is prone to duplication events in several human disease-causing genes (reviewed by Mitas [1997]). Three independent observations of error-prone synthesis of CTG-containing sequences in vivo and in vitro suggest a general property of different DNA polymerases. Another deletion motif, acACTTacaa (k = 0), has been encountered only in two hot spots (Table 1). Among all the deletions at hot spots, four-nucleotide-long deletions were overrepresented (Table 1). An excess of four-nucleotide-long deletions has also been found among spontaneous mutations in the E. coli lacI gene [Schaaper et al., 1986].  We have found two new contexts that increase the insertion rate. Although statistically significant, they probably should still be treated with caution, since the amount of data on insertions was four times below that on deletions. Eight out of 10 insertion hot spots produce single-nucleotide insertions and are located within periodic sequences (Table 1). These features differentiate them from deletion hot spots and suggest that mechanisms of deletions and insertions have different context properties. This is also supported by the absence of any overlaps between deletion and insertion hot spots (Table 1).
We have also identified one context each as a deletion and an insertion cold spot, TATCGC and GCGG. Mutation cold spots may be harder to identify than hot spots, since mutations from a sample may be absent within a particular context simply by chance. However, a large number of sites in our sample of exons of 19 loci belong to our insertion cold spot sequence (tetranucleotide with no deviations allowed) or deletion cold spot sequence (hexanucleotide with one deviation allowed). Thus, these cold spot contexts may well be real.
Some mutation-affecting contexts, such as the CpG motif, which facilitates substitution [Coorper and Youssoufian, 1988] can be defined unequivocally. Often, however, a large number of similar short sequences are known to have higher (or lower) mutabilities, and the context is hard to define. Sometimes, it may be desirable to consider several related contexts [Rogozin and Pavlov, 2003]. For example, mutation hot spots associated with somatic hypermutation in immunoglobulin genes have been reported as rGy(a/t), G being the mutable base, or gaRy(a/t) (see Rogozin and Pavlov [2003]). Rogozin et al. [2001b] proposed a statistical method to evaluate the relative merits of different consensus sequences. However, statistical analysis of 15 mutational spectra in immunoglobulin genes suggested not one sequence, but two sequences, rGy(a/t) and aGy(a/t), that had the same best score. Both motifs were used for further analysis of errors made by DNA polymerases in vitro [Rogozin et al., 2001b]. Here, we considered different variants of mutable motifs. We suggested the yyYTG consensus sequence, however several hot spots have one mismatch with this sequence (e.g., the deletion hot spot in the position 5012, Table 1), and were not included in the yyYTG set (Table 1). Thus, some other variants of suggested mutable contexts of deletions/insertions might exist, although a more accurate formal description of these motifs awaits larger datasets.
Perhaps the patterns in deletions and insertions observed within our sample of 19 human loci are representative of other coding genes. However, intergenic regions may have substantially different patterns in deletions and insertions, since local properties of noncoding and coding sequences are not the same (for example, noncoding sequences contain more repetitive fragments and fewer CpG sites). The ratio of deletion and insertion rates in noncoding regions is not yet known.
Mutations in periodic contexts cannot be used as fingerprints for identifying DNA polymerases and/or repair enzymes, since many of them are error-prone (at least DNA polymerases are [Bebenek and Kunkel, 2000]) in such contexts. Fortunately, many mutagenic contexts described here consist of nonperiodic sequences. In vitro studies [Pavlov et al., 2002;Muniappan and Thilly, 2002] may identify DNA polymerases that are error-prone within these contexts, and thus shed light on the mechanisms of spontaneous mutation.