Sequence, Structural, Functional, and Phylogenetic Analyses of Three Glycosidase Families

Glycosidases, which cleave the glycosidic bond between a carbohydrate and another moiety, have been classified into over 63 families. Here, a variety of computational techniques have been employed to examine three families important in normal and abnormal pathology with the aim of developing a framework for future homology modeling, experimental and other studies. Family 1 includes bacterial and archaeal enzymes as well as lactase phlorizin-hydrolase and klotho, glycosidases implicated in disaccharide intolerance II and aging respectively. A statistical model, a hidden Markov model (HMM), for the family 1 glycosidase domain was trained and used as the basis for comparative examination of the conserved and variable sequence and structural features as well as the phylogenetic relationships between family members. Although the structures of four family 1 glycosidases have been determined, this is the first comparative examination of all these enzymes. Aspects that are unique to specific members or subfamilies (substrate binding loops) as well those common to all members (a ([3hx)8 barrel fold) have been defined. Active site residues in some domains in klotho and lactase-phlorizin hydrolases differ from other members and in one instance may bind but not cleave substrate. The four invariant and most highly conserved residues are not residues implicated in catalysis and/or substrate binding. Of these, a histidine may be involved in transition state stabilization. Glucosylceramidase (family 30) and galactosylceramidase (family 59) are mutated in the lysosomal storage disorders Gaucher disease and Krabbe disease, respectively. HMM-based analysis, structure prediction studies and examination of disease mutations reveal a glycosidase domain common to these two families that also occurs in some bacterial glycosidases. Similarities in the reactions catalyzed by families 30 and 59 are reflected in the presence of a structurally and functionally related (13/a)8 barrel fold related to that in family 1. © 1998 Academic Press


INTRODUCTION
Glycdsyl hydrolases (glycosidases) are a diverse group of evolutionarily conserved enzymes which cleave the glycosidic bond between two or more carbohydrates or between a carbohydrate and a non-carbohydrate (aglycone) moiety (reviewed in (1)(2)(3)). Enzymatic hydrolysis occurs via a two step mechanism: formation of a glycosylenzyme with concomitant aglycone departure followed by hydrolysis of the glycosyl-enzyme by a water molecule. The reaction leads to either reten-tion or inversion of stereochemistry. Based on their primary sequences, glycosidases have been classified into over 63 families grouped into five clans or superfamilies (1). The three families of interest in this work (families 1, 30, and 59) contain members which have roles in both normal and abnormal pathology: aging, disaccharide intolerance II and two lysosomal storage disorders (Gaucher disease and Krabbe disease). Since these enzymes are known or putative ceramidases, they may be important in apoptosis. Here, a variety of computational techniques have been employed to analyze these glycosidase at the molecular level in order to provide insights into their structures and mechanisms of action and thus a framework for further studies.
Glycosidases belonging to family 1 have been the most extensively studied. Oligomerization is a common characteristic with some being tetramers (7,8) and others being dimers (9,10). Although myrosinases are S-glycosidases, they share extensive similarity with the O-glycosidase members of this family. Structure prediction (11-13) and crystallographic studies (7,10,14,15) of this family indicate that it has the classic (13/e08 barrel fold first observed in the structure of triose phosphate isomerase (16). The glycosidases whose structures have been determined are cyanogenic 13-glucosidase (linamarase) from white clover Trifolium repens (10), 6-phospho-13-galactosidase from the mesophilic bacterium Lactococcus lactis (14), 13-glycosidase from the hyperthermophilic archaeon Sulfolobus solfataricus (7) and myrosinase from Sinapis alba (15). Glycosylation is believed to involve nucleophilic attack by a critical, conserved glutamic acid (Glu) residue. The water molecule that mediates deglycosylation has been believed to be activated by a second conserved Glu acting as a general base. However, recent studies of myrosinase (15) in which this Glu is replaced by glutamine (Gln) suggest that hydrolysis of the glycosyl-enzyme intermediate occurs via the precise positioning of a nucleophilic water molecule and not by a general base activation of water.
The current study refines and extends preliminary work on KLI/KL2 and family 1 glycosidases (6) that employed the statistical modeling method known as hidden Markov models (HMMs). Profilebased HMMs of the type used here (17)(18)(19)(20) can characterize the primary sequence features of a family, generate a multiple sequence alignment, identify new members (database searches) and serve as the basis for structure prediction and phylogenetic studies (see for example (21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)). Examination of the sequence, structural, functional and phylogenetic features of the family 1 glycosidase domain provides information on the family as a whole as well as specific members, notably Klotho and lactase phlorizin hydrolase. These data were used subsequently to make inferences about the proteins mutated in Gaucher and Krabbe diseases, enzymes for which there is a paucity of information at the molecular level. An HMM trained for a glycosidase domain predicted to be common to families 30 and 59 and some bacterial glycosidases, structure prediction studies, and analysis of Gaucher and Krabbe disease mutations yield insights into the structure, function and active site of this family 30/50 glycosidase domain.

Statistical Modeling: Hidden Markov Model
A more detailed description of HMMs as used in modeling families of related sequences can be found elsewhere (17,(33)(34)(35) so only a summary is provided. Such HMMs consist of a sequence of nodes corresponding to the columns in a multiple sequence alignment of the family and can be viewed as profiles recast in a probabilistic framework. A match state corresponds to the consensus position in an alignment, an insert state permits insertions relative to the consensus and a delete state allows consensus positions to be skipped. Training an HMM (estimating its parameters) involves creating a stochastic model representing the family used to train it (the training set) by describing transitions into a match, delete or insert state and the occurrence of a given residue in a particular match or insert state. Generating a multiple sequence alignment for a family involves aligning each member to the HMM rather than to other members. A database search consists of scoring each sequence in a database against the model and evaluating the significance of the resultant score.
HMM creation, training and use were performed with v2.0 of the SAM (Sequence Alignment and Modeling Software System) suite (17,18) running on a MASPAR MP-2204 with a DEC Alpha 3000/300X frontend at the University of California Santa Cruz (UCSC). To improve the ability of the HMM to generalize, Dirichlet mixture priors (19,20) were employed. Free Insertion Modules .(FIMs) were utilized to allow an arbitrary number of insertions at either end of the HMM to accommodate domains that occurred within larger sequences. Sequencing weighting was used to ameliorate the problem of overrepresentation of some sequences.
In previous work (21)(22)(23)(24)(25)(26)(27), remote homologues were identified by an iterative scanning strategy: sequences found in one round of HMM searching were added to the training set and the expanded set was used to retrain the HMM for the next round of searching. This procedure was repeated until convergence. This approach to database searching using HMMs is similar to that obtained by use of PSI-BLAST (36), a recent extension of BLAST (37). PSI-BLAST performs an initial gapped BLAST search of the database. In subsequent iterations, statistically significant alignments from the previous search are employed to construct a position-specific score matrix for use in the next round of searching in place of the query and standard amino acid substitution matrix. PSI-BLAST converges and stops if all sequences found at a particular round below a threshold value were already in the model at the beginning of the round.
The SAM program hmm score and an initial HMM trained using 10 sequences (6) were used for HMM database searches by calculating logodds scores (38,39) for all sequences in a non-redundant protein database obtained from the NCI (40) and updated weekly at UCSC. The log-odds score for a sequence measures how much more likely it is to have been generated by the HMM as opposed to a competing NULL model consisting of a simple FIM loop (39). The level of significance cr is related to a log-odds score d as follows: cr _< Nz -J (N is the database size and z is the logarithm base) (39). The PSI-BLAST parameter E estimates the statistical significance of a hit by specifying the number of hits with a given score that are expected by chance in a search of a database of given size. The default expected number of false positives E is 0.01 (0.01 matches with a given score would be expected purely by chance). The SAM NULL model, assumed to be a reasonably accurate description of the space the sequences are drawn from, is unlikely to be a good model for the score distribution of all "random" sequences. Hence, although SAM o-values only approximate PSI-BLAST E values (or ~ E), the two are comparable and o-= 0.01 certainly indicates significance in spite of this significance level being pessimistic.
Taking into account the number of sequences in the database searched (-272,000 different proteins in November 1997), a significant log-odds score is considered to be 22.7, the value at which o-= 0.01. Log-odds scores higher than this value denote fewer expected false positives. The approach employed here emphasizes training an HMM that discriminates between training arid non-training set sequences, i.e., one in which the gap in log-odds scores between the lowest scoring training set sequence and the highest scoring non-training set (database) sequence is relatively large (usually greater than 5.0) and the absolute log-odds score for the lowest training set sequence is greater than 22.7. In addition, efforts were made to ensure that training resulted in an HMM capable of yielding an alignment such that known enzymatic elements aligned.
A database search with the initial HMM revealed a number of sequences with log-odds scores higher than or close to that of the lowest scoring training set sequence. The alignment of sequences with log-odds scores greater than 22.7 was examined and those which possessed regions conserved in the initial HMM were retained and added to the training set. The HMM was then retrained with this expanded training set. Further rounds of "search, align and retrain" revealed fearer and fewer new sequences with the domain. The gap in log-odds scores between training set and non-training set sequences remained relatively constant. At this point (November 1997) and after approximately 10 iterations, a final HMM was trained and used for subsequent studies.

Phylogenetic Analysis
An HMM-generated alignment of the training set containing only match and delete states was utilized for phylogenetic studies. Insert states are not modeled by an HMM because the regions in a sequence they represent are the most divergent parts of the molecules and are likely to be sources of systematic error in phylogenetic analysis. The MOLPHY suite uses a probabilistic procedure for inferring phylogenetic relationships (41,42). A number of initial trees were generated using the default JTT model for amino acid substitutions and v2.3 of the program. A maximum likelihood distance matrix was calculated and employed to infer a number of approximate trees with NJdist, a neighbor-joining method. The Star Decomposition algorithm was used to calculate another tree (not the maximum likelihood tree). Starting from these initial trees, repeated local rearrangements were employed to search for better tree topologies. Among these final trees, the one with the highest likelihood was selected. Approximate bootstrap probabilities were computed using the RELL method.

Fcunily 1 Glycosidase Domain: Klotho, Lactase Phlorizin Hydrolase
Hidden Markov model. Table 1 lists the sequences that comprised the final training set (very close homologues are not shown). Of -273,000 sequences searched using the final HMM, the lowest scoring training set sequence had a log-odds score of 138.5. Only training set sequences and close homologues had scores above this value. The next highest scoring sequences were fragments of Sinapis alba myrosinases (44.5-63.1); all other database sequences had scores below 15.4. Thus, sequences with log-odds scores greater than 138.0 are classified as possessing a family 1 glycosidase domain. PSI-BLAST searches with a few randomly selected divergent domains (data not shown) yielded the same set of proteins (those known to belong to this family). The large gap in log-odds scores between training set and non-training set sequences suggests that further generalization of the HMM is required to detect more distantly related family members.
Phyiogenetic analysis. Figure     The name of the organism, sequence abbreviation and the enzyme are listed. ~: denotes enzymes whose three-dimensional structures have been solved and whose amino acid sequence are taken from the PDB entries. Databank codes are given in square parenthesis.
Sa_2MYR and Ss_IGOW are likely to be the most and least similar, respectively, to the domains in Klotho, subsequent structural analyses will focus largely on these two enzymes.
Sequence and structural features. An HMMgenerated multiple sequence alignment of the training set and the lengths and locations of B-strands and a-helices in domains of known structure were used to infer the secondary structure elements likely to be present in all domains. These results together with the positions most likely to be important for structure and/or function are shown in Figure 2. Although family members range in length from -380 (Rn_LPH1, Oc_LPH1, Hs_LPH1) to 609 (Cw_BG2) residues, the 356 nodes in the current HMM provide an estimate for the number of positions likely to be common to all sequences that possess the domain. Differences in length can be attributed largely to variations in the size of two loop regions labeled L1 and L2. Although the first domain in lactase-phlorizin hydrolases is the most divergent in terms of primary sequence (Figure 1), it appears to represent the minimal domain. Only 10% of the positions are conserved across all the domains (37/  Figure 2. Sequence identifiers are given in Table 1 and those in italics are glycosidases whose three-dimensional structures have been determirled. Subfamilies discussed in the text are labeled at their root. Local bootstrap probabilities are given for each branch and indicate the bootstrap probability of that branch. Employing only residues in the core barrel for superimposition leads to a good spatial alignment of not only the core glycosidase 13-strands and a-helices (cyan), but also the active site residues (side chains drawn explicitly).

~A
In both enzymes, the core glycosidase elements are located at the end of the barrel containing the active site. The tetramer interface in Ss_IGOW (7) and the dimer interface in Sa_2MYR (15) (blue) are located at similar positions on the outer surface of the domain. Since lactasephlorizin hydrolase and Klotho are the only family members that possess two or more copies of the domain, the analogous regions may be the sites of interaction between the constituent domains. Thus, these enzymes may be a pseudotetramer and pseudodimer, respectively. It is possible that two Klotho molecules associate to form a dimer of pseudodimers that results in juxtaposition of the amino-termini extracellularly and carboxy-termini intracellularly. This association could be triggered by ligand binding and thus be an important mechanism underlying the biological activity of Klotho. With regards to the L1 and L2 loops (yellow), their variability at both the primary sequence and tertiary structure levels suggests that they could form flexible flaps over the the active site channel in all domains and thus play a key role in substrate recognition (particularly the nonreducing end).
The organization of and features observed in Ss_IGOW, Sa_2MYR and other family members provide a detailed framework for creating a threedimensional model for the domains in Klotho and lactase-phlorizin hydrolase by homology modelling. The folding of these domains should create enzymes with similar, though not identical, active sites whose differences are likely to be important.
Active site. Figure 5 shows the great similarity in the geometry of active site residues in a eukaryotic S-glycosidase and an archaeal Oglycosidase. Side chains shown in magenta and green and the corresponding residues in other family members ( Figure 2) have been discussed previously because of their roles in substrate binding and hydrolysis (1-3, 7, 10, 14, 15). Together with loops L1 and L2, differences between these residues are likely to play a role in precise substrate recognition. The nucleophile of family 1 glycosidases is a Glu at the end of 137. Inspection of Figure 2 indicates that some members of subfamily C differ at this position. The Asp in the first domain of lactase-phlorizin hydrolases and the serine (Ser) in the second domain of Klotho (Hs_KL2) are residues that could behave as nucleophiles. In the second domain of lactasephlorizin hydrolase, however, unless some posttranscriptional event alters the glycine (Gly) at this position, this domain may be unable to perform the first glycosylation step, i.e., it may be able to bind but not cleave its substrate. The Glu/Gln between 134 and 134(al) implicated in positioning the nucleophilic water molecule (15) is changed to Asp in the first domain of lactasephlorizin hydrolases and asparagine (Asn) in Hs_KL1. This observation supports the proposition that hydrolysis of the glycosyl-enzyme intermediate of retaining glycosidases may not necessarily involve base activation of the water molecule.
In contrast to what might be expected, the four invariant and most highly conserved positions (yellow) do not include the proton donor and nucleophile implicated in catalysis. These tyrosine (Tyr), histidine (His), proline (Pro) and aspartate   Figure 4. The similarities and differences between the two family 1 glycosidases shown in Figure 3 (the enzymes are in the same orientations). The coloring scheme is that given in Figure 2. Ribbons correspond to the 13-strands and a-helices of the barrel (red) and family 1 glycosidase domain (cyan) (those in grey are elements specific to each structure). Side chains drawn explicitly in magenta, yellow and green are important for structure and/or function. L2 (yellow) denotes a loop region proposed to be important in substrate recognition. The 2-deoxy-2-fluorglucosyl substrate bound at the active site of myrosinase is shown with carbon atoms in grey, oxygen.in red and fluorine in green.
characterizing both the variable and conserved regions of a family and the necessity for examining the entire domain rather than sequence motifs.

Famih, 30/59 Glvcosidase Domain: Gaucher and Krabbe Disease Proteins
Hidden Markov model and secondary structure prediction. PSI-BLAST (36) searches using the Gaucher disease protein as the query sequence revealed statistically significant similarities to some bacterial and worm glycosidases and Krabbe disease protein. These results suggested a common underlying architecture among these proteins so these sequences were used to train a family 30/59 glycosidase domain HMM. The final HMM had a total of 351 nodes, approximately the length of the family 1 glycosidase domain HMM. Since no structural data are available for any members of family 30/59, residues corresponding to match and delete states in an HMM-generated alignment of the 14 training set sequences were used to predict the secondary structure for this domain using two different methods: DSC (51 ) and PHD (52). These predictions were combined to create a consensus secondary structure which indicated an alternating pattern of 13-strands and c~-helices followed by a series of 13-strands. Since many glycosidases appear to have a (13/008 barrel fold, this alternating pattern co,uld indicate a similar fold. The locations of the nucleophile and proton donor at the ends of 134 and 137 in family 1 ( Figure 2) were employed as constraints on the placement of specific 13strands and cx-helices in a barrel. The consensus secondary structure and patterns of residues from family 1 were used to infer/estimate the core of the (13/a)8 bah'el. The results are shown in Figure 6.
Fold prediction. A further set of experiments were performed to assess whether the domain could have a (13/o08 barrel fold. For each glycosidase, a sequence consisting of residues corresponding to match states (those shown in Figure 6) were employed as input to 123D (53). 123D determines a plausible fold for a protein of unknown structure from a library of representative protein structures. It uses a substitution matrix, secondary structure prediction and contact capacity potentials to thread a sequence through a set of structures. For each sequence, although few structures from the iibrary had Z-scores significantly greater than 0.0, among the top 10 structures were one or more glycosidases that have been classified (1) as follows: o Family 5: endoglucanases and 13-mannanases Clostridium thelw~ocellum endo-l,4-13glucanase C307 (cellulase) (PDB entry 1CEO; Swiss-Prot entry GUNC_CLOSF).
Apart from 1PGS, the SCOP database of structures (54), categorizes these structures as possessing a 13/a barrel fold or its close variant the cellulase fold.

DISCUSSION
The family 1 glycosidase domain HMM trained here attempts to capture the core elements of the (~/a)8 barrel global fold and the residues involved in recognizing the carbohydrate residue adjacent to the scissile bond. While knowledge of these common features is a necessary early step in providing greater insights into this family, particularly members such as Klotho and lactasephlorizin hydrolase, it is insufficient for a comprehensive understanding. Clearly, the precise substrate specificity and mode of action of each member will be governed to a large degree by the more variable parts of the structure, most notably the LI and L2 loops. The flexibility of HMMs means that it is simple and straightforward to model these variable regions explicitly (in more detail) and generate alignmenis for further examination. Alternatively, if the goal is detecting more remote homologues and or characterizing the (13/a)8 barrel, the connecting regions within and between the 13/a repeats could be modeled implicitly by converting them to insertions leaving only the [3-strands and a-helices of the core barrel. The latter strategy would be most suitable for identifying distant relationships by merging specific glycosidase families and superfamilies in an effort to approximate the "archetypal" (or ancestral) glycosidase fold. In either analysis, extension of the current HMM generates automatically a statistical model that can be used for database searching. HMMs provide a framework for comparative analysis of genomes/proteins as well as detailed analysis of specific proteins. A synthesis of the considerable information on the family 1 domain and data from studies of the family 30/59 glycosidase domain demonstrate the ability of computational studies to provide insights into Gaucher and Krabbe diseases.
The results here provide some clues into cellular senescence and thus indirectly into the complex, multifactorial phenomenon of organisreal aging. In addition to Klotho, some of the eucaryotic family 1 glycosidases in subfamily C (Figure 1) also appear to be associated with aging. For example, defects in human lactase-phlorizin hydrolase are the cause of disaccharide intolerance type II or III and the activity of this enzyme sometimes declines in adults (55). There is an apparent age-related decline in its activity in rats (56). One of the plant enzymes (Zm_GLUI) cleaves the biologically inactive plant hormone conjugates cytokinin-O-glucosides and kinetin-N3glucoside, releasing the active cytokinins (57). Cytokinins, compounds structurally related to adenine, are found in many plants, in bacteria and in the tRNA of many bacteria and eukaryotes. In plants, they appear to promote cell division and differentiation and are associated with delay of senescence and the promotion of chloroplast and lateral bud development (reviewed in (58-60)). The cytokinin kinetin (N6-furfuryladenine) retards senescence in plants, delays aging in human cells in culture, slows development of insects and prolongs their lifespan (61-63). Recently, kinetin has been detected in commercially available DNA, human cellular DNA and plant cell extracts (64). A mechanism for the in vivo formation of kinetin has been suggested: it is a secondary oxdiative damage product of DNA (65,66).
The known or inferred glycosidase activities of Klotho, lactase-phlorizin hydrolase, Gaucher disease protein and Krabbe disease protein include hydrolysis of membrane phospholipids thereby generating ceramide, a second messenger that activates the apoptotic cascade. Whether these four mammalian enzymes also have a role in the generation of kinetin is unknown. The data here provide a foundation for modeling the threedimensional structures of the glycosidase domains present in, among others, these enzymes. Although these and other computational and experimental studies can be useful in understanding the normal and abnormal phenotypes at the molecular level, it will be a challenge translating these observations into an understanding of the cellular, organismal and clinical aspects of aging, cancer and certain human disorders.