Determining enzyme function by predicting substrate specificity

There are four research areas in the Department of Pharmaceutical Chemistry. Determining enzyme function by predicting substrate specificity is a research challenge within computational chemistry and biology.

The challenge

Exponential growth in genome sequencing, which allows scientists to determine the linear sequence of amino acids (residues) in protein molecules encoded by genes, has yielded a conundrum.

There were more than 69 million protein sequences identified in life forms ranging from humans to bacteria in the Uniprot database, as of June 2014, up from less than 10 million at the start of 2010. But at least half of those have uncertain, unknown, or incorrectly annotated functions.

In other words, despite approaches that include automated comparisons of the discovered proteins’ sequence data, their specific chemical and biological functions either cannot be determined or have been mislabeled. This is because closely related proteins (those with high sequence similarity) can have different functions, and unrelated proteins can have similar functions. The problem becomes ever more significant and self-perpetuating as newly discovered proteins are assigned incorrect function due to their similarity with those previously misannotated.

This increasing gap between knowledge of sequences and functions is especially significant when it comes to enzymes—proteins that catalyze and vastly accelerate the biochemical reactions vital to life and health. As the founding publication of the Enzyme Function Initiative notes: “Sequence similarity and/or genome/operon context provide coarse function clues… but they are rarely sufficient to provide information about the substrate specificity and, therefore, the actual reaction that is catalyzed.”

Being able to rapidly assign reliable functions to newly sequenced enzymes would help scientists decipher metabolic pathways, providing for both new drug targets and the potential to apply lessons about the relations between sequence, structure, and function to the engineering of enzymes for therapeutic use.

Examples of our research and methods include

Applying homology modeling and docking to determine substrates

Department scientists lead the computation core that is central to the Enzyme Function Initiative (EFI), a NIH-funded nine-institution effort to develop a large-scale, multidisciplinary sequence/structure-based strategy to determine the functions of unknown enzymes discovered in bacterial genome projects, partly via high-throughput prediction of their substrate specificity.

The EFI strategy is tested by selecting enzymes of unknown function that belong to one of several large and complex protein superfamilies. Superfamilies are groups of evolutionarily related enzymes that share a specific conserved chemical capability (for example, a partial reaction; that is, a single mechanistic step in catalysis or stabilizing the same type of reaction intermediate) performed by conserved active site features.

mandelate racemase

A depiction of mandelate racemase from Pseudomonas putida (1MDR)

Mandelate racemases, a family of enzymes within the functionally diverse enolase superfamily, catalyze interconversion of the (R-) right-handed and (S-) left-handed mirror image molecules (enantiomers) of mandelate, in a pathway for the latter’s metabolic breakdown (catabolism).

The protein backbone is shown as a ribbon colored by secondary structure (alpha helices turquoise, beta-strands purple, the rest gray). Also displayed are the side chains of functional residues (gold), a metal ion (bright green), and a ligand similar to mandelate (pink).

Yet enzymes in the same superfamily can catalyze very different overall reactions, making them “functionally diverse” and requiring further characterization. For example, enzymes in the enolase superfamily (about 25,000 known sequences, per the SFLD as of late 2013) share a metal ion in their actives sites and the chemical step of abstracting alpha-protons from carboxylic acids. But a superfamily member such as muconate lactonizing enzyme breaks down aromatic compounds for soil bacteria while its close relative, glucarate dehydratase, breaks down sugars for metabolism.

Selected enzymes are cloned, expressed, and up to 100 per year have their structures detailed via x-ray crystallography. Department-led computational efforts leverage and expand upon that structural information for high-throughput determination of function in two ways:

  • Applying comparative structure (homology) modeling, using databases and software developed by scientists in the School’s Department of Bioengineering and Therapeutic Sciences effectively extend the number of determined enzyme structures.
  • Employing in silico screening (docking) to rank virtual metabolite libraries, thus greatly winnowing the number of substrate candidates for in vitro testing. Such computational screening is much faster and cheaper than physical assays, casts a wider net beyond commercially available or readily synthesized compounds, and also provides detailed information about even negative results (i.e., non-binding interactions) to further guide substrate selection.

Such in silico screening represents a new application of docking and a different challenge than seeking ligands as potential drug leads. These small molecules may merely bind to an enzyme’s active site such that they compete with and block endogenous ligands to serve as therapeutic inhibitors. But substrates must precisely orient and align their reactive, specificity-determining residues with enzyme catalytic residues.

Such substrate docking incorporates department modeling work that accounts for the interactive conformational flexibility in both ligand and active site (induced fit) as well as predicting and accounting for the role of hydrogen bonds in stabilizing enzyme-substrate complexes.

New computational tools being developed here to guide the selection and/or synthesis of candidate substrates are continuously tested and refined by comparing their results with biochemical assays, and, where possible, ligand-bound crystal structures. When possible, further testing is done for in vivo function via mutant knockout / overexpression bacteria, transcriptomics, and metabolomics.

Specific studies in this area have included

Metabolite docking to homology models

Department scientists applied virtual metabolite docking to homology models of active sites to guide the discovery of substrate specificities and biochemical function of a subset of enzymes from the enolase superfamily.

The study’s 65 target enzymes were representative of a group (more than 2600 sequences per SFLD as of June, 2014) that share key conserved active site residues and motifs indicating they epimerize dipeptides; that is, they invert the spatial orientation of atoms around the substrates’ asymmetric carbons.

dipeptide epimerase

A dipeptide epimerase from Enterococcus faecalis in complex with dipeptide L-isoleucine-L-tyrosine substrate. The protein backbone of the enzyme is shown as a tan ribbon and the active site metal ion as a yellow ball. The dipeptide and the side chains of selected active site residues are shown as sticks color-coded by element: carbons tan (enzyme) or brown (dipeptide), oxygens red, and nitrogens blue.

dipeptide epimerase closeup

Closeup of the active site, with the dipeptide’s alpha-carbons shown as balls. Parts of the ribbon that would somewhat obscure the view have been made translucent.

Initially, there were two such characterized dipeptide epimerases from E. coli and B. subtilis found to be specific for L-Ala-D/L-Glu (AEEs). These are believed to be involved in the recycling of cell wall polymers, of which that substrate is a component.

The analysis—screening all possible dipeptides against models for dozens of related proteins—predicted an unexpected and notable diversity, including enzymes specific for hydrophobic dipeptides and a small group with specificity for positively charged dipeptides.

The findings underscored the synergistic benefit of combining computational modeling and bioinformatics sequence analysis. Predictions were investigated for some enzymes in vitro and in crystal structures, including substrate-ligand complexes, to confirm their accuracy and further detail the structural bases of the specificities.

Predicting chain building of unknown enzymes

Researchers here computationally predicted the chain-length specificity of a subgroup of the isoprenoid synthase (IS) superfamily—more than 9,000 trans-polyprenyl transferases (E-PTS) (per SFLD as of late 2013) which catalyze elongation of varied-length linear chains from 5-carbon molecule building blocks (isoprenes: isopentenyl diphosphate or IPP and dimethylallyl diphosphates or DMAPP); these serve, in turn, as “trunk” substrates for enzymes biosynthesizing the more than 55,000 known branching isoprenoid metabolites that play key roles in cells from all domains of life.


A depiction of the structure of an E-PTS enzyme (GGPP synthase, a drug target in certain malarial parasites) determined by x-ray crystallography with aspartartic acid-rich motifs. In its elongation cavity, active sites S1 and S2 bind 5-carbon building blocks (IPP, DMAPP), which link in an isoprenoid chain after cleavage of DMAPP’s diphosphate.

While E-PTSs share a common protein fold and active site binding motifs, there is no simple relationship between their functional and overall sequence similarity. Indeed, mutations in small numbers of residues in the enzymes’ elongation cavity, where chains are assembled, greatly alter chain-length specificity.

Department scientists and EFI colleagues used bioinformatics analysis to choose E-PTSs maximally distant in sequence from those with solved structures and functional annotations. Researchers generated new structures and homology models of 74 enzymes for docking that evaluated the steric complementarity of polyprenyl products and the elongation cavities.

This approach predicted chain-length specificities accurate upon biochemical verification to within one isoprene unit 94 percent of the time—and the enzymes’ products often vary as much. Such modeling might thus predict function of the complete E-PTS subgroup and also seems potentially corrective of substantial numbers of automated misannotations that relied on sequence similarity.