Visualizing and integrating bioinformatics and biomolecular data

There are four research areas in the Department of Pharmaceutical Chemistry. Visualizing and integrating bioinformatics and biomolecular data is a research challenge within computational chemistry and biology.

The challenge

Once a biomolecule’s structure is determined, visualization software is needed to view and interact with such complex three-dimensional objects.

Like architects and surgeons, biochemists and structural biologists need to analyze the structures they’re working with—and how they interrelate with neighboring structures—from varied angles and with different layers revealed. They need to examine molecular structures up close, at the level of individual atoms and chemical bonds, with relevant details color-coded for clarity, and from a distance, as part of the larger complexes in which most proteins routinely assemble to carry out tasks inside cells (e.g., ribosomes that make new proteins, proteasomes that degrade unneeded proteins, etc.).

Indeed, such cellular machinery can be too large, dynamic, flexible, and fragile for their structures to be captured in full by any one technique of structural biology. Thus software is needed to fit together experimental data about their molecular components from a variety of sources (x-ray crystallography, NMR spectroscopy, electron microscopy, etc.) into combined models.

There is also a need to visually organize molecules into biological systems, to convert the ever-growing reams of data about the interactions between proteins in cells (interactomes) into network maps in order to discern and dissect complex biological pathways that can be perturbed by disease and to discover key nodes for therapeutic intervention.

Likewise, bioinformatics techniques are required to make useful sense of the explosion in data from genomics research. While the amino acid sequences of tens of millions of proteins from thousands of species have been discovered, the vast majority of enzymes (proteins that catalyze the chemical reactions essential to life) have uncertain, unknown, or incorrectly annotated function.

Thus, just as software can help convert the raw data of atomic coordinates into molecular snapshots, so too are programs necessary to analyze and compare all those hundreds of thousands of long varying strings of amino acids for relevant similarities to known proteins and speed determination of their function. Being able to rapidly and accurately resolve the structures and functions of uncharacterized enzymes could yield countless new drug targets as well as provide templates for the engineering of therapeutic proteins.

Examples of our research, methods, and resources include

UCSF Resource for Biocomputing, Visualization and Informatics

The department is home to the UCSF Resource for Biocomputing, Visualization and Informatics (RBVI), which develops software and web-based resources for the visualization of molecular structures—from atomic-level details to large interacting complexes of molecules—drawn from multiple data sources. The Resource also creates computational tools to help visually map molecular interactions in biological pathways and systems, as well as to organize and analyze biological data to find meaningful similarities and correlations between protein sequences, structures, and functions.

ViewDock dialog

The Chimera app ViewDock aids the interactive screening of compounds from the outputs of molecular docking programs. UCSF DOCK was the first of such software, developed by department scientists in the early 1980s and revised many times since. The programs virtually screen small molecules (ligands) for the relative strength (affinity) with which they bind to protein active sites—potentially altering their activity therapeutically.

The target structure shown here is H-Ras, a protein often mutated in human cancers. Selecting a potential ligand name in the dialog displays the molecule. The docked molecule pictured is ribose monophosphate. (Potential hydrogen bonds are shown as yellow lines.)

Established in 1970, the RBVI pioneered in the field of molecular graphics and is the nation’s oldest Biomedical Technology Research Resource (BTRR). The National Institutes of Health (NIH)-funded BTRC program supports the development of broadly applicable, dynamically evolving enabling technologies—in this case biomolecular / bioinformatics visualization—as compared to more typical NIH grants that fund more narrowly defined research projects. The program requires numerous collaborative test-bed projects, as well as user training and wide dissemination of the advanced methodologies to the scientific community.

UCSF Chimera

Integrating multi-source/scale data to visualize molecular complexes

The RBVI continually develops, updates, and provides support for UCSF Chimera, a widely used, highly extensible program for interactive 3-D visualization of macromolecular structures and related data. The package provides more than 110 tools for the interactive analysis of atomic-level models, density maps, and protein sequences.

Chimera can fetch molecular structures, sequences, and density maps from web-linked databases, then allow users to do structural analyses such as measuring distances and angles; identifying hydrogen bonds and contacts; using coloring or other stylizations to highlight properties such as sequence conservation and electrostatic potential. It also provides for modifying or building atomic-level structures, comparative (homology) modeling of protein structures, and fitting of atomic-level structures into lower-resolution data such as electron microscopy density maps of large assemblies.

Some calculations are performed directly within the program, whereas others make use of web services provided by the Resource. The graphical scenes can be manipulated interactively in 3-D, with many options for labeling, display style, and visual effects (e.g., shadows) to enhance clarity.

molecule

Chimera allows properties of molecules such as electrostatic potential to be visualized with coloring. A highly negatively charged small molecule, inositol hexakisphosphate (depicted in stick mode) binds to a protease enzyme domain, which has its molecular surface colored red to indicate electrostatic potential: red for negative charge, white for neutral, and blue for positive.

This animation shows the binding interface between porcine pancreatic trypsin (left) and a trypsin inhibitor from soybean (right). In the open position, each protein is rotated 90 degrees outward from the closed (bound) position.

The Resource-developed software also provides simple graphical interfaces for animating molecular dynamics simulations of atomic-level interactions within and between molecules over time based on different input trajectories, plus interactive morphing of proteins between conformations.

Most proteins operate as part of larger molecular assemblies. Thus Chimera can generate interactive depictions of complexes and cellular machinery comprised of dozens of interacting molecules, such as ribosomes, microtubules, proteasomes, transmembrane channels, virus capsids, etc. by importing and fitting multi-scale data (including experimental and theoretical models).

molecule

As depicted using Chimera, the shell (or capsid) of a hepatitis B virus, about 420 angstroms in diameter, is an assembly of many separate proteins. Each colored blob is a separate protein, with multiple copies of the same protein shown in the same color. A copy of each type of protein is shown in ribbon format near the center.

Chimera is distinguished by interface design and detailed documentation that allows for the provision or end-user coding of app extension features tailored to specific needs.

Since its initial release 21 versions ago in 2004, Chimera has been referenced in more than 6,000 journal articles and downloaded by more than 370,000 users as of early 2014.

Examples of key features developed for UCSF Chimera by RBVI staff and collaborators include:

  • Reading more molecular data formats than any other program. Users can readily visualize atomic-level structure data from dozens of databases, such as the nearly 100,000 experimentally determined structures (per early 2014) in the Protein Data Bank to the nearly five million unique sequences homology modeled in ModBase, a database of comparative protein structure models determined by the program Modeller, developed by colleagues in the Department of Bioengineering and Therapeutic Sciences.
  • Importing and visualizing electron and other 3D microscopy data (density maps) from multiple formats, with a broad selection of interactive tools for their analysis. Such electron micrographs capture larger biological complexes such as virus particles, ribosomes, and microtubules in which large numbers of proteins interact—albeit at coarser resolutions. Chimera tools enable interactive segmentation to sort out substructures, fit atomic-level components into density maps, measure lengths and volumes to suggest potential molecular components for fitting, and automated coloring of surfaces to indicate probable locations of various types of macromolecules.

electron tomography of human immune T-cell

Chimera visualization shows electron tomography of a human immune system T-cell attacking another cell. Vesicles (blue) containing serine protease enzymes (which destroy cells by lysis, rupturing their cell membranes) are shown being transported to the adjacent cell membranes (in orange) along microtubules (red) to kill the target cell. The experimental data is from a 2006 Nature study by University of Oxford researchers testing a theory that killer T-cell’s use centriole organelles (yellow) to organize microtubules that are used to tow lytic vesicles to the interface with the target cell. The data did not provide clear support for this theory and instead reported a different mechanism for the delivery of such secreted vesicles to the cell membrane.

  • Direct interfacing with web services (some hosted by RBVI) that allow sequence-structure data retrieval and computational services (e.g., BLAST searches of sequence similarity, sequence alignment, homology modeling, docking calculations, etc.).
  • Performing combinatorial multi-scale modeling and visualization of large molecular assembles, fitting molecules into density maps of larger complexes either directly in Chimera or via web services. The latter’s calculations are launched automatically, with the results returned to Chimera for interactive adjustment and analysis. This work may tap a diverse set of the software’s features and web services including:
    • multiple simultaneous fitting of atomic-level structures into EM density maps by the MultiFit module of the IMP integrative modeling platform
    • small angle X-ray scattering (SAXS) profiles of atomic-level structures calculated by FoXS, another IMP module, to model flexible conformations and those in solution (i.e., native conditions)
    • graphical user interfaces to simplify set up of input data/parameters, evaluate results, and perform iterative refinements
    • evaluation of side chain conformations from backbone-dependent and backbone independent-rotamer libraries
    • “peel-back” animation to explore and convey interrelations between layers in multi-scale models of larger complexes such as muscle fibers
    • any of the tools for general molecular analysis, including identification of hydrogen bonds and contacts, structure comparison by superposition and morphing, and coloring to show properties like sequence conservation

An animation made with UCSF Chimera provides structural analysis of an electron microscopy density map of the thick filaments that perform muscle contractions. These filaments are made up of complexes of myosin protein molecules. The animation shows that (in order):

  1. the structure has rotational symmetry (or uniformity) if turned 90 degrees
  2. it also has translational symmetry in 43.5 nanometer sections
  3. it is a four-strand helix, with the myosin molecules’ J-shaped “heads” on the surface and “tails” comprising the muscle filament body
  4. atomic-level details of myosin molecule heads can be hand-placed (by computer mouse) in the larger assembly, then optimized and replicated by the software
  5. the assembly’s 12 subfilaments can be dissected and rotated by the software

Source: Data from John Woodhead, PhD, and Roger Craig, PhD, both of University of Massachusetts Medical School.

Cytoscape

Using network visualizations to merge systems and structural biology

Cytoscape is the most commonly used open source network visualization program. It is routinely applied in proteomics to map biological systems and protein interactions in metabolic, signaling, and regulatory pathways within and between cells. The RBVI, along with UCSF colleagues in the Bioinformatics Core of the Gladstone Institutes, is one of seven institutions providing core Cytoscape development for that purpose.

An RBVI focus is using Cytoscape apps to bridge the complementary data sets of systems and structural/molecular biology. This approach reflects the increasing overlap between the two: Systems approaches are becoming more granular, posing hypotheses about the interactions of individual proteins or the roles of specific metabolites in a pathway. Meanwhile, molecular and structural biologists increasingly investigate the impact of regulatory pathways on the transcription of proteins as well as how larger complexes of proteins work together to perform biological functions.

Examples of the 22 Cytoscape apps that RBVI has developed and supports include:

  • stuctureViz, which allows Cytoscape users to select nodes representing proteins in a given biological network and interactively display and analyze the detailed 3-D structures associated with those proteins in UCSF Chimera.

    Cytoscape users can also use Chimera sequence-structure analysis tools, such as Matchmaker (which superimposes structures) and Multalign Viewer (which displays sequence alignments and automatically associates them with structures). Thus functional residues and positions of conservation or divergence in the sequence alignment are easily mapped onto structures for further analyses. This also allows researchers to explore the possible structural implications of neighboring proteins in a pathway.

    structureViz screenshot

    A screen capture of the StructureViz app in action. A sequence similarity network (SSN) of the phosphotriesterase family of enzymes is shown in Cytoscape. Once the user has selected some protein nodes of interest in Cytoscape, StructureViz is used to open the corresponding structures in Chimera and spatially align them. The three structures in the image are colored white, cyan, and magenta, with selected parts outlined in green.


    To demonstrate the potential application of structureViz and associated tools, RBVI researchers examined a metabolic enzyme called isocitrate dehydrogenase (IDH1) that is mutated in glioblastoma mutiforme, the most common and aggressive brain tumor in humans.

    The Chimera MatchMaker tool superimposed the structures of the IDH1 wild-type and mutant forms to visually assess changes in its active site. The Chimera FindHBond tool (below) revealed that the mutant structure lost a hydrogen bond to its usual substrate. This structural change yields an altered product, 2-hydroxyglutarate (2HG), thought to be cancer-promoting (oncometabolite).

    Wild-type and mutant IDH1 structures

    The wild-type IDH1 enzyme with the substrate isocitrate, cofactor NADP (both dark green) and calcium ion (purple) bound in active site. Dashed red lines show H-bonds, identified with Chimera FindHBond tool, between its Arginine 132 residue and isocitrate. The inset shows the mutated protein, in which residue 132 is a histidine, with ligands modeled into the structure via MatchMaker superposition. As assessed by FindHBond, the histidine side-chain is too far from substrate to H-bond with it resulting in an altered product.


    By using Cytoscape to view a network of proteins organized by their ligand-binding specificity, the researchers found five proteins known to bind to small molecules similar to the 2HG oncometabolite. (Another RBVI-developed app, chemViz, was used to inspect the chemical properties of the ligands.) Those proteins included glutamyl aminopeptidase, an enzyme previously implicated in regulating brain tumor-associated blood vessels
  • clusterMaker unifies a wide variety of clustering techniques (algorithms) and visualization styles in a single interface for recognizing biologically meaningful patterns in large data sets and for confirming or generating hypotheses about biological function.

    For example, the app facilitates combined analyses of potentially complementary data sets from different types of experiments (i.e., yeast two-hybrid screening, high-throughput mass spectrometry protein complex identification) in order to more accurately identify stable complexes from clustered protein-protein interactions.

    It also notably interconnects cluster and network analysis of multiple types of biological data, including expression, genetic interaction, and physical interaction. For example, combining purported protein complex findings with data on gene expression in response to particular stimuli can suggest regulatory roles for particular proteins in a complex.

SFLD: Using bioinformatics and visualization to classify enzyme function

The RBVI hosts and develops infrastructure for the Structure-Function Linkage Database (SFLD) in collaboration with colleagues in the Department of Bioengineering and Therapeutic Sciences (BTS).

The SFLD addresses the exponentially growing gap between the tens of millions of protein amino acid sequences discovered via genomics and the accurate knowledge of these biomolecules’ functions. The database focuses on enzymes, which catalyze the chemical reactions essential to life and are thus key therapeutic targets.

There is no simple way of correlating enzymes’ primary structures (chains of hundreds or thousands of amino acids) with their functions. Indeed, many enzymes are assigned the wrong function (misannotated) based solely on their overall sequence similarity to others with known functions. However, certain similarities in their composition can indeed reveal vital clues toward functional classification. These similarities reveal evolutionary relatedness as well as amino acid configurations (sequence motifs) experimentally shown to carry out specific chemical functions.

The SFLD is a hierarchical classification resource that describes sequence-structure-function relationships within functionally diverse enzyme superfamilies. The members of such a superfamily can catalyze very different overall reactions, but share a common ancestor and an aspect of chemical function, such as a partial reaction, carried out by a conserved set of active site residues. Superfamilies are further subdivided into families, sets of enzymes that catalyze the same overall reaction.

dipeptide epimerase

Web page for the dipeptide epimerase family in the Structure-Function Linkage Database (SFLD).

The SFLD’s core focus is a dozen functionally diverse superfamilies (comprising about 365,000 enzymes as of early 2014) manually curated by BTS scientists such that they are reliably annotated (with functional evidence coding) and can serve as a “gold standard” for developing and evaluating the more automated methods that are ultimately needed.

The RBVI implements and maintains SFLD searchability by superfamily, reaction, or enzyme, and its hierarchy of specific sequence similarity and associated functional aspects (superfamily, then subgroups with more shared features, then families that catalyze the same overall reaction). The Resource also provides crucial tools and web services for comparing unknown enzyme sequences with those in the SFLD core dataset such as:

  • Searching for SFLD sequences similar to the query using BLAST and/or comparison to Hidden Markov Models (HMMs) representing SFLD families, subgroups, and superfamilies. The unknown sequence can be added to the pre-existing alignments for the HMM hits of interest.
  • Chimera mapping of sequence motifs (via the Multalign Viewer tool) in alignments from the SFLD onto structures, and calculation of measures of sequence conservation (entropy, variability, etc.) that can be shown as histograms above the alignment.
  • Comparative (homology) modeling of unknown structures using the Modeller program developed by BTS colleagues, launched from a Multalign Viewer window containing an alignment of the template (known structure) sequence with the target (unknown structure) sequence.
  • Chimera visualization of active site properties including shape, size, hydrophobicity, and electrostatic potential to help infer possible ligands and thus guide the selection of virtual substrates for docking.

Case study: Combining sfld data and tools with chimera to correct enzyme misannotation

M. capsulatus

In this Cytoscape-visualized sequence similarity network (SSN) the unknown protein sequence from M. capsulatus (yellow rectangle) clusters more with the dipeptide epimerases (light green) than with the chloromuconate cycloisomerases (pink), or several other family subsets of the enolase superfamily.

Color Indicates
  unknown
  dipeptide epimerase
  chloromuconate cycloisomerase
  muconate cycloisomerase (syn)
  muconate cycloisomerase (anti)
  N-succinylamino acid racemase 2
  o-succinylbenzoate synthase
  unclassified
In a demonstration scenario, scientists here used RBVI network, sequence, and structure analysis tools to investigate an unknown enzyme sequence from Methylococcus capsulatus that was previously part of a published study by School scientists and colleagues in the Enzyme Function Initiative.

The sequence was annotated in two major protein databases as a supposed chloromuconate cycloisomerase. That family is a subset of the SFLD’s enolase superfamily, all of which carry out the partial reaction of removing a proton from a carbon adjacent to a carboxylic acid. In addition, they all contain specific conserved active site residues that bind a divalent metal ion which, in turn, stabilizes the reaction intermediate.

However, incorporation of this sequence into a sequence similarity network (SSN) for analysis in Cytoscape revealed that it clusters more closely with the dipeptide epimerases, a different family within the enolase superfamily. (In the SFLD, families are synonymous with the catalyzed reaction. Thus, beyond their shared partial reaction, dipeptide epimerases catalyze the structural inversion of dual amino acid peptides, while chloromuconate cycloisomerases open or close the ring structures of chlorinated muconates.)

While the SSNs give a broad visual perspective on how sequences compare, further analysis of the M. capsulatus sequence using the SFLD’s Hiddden Markov Models (HMMs) confirmed its most statistically significant sequence similarity to be with dipeptide epimerase family.

As expected, the query sequence shares certain conserved sequence patterns characteristic of the enolase superfamily. However, aligning the query to SFLD sequences and displaying the results in Chimera’s Multalign Viewer reveals that the query sequence also contains a DXD motif characteristic of the dipeptide epimerase family.

dipeptide epimerase

Chimera depiction of the active site of a representative dipeptide epimerase from the SFLD, with the Ala-Glu substrate and the side chains of several active site residues shown as sticks. The family-conserved Asp-X-Asp (DXD) residues are selected in both the structure (green outlines) and sequence alignment. These Asp residues interact with the substrate. Three superfamily-conserved residues and the substrate bind the metal ion (light green sphere). Conservation is shown as a histogram above the sequences in the alignment.

Chimera includes an interface to comparative (homology) modeling with the Modeller program, developed by BTS colleagues and run on an RBVI web service. The homology modeling can be launched given at least one known structure to use as a template and a sequence alignment containing the sequences of both the target (query) and the template. (The structure and alignment in the figure above are from the SFLD.)

The active sites of the resulting homology models can be examined to infer substrate specificity. For example, Chimera interactive tools can be used to visualize active site pocket volumes by representing the van der Waals radii of atoms as solid surfaces, and to color the surfaces to indicate electrostatic potential, hydrophobicity, and other properties.

Chimera results

Chimera 3-D visualization of Modeller results. The experimentally determined dipeptide epimerase structure used as a template (beige) and the five models of the target (query) sequence (blue, magenta, etc.) have been superimposed with MatchMaker.

coulombic

Active site pocket surfaces show query enzyme has about a 50 percent greater pocket volume and is more predominantly negatively charged (red), thus suggesting dipeptide substrates of larger size with positively charged side chains, as was experimentally confirmed.