Visualizing and integrating bioinformatics and biomolecular data
Examples of our research, methods, and resources include
UCSF Resource for Biocomputing, Visualization and Informatics
The department is home to the UCSF Resource for Biocomputing, Visualization and Informatics (RBVI), which develops software and web-based resources for the visualization of molecular structures—from atomic-level details to large interacting complexes of molecules—drawn from multiple data sources. The Resource also creates computational tools to help visually map molecular interactions in biological pathways and systems, as well as to organize and analyze biological data to find meaningful similarities and correlations between protein sequences, structures, and functions.
Established in 1970, the RBVI pioneered in the field of molecular graphics and is the nation’s oldest Biomedical Technology Research Resource (BTRR). The National Institutes of Health (NIH)-funded BTRC program supports the development of broadly applicable, dynamically evolving enabling technologies—in this case biomolecular / bioinformatics visualization—as compared to more typical NIH grants that fund more narrowly defined research projects. The program requires numerous collaborative test-bed projects, as well as user training and wide dissemination of the advanced methodologies to the scientific community.
Integrating multi-source/scale data to visualize molecular complexes
The RBVI continually develops, updates, and provides support for UCSF Chimera, a widely used, highly extensible program for interactive 3-D visualization of macromolecular structures and related data. The package provides more than 110 tools for the interactive analysis of atomic-level models, density maps, and protein sequences.
Chimera can fetch molecular structures, sequences, and density maps from web-linked databases, then allow users to do structural analyses such as measuring distances and angles; identifying hydrogen bonds and contacts; using coloring or other stylizations to highlight properties such as sequence conservation and electrostatic potential. It also provides for modifying or building atomic-level structures, comparative (homology) modeling of protein structures, and fitting of atomic-level structures into lower-resolution data such as electron microscopy density maps of large assemblies.
Some calculations are performed directly within the program, whereas others make use of web services provided by the Resource. The graphical scenes can be manipulated interactively in 3-D, with many options for labeling, display style, and visual effects (e.g., shadows) to enhance clarity.
The Resource-developed software also provides simple graphical interfaces for animating molecular dynamics simulations of atomic-level interactions within and between molecules over time based on different input trajectories, plus interactive morphing of proteins between conformations.
Most proteins operate as part of larger molecular assemblies. Thus Chimera can generate interactive depictions of complexes and cellular machinery comprised of dozens of interacting molecules, such as ribosomes, microtubules, proteasomes, transmembrane channels, virus capsids, etc. by importing and fitting multi-scale data (including experimental and theoretical models).
Chimera is distinguished by interface design and detailed documentation that allows for the provision or end-user coding of app extension features tailored to specific needs.
Since its initial release 21 versions ago in 2004, Chimera has been referenced in more than 6,000 journal articles and downloaded by more than 370,000 users as of early 2014.
Examples of key features developed for UCSF Chimera by RBVI staff and collaborators include:
- Reading more molecular data formats than any other program. Users can readily visualize atomic-level structure data from dozens of databases, such as the nearly 100,000 experimentally determined structures (per early 2014) in the Protein Data Bank to the nearly five million unique sequences homology modeled in ModBase, a database of comparative protein structure models determined by the program Modeller, developed by colleagues in the Department of Bioengineering and Therapeutic Sciences.
- Importing and visualizing electron and other 3D microscopy data (density maps) from multiple formats, with a broad selection of interactive tools for their analysis. Such electron micrographs capture larger biological complexes such as virus particles, ribosomes, and microtubules in which large numbers of proteins interact—albeit at coarser resolutions. Chimera tools enable interactive segmentation to sort out substructures, fit atomic-level components into density maps, measure lengths and volumes to suggest potential molecular components for fitting, and automated coloring of surfaces to indicate probable locations of various types of macromolecules.
- Direct interfacing with web services (some hosted by RBVI) that allow sequence-structure data retrieval and computational services (e.g., BLAST searches of sequence similarity, sequence alignment, homology modeling, docking calculations, etc.).
- Performing combinatorial multi-scale modeling and visualization of large molecular assembles, fitting molecules into density maps of larger complexes either directly in Chimera or via web services. The latter’s calculations are launched automatically, with the results returned to Chimera for interactive adjustment and analysis. This work may tap a diverse set of the software’s features and web services including:
- multiple simultaneous fitting of atomic-level structures into EM density maps by the MultiFit module of the IMP integrative modeling platform
- small angle X-ray scattering (SAXS) profiles of atomic-level structures calculated by FoXS, another IMP module, to model flexible conformations and those in solution (i.e., native conditions)
- graphical user interfaces to simplify set up of input data/parameters, evaluate results, and perform iterative refinements
- evaluation of side chain conformations from backbone-dependent and backbone independent-rotamer libraries
- “peel-back” animation to explore and convey interrelations between layers in multi-scale models of larger complexes such as muscle fibers
- any of the tools for general molecular analysis, including identification of hydrogen bonds and contacts, structure comparison by superposition and morphing, and coloring to show properties like sequence conservation
Using network visualizations to merge systems and structural biology
Cytoscape is the most commonly used open source network visualization program. It is routinely applied in proteomics to map biological systems and protein interactions in metabolic, signaling, and regulatory pathways within and between cells. The RBVI, along with UCSF colleagues in the Bioinformatics Core of the Gladstone Institutes, is one of seven institutions providing core Cytoscape development for that purpose.
An RBVI focus is using Cytoscape apps to bridge the complementary data sets of systems and structural/molecular biology. This approach reflects the increasing overlap between the two: Systems approaches are becoming more granular, posing hypotheses about the interactions of individual proteins or the roles of specific metabolites in a pathway. Meanwhile, molecular and structural biologists increasingly investigate the impact of regulatory pathways on the transcription of proteins as well as how larger complexes of proteins work together to perform biological functions.
Examples of the 22 Cytoscape apps that RBVI has developed and supports include:
- stuctureViz, which allows Cytoscape users to select nodes representing proteins in a given biological network and interactively display and analyze the detailed 3-D structures associated with those proteins in UCSF Chimera.
Cytoscape users can also use Chimera sequence-structure analysis tools, such as Matchmaker (which superimposes structures) and Multalign Viewer (which displays sequence alignments and automatically associates them with structures). Thus functional residues and positions of conservation or divergence in the sequence alignment are easily mapped onto structures for further analyses. This also allows researchers to explore the possible structural implications of neighboring proteins in a pathway.
To demonstrate the potential application of structureViz and associated tools, RBVI researchers examined a metabolic enzyme called isocitrate dehydrogenase (IDH1) that is mutated in glioblastoma mutiforme, the most common and aggressive brain tumor in humans.
The Chimera MatchMaker tool superimposed the structures of the IDH1 wild-type and mutant forms to visually assess changes in its active site. The Chimera FindHBond tool (below) revealed that the mutant structure lost a hydrogen bond to its usual substrate. This structural change yields an altered product, 2-hydroxyglutarate (2HG), thought to be cancer-promoting (oncometabolite).
By using Cytoscape to view a network of proteins organized by their ligand-binding specificity, the researchers found five proteins known to bind to small molecules similar to the 2HG oncometabolite. (Another RBVI-developed app, chemViz, was used to inspect the chemical properties of the ligands.) Those proteins included glutamyl aminopeptidase, an enzyme previously implicated in regulating brain tumor-associated blood vessels
- clusterMaker unifies a wide variety of clustering techniques (algorithms) and visualization styles in a single interface for recognizing biologically meaningful patterns in large data sets and for confirming or generating hypotheses about biological function.
For example, the app facilitates combined analyses of potentially complementary data sets from different types of experiments (i.e., yeast two-hybrid screening, high-throughput mass spectrometry protein complex identification) in order to more accurately identify stable complexes from clustered protein-protein interactions.
It also notably interconnects cluster and network analysis of multiple types of biological data, including expression, genetic interaction, and physical interaction. For example, combining purported protein complex findings with data on gene expression in response to particular stimuli can suggest regulatory roles for particular proteins in a complex.
SFLD: Using bioinformatics and visualization to classify enzyme function
The RBVI hosts and develops infrastructure for the Structure-Function Linkage Database (SFLD) in collaboration with colleagues in the Department of Bioengineering and Therapeutic Sciences (BTS).
The SFLD addresses the exponentially growing gap between the tens of millions of protein amino acid sequences discovered via genomics and the accurate knowledge of these biomolecules’ functions. The database focuses on enzymes, which catalyze the chemical reactions essential to life and are thus key therapeutic targets.
There is no simple way of correlating enzymes’ primary structures (chains of hundreds or thousands of amino acids) with their functions. Indeed, many enzymes are assigned the wrong function (misannotated) based solely on their overall sequence similarity to others with known functions. However, certain similarities in their composition can indeed reveal vital clues toward functional classification. These similarities reveal evolutionary relatedness as well as amino acid configurations (sequence motifs) experimentally shown to carry out specific chemical functions.
The SFLD is a hierarchical classification resource that describes sequence-structure-function relationships within functionally diverse enzyme superfamilies. The members of such a superfamily can catalyze very different overall reactions, but share a common ancestor and an aspect of chemical function, such as a partial reaction, carried out by a conserved set of active site residues. Superfamilies are further subdivided into families, sets of enzymes that catalyze the same overall reaction.
The SFLD’s core focus is a dozen functionally diverse superfamilies (comprising about 365,000 enzymes as of early 2014) manually curated by BTS scientists such that they are reliably annotated (with functional evidence coding) and can serve as a “gold standard” for developing and evaluating the more automated methods that are ultimately needed.
The RBVI implements and maintains SFLD searchability by superfamily, reaction, or enzyme, and its hierarchy of specific sequence similarity and associated functional aspects (superfamily, then subgroups with more shared features, then families that catalyze the same overall reaction). The Resource also provides crucial tools and web services for comparing unknown enzyme sequences with those in the SFLD core dataset such as:
- Searching for SFLD sequences similar to the query using BLAST and/or comparison to Hidden Markov Models (HMMs) representing SFLD families, subgroups, and superfamilies. The unknown sequence can be added to the pre-existing alignments for the HMM hits of interest.
- Chimera mapping of sequence motifs (via the Multalign Viewer tool) in alignments from the SFLD onto structures, and calculation of measures of sequence conservation (entropy, variability, etc.) that can be shown as histograms above the alignment.
- Comparative (homology) modeling of unknown structures using the Modeller program developed by BTS colleagues, launched from a Multalign Viewer window containing an alignment of the template (known structure) sequence with the target (unknown structure) sequence.
- Chimera visualization of active site properties including shape, size, hydrophobicity, and electrostatic potential to help infer possible ligands and thus guide the selection of virtual substrates for docking.
Case study: Combining sfld data and tools with chimera to correct enzyme misannotation
The sequence was annotated in two major protein databases as a supposed chloromuconate cycloisomerase. That family is a subset of the SFLD’s enolase superfamily, all of which carry out the partial reaction of removing a proton from a carbon adjacent to a carboxylic acid. In addition, they all contain specific conserved active site residues that bind a divalent metal ion which, in turn, stabilizes the reaction intermediate.
However, incorporation of this sequence into a sequence similarity network (SSN) for analysis in Cytoscape revealed that it clusters more closely with the dipeptide epimerases, a different family within the enolase superfamily. (In the SFLD, families are synonymous with the catalyzed reaction. Thus, beyond their shared partial reaction, dipeptide epimerases catalyze the structural inversion of dual amino acid peptides, while chloromuconate cycloisomerases open or close the ring structures of chlorinated muconates.)
While the SSNs give a broad visual perspective on how sequences compare, further analysis of the M. capsulatus sequence using the SFLD’s Hiddden Markov Models (HMMs) confirmed its most statistically significant sequence similarity to be with dipeptide epimerase family.
As expected, the query sequence shares certain conserved sequence patterns characteristic of the enolase superfamily. However, aligning the query to SFLD sequences and displaying the results in Chimera’s Multalign Viewer reveals that the query sequence also contains a DXD motif characteristic of the dipeptide epimerase family.
Chimera includes an interface to comparative (homology) modeling with the Modeller program, developed by BTS colleagues and run on an RBVI web service. The homology modeling can be launched given at least one known structure to use as a template and a sequence alignment containing the sequences of both the target (query) and the template. (The structure and alignment in the figure above are from the SFLD.)
The active sites of the resulting homology models can be examined to infer substrate specificity. For example, Chimera interactive tools can be used to visualize active site pocket volumes by representing the van der Waals radii of atoms as solid surfaces, and to color the surfaces to indicate electrostatic potential, hydrophobicity, and other properties.