IT (Informatics
Talent) Guy scouts out the next bioinformatics
leaders
Inspired by this month's All-Stars awards, I
decided to get a jump-start on next year's contest by seeking out
the hottest bioinformatics software talent now. I put up a shingle
and opened Nat's Talent Agency. There are loads of great ideas out
there, but I winnowed the list down to four acts that I think might
explode in the next 12 months.
None is really new. They've all been playing
off-off-Broadway and the back streets of Vegas since before Ed
Sullivan. But this might just be the year they hit the big time. You
know how it is - overnight success after years of hard
work.
Knowledge Mining
A promising emerging act is knowledge mining, a
song and dance troupe that's been around so long and changed its
name so many times that it's hard to keep track of all the rooms
it's played. It started as information retrieval (boring), then text
mining, then literature mining. The new name has the most zing.
The goal is to search the literature better
through software that "understands" more of the content. Things like
gene and protein names, biological functions and processes, diseases
and physiology, anatomy, drugs and compounds, assays, and more. It's
like PubMed on steroids.
This scene is getting so hot that this year's
big data-mining competition, the KDD Cup 2002, which covers all
industries, focused on biomedicine. And even more to the point, the
winner (in one category) was da-da Celera in partnership with a
commercial knowledge mining company, ClearForest.
The idea is to be able to answer queries like,
"Find all references that discuss compounds that affect acetylation
for treatment of neurodegenerative disorders." Or for the more
molecular folks, "Find all references that discuss molecules that
affect the acetylation of transcription factors."
I haven't been able to see any of these
products live and can't judge how well they work on real problems.
The nice people at BioWisdom did a Web-based demo for me, but I
could only poke around in areas they had scripted. The Cellomics
website offers a free two-week trial, but they came through with the
required password too late.
If it works, knowledge mining will be a mega
star - the Tom Cruise of bioinformatics.
Pathway Modeling
Pathway modeling is my second pick for stardom.
It's a technology with obvious appeal. Biologists invariably draw
pathway diagrams to illustrate any biological process with more than
one step. The challenge is to pirouette beyond informal pathway
diagrams to formal models that represent biological processes in a
precise mathematical or computational form.
The most obvious reason to do formal pathway
modeling is to simulate the dynamic behavior of a biological process
of interest. This storyline has been kicking around for eons, but
it's never been very practical because it requires detailed
information on reaction rates and such, which is simply not
available for most processes of interest. A newer twist on the idea
is network inference, in which you start with a partial model and
try to figure out the missing bits by comparing simulated results to
experimental data. There is some hope that this will reduce the need
for detailed information about each reaction.
Pathway models can also play a knowledge
management role by organizing information about biological processes
in a form that is accessible and intuitive to researchers. This
theme was expressed years ago by Kurt Kohn, one of the patriarchs of
the field, but seems to have been dropped. I'd like to see this
genre revived: it could be the first starring role for the
technology.
There are a lot of academic codes available and
a few commercial ones. I tried one of each: the Jarnac/JDesigner
suite from Herbert Sauro at Caltech, and VisualCell from Gene
Network Sciences (with which my institution has a partnership). Both
worked well, but are intended for different kinds of pathways.
Jarnac/JDesigner is aimed at metabolic pathways, while VisualCell's
strength is on regulatory pathways.
Jarnac/JDesigner offers both a textual and a
graphical language, which I find a plus (see sidebar). I mainly used
the text language. It comes with a built-in simulator and simple
graphics for plotting the simulation results. It was a lot of fun
and very easy to vary the efficiency of steps in a metabolic pathway
and see the effects (which were generally small, as predicted by
theory). Jarnac/JDesigner is a great teaching tool even if it has no
place in your research.
VisualCell is purely graphical. Not
surprisingly, the language is complicated - it has to be in order to
precisely model real pathways. But with the help of the experts at
the company, I was able to learn the language in a day, and use it
the next day to create a small but realistic model of a disease
process. My learning curve was shortened by the abstract modeling
language.
High-performance Sequence
Analysis
A new generation of sequence rockers is hitting
the charts, hoping to unseat the reigning platinum record holder,
BLAST. Most promise blazing speed with no loss of sensitivity,
although one act (MPSRCH) is going for the sensitivity market at
somewhat lower speed.
In this context, sensitivity refers to the
ability of an algorithm to find distant matches, i.e., sequences in
the database that are only vaguely similar to the query sequence.
The flip side - specificity - refers to the number of false matches
an algorithm reports. This is not usually a concern, since the
algorithms assign a score to each reported match, and the user is
free to ignore matches with scores that are too low. As the
algorithms get more sophisticated, specificity will probably become
more of an issue.
Community Software
Our final contestant - community software - is
the sentimental, feel-good favorite. Community software is the step
beyond open source in which programmers from many places join
together to create software of value to all. It's the global village
in action.
Two big community efforts are underway: BioPerl
and MGED (Microarray Gene Expression Data). BioPerl is further along
and has already debuted their software. The MGED people are working
furiously and hopefully will raise the curtain soon.
BioPerl is a colossal production with 450 Perl
modules focused on sequence- related issues. There is code to read
and write the major sequence formats, create indexed sequence files,
and work with pairwise and multiple sequence alignments. It provides
wrappers for many popular programs including BLAST, HMMER, Sim4, and
others, though some important tools (e.g., FASTA) are oddly absent.
There is also software to create graphical displays of annotated
sequences.
The internals are a tour de force of Perl
programming. It's a veritable how-to guide for advanced Perl
programmers, reflecting the extraordinary software skills of the
developers.
Time will tell whether the BioPerl troupe can
hang together and even expand. I predict a deluge of new programmers
wanting to add their favorite software to the show. The BioPerl
stage managers will have to decide whether to let the newcomers on
stage, and if so, how to maintain quality, or to shoo them away and
become a closed shop.
Having seen all the contestants, here are my
final predictions for the technologies that will be shining brightly
in the not too distant future: My heart says community software. My
wistful eye says knowledge mining. My scientific hopes say pathway
modeling. And my pragmatic side says high performance sequence
analysis.
Knowledge Mining: Can it
work?
The hard part of doing literature searches is
going back in time and reinterpreting old results in light of new
data or theories.
For example, in answering the first question in
the main text, the system should report that valproate - a histone
deacetylase inhibitor - was tried on Huntington's disease patients
in a case report published in 2000. What makes this tricky is that
valproate wasn't known to be involved with acetylation when the
article was published, and no terms related to acetylation appear in
the paper. Moreover, the subsequent paper that connects valproate to
acetylation doesn't actually mention the drug by this name, but
rather talks about valproic acid, which is the active form.
The case report describes profound improvements
in HD symptoms. An exciting connection. But don't get too excited.
If the system were really smart, it would
temper its enthusiasm by noting that valproate was given in
combination with another drug: the authors were mainly interested in
the other drug, and they never followed up the valproate angle.
In answering the second question, the system
would have to tap dance around a rapidly changing area of science.
Until recently, people thought that the main way acetylation
affected transcription was by changing the acetylation level of
histones. Histones are the proteins around which DNA is wrapped to
form a compact, three-dimensional structure. The old theory was that
acetylation caused histones to loosen their grip on the DNA and
allow transcription factors to sneak in and do their job. The
cognoscenti now believe that this is only one effect, and that the
acetylation status of transcription factors is important, too. The
net effect is that many papers that talk about histone acetylation
have to be re-interpreted in this new light.
- NG
Pathway Modeling: Graphics vs.
Text
In the pathway modeling field there's a big
emphasis on graphical representation of models. This emphasis is
understandable given that biologists are alleged to think in
pictures.
I find this exasperating, because it can be
fiendishly hard to describe a complex process in pictures. There are
many things that are just plain easier to say in text.
Here's an example.
1) Protein kinase A (PKA) activates the
transcription factor CREB by phosphorylating the serine at position
133.
2) When activated, CREB can join a
three-molecule complex, consisting of itself, either CREB binding
protein (CBP) or a related protein p300, and TAFII130.
3) TAFII130 in turn can bind the TFIID subunit
of the basal transcriptional complex, which includes the TATA bind
protein (TBP). The details of this interaction are not known.
4) When the transcriptional complex is fully
assembled, TBP can bind the TATA box upstream of the transcription
initiation site.
5) This brings RNA polymerase II into contact
with the DNA to be transcribed, and transcription can proceed.
The words describe the process clearly. To turn
this into a model, one would have to recraft it using a precise
computer language, which wouldn't be too hard. I don't see what a
picture would add.
There's an apropos lesson from software
engineering: diagrams are a great way to document programs, but text
is the best way to write them.
- NG
High Performance Sequence
Analysis
Many fast sequence search methods gain their
speed from a simple trick. They start by finding short, exact
matches called seeds and then extend the seeds into longer, inexact
matches. This makes it possible to find short exact matches very
fast.
A key parameter of such methods is the size of
the initial seeds. This is the "word length" parameter you may have
seen in BLAST.
One simple way to gain speed is to increase the
seed length, but this reduces sensitivity. Another approach is to
pre-process the database and create an index telling where any given
seed exists.
A more sophisticated approach is to build a
fancy data structure called a suffix tree that effectively tells
where all sequences of any length exist in the database. Suffix
trees are widely used in the computer field, but have seen limited
use in bioinformatics because the traditional implementations
consume a lot of memory - about 40 bytes for each letter in the
database, which comes to 120 GB for the entire human genome. Too big
to be practical. Recent improvements in the method have cut the
memory requirements to 17 bytes per letter (50 GB for the human
genome), and have reduced the penalty for storing the data structure
on disk, which bring the method to the verge of practicality.
A different approach is to switch from exact
match seeds to ones with a limited number of mismatches. This makes
it possible to improve sensitivity for a given seed length at the
cost of slowing down the search for initial seeds. On certain
computers, notably the Cray SV vector machines, inexact matches can
be found almost as fast as exact ones, making this approach very
attractive. (Note that my institution has a partnership with
Cray.)
One algorithm that marches to a different
drummer is MPSRCH, which has opted for sensitivity over speed.
MPSRCH claims to implement the gold standard, most sensitive
algorithm known, namely full Smith-Waterman dynamic programming. I
tried their algorithm on their website and it's incredibly fast - I
wonder how they do it!
- NG
Nat Goodman, PhD, helped found the
Whitehead/MIT Center for Genome Research, directed a bioinformatics
group at the Jackson Laboratory and led a bioinformatics marketing
team for Compaq Computer. He is currently a senior research
scientist at the Institute for Systems Biology and an affiliate
professor of bioinformatics at University of Alaska-Fairbanks. Send
your comments to Nat at ngoodman@genomeweb.com.
TABLES
More on Knowledge
Mining
| Product |
Company |
Notes |
URL |
|
CELL |
Incellico |
|
http://www.incellico.com/ |
|
CellSpace Knowledge Miner |
Cellomics |
|
http://www.cellomics.com/ |
|
DiscoveryInsight |
BioWisdom |
|
http://www.biowisdom.com/ |
|
Gene Ontology Knowledge Discovery System
(GO KDS) |
GeneEd & Reel Two |
Pre-release |
http://www.geneed.com/
http://www.reeltwo.com/ |
| Gene Ontology Knowledge Discovery System
(GO KDS) |
Ingenuity |
|
http://www.ingenuity.com/ |
| Gene Ontology Knowledge Discovery System
(GO KDS) |
Celera & Clear Forest |
Press release |
http://www.clearforest.com/ whatsnew/press_releases.asp?id=24 |
Community Software
|
Organization |
URL |
|
BioPerl |
http://www.bioperl.org/ |
|
Microarray Gene Expression Data (MGED)
Society |
http://www.mged.org/ |
Help on High-Performance
Sequence Analysis
|
Program |
Source |
Web Server |
Software
Availability |
Remarks |
URL |
|
BLAT |
Jim Kent, University of California at
Santa Cruz |
Yes |
Free for academics |
Used in Santa Cruz genome
browser |
http://www.soe.ucsc.edu/
kent/ |
|
FLAG |
Biomedical Engineering Center, Industrial
Technology Research Institute of Taiwan |
Yes |
|
|
http://flag.itri.org.tw/ |
|
MPSRCH |
Aneda |
Demo |
|
Complete Smith-Waterman |
http://www.anedabio.com/ |
|
MUMmer 2 |
TIGR |
|
Free for academics |
Based on suffix trees |
http://www.tigr.org/software/mummer/ |
|
PatternHunter |
Bioinformatics Services |
Demo |
Free for academics |
|
http://www.bioinformaticssolutions.com/ |
|
WABA |
Jim Kent, University of California at
Santa Cruz |
Yes |
Free for academics |
|
http://www.soe.ucsc.edu/
kent/ |
Pathways Packages:
ACADEMIC
|
Package |
Authors |
Notes |
URL |
|
BioQuest |
Brian White |
Can be ordered through the ePress Project
at the University of Maryland |
http://omega.cc.umb.edu/
bwhite/ek.html |
|
BioSpice |
Adam Arkin |
Some software can be downloaded, but the
major work, BioSpice, seems to be hidden in a private
area |
http://www.lbl.gov/
aparkin |
|
DBSolve |
Igor Goryanin |
Free |
http://websites.ntl.com/
igor.goryanin |
|
DynaFit |
BioKin Ltd. |
Free for academics; free trial version
for all |
http://www.biokin.com/ |
|
E-Cell |
Masaru Tomita |
Open source |
http://www.e-cell.org/ |
|
Electronic Arc |
Gene Selkov |
Diagramming tool; apparently open
source |
http://home.xnet.com/
selkovjr/ElectricArc/ |
|
Gepasi |
Pedro Mendes |
Free |
http://www.gepasi.org/ |
|
Jarnac/JDesigner |
Herbert Sauro |
Open source |
http://www.cds.caltech.edu/
hsauro/ |
|
NetBuilder |
Hamid Bolouri |
Free |
http://strc.herts.ac.uk/bio/maria/NetBuilder/index.html |
|
StochSim |
Dennis Bray |
Open source |
http://www.zoo.cam.ac.uk/comp-cell/StochSim.html |
|
VCell |
Jim Schaff |
Web server; software not
available |
http://www.nrcam.uchc.edu/ |
|
WinScamp |
Herbert Sauro |
Open source |
http://www.cds.caltech.edu/
hsauro/ |
Pathways Packages: COMMERCIAL
|
Product |
Company |
URL |
|
DigitalCell/VisualCell |
Gene Network Sciences |
http://www.gnsbiotech.com/ |
|
PathwayPrism |
Physiome |
http://www.physiome.com/ |
|
PhysioLab |
Entelos |
http://www.entelos.com/ |
Pathways Packages: STANDARDS AND INTEREST
GROUPS
|
Product |
Notes |
URL |
|
CellML |
Co-development of Physiome and the
Bioengineering Institute, University of Auckland |
http://www.cellml.org/ |
|
Systems Biology Markup Language
(SBML) |
Part of Systems Biology Workbench (SBW),
ERATO Kitano |
http://www.cds.caltech.edu/ erato/sbml/docs/ |
|
BioPathways Consortium |
Systems Biology Project |
http://www.biopathways.org/ |