Investigators
Bo
Han ,
M.S.
Brown Celeste,
Ph.D.
Dunker Keith, Ph.D.
Garner Ethan
Iakoucheva, M. Lilia, Ph.D.
Lawson J. David, Ph.D.
Li Xiaohong, Ph.D.
O’Connor Tim
Obradovic Zoran, Ph.D.
Peng Kang, M.S.
Radivojac Predrag, M.Sc.
Romero Pedro, Ph.D.
Vucetic Slobodan, Ph.D.
Xie Hongbo, M.S.
Wang Junping, Ph.D.
Problem
Protein function is generally thought to follow from the prior formation
of a specific three-dimensional structure. In contrast to this view,
many proteins that require a lack of three-dimensional structure
for function have been reported through the literature over the
last 50 years. These “intrinsically disordered” proteins exist as
structural ensembles, either at the secondary or tertiary structure
level. In other words, disordered proteins or regions have atomic
coordinates and Ramachandran angles that vary significantly over
time. Both extended (i.e., random coil-like) regions - with perhaps
some secondary structure - and collapsed (i.e., partially folded
or molten globule-like) domains - with poorly packed secondary structure
units - are included in this definition. The existence of proteins
with intrinsic protein disorder calls for a re-assessment of the
view that prior folding into 3-D structure is always required for
protein function, a view sometimes called “the protein structure-function
paradigm.”
Results
In summary, our bioinformatics work provides strong evidence regarding
the importance of disordered promoted protein. Recently, Peter Wright,
who is Editor in Chief of the Journal of Molecular Biology, and
H. Jane Dyson emphasized importance of our results to the molecular
biology community at the first section of a survey on intrinsically
unstructured proteins (J. Mol. Biol. v. 293:321-331, 1999). Our
results suggest that there is need to critically re-assess the protein
structure-function paradigm taken for granted by most molecular
biologists.
Protein function lies not only as the basis for interpreting the
data from the human genome project, but also as one of the cornerstones
of molecular biology. Our work therefore has the potential for wide-spread
impact, not only in academia, but also all across the biotechnology
and pharmaceutical industries.
Summary
Towards the objective of understanding
commonness, flavors, complexity and function of protein disorder,
we assembled a database of known disordered protein sequence segments
and used it for developing predictors of protein disorder from primary
sequence information. The preliminary results were obtained by analyzing
sequences from the Protein Data Bank (PDB). Swiss Protein (SwissProt)
database and 34 complete or nearly complete genomes. In summary,
these prior studies provide strong evidence that: (1) disorder is
a very common element of protein structure; (2) the strength of
disorder prediction is correlated with sequence complexity; and
(3) eukaryotes evidently have a much larger fraction of proteins
with intrinsic disorder than eubacteria or archaebacteria.
Prediction of disorder
from sequence
Since amino acid sequence determines protein 3 D structure, we reasoned
that, if disorder were crucial to function, then amino acid sequence
would determine lack of 3D structure, or disorder, as well. To test
the hypothesis that disorder is encoded by the sequence, we have
assembled a dataset of ordered and disordered protein sequence segments
and used it to develop several predictors of disorder. Observed
prediction accuracies were in the 70-83% range [Romero,
P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker,
A.K., Proc. Pacific Symposium
on Biocomputing, Hawaii, 1998, vol. 3, pp. 435-446][Romero,
P., Obradovic, Z., and Dunker, A.K., Artificial Intelligence Review,
2000, Vol. 14, No. 6, S2, pp. 447-484][Romero,
P., Obradovic, Z., and Dunker, A.K., Proc. IEEE Int. Conf. on Neural
Networks, Houston, TX, 1997, vol. 1, pp. 90-95][Garner,
E., Cannon, P., Romero, P., Obradovic, Z., and Dunker, A.K., Proc.
Genome Informatics 1998,Tokyo, Japan, pp. 201-213][Li,
X., Romero, P., Rani, M., Dunker, A.K., and Obradovic, Z., Proc.
Genome Informatics 10, Tokyo, Japan, 1999, pp.
30-40]. That far exceeded the 50% expected by chance, demonstrating
that disorder is indeed very likely to be encoded by the sequence.
Our most accurate predictor [Vucetic,
S., Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K.,
Proc. 2001 IEEE/INNS International Joint Conference on Neural Networks,
Washington D.C., 2001, vol. 4, pp. 2718-2723] with 82.6% overall
accuracy (88.8% accuracy on ordered proteins, and 76.5% accuracy
on disordered proteins) is an ensemble of neural networks. However,
the difference in accuracy as compared to logistic regression classifiers
is smaller than 1% [Vucetic,
S., Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K.,
Proc. 2001 IEEE/INNS International Joint Conference on Neural Networks,
Washington D.C., 2001, vol. 4, pp. 2718-2723]. Such relatively
high accuracies strongly support the hypothesis that disorder is
an element of native protein structure that is encoded by the amino
acid sequence.
Understanding
the relationship between protein sequence and disordered protein.
We have constructed more than 6,000 composition-based and 265 property-based
sequence attributes with respect to their ability to discriminate
protein order and disorder[Li,
X., Obradovic, Z., Brown, C.J., Garner, E.C., and Dunker, A.K.,
Proc. Genome Informatics 11, Tokyo, Japan, 2000, pp. 172-184][Williams,
R.M., Obradovic, Z., Mathura, V., Braun, W., Garner, E.C., Young,
J., Takayama, S., Brown, C.J., and Dunker, A.K., 2000, Proc.
6th Pacific Symposium on Biocomputing, Maui
, Hawaii , pp. 89-100].
Our studies [Romero,
P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker,
A.K., Proc. IEEE Int.
Conf. on Neural Networks, Houston, TX, 1997, vol. 1, pp. 90-95]
[ Xie,
Q., Arnold, G.E., Romero, P., Obradovic, Z., Garner, E., and Dunker,
A.K., Proc. Genome Informatics 1998, Tokyo, Japan, pp. 193-200]
suggest that, compared to ordered sequences, disordered sequences
tend to have lower aromatic content, higher net charge, higher values
for the flexibility indices, and greater values for hydropathy as
well as other identifiable characteristics. Although ordered globular
proteins apparently have a lower bound for sequence complexity[Romero,
P., Obradovic, Z., and Dunker, A.K., FEBS
Letters. 1999, vol. 462, pp.363-367], disorder does not have
such a lower bound[Romero,
P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker,
A.K., Proteins: Structure,
Function and Genetics, 2001, vol. 42, pp. 38-48. ]. Overall,
the sequence differences observed between ordered and disordered
proteins make biochemical sense. Having amino acid compositions
that would be expected to lead to disorder adds weight to the view
that disorder is indeed encoded by the sequence.
Estimation
of the commonness of protein disorder.
Proteins with long disordered regions (>40 amino acids) were
occasionally found in protein structures characterized by X-ray
diffraction [Romero,
P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker,
A.K., . Proc. IEEE Int. Conf. on Neural Networks,
Houston , TX , 1997,
vol. 1, pp. 90-95]. We applied our predictors to sequence and
structure databases (SwissProt and PDB, respectively) with the result
that disorder appears to be much more common than previously thought.
Conservative estimates indicate that at least 25% of the sequences
in SwissProt contain long disordered regions [Romero,
P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker,
A.K., Proteins: Structure, Function and Genetics, 2001, vol. 42,
pp. 38-48][Romero,
P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., Guilliot,
S., Garner, E., and Dunker, A.K., Proc. Pacific Symposium on Biocomputing,
Hawaii, 1998, vol. 3, pp. 435-446]. Similar analysis on 32 complete
genomes resulted in the estimates that the percentage of proteins
with long disorder in 22 bacteria, 7 archaea, and 5 eucaryotae ranges
from 7-33%, 9-37%, and 36-63%, respectively [Dunker,
A.K., Obradovic, Z., Romero, P., Garner, E.C., and Brown, C.J.,
Proc. Genome Informatics
11, Tokyo, Japan, 2000 pp. 161-171].
Evolution
of disordered protein.
Differences in the amino-acid composition of ordered and disordered
protein may result in or from evolutionary differences between these
two types of protein. We find that both the quantity and quality
of amino-acid replacements in disordered protein differs from ordered.
We recently completed an evolutionary study of 28 protein families
with ordered and disordered regions, and found that 20 of the families
have disordered regions that evolve significantly more rapidly than
their ordered regions, and 3 families have disordered regions that
evolve more slowly [Brown, C.J., Takayama, S., Campen, A.M., Vise,
P., Marshall, T., Oldfield, C.J., Williams, C.J., and Dunker, A.K.,
2002]. Differences in amino-acid composition may also affect the
types of amino acid replacements that accumulate in disordered protein.
Matrices that furnish the probability for replacing a given amino
acid by another are generally based on ordered protein sequences.
We are developing scoring matrices using disordered protein families.
We find that scoring matrices based on disordered protein are more
successful in aligning homologous disordered protein sequences than
the commonly used scoring matrices [Radivojac,
P., Obradovic, Z., Brown, C.J., and Dunker, A.K., Proc. 7th Pacific
Symposium on Biocomputing, Hawaii, 2002 pp. 589-600].
Function
of confirmed disordered proteins.
We recently completed a survey of functions associated with disordered
protein from over 100 proteins. [Dunker,
A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., and Obradovic,
Z., Biochemistry, 2002,
May 28th, vol. 41, issue 21, pp. 6573 - 6582] Disordered protein
was identified either by missing electron density in x-ray crystal
structure entries in PDB, or b y w
ord searches for “NMR” or “circular dichroism” and “disordered”
or “unstructured” or “unfolded” in PubMed. The circular dichroism
papers generally had detailed discussions of the functions of their
disordered protein. NMR papers had somewhat less functional information,
and X-ray crystallography papers had very little functional information
for disordered regions. In order to find as much functional information
as possible for each disordered region, the SwissProt database was
searched, in depth literature reviews were performed and corresponding
authors were contacted by email. We found 28 functions performed
by the disordered regions of proteins. These functions can be summarized
into four broad categories: molecular recognition, molecular assembly/disassembly,
protein modification and entropic chains.
Disorder
in cell-signaling and cancer.
Many disordered regions
are involved in binding to DNA, RNA, or other proteins [Dunker,
A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., and Obradovic,
Z., Biochemistry,
2002, May 28th, vol. 41, issue 21, pp. 6573 - 6582]
this observation resulted in the hypothesis that disorder plays
an important role in the processes of molecular recognition, signaling
and regulation. To test this hypothesis, we applied our predictor
of disorder to a database of signaling proteins involved in the
broadest cascade of macromolecular interactions. Cancer-associated
proteins were also tested, since they are closely interrelated to
the cell signaling machinery; many are transcription factors overexpressed
as a result of activation during tumorogenesis. We found that there
is significantly more predicted disorder in signaling and cancer-associated
proteins than in several other categories of protein function, such
as, metabolism, biosynthesis and degradation [Iakoucheva,
L.M., Brown, C.J., Lawson, J.D., Obradovic, Z., and Dunker, A.K.,
Journal of Molecular
Biology, 2002, vol. 323, pp. 573-584].
|