Zoran Obradovic - Bioinformatics

Bioinformatics, Disordered Proteins and Function

Investigators

Bo Han , M.S.
Brown Celeste, Ph.D.
Dunker Keith, Ph.D.
Garner Ethan
Iakoucheva, M. Lilia, Ph.D.
Lawson J. David, Ph.D.
Li Xiaohong, Ph.D.
O'Connor Tim
Obradovic Zoran, Ph.D.
Peng Kang, M.S.
Radivojac Predrag, M.Sc.
Romero Pedro, Ph.D.
Vucetic Slobodan, Ph.D.
Xie Hongbo, M.S.
Wang Junping, Ph.D.

Problem

Protein function is generally thought to follow from the prior formation of a specific three-dimensional structure. In contrast to this view, many proteins that require a lack of three-dimensional structure for function have been reported through the literature over the last 50 years. These "intrinsically disordered" proteins exist as structural ensembles, either at the secondary or tertiary structure level. In other words, disordered proteins or regions have atomic coordinates and Ramachandran angles that vary significantly over time. Both extended (i.e., random coil-like) regions - with perhaps some secondary structure - and collapsed (i.e., partially folded or molten globule-like) domains - with poorly packed secondary structure units - are included in this definition. The existence of proteins with intrinsic protein disorder calls for a re-assessment of the view that prior folding into 3-D structure is always required for protein function, a view sometimes called "the protein structure-function paradigm."

Results

In summary, our bioinformatics work provides strong evidence regarding the importance of disordered promoted protein. Recently, Peter Wright, who is Editor in Chief of the Journal of Molecular Biology, and H. Jane Dyson emphasized importance of our results to the molecular biology community at the first section of a survey on intrinsically unstructured proteins (J. Mol. Biol. v. 293:321-331, 1999). Our results suggest that there is need to critically re-assess the protein structure-function paradigm taken for granted by most molecular biologists.
Protein function lies not only as the basis for interpreting the data from the human genome project, but also as one of the cornerstones of molecular biology. Our work therefore has the potential for wide-spread impact, not only in academia, but also all across the biotechnology and pharmaceutical industries.

Summary

Towards the objective of understanding commonness, flavors, complexity and function of protein disorder, we assembled a database of known disordered protein sequence segments and used it for developing predictors of protein disorder from primary sequence information. The preliminary results were obtained by analyzing sequences from the Protein Data Bank (PDB). Swiss Protein (SwissProt) database and 34 complete or nearly complete genomes. In summary, these prior studies provide strong evidence that: (1) disorder is a very common element of protein structure; (2) the strength of disorder prediction is correlated with sequence complexity; and (3) eukaryotes evidently have a much larger fraction of proteins with intrinsic disorder than eubacteria or archaebacteria.

Prediction of disorder from sequence

Since amino acid sequence determines protein 3 D structure, we reasoned that, if disorder were crucial to function, then amino acid sequence would determine lack of 3D structure, or disorder, as well. To test the hypothesis that disorder is encoded by the sequence, we have assembled a dataset of ordered and disordered protein sequence segments and used it to develop several predictors of disorder. Observed prediction accuracies were in the 70-83% range [Romero, P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker, A.K., Proc. Pacific Symposium on Biocomputing, Hawaii, 1998, vol. 3, pp. 435-446][Romero, P., Obradovic, Z., and Dunker, A.K., Artificial Intelligence Review, 2000, Vol. 14, No. 6, S2, pp. 447-484][Romero, P., Obradovic, Z., and Dunker, A.K., Proc. IEEE Int. Conf. on Neural Networks, Houston, TX, 1997, vol. 1, pp. 90-95][Garner, E., Cannon, P., Romero, P., Obradovic, Z., and Dunker, A.K., Proc. Genome Informatics 1998,Tokyo, Japan, pp. 201-213][Li, X., Romero, P., Rani, M., Dunker, A.K., and Obradovic, Z., Proc. Genome Informatics 10, Tokyo, Japan, 1999, pp. 30-40]. That far exceeded the 50% expected by chance, demonstrating that disorder is indeed very likely to be encoded by the sequence. Our most accurate predictor [Vucetic, S., Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K., Proc. 2001 IEEE/INNS International Joint Conference on Neural Networks, Washington D.C., 2001, vol. 4, pp. 2718-2723] with 82.6% overall accuracy (88.8% accuracy on ordered proteins, and 76.5% accuracy on disordered proteins) is an ensemble of neural networks. However, the difference in accuracy as compared to logistic regression classifiers is smaller than 1% [Vucetic, S., Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K., Proc. 2001 IEEE/INNS International Joint Conference on Neural Networks, Washington D.C., 2001, vol. 4, pp. 2718-2723]. Such relatively high accuracies strongly support the hypothesis that disorder is an element of native protein structure that is encoded by the amino acid sequence.

Understanding the relationship between protein sequence and disordered protein.

We have constructed more than 6,000 composition-based and 265 property-based sequence attributes with respect to their ability to discriminate protein order and disorder[Li, X., Obradovic, Z., Brown, C.J., Garner, E.C., and Dunker, A.K., proc. Genome Informatics 11, Tokyo, Japan, 2000, pp. 172-184][Williams, R.M., Obradovic, Z., Mathura, V., Braun, W., Garner, E.C., Young, J., Takayama, S., Brown, C.J., and Dunker, A.K., 2000, Proc. 6th Pacific Symposium on Biocomputing, Maui , Hawaii , pp. 89-100]. Our studies [Romero, P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker, A.K., Proc. IEEE Int. Conf. on Neural Networks, Houston, TX, 1997, vol. 1, pp. 90-95] [ Xie, Q., Arnold, G.E., Romero, P., Obradovic, Z., Garner, E., and Dunker, A.K., Proc. Genome Informatics 1998, Tokyo, Japan, pp. 193-200] suggest that, compared to ordered sequences, disordered sequences tend to have lower aromatic content, higher net charge, higher values for the flexibility indices, and greater values for hydropathy as well as other identifiable characteristics. Although ordered globular proteins apparently have a lower bound for sequence complexity[Romero, P., Obradovic, Z., and Dunker, A.K., FEBS Letters. 1999, vol. 462, pp.363-367], disorder does not have such a lower bound[Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker, A.K., Proteins: Structure, Function and Genetics, 2001, vol. 42, pp. 38-48. ]. Overall, the sequence differences observed between ordered and disordered proteins make biochemical sense. Having amino acid compositions that would be expected to lead to disorder adds weight to the view that disorder is indeed encoded by the sequence.

Estimation of the commonness of protein disorder.

Proteins with long disordered regions (>40 amino acids) were occasionally found in protein structures characterized by X-ray diffraction [Romero, P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker, A.K., . Proc. IEEE Int. Conf. on Neural Networks, Houston , TX , 1997, vol. 1, pp. 90-95]. We applied our predictors to sequence and structure databases (SwissProt and PDB, respectively) with the result that disorder appears to be much more common than previously thought. Conservative estimates indicate that at least 25% of the sequences in SwissProt contain long disordered regions [Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker, A.K., Proteins: Structure, Function and Genetics, 2001, vol. 42, pp. 38-48][Romero, P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., Guilliot, S., Garner, E., and Dunker, A.K., Proc. Pacific Symposium on Biocomputing, Hawaii, 1998, vol. 3, pp. 435-446]. Similar analysis on 32 complete genomes resulted in the estimates that the percentage of proteins with long disorder in 22 bacteria, 7 archaea, and 5 eucaryotae ranges from 7-33%, 9-37%, and 36-63%, respectively [Dunker, A.K., Obradovic, Z., Romero, P., Garner, E.C., and Brown, C.J., Proc. Genome Informatics 11, Tokyo, Japan, 2000 pp. 161-171].

Evolution of disordered protein.

Differences in the amino-acid composition of ordered and disordered protein may result in or from evolutionary differences between these two types of protein. We find that both the quantity and quality of amino-acid replacements in disordered protein differs from ordered. We recently completed an evolutionary study of 28 protein families with ordered and disordered regions, and found that 20 of the families have disordered regions that evolve significantly more rapidly than their ordered regions, and 3 families have disordered regions that evolve more slowly [Brown, C.J., Takayama, S., Campen, A.M., Vise, P., Marshall, T., Oldfield, C.J., Williams, C.J., and Dunker, A.K., 2002]. Differences in amino-acid composition may also affect the types of amino acid replacements that accumulate in disordered protein. Matrices that furnish the probability for replacing a given amino acid by another are generally based on ordered protein sequences. We are developing scoring matrices using disordered protein families. We find that scoring matrices based on disordered protein are more successful in aligning homologous disordered protein sequences than the commonly used scoring matrices [Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K., Proc. 7th Pacific Symposium on Biocomputing, Hawaii, 2002 pp. 589-600].

Function of confirmed disordered proteins.

We recently completed a survey of functions associated with disordered protein from over 100 proteins. [Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., and Obradovic, Z., Biochemistry, 2002, May 28th, vol. 41, issue 21, pp. 6573 - 6582] Disordered protein was identified either by missing electron density in x-ray crystal structure entries in PDB, or b y w ord searches for "NMR" or "circular dichroism" and "disordered" or "unstructured" or "unfolded" in PubMed. The circular dichroism papers generally had detailed discussions of the functions of their disordered protein. NMR papers had somewhat less functional information, and X-ray crystallography papers had very little functional information for disordered regions. In order to find as much functional information as possible for each disordered region, the SwissProt database was searched, in depth literature reviews were performed and corresponding authors were contacted by email. We found 28 functions performed by the disordered regions of proteins. These functions can be summarized into four broad categories: molecular recognition, molecular assembly/disassembly, protein modification and entropic chains.

Disorder in cell-signaling and cancer.

Many disordered regions are involved in binding to DNA, RNA, or other proteins [Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., and Obradovic, Z., Biochemistry, 2002, May 28th, vol. 41, issue 21, pp. 6573 - 6582] this observation resulted in the hypothesis that disorder plays an important role in the processes of molecular recognition, signaling and regulation. To test this hypothesis, we applied our predictor of disorder to a database of signaling proteins involved in the broadest cascade of macromolecular interactions. Cancer-associated proteins were also tested, since they are closely interrelated to the cell signaling machinery; many are transcription factors overexpressed as a result of activation during tumorogenesis. We found that there is significantly more predicted disorder in signaling and cancer-associated proteins than in several other categories of protein function, such as, metabolism, biosynthesis and degradation [Iakoucheva, L.M., Brown, C.J., Lawson, J.D., Obradovic, Z., and Dunker, A.K., Journal of Molecular Biology, 2002, vol. 323, pp. 573-584].

© 2007 Center for IST, Temple University