Methods Inf Med 2001; 40(04): 346-358
DOI: 10.1055/s-0038-1634431
Original Article
Schattauer GmbH

What is Bioinformatics? A Proposed Definition and Overview of the Field

N. M. Luscombe
1   Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA
,
D. Greenbaum
1   Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA
,
M. Gerstein
1   Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA
› Author Affiliations
Further Information

Publication History

Publication Date:
08 February 2018 (online)

Summary

Background: The recent flood of data from genome sequences and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science.

Objectives: Here we propose a definition for this new field and review some of the research that is being pursued, particularly in relation to transcriptional regulatory systems.

Methods: Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

Results and Conclusions: Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (eg expression data). Additional information includes the text of scientific papers and “relationship data” from metabolic pathways, taxonomy trees, and protein-protein interaction networks. Bioinformatics employs a wide range of computational techniques including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the web at http://bioinfo.mbb.yale.edu/what-is-it.

 
  • References

  • 1 Reichhardt T. It’s sink or swim as a tidal wave of data approaches. Nature 1999; 399 (6736) 517-20.
  • 2 Benson DA. et al. GenBank. Nucleic Acids Res 2000; 28 (Suppl. 01) 15-8.
  • 3 Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000; 28 (Suppl. 01) 45-8.
  • 4 Fleischmann RD. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269 5223 496-512.
  • 5 Drowning in data. The Economist. 1999 (26 June 1999).
  • 6 Bernstein FC. et al. The Protein Data Bank. A computer-based archival file for macromolecular structures. Eur J Biochem 1977; 80 (Suppl. 02) 319-24.
  • 7 Berman HM. et al. The Protein Data Bank. Nucleic Acids Res 2000; 28 (Suppl. 01) 235-42.
  • 8 Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988; 85 (Suppl. 08) 2444-8.
  • 9 Altschul SF. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25 (Suppl. 17) 3389-402.
  • 10 Fleischmann RD. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269 5223 496-512.
  • 11 Lander ES. et al. Initial sequencing and analysis of the human genome. Nature 2001; 409: 860-921.
  • 12 Venter JC. et al. The sequence of the human genome. Science 2001; 291 5507 1304-51.
  • 13 Tatusova TA, Karsch-Mizrachi I, Ostell JA. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics 1999; 15 7-8 536-43.
  • 14 Eisen MB, Brown PO. DNA arrays for analysis of gene expression. Methods Enzymol, 1999; 303: 179-205.
  • 15 Cheung VG. et al. Making and reading micro-arrays. Nat Genet 1999; 21 1 Suppl 15-9.
  • 16 Duggan DJ. et al. Expression profiling using cDNA microarrays. Nat Genet 1999; 21 (1 Suppl) 10-4.
  • 17 Lipshutz RJ. et al. High density synthetic oligonucleotide arrays. Nat Genet 1999; 21 (Suppl. 01) 20-4.
  • 18 Velculescu VE. et al. Serial Analysis of Gene Expression. Detailed Protocol. 1999
  • 19 Holstege FC. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 1998; 95 (Suppl. 05) 717-28.
  • 20 Roth FP, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nat Biotech 1998; 16 (Suppl. 10) 939-45.
  • 21 Jelinsky SA, Samson LD. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc Natl Acad Sci USA 1999; 96 (Suppl. 04) 1486-91.
  • 22 Cho RJ. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998; 2 (Suppl. 01) 65-73.
  • 23 DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997; 278 5338 680-6.
  • 24 Winzeler EA. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999; 285 5429 901-6.
  • 25 Perou CM. et al. Molecular portraits of human breast tumours. Nature 2000; 406 6797 747-52.
  • 26 Golub TR. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286 5439 531-7.
  • 27 Pedersendagger AG. et al. A DNA structural atlas for Escherichia coli. J Mol Biol 2000; 299 (Suppl. 04) 907-30.
  • 28 Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28 (Suppl. 01) 27-30.
  • 29 Jeffery CJ. Moonlighting proteins. TIBS 1999; 24 (Suppl. 01) 8-11.
  • 30 Chothia C. Proteins. One thousand families for the molecular biologist. Nature 1992; 357 6379 543-4.
  • 31 Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature 1994; 372 6507 631-4.
  • 32 Lesk AM, Chothia C. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol 1980; 136 (Suppl. 03) 225-70.
  • 33 Russell RB. et al. Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 1997; 269 (Suppl. 03) 423-39.
  • 34 Russell RB. et al. Recognition of analogous and homologous protein folds – assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng 1998; 11 (Suppl. 01) 1-9.
  • 35 Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool 1970; 19: 99-110.
  • 36 Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science 1997; 278 5338 631-7.
  • 37 Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22 (Suppl. 04) 277-304.
  • 38 Skolnick J, Fetrow JS. From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotech 2000; 18: 34-9.
  • 39 Qian J. et al. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res 2001; 29 (Suppl. 08) 1750-64.
  • 40 Gerstein M. Integrative database analysis in structural genomics. Nat Struct Biol 2000; 7 Suppl: 960-3.
  • 41 Etzold T, Ulyanov A, Argos P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol 1996; 266: 114-28.
  • 42 Schuler GD. et al. Entrez: molecular biology database and retrieval system. Methods Enzymol 1996; 266: 141-62.
  • 43 Wade K. Searching Entrez PubMed and uncover on the internet. Aviat Space Environ Med 2000; 71 (Suppl. 05) 559.
  • 44 Bertone P. et al. SPINE: An integrated tracking database and datamining approach for high-throughput structural proteomics, enabling the determination of the properties of readily characterized proteins. Nucleic Acids Res.; In Press.
  • 45 Zhang MQ. Promoter analysis of co-regulated genes in the yeast genome. Comput Chem 1999; 23 3-4 233-50.
  • 46 Boguski MS. Biosequence exegesis. Science 1999; 286 5439 453-5.
  • 47 Miller C, Gurd J, Brass A. A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bioinformatics 1999; 15 (Suppl. 02) 111-21.
  • 48 Gonnet GH, Korostensky C, Brenner S. Evaluation measures of multiple sequence alignments. J Comput Biol 2000; 7 1-2 261-76.
  • 49 Orengo CA, Taylor WR. SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996; 266: 617-35.
  • 50 Orengo CA. CORA – topological fingerprints for protein structural families. Protein Sci 1999; 8 (Suppl. 04) 699-715.
  • 51 Russell RB, Sternberg MJ. Structure prediction. How good are we?. Curr Biol 1995; 5 (Suppl. 05) 488-90.
  • 52 Martin AC. et al. Protein folds and functions. Structure 1998; 6 (Suppl. 07) 875-84.
  • 53 Hegyi H, Gerstein M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 1999; 288 (Suppl. 01) 147-64.
  • 54 Russell RB, Sasieni PD, Sternberg MJE. Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 1998; 282 (Suppl. 04) 903-18.
  • 55 Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000; 297 (Suppl. 01) 233-49.
  • 56 Harrison SC. A structural taxonomy of DNA-binding domains. Nature 1991; 353 6346 715-9.
  • 57 Luscombe NM. et al. An overview of the structures of protein-DNA complexes. Genome Biology 2000; 1 (Suppl. 01) 1-37.
  • 58 Jones S. et al. Protein-DNA interactions: A structural analysis. J Mol Biol 1999; 287 (Suppl. 05) 877-96.
  • 59 Suzuki M, Gerstein M. Binding geometry of alpha-helices that recognize DNA. Proteins 1995; 23 (Suppl. 04) 525-35.
  • 60 Luscombe NM, Thornton JM. Protein-DNA interactions: a 3D analysis of alpha-helix-binding in the major groove. Manuscript in preparation.
  • 61 Suzuki M. et al. DNA recognition code of transcription factors. Protein Eng 1995; 8 (Suppl. 04) 319-28.
  • 62 Suzuki M. DNA recognition by a -sheet. Protein Eng 1995; 8 (Suppl. 01) 1-4.
  • 63 Seeman NC, Rosenberg JM, Rich A. Sequence specific recognition of double helical nucleic acids by proteins. Proc Natl Acad Sci USA 1976; 73: 804-8.
  • 64 Suzuki M. A framework for the DNA-protein recognition code of the probe helix in transcription factors: the chemical and stereo-chemical rules. Structure 1994; 2 (Suppl. 04) 317-26.
  • 65 Mandel-Gutfreund Y, Schueler O, Margalit H. Comprehensive analysis of hydrogen bonds in regulatory protein-DNA complexes: in search of common principles. J Mol Biol 1995; 253 (Suppl. 02) 370-82.
  • 66 Luscombe NM, Laskowski RA, Thornton JM. Protein-DNA interactions: a 3D analysis of amino acid-base interactions. Nucleic Acids Res.; In Press.
  • 67 Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid-base interaction: inplications for prediction of protein-DNA binding sites. Nucleic Acids Res 1998; 26: 2306-12.
  • 68 Sternberg MJ, Gabb HA, Jackson RM. Predictive docking of protein-protein and protein-DNA complexes. Curr Opin Struct Biol 1998; 8 (Suppl. 02) 250-6.
  • 69 Aloy P. et al. Modelling repressor proteins docking to DNA. Proteins 1998; 33 (Suppl. 04) 535-49.
  • 70 Dickerson RE. DNA-binding: the prevalence of kinkiness and the virtues of normality. Nucleic Acids Res 1998; 26 (Suppl. 08) 1906-26.
  • 71 Perez-Rueda E, Collado-Vides J. The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res 2000; 28 (Suppl. 08) 1838-47.
  • 72 Mewes HW. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000; 28 (Suppl. 01) 37-40.
  • 73 Salgado H. et al. RegulonDB (version 3.0): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 2000; 28 (Suppl. 01) 65-7.
  • 74 Wingender E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 2000; 28 (Suppl. 01) 316-9.
  • 75 Teichmann SA, Chothia C, Gerstein M. Advances in structural genomics. Curr Opin Struct Biol 1999; 9 (Suppl. 03) 390-9.
  • 76 Aravind L, Koonin EV. DNA-binding proteins and evolution of transcription regulation in the archaea. Nucleic Acids Res 1999; 27 (Suppl. 23) 4658-70.
  • 77 Huynen MA, van Nimwegen E. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol 1998; 15 (Suppl. 05) 583-9.
  • 78 Luscombe NM, Thornton JM. Protein-DNA interactions: an analysis of amino acid conservation and the effect on binding specificity. Manuscript in preparation.
  • 79 Gelfand MS. Prediction of function in DNA sequence analysis. J Comp Biol 1995; 1: 87-115.
  • 80 Robison K, McGuire AM, Church GM. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J Mol Biol 1998; 284 (Suppl. 02) 241-54.
  • 81 Thieffry D. et al. Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12. Bioinformatics 1998; 14 (Suppl. 05) 391-400.
  • 82 Mironov AA. et al. Computer analysis of transcription regulatory patterns in completely sequenced bacterial genomes. Nucleic Acids Res 1999; 27 (Suppl. 14) 2981-9.
  • 83 Gelfand MS, Koonin EV, Mironov AA. Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Res 2000; 28 (Suppl. 03) 695-705.
  • 84 McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res 2000; 10 (Suppl. 06) 744-57.
  • 85 Bysani N, Daugherty JR, Cooper TG. Saturation mutagenesis of the UASNTR (GATAA) responsible for nitrogen catabolite repression-sensitive transcriptional activation of the allantoin pathway genes in Saccharomyces cerevisiae. J Bacteriol 1991; 173 (Suppl. 16) 4977-82.
  • 86 Clarke ND, Berg JM. Zinc fingers in Caenorhabditis elegans: finding families and probing pathways. Science 1998; 282 5396 2018-22.
  • 87 van Helden J, Andre B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998; 281 (Suppl. 05) 827-42.
  • 88 Salgado H. et al. Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA 2000; 97 (Suppl. 12) 6652-7.
  • 89 Tatusov RL. et al. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli. Curr Biol 1996; 6 (Suppl. 03) 279-91.
  • 90 Eisen MB. et al. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95 (Suppl. 25) 14863-8.
  • 91 Wen X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA 1998; 95 (Suppl. 01) 334-9.
  • 92 Alon U. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999; 96 (Suppl. 12) 6745-50.
  • 93 Tamayo P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999; 96 (Suppl. 06) 2907-12.
  • 94 Toronen P. et al. Analysis of gene expression data using self-organizing maps. FEBS Lett 1999; 451 (Suppl. 02) 142-6.
  • 95 Tavazoie S. et al. Systematic determination of genetic network architecture. Nat Genet 1999; 22 (Suppl. 03) 281-5.
  • 96 Subrahmanyam YV. et al. RNA expression patterns change dramatically in human neutrophils exposed to bacteria. Blood 2001; 97 (Suppl. 08) 2457-68.
  • 97 Jansen R, Gerstein M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res 2000; 28 (Suppl. 06) 1481-8.
  • 98 Gerstein M, Jansen R. The current excitment in bioinformatics, analysis of whole-genome expression data: how does it relate to protein structure and function. Curr Opin Struct Biol 2000; 10: 574-84.
  • 99 Drawid A, Gerstein M. A Bayesian System Integrating Expression Data with Sequence Patterns for Localizing Proteins: Comprehensive Application to the Yeast Genome. J Mol Biol 2000; 301: 1059-75.
  • 100 Drawid A, Jansen R, Gerstein M. Genom-wide analysis relating expression level with protein subcellular localisation. Trends Genet 2000; 16: 426-30.
  • 101 Marcotte EM. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 1999; 285 5428 751-3.
  • 102 Eisenberg D. et al. Protein function in the post-genomic era. Nature 2000; 405 6788 823-6.
  • 103 Jansen R, Greenbaum D, Gerstein M. Relating whole-genome expression data with protein-protein interactions. Manuscript in preparation.
  • 104 Marx J. DNA arrays reveal cancer in its many forms. Science 2000; 289 5485 1670-2.
  • 105 Ross DT. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 2000; 24 (Suppl. 03) 227-35.
  • 106 Perou CM. et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 1999; 96 (Suppl. 16) 9212-7.
  • 107 Livesey FJ. et al. Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr Biol 2000; 10 (Suppl. 06) 301-10.
  • 108 Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology 1993; 234 (Suppl. 03) 779-815.
  • 109 Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature 1992; 358 6381 86-9.
  • 110 Kok K, Naylor SL, Buys CH. Deletions of the short arm of chromosome 3 in solid tumors and the search for suppressor genes. Advances in Cancer Research 1997; 71: 27-92.
  • 111 Syngal S. et al. Sensitivity and specificity of clinical criteria for hereditary non-polyposis colorectal cancer associated mutations in MSH2 and MLH1. Journal Med Genet 2000; 37 (Suppl. 09) 641-5.
  • 112 Lin J, Gerstein M. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res 2000; 10 (Suppl. 06) 808-18.
  • 113 Harrison PM, Echols N, Gerstein MB. Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 2001; 29 (Suppl. 03) 818-30.
  • 114 Uetz P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000; 403 6770 623-7.
  • 115 Ross-Macdonald P. et al. Transposon muta-genesis for the analysis of protein production, function, and localization. Methods Enzymol 1999; 303: 512-32.
  • 116 Mewes HW. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 1999; 27 (Suppl. 01) 44-8.