Bioinformatics: May 2007

Thursday, May 31, 2007

Homology Modeling

INTRODUCTION -
The functional characterization of a protein sequence is one of the most frequent problems in biology. This is facilitated by the accurate 3D structure of the studied protein.
In the absence of an experimentally derived structure, comparative/homology modeling can sometimes provide a useful 3D model for a protein (target) that is related to at least one know protein structure (template).
Despite the progress in ab inito protein structure prediction, comparative modeling remains th eonly methos that can reliably predict the 3D structure of a protein with accuracy comparable to experimentally determined structure.
The 3D structure of proteins from the same family are more conserved at their primary sequence. Therefore, if the similarity between 2 proteins is detectable at the sequence level, structural similarity can usually be assumed.
Moreover, proteins that share a low or even a non-detectable sequence similarity, will often have similar structure.
There are several computer programs and web servers that automate the comparative modeling process (eg. Swiss Model Server). These construct an atomic-resolution mode of a protein from its amino acid sequence.
The quality of the model is dependent of the quality of the sequence alignment and template structure.

COMPARATIVE MODELING STEPS -
1) SEARCHING FOR RELATED PROTEIN STRUCTURE
Comparative modeling usually starts by searching the PDB of known protein structure using the target sequence as the query. This search is usually done by comparing the target sequence with the sequence of each of the structure in database. A variety of sequence-sequence comparison methods can be used. Frequently availability of many sequences realted to the target/potential template allows more sensitive searching with sequence profile moethods and HMM.A good starting point for the template searches are the many database search servers on the Internet.

2) SELECTING TEMPLATES
Once a list of potential templates is obtained using searching methods, it is necessary to select 1 or more templates that are appropriate for the particular modeling problem. Factors that are taken to account while selecting the templates:

o The quality of the template increases with the overall sequence similarity to the target and decreases with the number and length of gaps in alignment.

§ The simplest template selection rule is to select the structure with higher sequence similarity to the target sequence.

o The family of proteins that includes the target and templates can be frequently be organized into subfamilies.

§ The construction of a multiple alignment & a phylogenetic tree can help in selecting the templates from the subfamily that is closest to the target sequence.

o The similarity between the “environment” of the template with that in which the target needs to be modeled should be considered.

§ The term “environment” includes factors like solvent, pH, ligands and quaternary interactions.

o The quality of the experimentally determined structure is another important factor in template selection.

§ Resolution and R factor of crystallographic structure and the no. of restraints per residue for an NMR structure can indicate the accuracy of the structure.

§ This info can generally be obtained from the PDB template files of the article describing structure determination.

§ For eg., if 2 templates have comparable sequence similarity to the target, the one determined at highest resolution should generally be used.

3) TARGET-TEMPLATE ALIGNMENT.
To build a model, all comparative modeling programs depend on a list of assumed structural equivalence between the target and template residues. The list is defined by the alignment of target and template sequences. Search methods tend to be tuned for detection of remote relationships. Therefore, once the templates are selected, an alignment method should be used to align them with the target sequence. The alignment is relatively simple to obtain with the target-template sequence identity is above 40%.

4) MODEL BUILDING.
Once the initial target-template alignment is built, a variety of methods can be used to construct a 3D model for the target protein.
The original and still widely used method is modeling by rigid body assembly.
This method constructs the model from a few core regions and from loops and side chains, which are obtained from dissecting related structures.

LOOP BUILIDNG: there are 2 ways:

o AB INITIO loop prediction – is based on the conformational search or enumeration of conformations in a given environment, guided by a scoring/energy function.

o DB approach – consists of finding a segment of main chain that fits the two stem regions of a loop.

§ It is possible to estimate whether or not a given loop prediction is correct based on the RMSD – should be less than 2A.

5) MODEL EVALUTION.
After a model has been built, it is important to check for possible errors.
The quality of a model can be approximately predicted from the sequence similarity between the target and template. The sequence identity above 30% is a relatively good predictor of the expected accuracy of a model. However, other factors, including environment, can strongly influence the accuracy of a model.
For eg, some calcium binding proteins undergo large conformational changes when bound to calcium. If a calcium-free template is used to model the calcium bound state of a target, it is likely that the model will be incorrect irrespective to the target-template similarity.
If the target-template sequence identity falls below 30%, the sequence identity becomes significantly less reliable as a measure of expected accuracy of a single model.

COMMOIN USES OF COMPARATIVE PROTEIN STRUCTURE MODELS.

1) Designing (site-directed) mutants to test the hypotheses about function.

2) Identifying the active site, binding sites.

3) Searching for ligands of a giver binding sate.

4) Designing and improving ligands of a given binding site.

5) Protein-protein docking simulations.

6) Testing a given sequence-structure alignment.

Wednesday, May 30, 2007

Biology in the Computer Age.

From the interaction of species and populations, to the function of tissues and cells within an individual organism, biology is defined as the study of living organisms. Now at the beginning of the 20th century, we sue sophisticated laboratory technology that allow us to collect data faster than we can interpret it. We have large volumes of DNA sequence data at our fingertips but how do we figure out which parts of that DNA control the various chemical processes in life?
Bioinformatics is the science of using information to understand biology. Strictly speaking, bioinformatics is a subset of the larger field of computational biology, the application of quantitative analytical techniques in modeling biological systems.
The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Researchers come to bioinformatics from many fields, including mathematics, computer science, and linguistics. Unfortunately, biology is a science of the specific as well as the general. Bioinformatics is full of pitfalls for those who look for patterns and make predictions without completely understanding where biological data comes from and what it means. By providing algorithms, databases, user interfaces and statistical tools, bioinformatics makes it possible to do exciting things such as compare DNA sequences and generate results that are potentially significant. These new tools also give you the opportunity to over interpret data and assign meaning where none really exists.

What Informatics mean to Biologists....
The science of informatics is concerned with the representation, organization, manipulation, distribution, maintenance and use of informations, particularly in digital form. There is more than one interpretation of what bioinformatics actually means, and its quite possible to go out and apply for a job doing bioinformatics and find that the expectations of the job are entirely different than you though.
The functional aspect of bioinformatics is the representation, storage and distribution of data. Intelligent design of data formats and databases, creation of tools to query those databases, and development of user interfaces that bring together different tools to allow the user to ask complex questions about the data are all aspects of the development of bioinformatics infrastructure.
Developing analytical tools to discover knowledge in data is the second and more scientific aspect of bioinformatics. There are many levels at which we use biological information, whether we are comparing sequences to develop a hypothesis about the function of a newly discovered gene, breaking down known 3D protein structures into bits to find patterns that can help predict how the protein folds, or modeling how proteins and metabolites in a cell work together to make the cell function. The ultimate goal of analytical bioinformatician is to develop predictive methods that allow scientist to model the function and phenotype of an organism based only on its genome sequence. This is a grand goal, and one that will be approached only in small steps, by many scientists working together.

Challenges Biology Offers Computer Scientists....
The goal of biology is to develop a quantitative understanding of how living things are built from the genome that encodes them.
Cracking the genome code is complex. At the simplest level, we still have difficulty identifying unknown genes by computer analysis of genome sequence. We still have not managed to predict or model how a chain of amino acids fold into a specific structure of a functional protein.
Beyond the single-molecule level, the challenges are immense. The sheer amount of data in GenBank is now growing at an exponential rate, and as data types beyond DNA, RNA and protein sequence begin to undergo the same kind of explosion, simply managing, accessing and presenting this data to users in an intelligible form is critical task. Human computer interaction specialists need to work closely with academic and clinical researchers in the biological sciences to manage such staggering amounts of data.
Biological data is very complex and interlinked. A spot on a DNA array, for instance, is connected not only to immediate information about its intensity, but to layers of information about genomic location, DNA sequences, structure, function, and more. Creating information systems that allow biologists to seamlessly follow these links without getting lost in a sea of information in also a huge opportunity for computer scientists.
Finally, each gene in the genome isn't as independent entity. Multiple gene interact to form biochemical pathways, which in turn feed into other pathways. Biochemistry is influenced by the external environment, bu interaction with pathogens, and by other stimuli. Putting genomic and biochemical and physiology will be the work of a generation computational biologists. Computer scientists, mathematicians and statisticians will be a vital part of this effort.

Skills of a Bioinformatician....

Should have a fairly deep background in some aspect of molecular biology.
Must absolutely understand the central dogma of molecular biology. Understanding how and why DNA sequences is transcribed into RNA and translated into protein is vital.
Should have substantial experience with atleast one or two major molecular biology software packages, either for sequence analysis or molecular modeling. The experience of learning one of these packages makes it much easier to use other software quickly.
Should be comfortable working in command-line computing environment.
Should have experience with programming in a computer language such as C/C++, as well as in scripting language like PERL, Python.

Tuesday, May 29, 2007

Phylogenetic tree

A phylogenetic tree is a tree showing the evolutionary interrelationships among various species or other entities that are believed to have a common ancestor. A phylogenetic tree is a form of a cladogram. In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and edge lengths correspond to time estimates. Each node in a phylogenetic tree is called a taxonomic unit. Internal nodes are generally referred to as Hypothetical Taxonomic Units (HTUs) as they cannot be directly observed.

A rooted phylogenetic tree is a directed tree with a unique node corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree. The most common method for rooting trees is the use of an uncontroversial outgroup - close enough to allow inference from sequence or trait data, but far enough to be a clear outgroup.
While unrooted phylogenetic trees can be generated from rooted ones by omitting the root from a rooted tree, a root cannot be inferred on an unrooted tree without either an outgroup or additional assumptions (for instance, about relative rates of divergence).

There are three main methods of constructing phylogenetic trees: distance-based methods such as neighbor-joining, parsimony-based methods such as maximum parsimony, and character-based methods such as maximum likelihood or Bayesian inference.
Tree-building methods can be assessed on the basis of several criteria:

efficiency (how long does it take to compute the answer, how much memory does it need?)
power (does it make good use of the data, or is information being wasted?)
consistency (will it converge on the same answer repeatedly, if each time given different data for the same model problem?)
robustness (does it cope well with violations of the assumptions of the underlying model?)
falsifiability (does it alert us when it is not good to use, i.e. when assumptions are violated?)

Cladistics :- is a branch of biology that determines the evolutionary relationships between organisms based on derived similarities. It is the most prominent of several forms of phylogenetic systematics, which study the evolutionary relationships between organisms. Cladistics is a method of rigorous analysis, using "shared derived traits" of the organisms being studied. Cladistic analysis forms the basis for most modern systems of biological classification, which seek to group organisms by evolutionary relationships. In contrast, phenetics groups organisms based on their overall similarity, while approaches that are more traditional tend to rely on key characters (morphology). Willi Hennig (1913 - 1976) is widely regarded as the founder of cladistics.
As the end result of a cladistic analysis, treelike relationship-diagrams called "cladograms" are drawn up to show different hypotheses of relationships. A cladistic analysis can be based on as much or as little information as the researcher selects. Modern systematic research is likely to be based on a wide variety of information, including DNA-sequences (so called "molecular data"), biochemical data and morphological data.

A taxon:- (plural taxa), or taxonomic unit, is a grouping of organisms (named or unnamed). Once named, a taxon will usually have a rank and can be placed at a particular level in a hierarchy.

Multiple Alignment

Multiple Sequence Alignment (MSA) is an alignment of 3 or more sequences such that each column of the alignment is an attempt to represent the evolutionary changes, including substitution, insertion, and deletion.
Alignment of protein sequences can reveal whether there is functional and evolutionary relationship between the sequences.
Consensus information retrieved from MSA is used for the design of specific DNA probes to identify other members of the same group or family of similar sequences in different organisms.
The consensus can be used as a query sequence in database searching program to find other sequence with a similar pattern.
MSA are starting points for phylogenetic analysis.
Once the MSA has been found the number or types of changes in the aligned sequence residues maybe used for a phylogenetic analysis.
Each column in the alignment predicts the mutation that occured at one site during the evolutionary changes of sequence family.
Within the columns are original characters that were present early as well as other derived characters that appeared later in evolutionary time.

MSA can be global or local.....
An alignment of 2 sequences can be global alignment of the entire sequences or a local alignment of locally conserved regions in these sequences. These two types of alignments are also used in MSA.
In a global MSA, pairwise alignment is extended to include 3 or more related sequences.
Local conservation of patterns in DNA sequences represents conserved regulatory information in genome.
These patterns maybe found by local MSA designed to search for these conserved domains.

Tools....
Two main tools are used for MSA
a) CLUSTALW
b) TCOFFEE

Monday, May 28, 2007

Pairwise Alignment

This is done when 2 sequences (DNA/RNA/protein) are compared against each other to retrieve information about their evolutionary, structural, functional information.
This alignment is of 2 types:

1) Local Alignment.
2) Global Alignment.

LOCAL ALIGNMENT
In this alignment, stretches of sequences with high density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in other sequences that differ in length or sequences that share a conserved region or domain.

The SMITH-WATERMAN ALGORITHM is used to produce local alignments between pairs of DNA or protein sequences -
1) Assigns a score to each pair of bases
- uses similarity scores (where identical or similar residues have positive scores and dissimilar ones have 0 or negative scores) only.
- uses positive scores for related residues
- uses negative scores for substitutions and gaps
2) Initializes edges of the matrix with zeros
3) As the scores are summed in the matrix, any score below 0 is recorded as 0.
4) Begins traces back at the maximum value found anywhere in the matrix.
5) Continues until the score falls to 0.

GLOBAL ALIGNMENT
In global alignment, an attempt is made to align the entire sequence, using all sequence characters. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment.

Here, the NEEDLEMAN-WUNSCH ALGORITHM is used to produce global alignments. The alignment is stretched over the entire sequence length to include as many matching characters as possible up to and including the sequence ends. A global alignment is made possible by including gaps either within the middle of the alignments or at either end of one or both sequences. Vertical bars between the sequence indicate the presence of identical characters.

GLOBAL ALIGNMENT vs LOCAL ALIGNMENT
When trying to deduce evolutionary history by examining protein sequence similarity and differences, global alignment is typically meaningful and effective since it compares proteins of the same sequence family.
However, in many biological applications, local alignment is more meaningful than global alignment. It is particularly true when large stretches of anonymous DNA are compared since only some internal stretches of those strings maybe related.
When comparing protein sequences, local alignment is critical because proteins from very different families are often made up of the same structural or functional subunits (Motifs or Domains) and local alignment is appropriate for searching for these subunits.
An interesting example of conserved domain comes from protein encoded by Homeoboxgenes. These genes show up in a wide variety of species from fruit flies to frogs to humans. These genes regulate, at high level, embryonic development and a single mutation in their genes can transform one body part to another. The protein sequences that these genes encode are very different in each specie except in one region called the Homeodomains. This domain consists of about 60 amino acids that form the part of regulatory protein that binds to DNA. Homeodomains made by certain insect and mammalian genes are particularly similar showing about 50-95% identity in alignment without spaces.
Protein to DNA binding is central in how those proteins regulate embryo development and cell differentiation. So, the amino acid sequences is the most biologically critical part of those proteins (highly conserved) whereas the other parts of the protein sequences show very little similarity. In these cases, local alignment is certainly more appropriate way to compare protein sequence than in global alignment. Local alignment in proteins is additionally important because certain isolated characters of related proteins maybe more highly conserved than the rest of the proteins.
Local alignments will more likely detect these conserved characters than global alignment. It is the most appropriate type of alignment for comparing protein from different protein families.

Tuesday, May 22, 2007

Sequence Alignment

Sequence alignment is the way of comparing two (pair-wise) or more (multiple) primary sequences of DNA/RNA or protein to identify regions of similarity (i.e., searching for a series of individual characters or character patterns that are in the same order in the sequences), that maybe a consequence of functional, structural, or evolutionary relationships between the sequences.
Two sequences are aligned by writing them across a page in two rows. Identical or similar characters are placed in the same column. And non-identical characters are places in the same column as a mismatch or opposite a gap in the sequence.
In an optimal alignment, non-identical characters and gaps are placed to bring as many identical or similar characters as possible.

"-" represents a gap
"*" represents a match
":" represents a mismatch

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In protein sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids, the conservation of base pairing can indicate a similar functional or structural role.

TYPES OF SEQUENCE ALIGNMENT

1) Pairwise Sequence Alignment.
2) Multiple Sequence Alignment.

Sequence Alignment and Sequence Comparison is more or less the same. Only after the sequences are aligned, they can be compared.

Bioinformatics

My3s Archive

About Me