Compgen Tool Suite

Problem Set #2

mcmillan / Version 12

Comp 555: BioAlgorithms -- Spring 2015

Problem Set #1

Issued: 3/25/2015 Due: In class 4/6/2015

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Homologous Genes

In biology, homology is the existence of shared ancestry between a pair of structures, or genes, in different species. A common example of homologous structures in evolutionary biology are the wings of bats and the arms of primates. Evolutionary theory explains the existence of homologous structures adapted to different purposes as the result of descent with modification from a common ancestor.

As with anatomical structures, homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) or a duplication event (paralogs).

Homology among proteins or DNA is often incorrectly concluded on the basis of sequence similarity. The terms "percent homology" and "sequence similarity" are often used interchangeably. As with anatomical structures, high sequence similarity might occur because of convergent evolution, or, as with shorter sequences, by chance. Such sequences are similar but not homologous. Sequence regions that are homologous are also called conserved. This is not to be confused with conservation in amino acid sequences in which the amino acid at a specific position has been substituted with a different one with functionally equivalent physicochemical properties. One can, however, refer to partial homology where a fraction of the sequences compared (are presumed to) share descent, while the rest does not. For example, partial homology may result from a gene fusion event.

In this execise we will explore the sequence similarities between four sequences that vary significantly in their evolutionary distance from a common ancestor. The sequences that you will work with are given below:

All of your problem soluitions should be turned in as single iPython Notebook.

Problem 1. Normalization

Rarely will it be the case that you are given two sequences whose starts and ends are comensurate. What we'd like to do is to find the appropriate starts and ends for comparing each pair of sequences. In order to simplify the problem, the human, mouse, and yeast gene sequences have already been normalized (their starts and ends are already suitable for a global alignments, or other forms of sequence comparison), the problem is with the fish sequence. In this problem, try to find a suitable start and end for the fish sequence that maximizes its similarity to the human sequence (Hint: consider making a small change to Local Alignment as an approach).

Problem 2. Longest Conserved Substring

Devise a method to find the longest conserved substring shared by all four subsequences. Recall that a substring is a contiguous series of bases common to all four strings.

Problem 3. Evolutionary Distance

Finally, we will compute the pairwise distances between all pairs of normalized gene sequences. Our measure of distance will be the edit distance between two sequences where matches are weighted as +1, mismatches as 0, and any introduced gaps as -1.

Generate a table with the six distances between each pair of given normalized sequences.

Submit your solutions, in the form of an iPython Notebook, here.