Logged in as: guest Log in | ||||||||||||||||||||||||||||
Home | Research | Courses | Publications | |||||||||||||||||||||||||
Comp 590-087/790-087: BioAlgorithms -- Fall 2011
Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.
Doublet starred (**) parts of questions are optional and worth up to 10% extra credit. Question 1. In a Sequencing By Hybridization (SBH) study the following spectrum of 3-mers is collected:
Question 2.
Question 3. Compute a suffix array for the megabase DNA segment at this link. Use it, to answer the following questions about the sequence: A. What is the shortest unique substring (excluding any substring ending with the special end-of-string character)? B. What is the longest repeated substring (overlaps are allowed)? **C. Find the longest match to a given target string, s, with at most one mismatch (you can assume that the length s is at least 20). Feel free to start with the suffix array code given in class. You are welcome to speed it up if you like. It took about 40 mins for the argsort() to complete on my machine. The performance of findFirst() and findLast() were fine. Turn in your code to implement parts A and B as well as an answer. If you choose to do part C for extra credit, turn in your code and several examples. Posted by leonardo on 2011-11-16 02:43:35 (references Version 9)In Question 3.A, what does "excluding any string at the end of the segment" mean? For example, if the DNA fragment is AACCCTTTG, does that mean that we should output AA as the shortest unique substring, instead of G? What is the purpose of that? Posted by mcmillan on 2011-11-22 02:04:34 (references Version 12)Recall a special end-of-string character, '$', is used to terminate every string. This leads to a series of short substrings all ending with '$' in the suffix array. The point of this question is to find the shortest nonrepeated substring that is not made distinct from the substrings immedidately preceding or following it in the suffix array by this '$'. Thus, the 'G' in the previous comment is the shortest nonrepeated substring of AACCCTTTG$, because it does not rely on the '$' to make it unique. However, 'T' is not the shortest nonrepeated substring of AAACCCTTT$; instead either of {AC, CT} are. |