Issued Thursday, 3/1/2018; Due Thursday, 3/22/2018
\n",
"
\n",
"\n",
"**Homework Information:** Some of the problems are probably too long to be started the night before the due date, so plan accordingly. Late problem sets will be penalized by a factor of\t70.71% for each class meeting after the due date. Feel free to get help from others, but the work you submit in should be your own."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"section": "signature"
},
"outputs": [],
"source": [
"# Replace the following string values with the requested information\n",
"class Student:\n",
" first = \"Your given name\"\n",
" last = \"Your family name\"\n",
" onyen = \"Your UNC Onyen\"\n",
" pid = \"Your UNC pid\""
]
},
{
"cell_type": "markdown",
"metadata": {
"number": 1,
"section": "problem"
},
"source": [
"---\n",
"**Problem #1:** A file of *15-mers* simulating short reads from a genome can be downloaded [here](http://csbio.unc.edu/mcmillan/Comp555S18/data/kmers.txt). How many distinct nodes appear in the De Bruijn graph that represents these *15-mers* as edges? How many nodes are semi-balenced? How many nodes are balanced? How many are balanced with both in-degrees and out-degrees equal to 1? "
]
},
{
"cell_type": "raw",
"metadata": {
"number": 1,
"section": "answer"
},
"source": [
"Enter your answer here"
]
},
{
"cell_type": "markdown",
"metadata": {
"number": 2,
"section": "problem"
},
"source": [
"---\n",
"**Problem #2:** What is the length of the Eulerian path that can be constructed in the De Bruijn graph described in Problem #1? How does the resulting constructed sequence compare to the plasmid sequence of [*Salmonella Typhimurium*](http://csbio.unc.edu/mcmillan/Comp555S18/data/SalmonellaTyphimurium.fa). (Hint: one method of comparison is consider how many k-mers the two sequences share. Of course you can compare where k=15, but consider what is the smallest value of k for which the two sequences differ in their set of k-mers?)"
]
},
{
"cell_type": "raw",
"metadata": {
"number": 2,
"section": "answer"
},
"source": [
"Enter your answer here"
]
},
{
"cell_type": "markdown",
"metadata": {
"number": 3,
"section": "problem"
},
"source": [
"---\n",
"**Problem #3:** You will find a BWT of the primary \"genome\" sequence of *Salmonella Typhimurium* [here](http://csbio.unc.edu/mcmillan/Comp555S18/data/SalmonellaTyphimurium.bwt). This BWT is compressed as follows: Any run of two or more repeated characters is prefixed with an ASCII-encoded number followed by the character. For example the string \"AAACGGTTTTTTTTTT\" would be encoded as the string \"3AC2G10T\". What is the compression ratio of this BWT? Where the compression ration is given by: \n",
"\n",
"$$\\frac{len(compressed BWT)}{len(sequence)}$$\n",
"\n",
"What is the average run-length in the BWT (consider characters without runs as run-lengths of 1)?"
]
},
{
"cell_type": "raw",
"metadata": {
"number": 3,
"section": "answer"
},
"source": [
"Enter your answers here"
]
},
{
"cell_type": "markdown",
"metadata": {
"number": 4,
"section": "problem"
},
"source": [
"**Problem #4:** Uncompress the BWT from in Problem #3, and use code given in class to find how many times the substring \"ATGACAACGC\" and its reverse complement \"GCGTTGTCAT\" appear in the *Salmonella Typhimurium* genome. Repeat this for the substring \"ATGACAACGCAT\" and its reverse complement \"ATGCGTTGTCAT\". Note these sequences are the first few bases of the *HolC* gene that you searched for in Problem Set #2."
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true,
"number": 4,
"section": "answer"
},
"source": [
"Enter your answers here"
]
},
{
"cell_type": "markdown",
"metadata": {
"number": 5,
"section": "problem"
},
"source": [
"**Problem #5: Programming Problem** \n",
"\n",
"In class we discussed an algortihm to produce a BWT from a suffix array. In this problem you are asked to write code to do the inverse-- ***produce a suffix array from a BWT***. In the space provided below write your function and test it by finding the genomic indices for all of the substrings than you reported for Problem #4. (Hint: The first implict suffix in the BWT begins with '$', which is the last character of the original string)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"number": 5,
"section": "answer"
},
"outputs": [],
"source": [
"# Enter your code here"
]
},
{
"cell_type": "markdown",
"metadata": {
"section": "submit"
},
"source": [
"Click [here to submit](http://csbio.unc.edu/mcmillan/index.py?run=PS.upload) your completed problem set"
]
}
],
"metadata": {
"celltoolbar": "Edit Metadata",
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 1
}