In [1]:
%%javascript
var width = window.innerWidth || document.documentElement.clientWidth || document.body.clientWidth;
var height = window.innerHeight || document.documentElement.clientHeight || document.body.clientHeight;

IPython.notebook.kernel.execute("windowSize = (" + width + "," + height + ")");
// suitable for small screens
nbpresent.mode.tree.set(
    ["app", "theme-manager", "themes", "my-theme"], 
    {
    palette: {
        "blue": { id: "blue", rgb: [0, 153, 204] },
        "black": { id: "black", rgb: [0, 0, 0] },
        "white": { id: "white", rgb: [255, 255, 255] },
        "red": { id: "red", rgb: [240, 32, 32] },
        "gray": { id: "gray", rgb: [128, 128, 128] },
    },
    backgrounds: {
        "my-background": {
            "background-color": "white"
        }
    },
    "text-base": {
        "font-family": "Georgia",
        "font-size": 2.5
    },
    rules: {
        h1: {
            "font-size": 5.5,
            color: "blue",
            "text-align": "center"
        },
        h2: {
            "font-size": 3,
            color: "blue",
            "text-align": "center"
        },
        h3: {
            "font-size": 3,
            color: "black",
        },
        "ul li": {
            "font-size": 2.5,
            color: "black"
        },
        "ul li ul li": {
            "font-size": 2.0,
            color: "black"
        },
        "code": {
            "font-size": 1.6,
        },
        "pre": {
            "font-size": 1.6,
        }
    }
});

<IPython.core.display.Javascript object>

# Comparing Sequences

<img src="./Media/Clustal.gif" width="700">
<p style="text-align: center;">By Miguel Andrade at English Wikipedia</p>

<p style="text-align: right; clear: right;">1</p>

# Sequence Similarity

* A common problem in Biology

<table style="font-size: 12px;"><tbody>
<tr>
<th colspan="2" style="text-align: center;">
Insulin Protein Sequence
</th>
</tr>
<tr>
<td style="text-align: right;">Human</td>
<td><code>MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN</code></td>
</tr>
<tr>
<td style="text-align: right;">Dog</td>
<td><code>MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN</code></td>
</tr>
<tr>
<td style="text-align: right;">Cat</td>
<td><code>MAPWTRLLPLLALLSLWIPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQGKDAELGEAPGAGGLQPSALEAPLQKRGIVEQCCASVCSLYQLEHYCN</code></td>
</tr>
<tr>
<td style="text-align: right;">Pig</td>
<td><code>MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN</code></td>
</tr>
</tbody></table>
<img src="./Media/InsulinHexamer.jpg" width="240" style="float: right; margin-right: 120px;">

* All similar, but how similar?
* How do you measure similarity?
* Does Hamming distance work here?
* Uses
  - To establish a *phylogeny*
  - To identify *functional* or *conserved* components of the sequence

<p style="text-align: right; clear: right;">2</p>

# Hand Alignments

* Not that long ago, many aligments were done by hand

<pre  style="font-size: 12px;">
Human : MALWMRLLPLLALLALWGPdPAaAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQ_____________GSLQPLALEGs_LQKRGIVEQCCTSICSLYQLENYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||             |||||||||||||||||||||||||||||||||||||
  Dog : MALWMRLLPLLALLALWAPAPtRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREvEDLQvrDVELaG_APGeGGLQPLALEGA_LQKRGIVEQCCTSICSLYQLENYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  |||| | ||| ||||||||| | |||||||||||||||||||||||||
  Cat : MApWtRLLPLLALLsLWiPAPtRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQgkDaEL_GeAPGaGGLQPsALE_APLQKRGIVEQCCaSvCSLYQLEHYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  |||| | ||  |||||||||||| ||||||||||||||||||||||||
  Pig : MALWtRLLPLLALLAlWAPAPAqAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEnpQagaVEL_Gggl__GGLQaLALEGpP_QKRGIVEQCCTSICSLYQLENYCN
                               AFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAE                             QKRGIVEQCC SICSLYQLENYCN
</pre>

* Long conserved regions are shown below
* Solution strategy?
* Is this a well defined problem?
  - Is there an optimal or best solution
  - Did we find it
* By the way, this is an easy case. Within vertebrates, the amino acid sequence of insulin is strongly conserved.

<p style="text-align: right; clear: right;">3</p>

# The Alignment Game

Let's simplify the problem a bit by considering only 2 sequences, and estabishing rules as if it were a game.

* Rules:
  - You must remove all characters from both sequences
  - There are 3 possible moves at any point in the game.
  - Each move removes at least one character from one of the two given strings
  - Pressing [Match] removes one left-most character from both sequences
   * You get 1 point if the characters match, otherwise you get 0 points
  - Pressing [Del] removes the left-most character from the top sequence
   * You lose 1 point
  - Pressing [Ins] removes the left-most character from the bottom sequence
   * You lose 1 point
  - Your point total is allowed to go negative

* Objective: Get the most points

In [6]:
%%javascript

var Insulin = {
    "Human" : "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN",    
    "Dog" : "MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN",
    "Cat" : "MAPWTRLLPLLALLSLWIPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQGKDAELGEAPGAGGLQPSALEAPLQKRGIVEQCCASVCSLYQLEHYCN",
    "Pig" : "MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN"
};

var score = 0;
var topIndex = 0;
var botIndex = 0;

function Match() {
    var top = Insulin[document.getElementById("top").value];
    var bot = Insulin[document.getElementById("bot").value];
    score += (top[topIndex] == bot[botIndex]) ? 1 : 0;
    topIndex += 1;
    botIndex += 1;    
    document.getElementById("topseq").innerHTML = top.substring(topIndex,topIndex+24);
    document.getElementById("botseq").innerHTML = bot.substring(botIndex,botIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}

function Delete() {
    var top = Insulin[document.getElementById("top").value];
    score -= 1;
    topIndex += 1;    
    document.getElementById("topseq").innerHTML = top.substring(topIndex,topIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}

function Insert() {
    var bot = Insulin[document.getElementById("bot").value];
    score -= 1;
    botIndex += 1;    
    document.getElementById("botseq").innerHTML = bot.substring(botIndex,botIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}


function onStart() {
    var row, cell, button;
    var board = document.getElementById("board");
    var table = document.createElement("table");
    var tbody = document.createElement("tbody");
    
    topIndex = 0;
    botIndex = 0;
    var topseq = document.createElement("pre");
    topseq.id = "topseq";
    topseq.style = "color: blue; font-size: 50px; line-height: 100%;"
    topseq.innerHTML = Insulin[document.getElementById("top").value].substring(topIndex,topIndex+24);
    
    var botseq = document.createElement("pre");
    botseq.id = "botseq";
    botseq.style = "color: blue; font-size: 50px; line-height: 100%;"
    botseq.innerHTML = Insulin[document.getElementById("bot").value].substring(topIndex,topIndex+24);
    
    row = document.createElement("tr");
    cell = document.createElement("td");
    cell.style = "padding: 5px;";
    cell.rowSpan = 2;
    button = document.createElement("input");
    button.style = "height: 90px;";
    button.type = "button";
    button.value = "Match";
    button.addEventListener("click", Match, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    button = document.createElement("input");
    button.style = "height: 40px;";
    button.type = "button";
    button.value = "DEL";
    button.addEventListener("click", Delete, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    cell.style = "padding-left: 20px;"
    cell.appendChild(topseq);
    row.appendChild(cell);
    tbody.appendChild(row);

    row = document.createElement("tr");
    cell = document.createElement("td");
    button = document.createElement("input");
    button.style = "height: 40px;";
    button.type = "button";
    button.value = "INS";
    button.addEventListener("click", Insert, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    cell.appendChild(botseq);
    cell.style = "padding-left: 20px;"
    row.appendChild(cell);
    tbody.appendChild(row);
    table.appendChild(tbody);

    board.innerHTML = "";
    board.appendChild(table);
    score = 0;
    document.getElementById("scoreboard").innerHTML = score;
}

function selection(name, check) {
    var element = document.createElement("select");
    element.id = name;
    for (var key in Insulin) {
        var option = document.createElement("option");
        option.value = key;
        option.appendChild(document.createTextNode(key));
        option.selected = (key == check);
        element.appendChild(option);
    }
    return element;
}

element.append(selection("top", "Human"));
element.append("&nbsp;&nbsp;&nbsp;");
element.append(selection("bot", "Dog"));
element.append("&nbsp;&nbsp;&nbsp;");

var start = document.createElement("input");
start.type = "button";
start.value = "Start";
start.addEventListener("click", onStart, false);
element.append(start);

element.append("&nbsp;&nbsp;&nbsp; <b>Score:</b> ");
var scoreboard = document.createElement("span");
scoreboard.id = "scoreboard";
scoreboard.innerHTML = score;
element.append(scoreboard);

var board = document.createElement("div");
board.id = "board";
board.style = "margin: 20px; height: 160px;"
element.append(board);

// Hidden-Cell

<IPython.core.display.Javascript object>

<p style="text-align: right; clear: right;">4</p>

# How do you get the highest possible score?

* The solution may not be unique
<img src="./Media/BruteForce.jpg" width="150" style="float: right;">

* How many presses?
  - Minimum moves = *Max(len(top), len(bot))*
  - Maximum moves = *len(top) + len(bot)*

* How many possible moves?
  - Less than $O(3^{len(top) + len(bot)})$

* How big are these for our problem instance?
  - len(Human) = 98, len(dog) = 110
  - $3^{208} \approx 1.73 \times 10^{99}$, almost a googol (not a google)

* What algorithm is solves this problem
  - Make each move by considering only a short horizon following the current aligment thus far

<p style="text-align: right; clear: right;">5</p>

# There is an effcient solution

* It relies on a rather suprising idea

  - The best score can be found for the len(top) and len(bot) strings by finding the best score for every pair of substrings len(top[0:*n*]) and len(bot[0:*m*]) for all values of *n* up to len(top) and *m* up to len(bot) 
  - Finding this solution requires only $O(len(top)len(bot))$ steps
  - It also requires a table of size $Max(len(top),len(bot))$

* But before we solve this problem, let's look at another related related problem

* Finding a best city tour on a Manhattan grid

<img src="./Media/ManhattanGrid.jpg" width="400">

<p style="text-align: right; clear: right;">6</p>

# Manhattan Tourist Problem (MTP)

Imagine seeking a path from a given source to given destination in a Manhattan-like city grid that maximizes the number of attractions (<span style="color: red;">\*</span>) passed. With the following caveat– at every step you must make progress towards the goal.
We treat the city map as a graph, with a *vertices* at each intersection, and *weighted edges* along each block. The weights are the number of attractions along each block.

<img src="./Media/MTPVer01.png" width="400">

<p style="text-align: right; clear: right;">7</p>

# Manhattan Tourist Game

**Goal:** Find the maximum weighted shortest path in a grid.

**Input:** A weighted grid G with two distinct vertices, one labeled *source* and the other labeled *destination*

**Output:** A *shortest* path in G from *source* to *destination* with the *greatest* weight
  * There are many *shortest* paths that go south 4 blocks and east 4 blocks
  * Of those paths, which sees the most sites?

<img src="./Media/TallestMidget.jpg" width="150">

<p style="text-align: right; clear: right;">8</p>

# MTP: A Greedy Algorithm Is Not Optimal

<img src="./Media/MTPGreedy.png" width="500">

Different types of ***Greedy***
* <span style="color: red;">*Short horizon*</span>: At each block select the direction where the next block offers the most attractions
* <span style="color: magenta;">*Long horizon*</span>: Look ahead at all streets between your current position and the destination, and go towards the street with the most attractions

<p style="text-align: right; clear: right;">9</p>

# MTP: Observations

* There are limited number of ways to reach any destination
  - For example, in our grid, one can reach the desitination node, *(n,m)*, from either the north, *(n,m-1)*, or the    west *(n-1,m)*.
  - for each of those routes there is a known number of sites to see, so the best path is:
  
  $$Score(n,m) = Max(Score(n-1,m)+Edge(n-1,m), Score(n,m-1)+Edge(n,m-1))$$


  - Why is there only one edge per intersection? Because only one direction makes progress to our goal
  - This rule applies recursively with the base case
  
  $$Score(0,0) = 0$$

* We could write this strategy as a recursive algorithm, but it would still not be effcient. Why?

<img src="./Media/MTPInsight.png" width="320">

<p style="text-align: right; clear: right;">10</p>

# A New Solution Strategy

*Dynamic Programming* is a technique for *computing recurrence relations efficiently by storing and reusing intermediate results*

Three keys to constructing a dynamic programming solution:
  1. Formulate the answer as a recurrence relation
  2. Consider all instances of the recurrence at each step (In our case this means all paths that lead to a vertex).
  3. Order evaluations so you will always have precomputed the needed partial results

**Irony:** Often the most effcient approach to solving a specific problem involves solving every smaller subproblem.

<p style="text-align: right; clear: right;">11</p>

# MTP Dynamic Program Solution

<img src="./Media/MTP2.png" width="600">

The solution may not be unique, but it will have the best possible score

<p style="text-align: right; clear: right;">12</p>

# MTP Dynamic Program Strategy

* Instead of solving the Manhattan Tourist problem directly, (i.e. the path from (0,0) to (n,m)) we will solve a more general problem: find the longest path from (0,0) to any arbitrary vertex (i,j).


* If the longest path from (0,0) to (n,m) passes through some vertex (i,j), then the path from (0,0) to (i,j) must be the longest. Otherwise, you could increase the weight along your path by changing it.

<img src="./Media/cityblock.jpg" width="500">

<p style="text-align: right; clear: right;">13</p>

# MTP: Dynamic Program

* Calculate optimal path score for *every* vertex in the graph between our source and destination
* Each vertex’s score is the maximum of the prior vertices score plus the weight of the connecting edge in between

<img src="./Media/MTPOrder.png" width="500">

<p style="text-align: right; clear: right;">14</p>

# MTP: Dynamic Program Continued

<img src="./Media/MTPOrderPart2.png" width="500">

<p style="text-align: right; clear: right;">15</p>

# MTP: Dynamic Program Continued

<img src="./Media/MTPOrderPart3.png" width="500">

<p style="text-align: right; clear: right;">16</p>

# MTP: Dynamic Program Continued

<img src="./Media/MTPOrderPart4.png" width="500">

<p style="text-align: right; clear: right;">17</p>

# MTP: Dynamic Program Continued

<img src="./Media/MTPOrderPart5.png" width="500">

<p style="text-align: right; clear: right;">18</p>

# MTP: Dynamic Program Continued

<img src="./Media/MTPOrderPart6.png" width="400">

* Once the *destination* node (intersection) is reached, we’re done.
* Our table will have the answer of the maximum number of attractions stored in the entry associated with the destination.
* We use the *links* back in the table to recover the path. (Backtracking)

<p style="text-align: right; clear: right;">19</p>

# MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

<img src="./Media/MTPRecurrence.png" width="500">

The running time is *nm*  for a n &times; m grid
 * (You visit all intersections once, and perform 2 tests)

(n = # of rows, m = # of columns)

<p style="text-align: right; clear: right;">20</p>

# Manhattan Is Not A Perfect Grid

<img src="./Media/Broadway.png" width="500">

* Easy to fix. Just adds more recursion cases. 
* The score at point B is given by:

<img src="./Media/DiagGrid.png" width="400">

<p style="text-align: right; clear: right;">21</p>

# Other ways to safely explore the Manhattan

* We chose to evaluate our table in a particular order. Uniform distances from the source (all points one block away, then 2 blocks, etc.)
* Other strategies:
  - Column by column
  - Row by row
  - Along diagonals
* This choice can have performance implications

<table style="border: none;">
<tr style="border: none;">
<td style="border: none; padding: 0px 20px;"><img src="./Media/Traversal2.png" width="192px"></td>
<td style="border: none;"><img src="./Media/Traversal1.png" width="400px"></td>
</tr>
</table>

<p style="text-align: right; clear: right;">22</p>

# Next Time

* Return to biology
* Solving sequence alignments using Dynamic Programming

<img src="./Media/McDNAs.png" width="600">

<p style="text-align: right; clear: right;">23</p>

In [3]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $("div.input:contains('Hidden-Cell')").hide();
    } else {
        $("div.input:contains('Hidden-Cell')").show();
    }
    code_show = !code_show;
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle Hidden Cells"></form>''')