"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from graphviz import Digraph\n",
"\n",
"dot = Digraph(comment='debruijn')\n",
"\n",
"for km1mer in nodes:\n",
" dot.node(km1mer, km1mer)\n",
" \n",
"for src in nodes:\n",
" ends = nodes[src]\n",
" for end in ends:\n",
" dot.edge(src, end)\n",
"\n",
"dot.format = 'png'\n",
"dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Eulerian walks\n",
"\n",
"Why create this weird graph structure from our reads? At\n",
"first it seems like it makes things more complicated. Actually\n",
"this is an example of transforming your problem to make it\n",
"easier.\n",
"\n",
"To reconstruct the original string from this graph all we\n",
"have to do is find a trip along all the edges of the graph\n",
"that visits each edge only once. If we can do that, we\n",
"are done. People who are into graph theory call this\n",
"an eulerian walk. If you grew up in Germany you might\n",
"recognise this as solving the [kids puzzle](https://de.wikipedia.org/wiki/Haus_vom_Nikolaus) \"Das ist das\n",
"Haus vom Ni-ko-laus.\"\n",
"\n",
"\n",
"

Von SoylentGreen - Eigenes Werk, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2308144

\n",
"\n",
"\n",
"## Is it a tour?\n",
"\n",
"First things first, how will we know if a proposed tour is\n",
"a valid one? This is what the next few functions take care\n",
"of."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def edges(graph):\n",
" \"\"\"List all directed edges of `graph`\"\"\"\n",
" for node in graph:\n",
" for target in graph[node]:\n",
" yield (node, target)\n",
"\n",
"\n",
"def follow_tour(tour, graph):\n",
" \"\"\"Follow a tour and check it is eulerian\"\"\"\n",
" edges_ = list(edges(graph))\n",
" for start, end in zip(tour, tour[1:]):\n",
" try:\n",
" edges_.remove((start, end))\n",
" # most likely removing an edge that was already used\n",
" except:\n",
" return False\n",
" \n",
" # if there are any edges left this is neither\n",
" # an eulerian tour nor an eulerian trail\n",
" if edges_:\n",
" return False\n",
" else:\n",
" return True\n",
"\n",
"\n",
"def check_tour(start, graph):\n",
" our_tour = tour(start, graph)\n",
" valid_tour = follow_tour(our_tour, graph)\n",
" return valid_tour, \"\".join(s[0] for s in our_tour) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To construct an actual eulerian cycle or trail we use\n",
"[Hierholzer's algorithm](https://en.wikipedia.org/wiki/Eulerian_path#Hierholzer.27s_algorithm).\n",
"\n",
"There are some subtleties we would have to take\n",
"care of for a production grade implementation, like\n",
"dealing with where to start when the graph only contains\n",
"an eulerian path and not a cycle, etc. However it\n",
"does the job for small examples and allows you to\n",
"witness the miracle of genome assembly!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def tour(start_node, graph):\n",
" \"\"\"Find an eulerian cycle or trail.\n",
" \n",
" This does not check if the graph is eulerian\n",
" so it might return tours that are nonsense.\n",
" \"\"\"\n",
" # _tour() modifies the graph structure so we need to copy it\n",
" graph = copy.deepcopy(graph)\n",
" return _tour(start_node, graph)\n",
"\n",
"def _tour(start_node, graph, end=None):\n",
" tour = [start_node]\n",
" finish_on = end if end is not None else start_node\n",
" while True:\n",
" options = graph[tour[-1]]\n",
"\n",
" # eulerian trail, not tour?\n",
" if not options:\n",
" break\n",
" \n",
" tour.append(options.pop())\n",
" if tour[-1] == finish_on:\n",
" break\n",
" \n",
" # when we insert a sub-tour we extend the\n",
" # length of tour, need to track this\n",
" offset = 0\n",
" for n,step in enumerate(tour[:]):\n",
" options = graph[step]\n",
" if options:\n",
" t = _tour(options.pop(), graph, step)\n",
" n += offset\n",
" tour = tour[:n+1] + t + tour[n+1:]\n",
" offset += len(t)\n",
" \n",
" return tour"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"below some examples of graphs I tried out while\n",
"writing the algorithm. This is an [interactive blog post](http://betatim.github.io/posts/interactive-posts/)\n",
"so play around with them. Use the snippet form above to\n",
"draw each graph if you prefer to see it visually."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"defaultdict(, {'b': ['c', 'd', 'f'], 'c': ['b'], 'x': ['d'], 'f': ['g'], 'd': ['e', 'b'], 'g': ['a'], 'a': ['b'], 'e': ['x']})\n",
"cbfgabdexdbc\n"
]
}
],
"source": [
"check_tour('a', {'a': ['b'], 'b': ['a']})\n",
"check_tour('a', {'a': ['b'], 'b': ['c'], 'c': ['a', 'e'], 'e': ['f'], 'f': ['c']})\n",
"check_tour('a', {'a': ['b'], 'b': ['c'], 'c': ['a', 'e'], 'e': ['f'],\n",
" 'f': ['c', 'g'], 'g': ['f']})\n",
"check_tour('f', {'a': ['b'], 'b': ['c'], 'c': ['a', 'e'], 'e': ['f'],\n",
" 'f': ['c', 'g'], 'g': ['f']})\n",
"check_tour('g', {'a': ['b'], 'b': ['c'], 'c': ['a', 'e'], 'e': ['f'],\n",
" 'f': ['c', 'g'], 'g': ['f']})\n",
"check_tour('c', {'a': ['b'], 'b': ['c'], 'c': ['a', 'e'], 'e': ['f'],\n",
" 'f': ['c', 'g'], 'g': ['f']})\n",
"\n",
"\n",
"random.seed(54)\n",
"genome = 'abcbdexdbfga'\n",
"g = make_graph(genome, 2)\n",
"\n",
"valid, t = check_tour(genome[random.randint(0, len(genome)-1)], g)\n",
"\n",
"print(g)\n",
"print(t)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"I hope this helps you understand a bit more how all this genome assembly\n",
"stuff works. it certainly helped me. If you know about the biology behind\n",
"all this, please be lenient, if you spot gross mistakes or imprecise\n",
"statements that cause confusion for novices do get in touch!\n",
"\n",
"If you find a mistake or want to tell me something else get in touch on twitter @[betatim](//twitter.com/betatim)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}