Add slides for icml mloss

thequackdaddy · Jul 9, 2015 · 4ad943e · 4ad943e
1 parent 210ce81
commit 4ad943e
Show file tree

Hide file tree

Showing 15 changed files with 752 additions and 0 deletions.
diff --git a/docs/source/_static/presentations/icml-mloss.html b/docs/source/_static/presentations/icml-mloss.html
@@ -0,0 +1,72 @@
+<!doctype html>
+<html lang="en">
+
+	<head>
+		<meta charset="utf-8">
+
+		<title>Slides</title>
+
+		<link rel="stylesheet" href="css/reveal.css">
+		<link rel="stylesheet" href="css/theme/default.css" id="theme">
+
+        <link rel="stylesheet" href="lib/css/zenburn.css">
+	</head>
+
+	<body>
+
+		<div class="reveal">
+
+			<div class="slides">
+
+            <section data-markdown="markdown/icml-mloss.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/mloss/foundations.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/dask-array.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/dask-array-meteorology.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/mloss/dask-core.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/dask-svd.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/mloss/cross-validation.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            <section data-markdown="markdown/mloss/finish.md"
+                     data-separator="^\n\n\n"
+                     data-vertical="^\n\n"></section>
+            </div>
+		</div>
+
+		<script src="lib/js/head.min.js"></script>
+		<script src="js/reveal.js"></script>
+
+		<script>
+
+			Reveal.initialize({
+				controls: true,
+				progress: true,
+				history: true,
+				center: true,
+
+				// Optional libraries used to extend on reveal.js
+				dependencies: [
+					{ src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
+					{ src: 'marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
+                    { src: 'markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
+                    { src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
+					{ src: 'plugin/notes/notes.js' }
+				]
+			});
+
+		</script>
+
+        </body>
+        </html>
diff --git a/docs/source/_static/presentations/images/dask-cross-validation.pdf b/docs/source/_static/presentations/images/dask-cross-validation.pdf
diff --git a/docs/source/_static/presentations/images/dask-cross-validation.png b/docs/source/_static/presentations/images/dask-cross-validation.png
diff --git a/docs/source/_static/presentations/images/dask-svd.png b/docs/source/_static/presentations/images/dask-svd.png
diff --git a/docs/source/_static/presentations/images/dataframe.png b/docs/source/_static/presentations/images/dataframe.png
diff --git a/docs/source/_static/presentations/index.html b/docs/source/_static/presentations/index.html
@@ -7,6 +7,8 @@
 <body>
 <ul>
 
+  <li><a href="icml-mloss.html">Dask and Parallel Python</a>
+  at <a href="http://mloss.org/workshop/icml15/">ICML - Machine Learning and Open Source Software workshop, Lille, July 2015</a></li>
   <li><a href="pydata-berlin.html">Dask and Parallel PyData</a>
   at <a href="https://pydata.org/berlin2015/">PyData Berlin, 2015</a></li>
   <li><a href="ucar-sea-2015.html">Dask.array at UCAR SEA -- Boulder, CO April 12th</a> -- <a href="https://sea.ucar.edu/event/out-core-computations-blaze">video</a>

diff --git a/docs/source/_static/presentations/markdown/dask-dataframe.md b/docs/source/_static/presentations/markdown/dask-dataframe.md
@@ -0,0 +1,16 @@
+## `dask.dataframe` is...
+
+*  an out-of-core, multi-core, partitioned dataframe
+*  that copies the `pandas` interface
+*  using blocked algorithms
+*  and task scheduling
+*  to orchestrate many in-memory Pandas operations
+
+<img src="images/dataframe.png" alt="Partitioned DataFrame">
+
+
+## `dask.dataframe` is...
+
+Very new.
+
+Ready for use but known failures popping up.
diff --git a/docs/source/_static/presentations/markdown/dask-svd.md b/docs/source/_static/presentations/markdown/dask-svd.md
@@ -0,0 +1,85 @@
+## Example: SVD
+
+
+## Most Parallel Computation is Simple
+
+    >>> import dask.bag as db
+    >>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
+              .map(json.loads)
+              .map(lambda d: d['type'] == 'PushEvent')
+              .count()
+
+<img src="images/embarrassing.png" alt="embarassingly parallel dask workload">
+
+
+## What about more complex workflows?
+
+    >>> import dask.array as da
+    >>> x = da.ones((5000, 1000), chunks=(1000, 1000))
+    >>> u, s, v = da.linalg.svd(x)
+
+<a href="images/dask-svd.png">
+  <img src="images/dask-svd.png" alt="Dask SVD graph" width="30%">
+</a>
+
+*Work by Mariano Tepper.  "Compressed Nonnegative Matrix Factorization is Fast
+and Accurate" [arXiv](http://arxiv.org/abs/1505.04650)*
+
+
+## SVD - Dict
+
+    >>> s.dask
+    {('x', 0, 0): (np.ones, (1000, 1000)),
+     ('x', 1, 0): (np.ones, (1000, 1000)),
+     ('x', 2, 0): (np.ones, (1000, 1000)),
+     ('x', 3, 0): (np.ones, (1000, 1000)),
+     ('x', 4, 0): (np.ones, (1000, 1000)),
+     ('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
+     ('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
+     ('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
+     ('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
+     ('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
+     ('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
+     ('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
+     ('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
+     ('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
+     ('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
+     ('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
+     ('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
+                                       [('tsqr_2_R_st1', 0, 0),
+                                        ('tsqr_2_R_st1', 1, 0),
+                                        ('tsqr_2_R_st1', 2, 0),
+                                        ('tsqr_2_R_st1', 3, 0),
+                                        ('tsqr_2_R_st1', 4, 0)])),
+     ('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
+     ('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
+     ('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}
+
+
+## SVD - Parallel Profile
+
+<iframe src="../svd.profile.html"
+        marginwidth="0"
+        marginheight="0" scrolling="no" width="800"
+        height="300"></iframe>
+
+*Bokeh profile tool by Jim Crist*
+
+
+## Randomized Approximate Parallel Out-of-Core SVD
+
+    >>> import dask.array as da
+    >>> x = da.ones((5000, 1000), chunks=(1000, 1000))
+    >>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)
+
+<a href="images/dask-svd-random.png">
+<img src="images/dask-svd-random.png"
+     alt="Dask graph for random SVD"
+     width="10%" >
+</a>
+
+N. Halko, P. G. Martinsson, and J. A. Tropp.
+*Finding structure with randomness: Probabilistic algorithms for
+constructing approximate matrix decompositions.*
+
+*Dask implementation by Mariano Tepper*
diff --git a/docs/source/_static/presentations/markdown/icml-mloss.md b/docs/source/_static/presentations/markdown/icml-mloss.md
@@ -0,0 +1,27 @@
+## Dask
+
+*or*
+
+## Python and Parallelism
+
+*Matthew Rocklin*
+
+Continuum Analytics
+
+
+## Outline
+
+* .
+* Dask - Dynamic Task Scheduling
+    *  Dask.array - out-of-core, parallel NumPy
+    *  Dask with other workloads
+* .
+
+
+## Outline
+
+* Numeric Python and Parallelism
+* Dask - Dynamic Task Scheduling
+    *  Dask.array - out-of-core, parallel NumPy
+    *  Dask with other workloads
+* Parallelism and Machine Learning
diff --git a/docs/source/_static/presentations/markdown/mloss/cross-validation.md b/docs/source/_static/presentations/markdown/mloss/cross-validation.md
@@ -0,0 +1,32 @@
+## Example: Cross Validation
+
+Afternoon sprint with Olivier Grisel
+
+    for fold_id in range(n_folds):
+        ...
+        dsk[(name, 'model', model_id)] = clone(model)
+
+        for partition_id in range(data.npartitions):
+            if partition_id % n_folds == fold_id:
+                dsk[(name, 'validation', validation_id)] = (score, ...)
+            else:
+                dsk[(name, 'model', model_id)] = (_partial_fit, ...)
+
+            ...
+
+
+## Cross Validation
+
+<a href="../images/dask-cross-validation.pdf">
+<img src="../images/dask-cross-validation.png" alt="Cross validation dask"
+     width="40%">
+</a>
+
+
+## Cross Validation
+
+This killed the small-memory-footprint heuristics in the dask scheduler.
+Fixing with small amounts of
+[static scheduling](https://github.com/ContinuumIO/dask/pull/403).
+
+[Profile](https://rawgit.com/mrocklin/8ec0443c94da553fe00c/raw/ff7d8d0754d07f35086b08c0d21865a03b3edeac/profile.html)
diff --git a/docs/source/_static/presentations/markdown/mloss/dask-core.md b/docs/source/_static/presentations/markdown/mloss/dask-core.md
@@ -0,0 +1,109 @@
+## `dask.core`
+
+Dead simple task scheduling
+
+[dask.pydata.org](http://dask.pydata.org/en/latest/)
+
+
+## We've seen `dask.array`
+
+*  Turns Numpy-ish code
+
+        (2*x + 1) ** 3
+
+*  Into Graphs
+
+![](images/embarrassing.png)
+
+
+## We've seen `dask.array`
+
+*  Turns Numpy-ish code
+
+        (2*x + 1) ** 3
+
+*  Then executes those graphs
+
+![](images/embarrassing.gif)
+
+
+### Q: What constitutes a dask graph?
+
+
+<img src="images/dask-simple.png"
+     alt="A simple dask dictionary"
+     width="18%"
+     align="right">
+
+    # Normal Python             # Dask
+
+    def inc(i):
+       return i + 1
+
+    def add(a, b):
+       return a + b
+
+    x = 1                       d = {'x': 1,
+    y = inc(x)                       'y': (inc, 'x'),
+    z = add(y, 10)                   'z': (add, 'y', 10)}
+
+<hr>
+
+    >>> from dask.threaded import get
+    >>> get(d, 'z')
+    12
+
+*  Simple representation
+*  Use Python to generate graphs (no DSL)
+*  Not user-friendly
+
+
+### Example - dask.array
+
+    >>> x = da.arange(15, chunks=(5,))
+    dask.array<x, shape=(15,), chunks=((5, 5, 5)), dtype=None>
+
+    >>> x.dask
+    {("x", 0): (np.arange,  0,  5),
+     ("x", 1): (np.arange,  5, 10),
+     ("x", 2): (np.arange, 10, 15)}
+
+    >>> x.sum().dask
+    {("x", 0): (np.arange,  0,  5),
+     ("x", 1): (np.arange,  5, 10),
+     ("x", 2): (np.arange, 10, 15),
+     ("s", 0): (np.sum, ("x", 0)),
+     ("s", 1): (np.sum, ("x", 1)),
+     ("s", 2): (np.sum, ("x", 2)),
+     ("s",):   (sum, [("s", 0), ("s", 1), ("s", 2)])}
+
+
+### Dask.array is a convenient way to make dictionaries
+
+<hr>
+
+### Dask is a convenient way to make libraries like dask.array
+
+
+## Dask works for more than just arrays
+
+* `dask.array` = `numpy` + `threading`
+* `dask.bag` = `list` + `multiprocessing`
+* `dask.dataframe` = `pandas` + `threading`
+* ...
+
+
+*  Collections build graphs
+*  Schedulers execute graphs
+
+<img src="images/collections-schedulers.png"
+     width="100%">
+
+*  Neither side needs the other
+
+
+### Question: Is there a similar class of problems in ML?
+
+<hr>
+
+### Question: How should we write them down as code?
diff --git a/docs/source/_static/presentations/markdown/mloss/finish.md b/docs/source/_static/presentations/markdown/mloss/finish.md
@@ -0,0 +1,25 @@
+## Final Thoughts
+
+*  Python and Parallelism
+    *  Most data is small
+    *  Storage, representation, streaming, sampling offer bigger gains
+    *  That being said, please [release the GIL](https://github.com/scikit-image/scikit-image/pull/1519)
+
+*  Dask: Dynamic task scheduling yields sane parallelism
+    *  Simple library to enable parallelism
+    *  Dask.array/dataframe demonstrate ability
+    *  Rarely optimal performance (Theano is far smarter)
+    *  Scheduling necessary for composed algorithms
+
+*  Questions:
+    *  Appropriate class of problems in ML?
+    *  What is the right API for algorithm builders?
+
+
+## Questions
+
+[http://dask.pydata.org](http://dask.pydata.org)
+
+<img src="images/jenga.png" width="60%">
+
+<img src="images/fail-case.gif" width="60%">