Skip to content

Commit

Permalink
Add slides for icml mloss
Browse files Browse the repository at this point in the history
  • Loading branch information
mrocklin committed Jul 9, 2015
1 parent 210ce81 commit 4ad943e
Show file tree
Hide file tree
Showing 15 changed files with 752 additions and 0 deletions.
72 changes: 72 additions & 0 deletions docs/source/_static/presentations/icml-mloss.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<!doctype html>
<html lang="en">

<head>
<meta charset="utf-8">

<title>Slides</title>

<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/default.css" id="theme">

<link rel="stylesheet" href="lib/css/zenburn.css">
</head>

<body>

<div class="reveal">

<div class="slides">

<section data-markdown="markdown/icml-mloss.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/mloss/foundations.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-array.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-array-meteorology.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/mloss/dask-core.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-svd.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/mloss/cross-validation.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/mloss/finish.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
</div>
</div>

<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>

<script>

Reveal.initialize({
controls: true,
progress: true,
history: true,
center: true,

// Optional libraries used to extend on reveal.js
dependencies: [
{ src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: 'marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'plugin/notes/notes.js' }
]
});

</script>

</body>
</html>
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/source/_static/presentations/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
<body>
<ul>

<li><a href="icml-mloss.html">Dask and Parallel Python</a>
at <a href="http://mloss.org/workshop/icml15/">ICML - Machine Learning and Open Source Software workshop, Lille, July 2015</a></li>
<li><a href="pydata-berlin.html">Dask and Parallel PyData</a>
at <a href="https://pydata.org/berlin2015/">PyData Berlin, 2015</a></li>
<li><a href="ucar-sea-2015.html">Dask.array at UCAR SEA -- Boulder, CO April 12th</a> -- <a href="https://sea.ucar.edu/event/out-core-computations-blaze">video</a>
Expand Down
16 changes: 16 additions & 0 deletions docs/source/_static/presentations/markdown/dask-dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## `dask.dataframe` is...

* an out-of-core, multi-core, partitioned dataframe
* that copies the `pandas` interface
* using blocked algorithms
* and task scheduling
* to orchestrate many in-memory Pandas operations

<img src="images/dataframe.png" alt="Partitioned DataFrame">


## `dask.dataframe` is...

Very new.

Ready for use but known failures popping up.
85 changes: 85 additions & 0 deletions docs/source/_static/presentations/markdown/dask-svd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
## Example: SVD


## Most Parallel Computation is Simple

>>> import dask.bag as db
>>> b = db.from_s3('githubarchive-data', '2015-01-01-*.json.gz')
.map(json.loads)
.map(lambda d: d['type'] == 'PushEvent')
.count()

<img src="images/embarrassing.png" alt="embarassingly parallel dask workload">


## What about more complex workflows?

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd(x)

<a href="images/dask-svd.png">
<img src="images/dask-svd.png" alt="Dask SVD graph" width="30%">
</a>

*Work by Mariano Tepper. "Compressed Nonnegative Matrix Factorization is Fast
and Accurate" [arXiv](http://arxiv.org/abs/1505.04650)*


## SVD - Dict

>>> s.dask
{('x', 0, 0): (np.ones, (1000, 1000)),
('x', 1, 0): (np.ones, (1000, 1000)),
('x', 2, 0): (np.ones, (1000, 1000)),
('x', 3, 0): (np.ones, (1000, 1000)),
('x', 4, 0): (np.ones, (1000, 1000)),
('tsqr_2_QR_st1', 0, 0): (np.linalg.qr, ('x', 0, 0)),
('tsqr_2_QR_st1', 1, 0): (np.linalg.qr, ('x', 1, 0)),
('tsqr_2_QR_st1', 2, 0): (np.linalg.qr, ('x', 2, 0)),
('tsqr_2_QR_st1', 3, 0): (np.linalg.qr, ('x', 3, 0)),
('tsqr_2_QR_st1', 4, 0): (np.linalg.qr, ('x', 4, 0)),
('tsqr_2_R', 0, 0): (operator.getitem, ('tsqr_2_QR_st2', 0, 0), 1),
('tsqr_2_R_st1', 0, 0): (operator.getitem,('tsqr_2_QR_st1', 0, 0), 1),
('tsqr_2_R_st1', 1, 0): (operator.getitem, ('tsqr_2_QR_st1', 1, 0), 1),
('tsqr_2_R_st1', 2, 0): (operator.getitem, ('tsqr_2_QR_st1', 2, 0), 1),
('tsqr_2_R_st1', 3, 0): (operator.getitem, ('tsqr_2_QR_st1', 3, 0), 1),
('tsqr_2_R_st1', 4, 0): (operator.getitem, ('tsqr_2_QR_st1', 4, 0), 1),
('tsqr_2_R_st1_stacked', 0, 0): (np.vstack,
[('tsqr_2_R_st1', 0, 0),
('tsqr_2_R_st1', 1, 0),
('tsqr_2_R_st1', 2, 0),
('tsqr_2_R_st1', 3, 0),
('tsqr_2_R_st1', 4, 0)])),
('tsqr_2_QR_st2', 0, 0): (np.linalg.qr, ('tsqr_2_R_st1_stacked', 0, 0)),
('tsqr_2_SVD_st2', 0, 0): (np.linalg.svd, ('tsqr_2_R', 0, 0)),
('tsqr_2_S', 0): (operator.getitem, ('tsqr_2_SVD_st2', 0, 0), 1)}


## SVD - Parallel Profile

<iframe src="../svd.profile.html"
marginwidth="0"
marginheight="0" scrolling="no" width="800"
height="300"></iframe>

*Bokeh profile tool by Jim Crist*


## Randomized Approximate Parallel Out-of-Core SVD

>>> import dask.array as da
>>> x = da.ones((5000, 1000), chunks=(1000, 1000))
>>> u, s, v = da.linalg.svd_compressed(x, k=100, n_power_iter=2)

<a href="images/dask-svd-random.png">
<img src="images/dask-svd-random.png"
alt="Dask graph for random SVD"
width="10%" >
</a>

N. Halko, P. G. Martinsson, and J. A. Tropp.
*Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions.*

*Dask implementation by Mariano Tepper*
27 changes: 27 additions & 0 deletions docs/source/_static/presentations/markdown/icml-mloss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Dask

*or*

## Python and Parallelism

*Matthew Rocklin*

Continuum Analytics


## Outline

* .
* Dask - Dynamic Task Scheduling
* Dask.array - out-of-core, parallel NumPy
* Dask with other workloads
* .


## Outline

* Numeric Python and Parallelism
* Dask - Dynamic Task Scheduling
* Dask.array - out-of-core, parallel NumPy
* Dask with other workloads
* Parallelism and Machine Learning
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Example: Cross Validation

Afternoon sprint with Olivier Grisel

for fold_id in range(n_folds):
...
dsk[(name, 'model', model_id)] = clone(model)

for partition_id in range(data.npartitions):
if partition_id % n_folds == fold_id:
dsk[(name, 'validation', validation_id)] = (score, ...)
else:
dsk[(name, 'model', model_id)] = (_partial_fit, ...)

...


## Cross Validation

<a href="../images/dask-cross-validation.pdf">
<img src="../images/dask-cross-validation.png" alt="Cross validation dask"
width="40%">
</a>


## Cross Validation

This killed the small-memory-footprint heuristics in the dask scheduler.
Fixing with small amounts of
[static scheduling](https://github.com/ContinuumIO/dask/pull/403).

[Profile](https://rawgit.com/mrocklin/8ec0443c94da553fe00c/raw/ff7d8d0754d07f35086b08c0d21865a03b3edeac/profile.html)
109 changes: 109 additions & 0 deletions docs/source/_static/presentations/markdown/mloss/dask-core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
## `dask.core`

Dead simple task scheduling

[dask.pydata.org](http://dask.pydata.org/en/latest/)


## We've seen `dask.array`

* Turns Numpy-ish code

(2*x + 1) ** 3

* Into Graphs

![](images/embarrassing.png)


## We've seen `dask.array`

* Turns Numpy-ish code

(2*x + 1) ** 3

* Then executes those graphs

![](images/embarrassing.gif)


### Q: What constitutes a dask graph?


<img src="images/dask-simple.png"
alt="A simple dask dictionary"
width="18%"
align="right">

# Normal Python # Dask

def inc(i):
return i + 1

def add(a, b):
return a + b

x = 1 d = {'x': 1,
y = inc(x) 'y': (inc, 'x'),
z = add(y, 10) 'z': (add, 'y', 10)}

<hr>

>>> from dask.threaded import get
>>> get(d, 'z')
12

* Simple representation
* Use Python to generate graphs (no DSL)
* Not user-friendly


### Example - dask.array

>>> x = da.arange(15, chunks=(5,))
dask.array<x, shape=(15,), chunks=((5, 5, 5)), dtype=None>

>>> x.dask
{("x", 0): (np.arange, 0, 5),
("x", 1): (np.arange, 5, 10),
("x", 2): (np.arange, 10, 15)}

>>> x.sum().dask
{("x", 0): (np.arange, 0, 5),
("x", 1): (np.arange, 5, 10),
("x", 2): (np.arange, 10, 15),
("s", 0): (np.sum, ("x", 0)),
("s", 1): (np.sum, ("x", 1)),
("s", 2): (np.sum, ("x", 2)),
("s",): (sum, [("s", 0), ("s", 1), ("s", 2)])}


### Dask.array is a convenient way to make dictionaries

<hr>

### Dask is a convenient way to make libraries like dask.array


## Dask works for more than just arrays

* `dask.array` = `numpy` + `threading`
* `dask.bag` = `list` + `multiprocessing`
* `dask.dataframe` = `pandas` + `threading`
* ...


* Collections build graphs
* Schedulers execute graphs

<img src="images/collections-schedulers.png"
width="100%">

* Neither side needs the other


### Question: Is there a similar class of problems in ML?

<hr>

### Question: How should we write them down as code?
25 changes: 25 additions & 0 deletions docs/source/_static/presentations/markdown/mloss/finish.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## Final Thoughts

* Python and Parallelism
* Most data is small
* Storage, representation, streaming, sampling offer bigger gains
* That being said, please [release the GIL](https://github.com/scikit-image/scikit-image/pull/1519)

* Dask: Dynamic task scheduling yields sane parallelism
* Simple library to enable parallelism
* Dask.array/dataframe demonstrate ability
* Rarely optimal performance (Theano is far smarter)
* Scheduling necessary for composed algorithms

* Questions:
* Appropriate class of problems in ML?
* What is the right API for algorithm builders?


## Questions

[http://dask.pydata.org](http://dask.pydata.org)

<img src="images/jenga.png" width="60%">

<img src="images/fail-case.gif" width="60%">
Loading

0 comments on commit 4ad943e

Please sign in to comment.