Skip to content

Commit

Permalink
update mloss talk
Browse files Browse the repository at this point in the history
  • Loading branch information
mrocklin committed Jul 9, 2015
1 parent 4ad943e commit dcbd449
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 14 deletions.
6 changes: 6 additions & 0 deletions docs/source/_static/presentations/icml-mloss.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@
<section data-markdown="markdown/mloss/foundations.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/parallel-options.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-array.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
Expand All @@ -33,6 +36,9 @@
<section data-markdown="markdown/mloss/dask-core.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-dataframe.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
<section data-markdown="markdown/dask-svd.md"
data-separator="^\n\n\n"
data-vertical="^\n\n"></section>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/source/_static/presentations/images/frame.png
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Afternoon sprint with Olivier Grisel

## Cross Validation

<a href="../images/dask-cross-validation.pdf">
<img src="../images/dask-cross-validation.png" alt="Cross validation dask"
<a href="images/dask-cross-validation.png">
<img src="images/dask-cross-validation.png" alt="Cross validation dask"
width="40%">
</a>

Expand Down
15 changes: 9 additions & 6 deletions docs/source/_static/presentations/markdown/mloss/foundations.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
<img src="images/jenga.png" width="100%">


### Shared data structures enable interactions without coordination
### Shared data structures enable interactions


### Enables a vibrant ecosystem
### Shared data structures enable a vibrant ecosystem

### but exposes us to risk of obsolescence
### but expose us to risk of obsolescence


### Python, NumPy and Pandas are old(ish)
Expand All @@ -32,6 +32,7 @@
* Poor support for variable length strings
* Poor support for missing data
* Poor support for nested/semi-structured data
* Code bases are now hard to change


### The Numeric Python ecosystem inherits these limitations
Expand Down Expand Up @@ -73,10 +74,12 @@

## Why do we still use Python?

* Easy to setup and use by domain scientists
* Ubiquitous
* Easy to setup and use
* C/Fortran heritage
* Hundreds of PhD theses in software stack
* Strong academic and industry communities
* Domain expertise in the software stack (scikits)
* Strong academic and industry relationship
* Other communities (web, sysops, etc..)


### PyData rests on single-threaded foundations
Expand Down
32 changes: 26 additions & 6 deletions docs/source/_static/presentations/markdown/parallel-options.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
## My Job: Work towards parallel Numeric Python stack


## Python's options for Parallelism

Explicit control. Fast but hard.

* Threads/Processes
* MPI
* Concurrent.futures
* Threads/Processes/MPI
* Concurrent.futures/...
* Joblib
* .
* .
* .
* IPython parallel
* Luigi
* PySpark
* Hadoop (mrjob)
Expand All @@ -21,13 +24,13 @@ Implicit control. Restrictive/slow but easy.

Explicit control. Fast but hard.

* Threads/Processes
* MPI
* Concurrent.futures
* Threads/Processes/MPI
* Concurrent.futures/...
* Joblib
* .
* . <-- I need this
* .
* IPython parallel
* Luigi
* PySpark
* Hadoop (mrjob)
Expand All @@ -36,8 +39,25 @@ Explicit control. Fast but hard.
Implicit control. Restrictive but easy.


### My Solution: Dynamic task scheduling


## Scale

* Single four-core laptop (Gigabyte scale)
* Single thirty-core workstation (Terabyte scale)
* Distributed thousand-core cluster (Petabyte Scale)


## Scale

* Single four-core laptop (Gigabyte scale)
* **Single thirty-core workstation (Terabyte scale)**
* Distributed thousand-core cluster (Petabyte Scale)


## Outline

* Dask.array - parallel array library using dask
* Dask - internals
* Dask.dataframe/other - think about if this is useful to you

0 comments on commit dcbd449

Please sign in to comment.