Skip to content

Commit 57f4177

Browse files
committed
documentation: Update docs for 0.3.0.0
1 parent fc2403a commit 57f4177

5 files changed

Lines changed: 107 additions & 30 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# Revision history for dataframe
22

3+
## 0.3.0.0
4+
* Now supports inner joins
5+
* Aggregations are now expressions allowing for more expressive aggregation logic.
6+
* In GHCI, you can now create type-safe bindings for each column and use those in expressions.
7+
* Added pandas and polars benchmarks.
8+
* Performance improvements to `groupBy`.
9+
* Various bug fixes.
10+
311
## 0.2.0.2
412
* Experimental Apache Parquet support.
513
* Rename conversion columns (changed from toColumn and toColumn' to fromVector and fromList).

README.md

Lines changed: 94 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -25,38 +25,83 @@ A fast, safe, and intuitive DataFrame library.
2525

2626
## Why use this DataFrame library?
2727

28-
* Encourages concise, declarative, and composable data pipelines through a powerful abstraction model.
28+
* Encourages concise, declarative, and composable data pipelines.
2929
* Static typing makes code easier to reason about and catches many bugs at compile time—before your code ever runs.
3030
* Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model.
3131
* Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults.
3232

33-
## Installing
34-
35-
### Jupyter notebook
36-
* We have a [hosted version of the Jupyter notebook](https://ihaskell-dataframe-crf7g5fvcpahdegz.westus2-01.azurewebsites.net/lab/) on azure sites.
37-
* Use the Dockerfile in the [ihaskell-dataframe](https://github.com/mchav/ihaskell-dataframe) to build and run an image with dataframe integration.
38-
* For a preview check out the [California Housing](https://ihaskell-dataframe-crf7g5fvcpahdegz.westus2-01.azurewebsites.net/lab/tree/California%20Housing.ipynb) notebook.
39-
40-
### CLI
41-
* Install Haskell (ghc + cabal) via [ghcup](https://www.haskell.org/ghcup/install/) selecting all the default options.
42-
* Install snappy (needed for Parquet support) by running: `sudo apt install libsnappy-dev`.
43-
* To install dataframe run `cabal update && cabal install dataframe`
44-
* Open a Haskell repl with dataframe loaded by running `cabal repl --build-depends dataframe`.
45-
* Follow along any one of the tutorials below.
46-
47-
48-
## What is exploratory data analysis?
49-
We provide a primer [here](https://github.com/mchav/dataframe/blob/main/docs/exploratory_data_analysis_primer.md) and show how to do some common analyses.
33+
## Example usage
5034

51-
## Coming from other dataframe libraries
52-
Familiar with another dataframe library? Get started:
53-
* [Coming from Pandas](https://github.com/mchav/dataframe/blob/main/docs/coming_from_pandas.md)
54-
* [Coming from Polars](https://github.com/mchav/dataframe/blob/main/docs/coming_from_polars.md)
55-
* [Coming from dplyr](https://github.com/mchav/dataframe/blob/main/docs/coming_from_dplyr.md)
35+
### Interactive environment
36+
```haskell
37+
ghci> import qualified DataFrame as D
38+
ghci> import DataFrame ((|>))
39+
ghci> df <- D.readCsv "./data/housing.csv"
40+
ghci> D.columnInfo df
41+
--------------------------------------------------------------------------------------------------------------------
42+
index | Column Name | # Non-null Values | # Null Values | # Partially parsed | # Unique Values | Type
43+
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
44+
Int | Text | Int | Int | Int | Int | Text
45+
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
46+
0 | total_bedrooms | 20433 | 207 | 0 | 1924 | Maybe Double
47+
1 | ocean_proximity | 20640 | 0 | 0 | 5 | Text
48+
2 | median_house_value | 20640 | 0 | 0 | 3842 | Double
49+
3 | median_income | 20640 | 0 | 0 | 12928 | Double
50+
4 | households | 20640 | 0 | 0 | 1815 | Double
51+
5 | population | 20640 | 0 | 0 | 3888 | Double
52+
6 | total_rooms | 20640 | 0 | 0 | 5926 | Double
53+
7 | housing_median_age | 20640 | 0 | 0 | 52 | Double
54+
8 | latitude | 20640 | 0 | 0 | 862 | Double
55+
9 | longitude | 20640 | 0 | 0 | 844 | Double
56+
ghci> :exposeColumns df
57+
ghci> import qualified DataFrame.Functions as F
58+
ghci> df |> D.groupBy ["ocean_proximity"] |> D.aggregate [(F.mean median_house_value) `F.as` "avg_house_value" ]
59+
--------------------------------------------
60+
index | ocean_proximity | avg_house_value
61+
------|-----------------|-------------------
62+
Int | Text | Double
63+
------|-----------------|-------------------
64+
0 | <1H OCEAN | 240084.28546409807
65+
1 | INLAND | 124805.39200122119
66+
2 | ISLAND | 380440.0
67+
3 | NEAR BAY | 259212.31179039303
68+
4 | NEAR OCEAN | 249433.97742663656
69+
ghci> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 10
70+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
71+
index | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | rooms_per_household
72+
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
73+
Int | Double | Double | Double | Double | Maybe Double | Double | Double | Double | Double | Text | Double
74+
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
75+
0 | -122.23 | 37.88 | 41.0 | 880.0 | Just 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 6.984126984126984
76+
1 | -122.22 | 37.86 | 21.0 | 7099.0 | Just 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 6.238137082601054
77+
2 | -122.24 | 37.85 | 52.0 | 1467.0 | Just 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 8.288135593220339
78+
3 | -122.25 | 37.85 | 52.0 | 1274.0 | Just 235.0 | 558.0 | 219.0 | 5.6431000000000004 | 341300.0 | NEAR BAY | 5.8173515981735155
79+
4 | -122.25 | 37.85 | 52.0 | 1627.0 | Just 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 6.281853281853282
80+
5 | -122.25 | 37.85 | 52.0 | 919.0 | Just 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY | 4.761658031088083
81+
6 | -122.25 | 37.84 | 52.0 | 2535.0 | Just 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY | 4.9319066147859925
82+
7 | -122.25 | 37.84 | 52.0 | 3104.0 | Just 687.0 | 1157.0 | 647.0 | 3.12 | 241400.0 | NEAR BAY | 4.797527047913447
83+
8 | -122.26 | 37.84 | 42.0 | 2555.0 | Just 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY | 4.294117647058823
84+
9 | -122.25 | 37.84 | 52.0 | 3549.0 | Just 707.0 | 1551.0 | 714.0 | 3.6912000000000003 | 261100.0 | NEAR BAY | 4.970588235294118
85+
ghci> df |> D.derive "nonsense_feature" (latitude + ocean_proximity) |> D.take 10
86+
87+
<interactive>:14:47: error: [GHC-83865]
88+
Couldn't match type Text with Double
89+
Expected: Expr Double
90+
Actual: Expr Text
91+
In the second argument of (+), namely ocean_proximity
92+
In the second argument of derive, namely
93+
(latitude + ocean_proximity)
94+
In the second argument of (|>), namely
95+
derive "nonsense_feature" (latitude + ocean_proximity)
96+
```
5697

57-
## Example usage
98+
Key features in example:
99+
* Intuitive, SQL-like API to get from data to insights.
100+
* Create type-safe references to columns in a dataframe using `:exponseColumns`
101+
* Type-safe column transformations for faster and safer exploration.
102+
* Fluid, chaining API that makes code easy to reason about.
58103

59-
### Code example
104+
### Standalone script example
60105
```haskell
61106
-- Useful Haskell extensions.
62107
{-# LANGUAGE OverloadedStrings #-} -- Allow string literal to be interpreted as any other string type.
@@ -108,6 +153,30 @@ Full example in `./examples` folder using many of the constructs in the API.
108153
### Visual example
109154
![Screencast of usage in GHCI](./static/example.gif)
110155

156+
## Installing
157+
158+
### Jupyter notebook
159+
* We have a [hosted version of the Jupyter notebook](https://ihaskell-dataframe-crf7g5fvcpahdegz.westus2-01.azurewebsites.net/lab/) on azure sites.
160+
* Use the Dockerfile in the [ihaskell-dataframe](https://github.com/mchav/ihaskell-dataframe) to build and run an image with dataframe integration.
161+
* For a preview check out the [California Housing](https://ihaskell-dataframe-crf7g5fvcpahdegz.westus2-01.azurewebsites.net/lab/tree/California%20Housing.ipynb) notebook.
162+
163+
### CLI
164+
* Install Haskell (ghc + cabal) via [ghcup](https://www.haskell.org/ghcup/install/) selecting all the default options.
165+
* Install snappy (needed for Parquet support) by running: `sudo apt install libsnappy-dev`.
166+
* To install dataframe run `cabal update && cabal install dataframe`
167+
* Open a Haskell repl with dataframe loaded by running `cabal repl --build-depends dataframe`.
168+
* Follow along any one of the tutorials below.
169+
170+
171+
## What is exploratory data analysis?
172+
We provide a primer [here](https://github.com/mchav/dataframe/blob/main/docs/exploratory_data_analysis_primer.md) and show how to do some common analyses.
173+
174+
## Coming from other dataframe libraries
175+
Familiar with another dataframe library? Get started:
176+
* [Coming from Pandas](https://github.com/mchav/dataframe/blob/main/docs/coming_from_pandas.md)
177+
* [Coming from Polars](https://github.com/mchav/dataframe/blob/main/docs/coming_from_polars.md)
178+
* [Coming from dplyr](https://github.com/mchav/dataframe/blob/main/docs/coming_from_dplyr.md)
179+
111180
## Supported input formats
112181
* CSV
113182
* Apache Parquet (still buggy and experimental)

dataframe.cabal

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
cabal-version: 2.4
22
name: dataframe
3-
version: 0.2.0.2
3+
version: 0.2.0.3
44

5-
synopsis: An intuitive, dynamically-typed DataFrame library.
5+
synopsis: A fast, safe, and intuitive DataFrame library.
66

7-
description: An intuitive, dynamically-typed DataFrame library for exploratory data analysis.
7+
description: A fast, safe, and intuitive DataFrame library for exploratory data analysis.
88

99
bug-reports: https://github.com/mchav/dataframe/issues
1010
license: GPL-3.0-or-later

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
dataframe
22
=========
33

4-
An intuitive, dynamically-typed DataFrame library for exploratory data analysis.
4+
A fast, safe, and intuitive DataFrame library for exploratory data analysis.
55

66
.. toctree::
77
:maxdepth: 2

flake.nix

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
description = "An intuitive, dynamically-typed DataFrame library";
2+
description = "A fast, safe, and intuitive DataFrame library.";
33

44
inputs = {
55
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";

0 commit comments

Comments
 (0)