Skip to content

Commit 34af2bc

Browse files
committed
docs: Update documentation for clarity and accuracy across multiple files
1 parent 2c76b5e commit 34af2bc

7 files changed

Lines changed: 778 additions & 73 deletions

README.md

Lines changed: 331 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -21,58 +21,348 @@
2121

2222
# DataFrame
2323

24-
A fast, safe, and intuitive DataFrame library.
24+
Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.
2525

26-
## Why use this DataFrame library?
26+
The library ships three API layers — all operating on the same underlying `DataFrame` type at runtime:
2727

28-
* Encourages concise, declarative, and composable data pipelines.
29-
* Lets you opt into your preferred level of type safety: keep it lightweight for rapid exploration or lock it down completely for robust production pipelines.
30-
* Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model.
31-
* Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults.
32-
* Works seamlessly in both command-line and notebook environments—great for exploration and scripting alike.
28+
- **Untyped** (`import qualified DataFrame as D`) — string-based column names, great for exploration and scripting.
29+
- **Typed** (`import qualified DataFrame.Typed as T`) — phantom-type schema tracking with compile-time column validation.
30+
- **Monadic API** — write your transformation as a self contained pipeline.
3331

34-
## Features
35-
- Type-safe column operations with compile-time guarantees
36-
- Familiar, approachable API designed to feel easy coming from other languages.
37-
- Interactive REPL for data exploration and plotting.
32+
## Why this library?
3833

39-
## Quick start
40-
Browse through some examples in [binder](https://mybinder.org/v2/gh/mchav/ihaskell-dataframe/HEAD).
34+
* Concise, declarative, composable data pipelines using the `|>` pipe operator.
35+
* Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
36+
* High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
37+
* Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.
4138

4239
## Install
43-
See the [Quick Start](https://dataframe.readthedocs.io/en/latest/quick_start.html) guide for setup and installation instructions.
4440

45-
## Example
41+
```bash
42+
cabal update
43+
cabal install dataframe
44+
```
45+
46+
To use as a dependency in a project:
47+
48+
```
49+
build-depends: base >= 4, dataframe
50+
```
51+
52+
Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:
53+
54+
```bash
55+
dataframe
56+
```
57+
58+
## Quick Start
59+
60+
Save this as `Example.hs` and run with `cabal run Example.hs`:
61+
62+
```haskell
63+
#!/usr/bin/env cabal
64+
{- cabal:
65+
build-depends: base >= 4, dataframe
66+
-}
67+
{-# LANGUAGE OverloadedStrings #-}
68+
{-# LANGUAGE TypeApplications #-}
69+
70+
import qualified DataFrame as D
71+
import qualified DataFrame.Functions as F
72+
import DataFrame.Operators
73+
74+
main :: IO ()
75+
main = do
76+
let sales = D.fromNamedColumns
77+
[ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
78+
, ("amount", D.fromList [100, 120, 50, 20, 40, 30 :: Int])
79+
]
80+
81+
-- Group by product and compute totals
82+
print $ sales
83+
|> D.groupBy ["product"]
84+
|> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
85+
, F.count (F.col @Int "amount") `as` "orders"
86+
]
87+
```
88+
89+
```
90+
-----------------------
91+
product | total | orders
92+
--------|-------|-------
93+
Int | Int | Int
94+
--------|-------|-------
95+
1 | 220 | 2
96+
2 | 70 | 2
97+
3 | 70 | 2
98+
```
99+
100+
Reading from files works the same way:
101+
102+
```haskell
103+
df <- D.readCsv "data.csv"
104+
df <- D.readParquet "data.parquet"
105+
106+
-- Hugging Face datasets
107+
df <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"
108+
```
109+
110+
## Interactive REPL
111+
112+
The `dataframe` REPL comes with all imports pre-loaded. Here's a typical exploration session:
46113

47114
```haskell
48-
dataframe> df = D.fromNamedColumns [("product_id", D.fromList [1,1,2,2,3,3]), ("sales", D.fromList [100,120,50,20,40,30])]
49-
dataframe> df
50-
------------------
51-
product_id | sales
52-
-----------|------
53-
Int | Int
54-
-----------|------
55-
1 | 100
56-
1 | 120
57-
2 | 50
58-
2 | 20
59-
3 | 40
60-
3 | 30
115+
dataframe> df <- D.readCsv "./data/housing.csv"
116+
dataframe> D.dimensions df
117+
(20640, 10)
118+
119+
dataframe> D.describeColumns df
120+
------------------------------------------------------------------------
121+
Column Name | ## Non-null Values | ## Null Values | Type
122+
--------------------|--------------------|----------------|-------------
123+
Text | Int | Int | Text
124+
--------------------|--------------------|----------------|-------------
125+
total_bedrooms | 20433 | 207 | Maybe Double
126+
ocean_proximity | 20640 | 0 | Text
127+
median_house_value | 20640 | 0 | Double
128+
median_income | 20640 | 0 | Double
129+
households | 20640 | 0 | Double
130+
population | 20640 | 0 | Double
131+
total_rooms | 20640 | 0 | Double
132+
housing_median_age | 20640 | 0 | Double
133+
latitude | 20640 | 0 | Double
134+
longitude | 20640 | 0 | Double
135+
```
61136

137+
The `:declareColumns` macro generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing `F.col @Double "median_income"` every time:
138+
139+
```haskell
62140
dataframe> :declareColumns df
63-
"product_id :: Expr Int"
64-
"sales :: Expr Int"
65-
dataframe> df |> D.groupBy [F.name product_id] |> D.aggregate [F.sum sales `as` "total_sales"]
66-
------------------------
67-
product_id | total_sales
68-
-----------|------------
69-
Int | Int
70-
-----------|------------
71-
1 | 220
72-
2 | 70
73-
3 | 70
141+
"longitude :: Expr Double"
142+
"latitude :: Expr Double"
143+
"housing_median_age :: Expr Double"
144+
"total_rooms :: Expr Double"
145+
"total_bedrooms :: Expr (Maybe Double)"
146+
"population :: Expr Double"
147+
"households :: Expr Double"
148+
"median_income :: Expr Double"
149+
"median_house_value :: Expr Double"
150+
"ocean_proximity :: Expr Text"
151+
152+
dataframe> df |> D.groupBy ["ocean_proximity"]
153+
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
154+
-------------------------------------
155+
ocean_proximity | avg_value
156+
-----------------|-------------------
157+
Text | Double
158+
-----------------|-------------------
159+
<1H OCEAN | 240084.28546409807
160+
INLAND | 124805.39200122119
161+
ISLAND | 380440.0
162+
NEAR BAY | 259212.31179039303
163+
NEAR OCEAN | 249433.97742663656
164+
```
165+
166+
Create new columns from existing ones:
167+
168+
```haskell
169+
dataframe> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3
170+
-----------------------------------------------------------------------------------------------------------------
171+
longitude | latitude | housing_median_age | total_rooms | ... | ocean_proximity | rooms_per_household
172+
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
173+
Double | Double | Double | Double | ... | Text | Double
174+
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
175+
-122.23 | 37.88 | 41.0 | 880.0 | ... | NEAR BAY | 6.984126984126984
176+
-122.22 | 37.86 | 21.0 | 7099.0 | ... | NEAR BAY | 6.238137082601054
177+
-122.24 | 37.85 | 52.0 | 1467.0 | ... | NEAR BAY | 8.288135593220339
178+
```
179+
180+
Type mismatches are caught as compile errors — adding a `Double` column to a `Text` column won't silently produce garbage:
181+
182+
```haskell
183+
dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)
184+
185+
<interactive>:14:47: error: [GHC-83865]
186+
Couldn't match type 'Text' with 'Double'
187+
Expected: Expr Double
188+
Actual: Expr Text
189+
In the second argument of '(+)', namely 'ocean_proximity'
190+
In the second argument of 'derive', namely
191+
'(latitude + ocean_proximity)'
192+
```
193+
194+
## Template Haskell
195+
196+
For scripts and projects, Template Haskell can generate column bindings at compile time.
197+
198+
### Generate column references from a CSV
199+
200+
`declareColumnsFromCsvFile` reads your CSV at compile time and generates typed `Expr` bindings for every column:
201+
202+
```haskell
203+
{-# LANGUAGE TemplateHaskell #-}
204+
{-# LANGUAGE OverloadedStrings #-}
205+
206+
import qualified DataFrame as D
207+
import qualified DataFrame.Functions as F
208+
import DataFrame.Operators
209+
210+
-- Reads housing.csv at compile time and generates:
211+
-- latitude :: Expr Double
212+
-- total_rooms :: Expr Double
213+
-- ocean_proximity :: Expr Text
214+
-- ... one binding per column
215+
$(F.declareColumnsFromCsvFile "./data/housing.csv")
216+
217+
main :: IO ()
218+
main = do
219+
df <- D.readCsv "./data/housing.csv"
220+
print $ df
221+
|> D.derive "rooms_per_household" (total_rooms / households)
222+
|> D.filterWhere (median_income .>. 5)
223+
|> D.groupBy ["ocean_proximity"]
224+
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
225+
```
226+
227+
Compare this to the manual version which requires spelling out every column name and type:
228+
229+
```haskell
230+
-- Without TH — every column needs its name and type spelled out
231+
df |> D.derive "rooms_per_household"
232+
(F.col @Double "total_rooms" / F.col @Double "households")
233+
|> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
234+
```
235+
236+
### Generate a schema type from a CSV
237+
238+
`deriveSchemaFromCsvFile` generates a type synonym for use with the typed API — instead of manually writing out every column name and type:
239+
240+
```haskell
241+
{-# LANGUAGE TemplateHaskell #-}
242+
{-# LANGUAGE DataKinds #-}
243+
244+
import qualified DataFrame.Typed as T
245+
246+
-- Generates:
247+
-- type HousingSchema = '[ T.Column "longitude" Double
248+
-- , T.Column "latitude" Double
249+
-- , T.Column "total_rooms" Double
250+
-- , ...
251+
-- ]
252+
$(T.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")
253+
```
254+
255+
## Typed API
256+
257+
When you want compile-time guarantees that column names exist and types match, wrap your `DataFrame` in a `TypedDataFrame`:
258+
259+
```haskell
260+
{-# LANGUAGE DataKinds #-}
261+
{-# LANGUAGE TypeApplications #-}
262+
{-# LANGUAGE OverloadedStrings #-}
263+
264+
import qualified DataFrame as D
265+
import qualified DataFrame.Typed as T
266+
import Data.Text (Text)
267+
import DataFrame.Operators
268+
269+
type EmployeeSchema =
270+
'[ T.Column "name" Text
271+
, T.Column "department" Text
272+
, T.Column "salary" Double
273+
]
274+
275+
main :: IO ()
276+
main = do
277+
df <- D.readCsv "employees.csv"
278+
case T.freeze @EmployeeSchema df of
279+
Nothing -> putStrLn "Schema mismatch!"
280+
Just tdf -> do
281+
let result = tdf
282+
|> T.derive @"bonus" (T.col @"salary" * T.lit 0.1)
283+
|> T.filterWhere (T.col @"salary" .>. T.lit 50000)
284+
|> T.select @'["name", "bonus"]
285+
print (T.thaw result)
286+
```
287+
288+
`T.freeze` validates the runtime `DataFrame` against your schema once at the boundary. After that, every column access is checked at compile time:
289+
290+
```haskell
291+
-- Typo in column name → compile error
292+
tdf |> T.filterWhere (T.col @"slary" .>. T.lit 50000)
293+
-- error: Column "slary" not found in schema
294+
295+
-- Wrong type → compile error
296+
tdf |> T.filterWhere (T.col @"name" .>. T.lit 50000)
297+
-- error: Couldn't match type 'Text' with 'Double'
74298
```
75299

300+
`filterAllJust` goes further — it strips `Maybe` from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:
301+
302+
```haskell
303+
-- Before: TypedDataFrame '[Column "score" (Maybe Double), Column "name" Text]
304+
let cleaned = T.filterAllJust tdf
305+
-- After: TypedDataFrame '[Column "score" Double, Column "name" Text]
306+
307+
cleaned |> T.derive @"scaled" (T.col @"score" * T.lit 100)
308+
```
309+
310+
## Features
311+
312+
**I/O**: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (`hf://` URIs). Column projection and predicate pushdown for Parquet reads.
313+
314+
**Operations**: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.
315+
316+
**Expressions**: typed column references (`F.col @Double "x"`), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (`.==`, `.&&`), string matching (`like`, `regex`), casting, and user-defined functions via `lift`/`lift2`.
317+
318+
**Statistics**: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.
319+
320+
**Plotting**: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.
321+
322+
**Lazy engine**: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.
323+
324+
**Interop**: Arrow C Data Interface for zero-copy round-trips with Python and Polars.
325+
326+
**ML**: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.
327+
328+
**Notebooks**: IHaskell integration with [pre-built Binder examples](https://mybinder.org/v2/gh/mchav/ihaskell-dataframe/HEAD).
329+
330+
## Lazy Queries
331+
332+
For files too large to fit in memory, `DataFrame.Lazy` provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and `runDataFrame` runs it through an optimizer before streaming results batch-by-batch:
333+
334+
```haskell
335+
import qualified DataFrame.Lazy as L
336+
import qualified DataFrame.Functions as F
337+
import DataFrame.Operators
338+
import DataFrame.Internal.Schema (Schema, schemaType)
339+
import Data.Text (Text)
340+
341+
mySchema :: Schema
342+
mySchema = [ ("name", schemaType @Text)
343+
, ("weight", schemaType @Double)
344+
, ("height", schemaType @Double)
345+
]
346+
347+
main :: IO ()
348+
main = do
349+
result <- L.runDataFrame $
350+
L.scanCsv mySchema "large_file.csv"
351+
|> L.filter (F.col @Double "height" .>. F.lit 1.7)
352+
|> L.select ["name", "weight", "height"]
353+
|> L.derive "bmi" (F.col @Double "weight"
354+
/ (F.col @Double "height" * F.col @Double "height"))
355+
|> L.take 1000
356+
print result
357+
```
358+
359+
The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.
360+
76361
## Documentation
77-
* 📚 User guide: https://dataframe.readthedocs.io/en/latest/
78-
* 📖 API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
362+
363+
* User guide: https://dataframe.readthedocs.io/en/latest/
364+
* API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
365+
* [Coming from pandas, Polars, dplyr, or Frames?](docs/coming_from_other_implementations.md)
366+
* [Cookbook (SQL-style patterns)](docs/cookbook.md)
367+
* [Tutorials](docs/tutorial.md)
368+
* Discord: https://discord.gg/8u8SCWfrNC

0 commit comments

Comments
 (0)