|
21 | 21 |
|
22 | 22 | # DataFrame |
23 | 23 |
|
24 | | -A fast, safe, and intuitive DataFrame library. |
| 24 | +Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety. |
25 | 25 |
|
26 | | -## Why use this DataFrame library? |
| 26 | +The library ships three API layers — all operating on the same underlying `DataFrame` type at runtime: |
27 | 27 |
|
28 | | -* Encourages concise, declarative, and composable data pipelines. |
29 | | -* Lets you opt into your preferred level of type safety: keep it lightweight for rapid exploration or lock it down completely for robust production pipelines. |
30 | | -* Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model. |
31 | | -* Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults. |
32 | | -* Works seamlessly in both command-line and notebook environments—great for exploration and scripting alike. |
| 28 | +- **Untyped** (`import qualified DataFrame as D`) — string-based column names, great for exploration and scripting. |
| 29 | +- **Typed** (`import qualified DataFrame.Typed as T`) — phantom-type schema tracking with compile-time column validation. |
| 30 | +- **Monadic API** — write your transformation as a self contained pipeline. |
33 | 31 |
|
34 | | -## Features |
35 | | -- Type-safe column operations with compile-time guarantees |
36 | | -- Familiar, approachable API designed to feel easy coming from other languages. |
37 | | -- Interactive REPL for data exploration and plotting. |
| 32 | +## Why this library? |
38 | 33 |
|
39 | | -## Quick start |
40 | | -Browse through some examples in [binder](https://mybinder.org/v2/gh/mchav/ihaskell-dataframe/HEAD). |
| 34 | +* Concise, declarative, composable data pipelines using the `|>` pipe operator. |
| 35 | +* Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines. |
| 36 | +* High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability. |
| 37 | +* Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages. |
41 | 38 |
|
42 | 39 | ## Install |
43 | | -See the [Quick Start](https://dataframe.readthedocs.io/en/latest/quick_start.html) guide for setup and installation instructions. |
44 | 40 |
|
45 | | -## Example |
| 41 | +```bash |
| 42 | +cabal update |
| 43 | +cabal install dataframe |
| 44 | +``` |
| 45 | + |
| 46 | +To use as a dependency in a project: |
| 47 | + |
| 48 | +``` |
| 49 | +build-depends: base >= 4, dataframe |
| 50 | +``` |
| 51 | + |
| 52 | +Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing: |
| 53 | + |
| 54 | +```bash |
| 55 | +dataframe |
| 56 | +``` |
| 57 | + |
| 58 | +## Quick Start |
| 59 | + |
| 60 | +Save this as `Example.hs` and run with `cabal run Example.hs`: |
| 61 | + |
| 62 | +```haskell |
| 63 | +#!/usr/bin/env cabal |
| 64 | +{- cabal: |
| 65 | + build-depends: base >= 4, dataframe |
| 66 | +-} |
| 67 | +{-# LANGUAGE OverloadedStrings #-} |
| 68 | +{-# LANGUAGE TypeApplications #-} |
| 69 | + |
| 70 | +import qualified DataFrame as D |
| 71 | +import qualified DataFrame.Functions as F |
| 72 | +import DataFrame.Operators |
| 73 | + |
| 74 | +main :: IO () |
| 75 | +main = do |
| 76 | + let sales = D.fromNamedColumns |
| 77 | + [ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int]) |
| 78 | + , ("amount", D.fromList [100, 120, 50, 20, 40, 30 :: Int]) |
| 79 | + ] |
| 80 | + |
| 81 | + -- Group by product and compute totals |
| 82 | + print $ sales |
| 83 | + |> D.groupBy ["product"] |
| 84 | + |> D.aggregate [ F.sum (F.col @Int "amount") `as` "total" |
| 85 | + , F.count (F.col @Int "amount") `as` "orders" |
| 86 | + ] |
| 87 | +``` |
| 88 | + |
| 89 | +``` |
| 90 | +----------------------- |
| 91 | +product | total | orders |
| 92 | +--------|-------|------- |
| 93 | + Int | Int | Int |
| 94 | +--------|-------|------- |
| 95 | +1 | 220 | 2 |
| 96 | +2 | 70 | 2 |
| 97 | +3 | 70 | 2 |
| 98 | +``` |
| 99 | + |
| 100 | +Reading from files works the same way: |
| 101 | + |
| 102 | +```haskell |
| 103 | +df <- D.readCsv "data.csv" |
| 104 | +df <- D.readParquet "data.parquet" |
| 105 | + |
| 106 | +-- Hugging Face datasets |
| 107 | +df <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet" |
| 108 | +``` |
| 109 | + |
| 110 | +## Interactive REPL |
| 111 | + |
| 112 | +The `dataframe` REPL comes with all imports pre-loaded. Here's a typical exploration session: |
46 | 113 |
|
47 | 114 | ```haskell |
48 | | -dataframe> df = D.fromNamedColumns [("product_id", D.fromList [1,1,2,2,3,3]), ("sales", D.fromList [100,120,50,20,40,30])] |
49 | | -dataframe> df |
50 | | ------------------- |
51 | | -product_id | sales |
52 | | ------------|------ |
53 | | - Int | Int |
54 | | ------------|------ |
55 | | -1 | 100 |
56 | | -1 | 120 |
57 | | -2 | 50 |
58 | | -2 | 20 |
59 | | -3 | 40 |
60 | | -3 | 30 |
| 115 | +dataframe> df <- D.readCsv "./data/housing.csv" |
| 116 | +dataframe> D.dimensions df |
| 117 | +(20640, 10) |
| 118 | + |
| 119 | +dataframe> D.describeColumns df |
| 120 | +------------------------------------------------------------------------ |
| 121 | + Column Name | ## Non-null Values | ## Null Values | Type |
| 122 | +--------------------|--------------------|----------------|------------- |
| 123 | + Text | Int | Int | Text |
| 124 | +--------------------|--------------------|----------------|------------- |
| 125 | + total_bedrooms | 20433 | 207 | Maybe Double |
| 126 | + ocean_proximity | 20640 | 0 | Text |
| 127 | + median_house_value | 20640 | 0 | Double |
| 128 | + median_income | 20640 | 0 | Double |
| 129 | + households | 20640 | 0 | Double |
| 130 | + population | 20640 | 0 | Double |
| 131 | + total_rooms | 20640 | 0 | Double |
| 132 | + housing_median_age | 20640 | 0 | Double |
| 133 | + latitude | 20640 | 0 | Double |
| 134 | + longitude | 20640 | 0 | Double |
| 135 | +``` |
61 | 136 |
|
| 137 | +The `:declareColumns` macro generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing `F.col @Double "median_income"` every time: |
| 138 | + |
| 139 | +```haskell |
62 | 140 | dataframe> :declareColumns df |
63 | | -"product_id :: Expr Int" |
64 | | -"sales :: Expr Int" |
65 | | -dataframe> df |> D.groupBy [F.name product_id] |> D.aggregate [F.sum sales `as` "total_sales"] |
66 | | ------------------------- |
67 | | -product_id | total_sales |
68 | | ------------|------------ |
69 | | - Int | Int |
70 | | ------------|------------ |
71 | | -1 | 220 |
72 | | -2 | 70 |
73 | | -3 | 70 |
| 141 | +"longitude :: Expr Double" |
| 142 | +"latitude :: Expr Double" |
| 143 | +"housing_median_age :: Expr Double" |
| 144 | +"total_rooms :: Expr Double" |
| 145 | +"total_bedrooms :: Expr (Maybe Double)" |
| 146 | +"population :: Expr Double" |
| 147 | +"households :: Expr Double" |
| 148 | +"median_income :: Expr Double" |
| 149 | +"median_house_value :: Expr Double" |
| 150 | +"ocean_proximity :: Expr Text" |
| 151 | + |
| 152 | +dataframe> df |> D.groupBy ["ocean_proximity"] |
| 153 | + |> D.aggregate [F.mean median_house_value `as` "avg_value"] |
| 154 | +------------------------------------- |
| 155 | + ocean_proximity | avg_value |
| 156 | +-----------------|------------------- |
| 157 | + Text | Double |
| 158 | +-----------------|------------------- |
| 159 | + <1H OCEAN | 240084.28546409807 |
| 160 | + INLAND | 124805.39200122119 |
| 161 | + ISLAND | 380440.0 |
| 162 | + NEAR BAY | 259212.31179039303 |
| 163 | + NEAR OCEAN | 249433.97742663656 |
| 164 | +``` |
| 165 | + |
| 166 | +Create new columns from existing ones: |
| 167 | + |
| 168 | +```haskell |
| 169 | +dataframe> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3 |
| 170 | +----------------------------------------------------------------------------------------------------------------- |
| 171 | + longitude | latitude | housing_median_age | total_rooms | ... | ocean_proximity | rooms_per_household |
| 172 | +-----------|----------|--------------------|-------------|-----|-----------------|-------------------- |
| 173 | + Double | Double | Double | Double | ... | Text | Double |
| 174 | +-----------|----------|--------------------|-------------|-----|-----------------|-------------------- |
| 175 | + -122.23 | 37.88 | 41.0 | 880.0 | ... | NEAR BAY | 6.984126984126984 |
| 176 | + -122.22 | 37.86 | 21.0 | 7099.0 | ... | NEAR BAY | 6.238137082601054 |
| 177 | + -122.24 | 37.85 | 52.0 | 1467.0 | ... | NEAR BAY | 8.288135593220339 |
| 178 | +``` |
| 179 | + |
| 180 | +Type mismatches are caught as compile errors — adding a `Double` column to a `Text` column won't silently produce garbage: |
| 181 | + |
| 182 | +```haskell |
| 183 | +dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity) |
| 184 | + |
| 185 | +<interactive>:14:47: error: [GHC-83865] |
| 186 | + • Couldn't match type 'Text' with 'Double' |
| 187 | + Expected: Expr Double |
| 188 | + Actual: Expr Text |
| 189 | + • In the second argument of '(+)', namely 'ocean_proximity' |
| 190 | + In the second argument of 'derive', namely |
| 191 | + '(latitude + ocean_proximity)' |
| 192 | +``` |
| 193 | + |
| 194 | +## Template Haskell |
| 195 | + |
| 196 | +For scripts and projects, Template Haskell can generate column bindings at compile time. |
| 197 | + |
| 198 | +### Generate column references from a CSV |
| 199 | + |
| 200 | +`declareColumnsFromCsvFile` reads your CSV at compile time and generates typed `Expr` bindings for every column: |
| 201 | + |
| 202 | +```haskell |
| 203 | +{-# LANGUAGE TemplateHaskell #-} |
| 204 | +{-# LANGUAGE OverloadedStrings #-} |
| 205 | + |
| 206 | +import qualified DataFrame as D |
| 207 | +import qualified DataFrame.Functions as F |
| 208 | +import DataFrame.Operators |
| 209 | + |
| 210 | +-- Reads housing.csv at compile time and generates: |
| 211 | +-- latitude :: Expr Double |
| 212 | +-- total_rooms :: Expr Double |
| 213 | +-- ocean_proximity :: Expr Text |
| 214 | +-- ... one binding per column |
| 215 | +$(F.declareColumnsFromCsvFile "./data/housing.csv") |
| 216 | + |
| 217 | +main :: IO () |
| 218 | +main = do |
| 219 | + df <- D.readCsv "./data/housing.csv" |
| 220 | + print $ df |
| 221 | + |> D.derive "rooms_per_household" (total_rooms / households) |
| 222 | + |> D.filterWhere (median_income .>. 5) |
| 223 | + |> D.groupBy ["ocean_proximity"] |
| 224 | + |> D.aggregate [F.mean median_house_value `as` "avg_value"] |
| 225 | +``` |
| 226 | + |
| 227 | +Compare this to the manual version which requires spelling out every column name and type: |
| 228 | + |
| 229 | +```haskell |
| 230 | +-- Without TH — every column needs its name and type spelled out |
| 231 | +df |> D.derive "rooms_per_household" |
| 232 | + (F.col @Double "total_rooms" / F.col @Double "households") |
| 233 | + |> D.filterWhere (F.col @Double "median_income" .>. F.lit 5) |
| 234 | +``` |
| 235 | + |
| 236 | +### Generate a schema type from a CSV |
| 237 | + |
| 238 | +`deriveSchemaFromCsvFile` generates a type synonym for use with the typed API — instead of manually writing out every column name and type: |
| 239 | + |
| 240 | +```haskell |
| 241 | +{-# LANGUAGE TemplateHaskell #-} |
| 242 | +{-# LANGUAGE DataKinds #-} |
| 243 | + |
| 244 | +import qualified DataFrame.Typed as T |
| 245 | + |
| 246 | +-- Generates: |
| 247 | +-- type HousingSchema = '[ T.Column "longitude" Double |
| 248 | +-- , T.Column "latitude" Double |
| 249 | +-- , T.Column "total_rooms" Double |
| 250 | +-- , ... |
| 251 | +-- ] |
| 252 | +$(T.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv") |
| 253 | +``` |
| 254 | + |
| 255 | +## Typed API |
| 256 | + |
| 257 | +When you want compile-time guarantees that column names exist and types match, wrap your `DataFrame` in a `TypedDataFrame`: |
| 258 | + |
| 259 | +```haskell |
| 260 | +{-# LANGUAGE DataKinds #-} |
| 261 | +{-# LANGUAGE TypeApplications #-} |
| 262 | +{-# LANGUAGE OverloadedStrings #-} |
| 263 | + |
| 264 | +import qualified DataFrame as D |
| 265 | +import qualified DataFrame.Typed as T |
| 266 | +import Data.Text (Text) |
| 267 | +import DataFrame.Operators |
| 268 | + |
| 269 | +type EmployeeSchema = |
| 270 | + '[ T.Column "name" Text |
| 271 | + , T.Column "department" Text |
| 272 | + , T.Column "salary" Double |
| 273 | + ] |
| 274 | + |
| 275 | +main :: IO () |
| 276 | +main = do |
| 277 | + df <- D.readCsv "employees.csv" |
| 278 | + case T.freeze @EmployeeSchema df of |
| 279 | + Nothing -> putStrLn "Schema mismatch!" |
| 280 | + Just tdf -> do |
| 281 | + let result = tdf |
| 282 | + |> T.derive @"bonus" (T.col @"salary" * T.lit 0.1) |
| 283 | + |> T.filterWhere (T.col @"salary" .>. T.lit 50000) |
| 284 | + |> T.select @'["name", "bonus"] |
| 285 | + print (T.thaw result) |
| 286 | +``` |
| 287 | + |
| 288 | +`T.freeze` validates the runtime `DataFrame` against your schema once at the boundary. After that, every column access is checked at compile time: |
| 289 | + |
| 290 | +```haskell |
| 291 | +-- Typo in column name → compile error |
| 292 | +tdf |> T.filterWhere (T.col @"slary" .>. T.lit 50000) |
| 293 | +-- error: Column "slary" not found in schema |
| 294 | + |
| 295 | +-- Wrong type → compile error |
| 296 | +tdf |> T.filterWhere (T.col @"name" .>. T.lit 50000) |
| 297 | +-- error: Couldn't match type 'Text' with 'Double' |
74 | 298 | ``` |
75 | 299 |
|
| 300 | +`filterAllJust` goes further — it strips `Maybe` from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable: |
| 301 | + |
| 302 | +```haskell |
| 303 | +-- Before: TypedDataFrame '[Column "score" (Maybe Double), Column "name" Text] |
| 304 | +let cleaned = T.filterAllJust tdf |
| 305 | +-- After: TypedDataFrame '[Column "score" Double, Column "name" Text] |
| 306 | + |
| 307 | +cleaned |> T.derive @"scaled" (T.col @"score" * T.lit 100) |
| 308 | +``` |
| 309 | + |
| 310 | +## Features |
| 311 | + |
| 312 | +**I/O**: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (`hf://` URIs). Column projection and predicate pushdown for Parquet reads. |
| 313 | + |
| 314 | +**Operations**: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits. |
| 315 | + |
| 316 | +**Expressions**: typed column references (`F.col @Double "x"`), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (`.==`, `.&&`), string matching (`like`, `regex`), casting, and user-defined functions via `lift`/`lift2`. |
| 317 | + |
| 318 | +**Statistics**: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation. |
| 319 | + |
| 320 | +**Plotting**: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots. |
| 321 | + |
| 322 | +**Lazy engine**: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes. |
| 323 | + |
| 324 | +**Interop**: Arrow C Data Interface for zero-copy round-trips with Python and Polars. |
| 325 | + |
| 326 | +**ML**: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling. |
| 327 | + |
| 328 | +**Notebooks**: IHaskell integration with [pre-built Binder examples](https://mybinder.org/v2/gh/mchav/ihaskell-dataframe/HEAD). |
| 329 | + |
| 330 | +## Lazy Queries |
| 331 | + |
| 332 | +For files too large to fit in memory, `DataFrame.Lazy` provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and `runDataFrame` runs it through an optimizer before streaming results batch-by-batch: |
| 333 | + |
| 334 | +```haskell |
| 335 | +import qualified DataFrame.Lazy as L |
| 336 | +import qualified DataFrame.Functions as F |
| 337 | +import DataFrame.Operators |
| 338 | +import DataFrame.Internal.Schema (Schema, schemaType) |
| 339 | +import Data.Text (Text) |
| 340 | + |
| 341 | +mySchema :: Schema |
| 342 | +mySchema = [ ("name", schemaType @Text) |
| 343 | + , ("weight", schemaType @Double) |
| 344 | + , ("height", schemaType @Double) |
| 345 | + ] |
| 346 | + |
| 347 | +main :: IO () |
| 348 | +main = do |
| 349 | + result <- L.runDataFrame $ |
| 350 | + L.scanCsv mySchema "large_file.csv" |
| 351 | + |> L.filter (F.col @Double "height" .>. F.lit 1.7) |
| 352 | + |> L.select ["name", "weight", "height"] |
| 353 | + |> L.derive "bmi" (F.col @Double "weight" |
| 354 | + / (F.col @Double "height" * F.col @Double "height")) |
| 355 | + |> L.take 1000 |
| 356 | + print result |
| 357 | +``` |
| 358 | + |
| 359 | +The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected. |
| 360 | + |
76 | 361 | ## Documentation |
77 | | -* 📚 User guide: https://dataframe.readthedocs.io/en/latest/ |
78 | | -* 📖 API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html |
| 362 | + |
| 363 | +* User guide: https://dataframe.readthedocs.io/en/latest/ |
| 364 | +* API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html |
| 365 | +* [Coming from pandas, Polars, dplyr, or Frames?](docs/coming_from_other_implementations.md) |
| 366 | +* [Cookbook (SQL-style patterns)](docs/cookbook.md) |
| 367 | +* [Tutorials](docs/tutorial.md) |
| 368 | +* Discord: https://discord.gg/8u8SCWfrNC |
0 commit comments