Skip to content

Commit 03fab4e

Browse files
committed
feat: Faster CSV I/O + more algorithms in dataframe-learn
1 parent a8f599c commit 03fab4e

165 files changed

Lines changed: 21618 additions & 2300 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# Revision history for dataframe
22

3+
## 2.2.0.0
4+
5+
A large performance and ML release.
6+
7+
### Highlights
8+
* **Much faster I/O and analytics.** CSV reading, group-by, joins and sorting were rebuilt on compact unboxed / `PackedText` columns, parallel open-addressing hash tables, and vectorized aggregation; the default reader drops `cassava` for a single-pass scanner and `dataframe-fastcsv` adds multicore chunking. The end-to-end join + group-by pipeline runs several times faster with much lower memory and scales with `-threaded` / `+RTS -N`; results are byte-identical (golden-tested).
9+
* **Lazy engine on par with eager** — bounded-source queries route through the same fast paths (streaming preserved for unbounded); lazy `sortBy` now also orders non-`Text` columns correctly.
10+
* **New ML library (`dataframe-learn`).** scikit-learn-style estimators behind a uniform `fit` / `predict`: linear / ridge / lasso / logistic regression, SVM, trees, boosting, k-means, GMM, DBSCAN, PCA and kernel PCA, symbolic regression and feature synthesis, with metrics and cross-validation. Pure and deterministic; every model also compiles to a dataframe `Expr`.
11+
* **Typed joins are checked at compile time** — keys must exist in both schemas with matching types (previously a runtime failure or silent empty result).
12+
13+
### Breaking changes
14+
* The `Column` GADT gains a `PackedText` constructor — exhaustive matches need a new arm (`materializePacked` decodes it back to boxed `Text`). CSV reads are now strict / fully forced; schema columns parse as their declared type; ragged rows pad with null instead of silently misaligning columns; overflowing integers parse as `Double`. `dataframe-learn`'s old beam-search synthesis and per-model `fit*` helpers are replaced by `fit` / `predict`.
15+
* Coordinated major bumps (`dataframe-core`, `dataframe-learn``1.1.0.0`; umbrella `dataframe``2.2.0.0`) with inter-package lower bounds tightened so a newer package cannot resolve against an incompatible sibling. Drops `cassava` and `unordered-containers`; requires `text >= 2.1`.
16+
317
## 2.1.0.3
418
### Packaging
519
* Fix dependency resolution for the `dataframe` meta-package and its satellites

data/ml/blobs.csv

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
x,y,cluster
2+
10.338264760556216,0.847533060653685,2.0
3+
-8.030169963007346,5.6974805690319235,0.0
4+
10.72808765826904,0.8199785012860177,2.0
5+
-9.06787703268451,6.456351950062689,0.0
6+
-1.9752744232671209,4.798741021511607,1.0
7+
-7.950069649558368,6.313857688658719,0.0
8+
9.185263691410944,-0.02689531876281781,2.0
9+
-1.2924664642268644,5.094345005459218,1.0
10+
-8.386915515325072,6.748967461389298,0.0
11+
9.47365658115339,-0.3512223275732281,2.0
12+
-9.794694684045528,6.026719329863166,0.0
13+
-8.474546521207202,4.194596399725956,0.0
14+
-1.8006579280809625,4.866203894410084,1.0
15+
-8.798016496863006,3.7677237632249145,0.0
16+
9.608066918226854,1.550543508284739,2.0
17+
-0.46295420029132506,4.91792529440678,1.0
18+
9.085876965700251,0.8012523279907401,2.0
19+
-7.2757229200427895,5.372267655941989,0.0
20+
9.40388503768542,1.2438256022898402,2.0
21+
-8.208254840130058,6.1863971123041805,0.0
22+
-9.8405055750545,4.1558969180185,0.0
23+
-1.0130643168373106,3.870050967862403,1.0
24+
-1.750299605496018,5.8673653480977395,1.0
25+
-7.250170505806332,6.133910399530022,0.0
26+
-0.8427472133887682,4.013725740204026,1.0
27+
-8.783546167149966,7.221633621411183,0.0
28+
-7.801247618806028,5.454386723800413,0.0
29+
10.043495267038827,1.5187267683935621,2.0
30+
-9.428707750327403,4.758362395941994,0.0
31+
-0.7221251049321777,4.573801938428627,1.0
32+
-8.768888987279134,4.073279941977919,0.0
33+
-2.3850503444929023,3.3944007047064506,1.0
34+
9.52005242972435,1.344762816352543,2.0
35+
-1.6485932394255796,4.637081202200104,1.0
36+
-7.634316973046253,5.265196389986278,0.0
37+
-2.873923777802405,6.120339997469102,1.0
38+
-8.01938470265847,4.996106087665121,0.0
39+
8.650080809399894,0.5164250456305327,2.0
40+
-6.944219122655304,5.788217322112048,0.0
41+
9.411501808610677,1.1207212024133286,2.0
42+
8.727685388903897,0.7881397821440224,2.0
43+
9.940157883783744,0.31129800696905785,2.0
44+
-1.5441245562713808,3.7925654469165777,1.0
45+
-8.668033475458893,4.43578271480966,0.0
46+
-8.974177391694264,5.461137235845716,0.0
47+
-8.579735970574333,5.350794772331189,0.0
48+
-0.8507452109763282,5.35913594514609,1.0
49+
9.799881364516688,1.4185320368386882,2.0
50+
7.715643621051086,1.4926421993810217,2.0
51+
9.204875175494418,0.6321707789639058,2.0
52+
10.106053390869812,0.1776808950835841,2.0
53+
-8.549654572872772,6.552844582677501,0.0
54+
10.555503658306312,1.421054556659824,2.0
55+
-0.8409150384679815,4.759055557379698,1.0
56+
10.580079627664762,0.7854064702080787,2.0
57+
-10.125366055017796,5.068648373068846,0.0
58+
-0.36588996340355817,3.8233854113954955,1.0
59+
9.426230450546782,0.5860742620560562,2.0
60+
-1.885479568645208,4.623149715673959,1.0
61+
9.223577182486164,1.5910144728588864,2.0
62+
-7.2711793961873274,5.428189471188245,0.0
63+
9.453560211741014,-0.7499716472172048,2.0
64+
-1.984038662573283,4.364283036544825,1.0
65+
-8.392687024351563,5.800438033246547,0.0
66+
11.030025753690552,0.7766578881818996,2.0
67+
-2.2915284396471147,5.001466504584993,1.0
68+
-1.6145496750257826,4.777612786934544,1.0
69+
10.168318216276937,0.6986657131606543,2.0
70+
-2.2266293822669425,4.0573617784558555,1.0
71+
-8.254266273512885,4.37715621930642,0.0
72+
-9.437210089000642,6.767956346573112,0.0
73+
10.590353086938768,0.2993071633588282,2.0
74+
-2.79339785001823,4.767968183672174,1.0
75+
-1.2140195803496816,4.126048681774031,1.0
76+
-8.141432121082447,4.859540119431719,0.0
77+
9.089586732772219,-0.9279621694174798,2.0
78+
-7.1532744595716755,5.721844273167379,0.0
79+
-8.510143036409373,4.437832885485033,0.0
80+
-1.138567419393835,2.986258859412659,1.0
81+
-8.034838116698193,6.139679036447002,0.0
82+
-1.4384684377722188,5.3796868046940745,1.0
83+
-1.2202242152011087,3.034384563178521,1.0
84+
9.10502555609624,0.429281886212569,2.0
85+
10.585464339946231,0.008965041738488957,2.0
86+
-7.659707807990062,6.078774657537933,0.0
87+
9.823590118434234,0.6308585417952228,2.0
88+
-2.476845488661084,5.076283717062412,1.0
89+
-8.553522718036747,6.958005682199637,0.0
90+
-8.069594713163454,5.389290712648975,0.0
91+
-1.4505216161787537,3.810535489405869,1.0
92+
0.14236866854605823,4.102802122310017,1.0
93+
-8.713989201960986,4.654390170035729,0.0
94+
-1.462202464398755,4.709150238434744,1.0
95+
9.139446464353718,0.9973645548743786,2.0
96+
-2.0942032865826947,3.003362783235963,1.0
97+
9.384194636323251,0.5881152884036329,2.0
98+
-0.09043844586306131,4.594778743582171,1.0
99+
-8.630656062367228,5.125759989711783,0.0
100+
-8.258304908114546,5.178692149281611,0.0
101+
-1.6564983727218412,3.57173364605436,1.0
102+
-8.747527831262346,5.220834523095475,0.0
103+
8.289191474819738,0.9762956762429598,2.0
104+
-0.2667896158894523,5.2537491444891975,1.0
105+
-1.4567017014328651,3.240421719648844,1.0
106+
-2.578532846092601,5.281315504987536,1.0
107+
-7.060905510177389,5.334844843271841,0.0
108+
9.544505126636004,1.2726118372517006,2.0
109+
-1.789088183709013,4.9356797992475885,1.0
110+
-9.012750362421714,5.553681884762233,0.0
111+
-2.0885293775780265,3.819536691473433,1.0
112+
-8.160597657718792,4.443878519201542,0.0
113+
9.953505807271908,0.7455103282555472,2.0
114+
-8.628118580800681,4.175965562435486,0.0
115+
-0.4807339560151918,4.5869562471000975,1.0
116+
9.458262920723499,0.3546075672900924,2.0
117+
-1.3869004247455148,5.027138844442485,1.0
118+
-1.3046003840083444,5.232595722796476,1.0
119+
10.903405766013115,0.2879089162540264,2.0
120+
-8.167336731114846,7.396451887048452,0.0
121+
-1.1612914315034135,3.4000871166542685,1.0
122+
-1.7253529304545083,4.179477773883045,1.0
123+
-0.9912710391638034,5.444758663793727,1.0
124+
-2.3277031463654856,5.054227064960736,1.0
125+
8.147437705805391,1.071564094656846,2.0
126+
9.877041844526358,1.7070321683770884,2.0
127+
-0.3870567318504991,4.922009547520749,1.0
128+
-8.713603202383117,6.635883998983632,0.0
129+
9.549394662765959,1.8843175122058762,2.0
130+
-6.665876623219847,6.293607308205353,0.0
131+
-9.6218049853493,5.999675143347125,0.0
132+
-8.780572710866066,4.886490355985694,0.0
133+
10.03429276137997,0.5300549650285779,2.0
134+
9.157196652555177,1.2812079409319164,2.0
135+
8.1447728798683,0.5915572830522724,2.0
136+
9.146956239818431,1.017246090363041,2.0
137+
-9.165426205789453,5.897872124592354,0.0
138+
9.959392438724798,0.19080965893214255,2.0
139+
8.486697738274069,1.4784940535418505,2.0
140+
-0.3582643293862715,5.840807754114097,1.0
141+
-2.2186345261641804,4.615624117910637,1.0
142+
10.47604297470776,0.4921540622294554,2.0
143+
-2.7604144732798868,4.9784338438022795,1.0
144+
9.937212447233478,-0.1112822931713835,2.0
145+
-8.483627961806011,5.53784507561264,0.0
146+
9.270455589800079,1.4007210675022714,2.0
147+
0.25622916659958644,3.2714846427335473,1.0
148+
10.148892613250826,-0.16841584321546238,2.0
149+
9.702422433216821,2.162194444522933,2.0
150+
-9.104972635421447,5.6000283031270515,0.0
151+
-0.46045498366900267,4.696212695686656,1.0

data/ml/golden.json

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
{
2+
"gbm": {
3+
"accuracy": 1.0
4+
},
5+
"kmeans": {
6+
"inertia": 173.75480764755966
7+
},
8+
"linear_svc": {
9+
"accuracy": 1.0
10+
},
11+
"logistic_binary": {
12+
"accuracy": 1.0
13+
},
14+
"logistic_iris": {
15+
"accuracy": 0.9733333333333334
16+
},
17+
"ols": {
18+
"coef": [
19+
-10.009866299810287,
20+
-239.81564367242294,
21+
519.8459200544604,
22+
324.38464550232385,
23+
-792.1756385522323,
24+
476.7390210052585,
25+
101.04326793803448,
26+
177.06323767134674,
27+
751.2736995571044,
28+
67.62669218370488
29+
],
30+
"intercept": 152.13348416289597
31+
},
32+
"pca": {
33+
"components_abs": [
34+
[
35+
0.3613865917853659,
36+
0.08452251406457255,
37+
0.8566706059498347,
38+
0.3582891971515517
39+
],
40+
[
41+
0.6565887712868534,
42+
0.7301614347850159,
43+
0.17337266279585964,
44+
0.0754810199174582
45+
]
46+
],
47+
"evr": [
48+
0.9246187232017291,
49+
0.05306648311706544
50+
]
51+
},
52+
"ridge": {
53+
"alpha": 1.0,
54+
"coef": [
55+
29.466111893477002,
56+
-83.15427636187523,
57+
306.3526801506861,
58+
201.62773437326962,
59+
5.909614367497247,
60+
-29.515495079689597,
61+
-152.04028006186414,
62+
117.31173160030173,
63+
262.9442900143125,
64+
111.87895643952352
65+
],
66+
"intercept": 152.133484162896
67+
}
68+
}

data/ml/iris.csv

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
sepal_length,sepal_width,petal_length,petal_width,species
2+
5.1,3.5,1.4,0.2,0.0
3+
4.9,3.0,1.4,0.2,0.0
4+
4.7,3.2,1.3,0.2,0.0
5+
4.6,3.1,1.5,0.2,0.0
6+
5.0,3.6,1.4,0.2,0.0
7+
5.4,3.9,1.7,0.4,0.0
8+
4.6,3.4,1.4,0.3,0.0
9+
5.0,3.4,1.5,0.2,0.0
10+
4.4,2.9,1.4,0.2,0.0
11+
4.9,3.1,1.5,0.1,0.0
12+
5.4,3.7,1.5,0.2,0.0
13+
4.8,3.4,1.6,0.2,0.0
14+
4.8,3.0,1.4,0.1,0.0
15+
4.3,3.0,1.1,0.1,0.0
16+
5.8,4.0,1.2,0.2,0.0
17+
5.7,4.4,1.5,0.4,0.0
18+
5.4,3.9,1.3,0.4,0.0
19+
5.1,3.5,1.4,0.3,0.0
20+
5.7,3.8,1.7,0.3,0.0
21+
5.1,3.8,1.5,0.3,0.0
22+
5.4,3.4,1.7,0.2,0.0
23+
5.1,3.7,1.5,0.4,0.0
24+
4.6,3.6,1.0,0.2,0.0
25+
5.1,3.3,1.7,0.5,0.0
26+
4.8,3.4,1.9,0.2,0.0
27+
5.0,3.0,1.6,0.2,0.0
28+
5.0,3.4,1.6,0.4,0.0
29+
5.2,3.5,1.5,0.2,0.0
30+
5.2,3.4,1.4,0.2,0.0
31+
4.7,3.2,1.6,0.2,0.0
32+
4.8,3.1,1.6,0.2,0.0
33+
5.4,3.4,1.5,0.4,0.0
34+
5.2,4.1,1.5,0.1,0.0
35+
5.5,4.2,1.4,0.2,0.0
36+
4.9,3.1,1.5,0.2,0.0
37+
5.0,3.2,1.2,0.2,0.0
38+
5.5,3.5,1.3,0.2,0.0
39+
4.9,3.6,1.4,0.1,0.0
40+
4.4,3.0,1.3,0.2,0.0
41+
5.1,3.4,1.5,0.2,0.0
42+
5.0,3.5,1.3,0.3,0.0
43+
4.5,2.3,1.3,0.3,0.0
44+
4.4,3.2,1.3,0.2,0.0
45+
5.0,3.5,1.6,0.6,0.0
46+
5.1,3.8,1.9,0.4,0.0
47+
4.8,3.0,1.4,0.3,0.0
48+
5.1,3.8,1.6,0.2,0.0
49+
4.6,3.2,1.4,0.2,0.0
50+
5.3,3.7,1.5,0.2,0.0
51+
5.0,3.3,1.4,0.2,0.0
52+
7.0,3.2,4.7,1.4,1.0
53+
6.4,3.2,4.5,1.5,1.0
54+
6.9,3.1,4.9,1.5,1.0
55+
5.5,2.3,4.0,1.3,1.0
56+
6.5,2.8,4.6,1.5,1.0
57+
5.7,2.8,4.5,1.3,1.0
58+
6.3,3.3,4.7,1.6,1.0
59+
4.9,2.4,3.3,1.0,1.0
60+
6.6,2.9,4.6,1.3,1.0
61+
5.2,2.7,3.9,1.4,1.0
62+
5.0,2.0,3.5,1.0,1.0
63+
5.9,3.0,4.2,1.5,1.0
64+
6.0,2.2,4.0,1.0,1.0
65+
6.1,2.9,4.7,1.4,1.0
66+
5.6,2.9,3.6,1.3,1.0
67+
6.7,3.1,4.4,1.4,1.0
68+
5.6,3.0,4.5,1.5,1.0
69+
5.8,2.7,4.1,1.0,1.0
70+
6.2,2.2,4.5,1.5,1.0
71+
5.6,2.5,3.9,1.1,1.0
72+
5.9,3.2,4.8,1.8,1.0
73+
6.1,2.8,4.0,1.3,1.0
74+
6.3,2.5,4.9,1.5,1.0
75+
6.1,2.8,4.7,1.2,1.0
76+
6.4,2.9,4.3,1.3,1.0
77+
6.6,3.0,4.4,1.4,1.0
78+
6.8,2.8,4.8,1.4,1.0
79+
6.7,3.0,5.0,1.7,1.0
80+
6.0,2.9,4.5,1.5,1.0
81+
5.7,2.6,3.5,1.0,1.0
82+
5.5,2.4,3.8,1.1,1.0
83+
5.5,2.4,3.7,1.0,1.0
84+
5.8,2.7,3.9,1.2,1.0
85+
6.0,2.7,5.1,1.6,1.0
86+
5.4,3.0,4.5,1.5,1.0
87+
6.0,3.4,4.5,1.6,1.0
88+
6.7,3.1,4.7,1.5,1.0
89+
6.3,2.3,4.4,1.3,1.0
90+
5.6,3.0,4.1,1.3,1.0
91+
5.5,2.5,4.0,1.3,1.0
92+
5.5,2.6,4.4,1.2,1.0
93+
6.1,3.0,4.6,1.4,1.0
94+
5.8,2.6,4.0,1.2,1.0
95+
5.0,2.3,3.3,1.0,1.0
96+
5.6,2.7,4.2,1.3,1.0
97+
5.7,3.0,4.2,1.2,1.0
98+
5.7,2.9,4.2,1.3,1.0
99+
6.2,2.9,4.3,1.3,1.0
100+
5.1,2.5,3.0,1.1,1.0
101+
5.7,2.8,4.1,1.3,1.0
102+
6.3,3.3,6.0,2.5,2.0
103+
5.8,2.7,5.1,1.9,2.0
104+
7.1,3.0,5.9,2.1,2.0
105+
6.3,2.9,5.6,1.8,2.0
106+
6.5,3.0,5.8,2.2,2.0
107+
7.6,3.0,6.6,2.1,2.0
108+
4.9,2.5,4.5,1.7,2.0
109+
7.3,2.9,6.3,1.8,2.0
110+
6.7,2.5,5.8,1.8,2.0
111+
7.2,3.6,6.1,2.5,2.0
112+
6.5,3.2,5.1,2.0,2.0
113+
6.4,2.7,5.3,1.9,2.0
114+
6.8,3.0,5.5,2.1,2.0
115+
5.7,2.5,5.0,2.0,2.0
116+
5.8,2.8,5.1,2.4,2.0
117+
6.4,3.2,5.3,2.3,2.0
118+
6.5,3.0,5.5,1.8,2.0
119+
7.7,3.8,6.7,2.2,2.0
120+
7.7,2.6,6.9,2.3,2.0
121+
6.0,2.2,5.0,1.5,2.0
122+
6.9,3.2,5.7,2.3,2.0
123+
5.6,2.8,4.9,2.0,2.0
124+
7.7,2.8,6.7,2.0,2.0
125+
6.3,2.7,4.9,1.8,2.0
126+
6.7,3.3,5.7,2.1,2.0
127+
7.2,3.2,6.0,1.8,2.0
128+
6.2,2.8,4.8,1.8,2.0
129+
6.1,3.0,4.9,1.8,2.0
130+
6.4,2.8,5.6,2.1,2.0
131+
7.2,3.0,5.8,1.6,2.0
132+
7.4,2.8,6.1,1.9,2.0
133+
7.9,3.8,6.4,2.0,2.0
134+
6.4,2.8,5.6,2.2,2.0
135+
6.3,2.8,5.1,1.5,2.0
136+
6.1,2.6,5.6,1.4,2.0
137+
7.7,3.0,6.1,2.3,2.0
138+
6.3,3.4,5.6,2.4,2.0
139+
6.4,3.1,5.5,1.8,2.0
140+
6.0,3.0,4.8,1.8,2.0
141+
6.9,3.1,5.4,2.1,2.0
142+
6.7,3.1,5.6,2.4,2.0
143+
6.9,3.1,5.1,2.3,2.0
144+
5.8,2.7,5.1,1.9,2.0
145+
6.8,3.2,5.9,2.3,2.0
146+
6.7,3.3,5.7,2.5,2.0
147+
6.7,3.0,5.2,2.3,2.0
148+
6.3,2.5,5.0,1.9,2.0
149+
6.5,3.0,5.2,2.0,2.0
150+
6.2,3.4,5.4,2.3,2.0
151+
5.9,3.0,5.1,1.8,2.0

0 commit comments

Comments
 (0)