Skip to content

Commit c069143

Browse files
authored
RowLab (#42)
* first commit RowLab * add measurement generator * progress on rowlab boilerplate * super naive impl done * move naive solution to handout * cleanup * fix * reset everything * first working version * final draft * cleanup * add tests and benchmark * update handout * update toml and build scripts * better doc comment * add writeups * restore gitignore
1 parent 5ea706b commit c069143

21 files changed

+2113
-1
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ Cargo.lock
1313
results/
1414
submission/
1515

16-
1716
# These are backup files generated by rustfmt
1817
**/*.rs.bk
1918

week12/rowlab/Cargo.toml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[package]
2+
name = "rowlab"
3+
version = "0.1.0"
4+
edition = "2024"
5+
default-run = "rowlab"
6+
7+
[dependencies]
8+
itertools = "0.14.0"
9+
rand = "0.9.0"
10+
rand_distr = "0.5.1"
11+
regex = "1.11.1"
12+
13+
[dev-dependencies]
14+
criterion = "0.5"
15+
16+
[[bench]]
17+
name = "brc"
18+
harness = false

week12/rowlab/README.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
### 98-008: Intro to the Rust Programming Language
2+
3+
# Row Lab
4+
5+
This is the final assignment of the semester! Congrats on making it here.
6+
7+
Your final task is to implement something similar to
8+
[The One Billion Row Challenge](https://www.morling.dev/blog/one-billion-row-challenge/). The goal
9+
here is to get familiar with parallelism in Rust, as well as put together everything you've learned
10+
over the past semester to write a program that has practical application in the real world!
11+
12+
We are not going to give that much guidance here, since at this point you should be familiar enough
13+
with Rust as a language that you can figure out everything on your own. Of course, we'll explain
14+
enough to get you started.
15+
16+
**The description of the original challenge can be found
17+
[here](https://www.morling.dev/blog/one-billion-row-challenge/), so give it a quick read!**
18+
19+
The main difference between this assignment and the real challenge is A) we are not writing Java,
20+
and B) instead of reading the data from a file / disk, we computationally generate the random data
21+
(in-memory) via an iterator.
22+
23+
_The second difference is mainly because Gradescope does not support more than 6 GB of memory per
24+
autograder (1 billion rows is approximately 14 GB), which means the complete data cannot fit in
25+
memory. Asking you to interact with I/O while also dealing with parallelism seemed a bit too cruel
26+
for this assignment, so we modified the challenge slightly. That being said, we encourage you to
27+
take the code you write for this lab and try the real challenge yourself!_
28+
29+
**For this lab, you are allowed to use third-party crates!** This means you will have to also submit
30+
your `Cargo.toml` file. See the [Submission](#submission) section for more information.
31+
32+
# Starter Code
33+
34+
We have provided quite a lot of starter code for you to use! The two files that you should be
35+
modifying are `aggregation.rs` and `lib.rs`. You are allowed to modify `main.rs` and
36+
`measurements.rs` locally on your own computer, but the Gradescope autograder will be using the
37+
starter code for those two files. The other two files you should know about are `tests/mock.rs` as
38+
well as `benches/brc.rs`, which are explained in the next two sections.
39+
40+
`aggregation.rs` contains our recommended helper structs and methods for aggregating the data. You
41+
are allowed to completely rewrite everything except the function signature of `aggregate` and the
42+
struct definitions for `WeatherStations` and `AggregationResults` (but you are allowed to and
43+
encouraged to change the fields of `AggregationResults`).
44+
45+
Once you have implemented the `todo!()`s in `aggregation.rs`, you can move on to `lib.rs`. We have
46+
provided you with a naive single-threaded version of this challenge. From here, it is up to you to
47+
make things faster! See the [Benchmarking](#benchmarking-and-leaderboard) section for some hints 🦀.
48+
49+
# Testing
50+
51+
There are 3 integration tests located in `tests/mock.rs`. We will manually check your code for
52+
parallelism, and as long as you have integrated parallelism in some non-trivial manner, you will
53+
receive full credit if you pass the 3 tests.
54+
55+
If you make any changes to struct definitions or function signatures, make sure that you can still
56+
compile everything with `cargo test`!
57+
58+
# Benchmarking and Leaderboard
59+
60+
We have set up benchmarking via [Criterion](https://bheisler.github.io/criterion.rs/book/) for you.
61+
You can run `cargo bench` to see how long (on average) your `aggregate` function takes to aggregate
62+
1 billion rows. Note that the minimum number of samples it will run is 10, so if your code is
63+
**very** slow, you might just want to run the small timing program in `main.rs` via `cargo run`.
64+
65+
There will also be a leaderboard on Gradescope! Compete to please Ferris with the fastest time. We
66+
will give you quite a lot of extra credit if you can beat Ferris (our reference solution) by some
67+
non-trivial amount. The top leaderboard finishers might get a huge amount of points 🦀🦀🦀🦀🦀
68+
69+
### Optimizations
70+
71+
There are many, many ways to speed up a program like the one you need to implement. In fact, there
72+
is a whole field dedicated to speeding up this kind of program: when you have a `GROUP BY` clause in
73+
SQL, the relational database executing the SQL query is doing almost this exact aggregation! If you
74+
are interested in this, you should take CMU's
75+
[Databse Systems](https://15445.courses.cs.cmu.edu/spring2025/) course.
76+
77+
We won't go into detail here, but you are allowed to go online and look at all of the techniques
78+
other people have used for this challenge. You can also read the
79+
[Rust Performance Book](https://nnethercote.github.io/perf-book/) online. Just make sure not to copy
80+
and paste anyone else's code without citing them first!
81+
82+
For this assignment, we would actually encourage you to look at the reference solution after giving
83+
a good-faith attempt at designing an algorithm yourself. Our reference solution is purposefully not
84+
very well optimized, but it does show the syntax for using parallelism in Rust. We encourage you to
85+
play around with the code!
86+
87+
Note that because the original challenge involved reading from a file (interacting with I/O), not
88+
everything online will be applicable to this assignment. Still, there's a lot of cool things on the
89+
internet that you _can_ make use of. Also, be careful when trying to use SIMD, as you will be graded
90+
on the Gradescope Docker containers.
91+
92+
_That being said, if you really want to play around with I/O and perhaps some `unsafe`ty with system
93+
calls (like `mmap`), reach out to us! We might give permission for you to submit the real challenge
94+
if we think you are capable of it._
95+
96+
# Submission
97+
98+
For this lab, you are allowed to use third-party crates! **This means that you must also submit your
99+
`Cargo.toml` file** (otherwise we wouldn't be able to compile your code). The `build.rs` build
100+
script will handle that for you
101+
102+
### Formatting and Style
103+
104+
The autograder will run these two commands on your code:
105+
106+
```sh
107+
cargo clippy && cargo fmt --all -- --check
108+
```
109+
110+
**If the autograder detects any errors from the command above, you will not be able to receive**
111+
**any points.** This may seem strict, but we have decided to follow standard best practices for
112+
Rust.
113+
114+
By following [Rust's style guidelines](https://doc.rust-lang.org/stable/style-guide/), you ensure
115+
that anybody reading your code (who is familiar with Rust) will be able to easily navigate your
116+
code. This can help with diving into an unfamiliar code base, and it also eliminates the need for
117+
debate with others over style rules, saving time and energy.
118+
119+
See the official [guidelines](https://doc.rust-lang.org/stable/style-guide/) for more information.
120+
121+
### Unix
122+
123+
If you are on a unix system, we will try to create a `handin.zip` automatically for you,
124+
**but you will need to have `zip` already installed**.
125+
126+
If you _do not_ have `zip` installed on your system, install `zip` on your machine or use the CMU
127+
Linux SSH machines. If you need help with this, please reach out to us!
128+
129+
Once you have `zip` installed, we will create the `handin.zip` automatically for you (_take a peek_
130+
_into `build.rs` file if you're interested in how this works!_).
131+
132+
Once you have the `handin.zip` file, submit it (and only the zip) to Gradescope.
133+
134+
### Windows
135+
136+
If you are on a windows system, you can zip the `src/` folder manually and upload that to
137+
Gradescope. For this lab, you also need to add the `Cargo.toml` file to that zip folder. Please
138+
reach out to us if you are unsure how to do this!
139+
140+
Note that you don't _need_ to name it `handin.zip`, you can name it whatever you'd like.
141+
142+
# Collaboration
143+
144+
In general, feel free to discuss homeworks with other students! As long as you do not copy someone
145+
else's work, any communication is fair game.
146+
147+
All formal questions should be asked on Piazza. Try to discuss on Piazza so that other students can
148+
see your questions and answers as well!
149+
150+
You can also discuss on Discord, but try to keep any technical questions on Piazza.
151+
152+
# Feedback
153+
154+
We would like to reiterate that you should let us know if you spent anywhere in significant excess
155+
of an hour on this homework.
156+
157+
In addition, Rust has a notoriously steep learning curve, so if you find yourself not understanding
158+
the concepts, you should reach out to us and let us know as well --- chances are, you're not the
159+
only one!

week12/rowlab/benches/brc.rs

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
//! The 1 billion row challenge! Except without interacting with any I/O!
2+
3+
use criterion::{Criterion, black_box, criterion_group, criterion_main};
4+
use rowlab::{BILLION, WeatherStations, aggregate};
5+
6+
pub fn one_billion_row_challenge(c: &mut Criterion) {
7+
// Create the measurements iterator. In the real challenge, you would be reading these values
8+
// from a file on disk.
9+
let stations = WeatherStations::new();
10+
let measurements = stations.measurements();
11+
12+
c.bench_function("brc", |b| {
13+
b.iter(|| {
14+
black_box(aggregate(measurements.clone().take(BILLION)));
15+
})
16+
});
17+
}
18+
19+
criterion_main!(benches);
20+
criterion_group! {
21+
name = benches;
22+
config = Criterion::default()
23+
.sample_size(10);
24+
targets = one_billion_row_challenge
25+
}

week12/rowlab/build.rs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
use std::process::Command;
2+
3+
fn main() {
4+
if cfg!(unix) {
5+
Command::new("zip")
6+
.arg("-r")
7+
.arg("handin.zip")
8+
.arg("src/")
9+
.arg("Cargo.toml")
10+
.output()
11+
.expect("\nError: Unable to zip handin files. Either the zip executable is not installed on this computer, the zip binary is not on your PATH, or something went very wrong with zip. Please contact the staff for help!\n\n");
12+
}
13+
14+
println!("cargo:rerun-if-changed=handin.zip");
15+
}

week12/rowlab/src/aggregation.rs

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
use itertools::Itertools;
2+
use std::collections::HashMap;
3+
use std::fmt::{Display, Write};
4+
5+
/// Aggregate statistics for a specific [`WeatherStation`].
6+
#[derive(Debug, Clone, Copy)]
7+
pub struct StationAggregation {
8+
/// The minimum temperature measurement.
9+
min: f64,
10+
/// The maximum temperature measurement.
11+
max: f64,
12+
/// The average / mean temperature measurement.
13+
mean: f64,
14+
/// Helper field for calculating mean (sum_measurements / num_measurements).
15+
sum_measurements: f64,
16+
/// Helper field for calculating mean (sum_measurements / num_measurements).
17+
num_measurements: f64,
18+
}
19+
20+
impl StationAggregation {
21+
/// Creates a new `StationAggregation` for computing aggregations.
22+
pub fn new() -> Self {
23+
Self {
24+
min: f64::INFINITY,
25+
mean: 0.0,
26+
max: f64::NEG_INFINITY,
27+
sum_measurements: 0.0,
28+
num_measurements: 0.0,
29+
}
30+
}
31+
32+
/// Updates the aggregation with a new measurement.
33+
///
34+
/// TODO(student): Is processing measurements one-by-one the best way to compute aggregations?
35+
/// Remember that you are allowed to add other methods in this implementation block!
36+
pub fn add_measurement(&mut self, measurement: f64) {
37+
todo!("Implement me!")
38+
}
39+
40+
pub fn min(&self) -> f64 {
41+
self.min
42+
}
43+
44+
pub fn max(&self) -> f64 {
45+
self.max
46+
}
47+
48+
pub fn mean(&self) -> f64 {
49+
self.mean
50+
}
51+
}
52+
53+
impl Display for StationAggregation {
54+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
55+
write!(f, "{:.1}/{:.1}/{:.1}", self.min, self.mean, self.max)
56+
}
57+
}
58+
59+
/// The aggregation results for the billion row challenge.
60+
///
61+
/// TODO(student): This is purposefully not an ideal structure! You are allowed to change what
62+
/// types this struct contains. Think about what this structure should represent, and where the data
63+
/// might best be located. Also, you are allowed to use third-party data structures.
64+
#[derive(Debug)]
65+
pub struct AggregationResults {
66+
/// A map from weather station identifier to its aggregate metrics.
67+
results: HashMap<String, StationAggregation>,
68+
}
69+
70+
impl AggregationResults {
71+
/// Creates an empty `AggregationResult`.
72+
pub fn new() -> Self {
73+
Self {
74+
results: HashMap::new(),
75+
}
76+
}
77+
78+
/// Updates the metrics for the given station with a measurement.
79+
pub fn insert_measurement(&mut self, station: &str, measurement: f64) {
80+
todo!("Implement me!")
81+
}
82+
83+
/// Retrieve the stats of a specific station, if it exists. Used for testing purposes.
84+
pub fn get_metrics(&self, station: &str) -> Option<StationAggregation> {
85+
self.results.get(station).copied()
86+
}
87+
}
88+
89+
impl Display for AggregationResults {
90+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
91+
// Sort the results by weather station ID and join into the output string format.
92+
let sorted_results: Vec<_> = self
93+
.results
94+
.iter()
95+
.sorted_by(|a, b| Ord::cmp(&a.0, &b.0))
96+
.collect();
97+
98+
f.write_char('{')?;
99+
100+
// Append each weather station's metrics to the output string.
101+
for (station, aggregation) in sorted_results.iter().take(sorted_results.len() - 1) {
102+
f.write_str(station)?;
103+
f.write_char('=')?;
104+
// Note that implementing `Display` on `StationAggregation` means that you can call
105+
// `to_string` and it will do a similar thing as `Display::fmt`.
106+
f.write_str(&aggregation.to_string())?;
107+
f.write_char(',')?;
108+
f.write_char(' ')?;
109+
}
110+
111+
let (last_station, last_aggregation) =
112+
sorted_results.last().expect("somehow empty results");
113+
f.write_str(last_station)?;
114+
f.write_char('=')?;
115+
f.write_str(&last_aggregation.to_string())?;
116+
117+
f.write_char('}')
118+
}
119+
}
120+
121+
impl Default for StationAggregation {
122+
fn default() -> Self {
123+
Self::new()
124+
}
125+
}
126+
127+
impl Default for AggregationResults {
128+
fn default() -> Self {
129+
Self::new()
130+
}
131+
}

week12/rowlab/src/lib.rs

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#![doc = include_str!("../README.md")]
2+
3+
mod aggregation;
4+
use aggregation::AggregationResults;
5+
6+
mod measurements;
7+
pub use measurements::WeatherStations;
8+
9+
/// One billion.
10+
pub const BILLION: usize = 1_000_000_000;
11+
12+
/// Given an iterator that yields measurements for weather stations, aggregate each weather
13+
/// station's data.
14+
///
15+
/// TODO(student): This is purposefully an very bad way to compute aggregations (namely, completely
16+
/// sequentially). If you don't want to time out, you will need to introduce parallelism in some
17+
/// manner. And even after you introduce parallelism, there are many different things you can do to
18+
/// speed this up dramatically.
19+
///
20+
/// For this lab, we would encourage you to look at the reference solution after giving this a good
21+
/// attempt on your own! Note that the reference solution is purposefully not optimized in several
22+
/// places, and there is lots of room for improvement. We also encourage you to go online and see if
23+
/// you can find any interesting techniques for speeding this up.
24+
pub fn aggregate<'a, I>(measurements: I) -> AggregationResults
25+
where
26+
I: Iterator<Item = (&'a str, f64)>,
27+
{
28+
let mut results = AggregationResults::new();
29+
30+
for (station, measurement) in measurements {
31+
results.insert_measurement(station, measurement);
32+
}
33+
34+
results
35+
}

0 commit comments

Comments
 (0)