Skip to content

Commit

Permalink
adding xet-team org to other xet posts (#2678)
Browse files Browse the repository at this point in the history
  • Loading branch information
jsulz authored Feb 14, 2025
1 parent 766a3ee commit ff3d728
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 20 deletions.
2 changes: 2 additions & 0 deletions from-files-to-chunks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ title: "From Files to Chunks: Improving HF Storage Efficiency"
thumbnail: /blog/assets/from-files-to-chunks/thumbnail.png
authors:
- user: jsulz
org: xet-team
- user: erinys
org: xet-team
---

# From Files to Chunks: Improving HF Storage Efficiency
Expand Down
41 changes: 21 additions & 20 deletions improve_parquet_dedupe.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ title: "Improving Parquet Dedupe on Hugging Face Hub"
thumbnail: /blog/assets/improve_parquet_dedupe/thumbnail.png
authors:
- user: yuchenglow
org: xet-team
- user: seanses
org: xet-team
---

# Improving Parquet Dedupe on Hugging Face Hub
Expand All @@ -12,18 +14,18 @@ The Xet team at Hugging Face is working on improving the efficiency of the Hub's
storage architecture to make it easier and quicker for users to
store and update data and models. As Hugging Face hosts nearly 11PB of datasets
with Parquet files alone accounting for over 2.2PB of that storage,
optimizing Parquet storage is of pretty high priority.
optimizing Parquet storage is of pretty high priority.

Most Parquet files are bulk exports from various data analysis pipelines
or databases, often appearing as full snapshots rather than incremental
updates. Data deduplication becomes critical for efficiency when users want to
update their datasets on a regular basis. Only by deduplicating can we store
updates. Data deduplication becomes critical for efficiency when users want to
update their datasets on a regular basis. Only by deduplicating can we store
all versions as compactly as possible, without requiring everything to be uploaded
again on every update. In an ideal case, we should be able to store every version
again on every update. In an ideal case, we should be able to store every version
of a growing dataset with only a little more space than the size of its largest version.

Our default storage algorithm uses byte-level [Content-Defined Chunking (CDC)](https://joshleeb.com/posts/content-defined-chunking.html),
which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.
Our default storage algorithm uses byte-level [Content-Defined Chunking (CDC)](https://joshleeb.com/posts/content-defined-chunking.html),
which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.
Here we run some experiments to see how some simple modifications behave on
Parquet files, using a 2GB Parquet file with 1,092,000 rows from the
[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20)
Expand Down Expand Up @@ -76,17 +78,17 @@ that absolute file offsets are part of the Parquet column headers (see the
structures ColumnChunk and ColumnMetaData)! This means that any
modification is likely to rewrite all the Column headers. So while the
data does dedupe well (it is mostly green), we get new bytes in every
column header.
column header.

In this case, the new file is only 89% deduped, requiring 230MB of additional
storage.

## Deletion

Here we delete a row from the middle of the file (note: insertion should have
similar behavior). As this reorganizes the entire row group layout (each
similar behavior). As this reorganizes the entire row group layout (each
row group is 1000 rows), we see that while we dedupe the first half of
the file, the remaining file has completely new blocks.
the file, the remaining file has completely new blocks.

<p align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/improve_parquet_dedupe/3_delete.png" alt="Visualization of dedupe from data deletion" width=30%>
Expand All @@ -101,18 +103,18 @@ aggressively:
</p>

However the file sizes are nearly 2x larger if we store the data
uncompressed.
uncompressed.

Is it possible to have the benefit of dedupe and compression at the same
time?

## Content Defined Row Groups

One potential solution is to use not only byte-level CDC, but also apply it at the row level:
One potential solution is to use not only byte-level CDC, but also apply it at the row level:
we split row groups not based on absolute count (1000 rows), but on a hash of a provided
“Key” column. In other words, I split off a row group whenever the hash of
the key column % [target row count] = 0, with some allowances for a minimum
and a maximum row group size.
and a maximum row group size.

I hacked up a quick inefficient experimental demonstration
[here](https://gist.github.com/ylow/db38522fb0ca69bdf1065237222b4d1c).
Expand All @@ -132,22 +134,21 @@ Based on these experiments, we could consider improving Parquet file
dedupe-ability in a couple of ways:

1. Use relative offsets instead of absolute offsets for file structure
data. This would make the Parquet structures position independent and
easy to “memcpy” around, although it is an involving file format change that
is probably difficult to do.
data. This would make the Parquet structures position independent and
easy to “memcpy” around, although it is an involving file format change that
is probably difficult to do.
2. Support content defined chunking on row groups. The format actually
supports this today as it does not require row groups to be uniformly sized,
so this can be done with minimal blast radius. Only Parquet format writers
would have to be updated.
supports this today as it does not require row groups to be uniformly sized,
so this can be done with minimal blast radius. Only Parquet format writers
would have to be updated.

While we will continue exploring ways to improve Parquet storage performance
(Ex: perhaps we could optionally rewrite Parquet files before uploading?
Strip absolute file offsets on upload and restore on download?), we would
Strip absolute file offsets on upload and restore on download?), we would
love to work with the Apache Arrow project to see if there is interest in
implementing some of these ideas in the Parquet / Arrow code base.

In the meantime, we are also exploring the behavior of our data dedupe process
on other common filetypes. Please do try our [dedupe
estimator](https://github.com/huggingface/dedupe_estimator) and tell us about
your findings!

3 changes: 3 additions & 0 deletions rearchitecting-uploads-and-downloads.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,11 @@ title: "Rearchitecting Hugging Face Uploads and Downloads"
thumbnail: /blog/assets/rearchitecting-uploads-and-downloads/thumbnail.png
authors:
- user: port8080
org: xet-team
- user: jsulz
org: xet-team
- user: erinys
org: xet-team
---

# Rearchitecting Hugging Face Uploads and Downloads
Expand Down

0 comments on commit ff3d728

Please sign in to comment.