adding xet-team org to other xet posts (#2678)

huggingface · Feb 14, 2025 · ff3d728 · ff3d728
1 parent 766a3ee
commit ff3d728
Show file tree

Hide file tree

Showing 3 changed files with 26 additions and 20 deletions.
diff --git a/from-files-to-chunks.md b/from-files-to-chunks.md
@@ -3,7 +3,9 @@ title: "From Files to Chunks: Improving HF Storage Efficiency"
 thumbnail: /blog/assets/from-files-to-chunks/thumbnail.png
 authors:
   - user: jsulz
+    org: xet-team
   - user: erinys
+    org: xet-team
 ---
 
 # From Files to Chunks: Improving HF Storage Efficiency

diff --git a/improve_parquet_dedupe.md b/improve_parquet_dedupe.md
@@ -3,7 +3,9 @@ title: "Improving Parquet Dedupe on Hugging Face Hub"
 thumbnail: /blog/assets/improve_parquet_dedupe/thumbnail.png
 authors:
   - user: yuchenglow
+    org: xet-team
   - user: seanses
+    org: xet-team
 ---
 
 # Improving Parquet Dedupe on Hugging Face Hub
@@ -12,18 +14,18 @@ The Xet team at Hugging Face is working on improving the efficiency of the Hub's
 storage architecture to make it easier and quicker for users to
 store and update data and models. As Hugging Face hosts nearly 11PB of datasets
 with Parquet files alone accounting for over 2.2PB of that storage,
-optimizing Parquet storage is of pretty high priority. 
+optimizing Parquet storage is of pretty high priority.
 
 Most Parquet files are bulk exports from various data analysis pipelines
 or databases, often appearing as full snapshots rather than incremental
-updates. Data deduplication becomes critical for efficiency when users want to 
-update their datasets on a regular basis. Only by deduplicating can we store 
+updates. Data deduplication becomes critical for efficiency when users want to
+update their datasets on a regular basis. Only by deduplicating can we store
 all versions as compactly as possible, without requiring everything to be uploaded
-again on every update. In an ideal case, we should be able to store every version 
+again on every update. In an ideal case, we should be able to store every version
 of a growing dataset with only a little more space than the size of its largest version.
 
-Our default storage algorithm uses byte-level [Content-Defined Chunking (CDC)](https://joshleeb.com/posts/content-defined-chunking.html), 
-which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges. 
+Our default storage algorithm uses byte-level [Content-Defined Chunking (CDC)](https://joshleeb.com/posts/content-defined-chunking.html),
+which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.
 Here we run some experiments to see how some simple modifications behave on
 Parquet files, using a 2GB Parquet file with 1,092,000 rows from the
 [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20)
@@ -76,17 +78,17 @@ that absolute file offsets are part of the Parquet column headers (see the
 structures ColumnChunk and ColumnMetaData)! This means that any
 modification is likely to rewrite all the Column headers. So while the
 data does dedupe well (it is mostly green), we get new bytes in every
-column header. 
+column header.
 
 In this case, the new file is only 89% deduped, requiring 230MB of additional
 storage.
 
 ## Deletion
 
 Here we delete a row from the middle of the file (note: insertion should have
-similar behavior). As this reorganizes the entire row group layout (each 
+similar behavior). As this reorganizes the entire row group layout (each
 row group is 1000 rows), we see that while we dedupe the first half of
-the file, the remaining file has completely new blocks. 
+the file, the remaining file has completely new blocks.
 
 <p align="center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/improve_parquet_dedupe/3_delete.png" alt="Visualization of dedupe from data deletion" width=30%>
@@ -101,18 +103,18 @@ aggressively:
 </p>
 
 However the file sizes are nearly 2x larger if we store the data
-uncompressed. 
+uncompressed.
 
 Is it possible to have the benefit of dedupe and compression at the same
 time?
 
 ## Content Defined Row Groups
 
-One potential solution is to use not only byte-level CDC, but also apply it at the row level: 
+One potential solution is to use not only byte-level CDC, but also apply it at the row level:
 we split row groups not based on absolute count (1000 rows), but on a hash of a provided
 “Key” column. In other words, I split off a row group whenever the hash of
 the key column % [target row count] = 0, with some allowances for a minimum
-and a maximum row group size. 
+and a maximum row group size.
 
 I hacked up a quick inefficient experimental demonstration
 [here](https://gist.github.com/ylow/db38522fb0ca69bdf1065237222b4d1c).
@@ -132,22 +134,21 @@ Based on these experiments, we could consider improving Parquet file
 dedupe-ability in a couple of ways:
 
 1. Use relative offsets instead of absolute offsets for file structure
-data. This would make the Parquet structures position independent and 
-easy to “memcpy” around, although it is an involving file format change that
-is probably difficult to do.
+   data. This would make the Parquet structures position independent and
+   easy to “memcpy” around, although it is an involving file format change that
+   is probably difficult to do.
 2. Support content defined chunking on row groups. The format actually
-supports this today as it does not require row groups to be uniformly sized,
-so this can be done with minimal blast radius. Only Parquet format writers
-would have to be updated.
+   supports this today as it does not require row groups to be uniformly sized,
+   so this can be done with minimal blast radius. Only Parquet format writers
+   would have to be updated.
 
 While we will continue exploring ways to improve Parquet storage performance
 (Ex: perhaps we could optionally rewrite Parquet files before uploading?
-Strip absolute file offsets on upload and restore on download?), we would 
+Strip absolute file offsets on upload and restore on download?), we would
 love to work with the Apache Arrow project to see if there is interest in
 implementing some of these ideas in the Parquet / Arrow code base.
 
 In the meantime, we are also exploring the behavior of our data dedupe process
 on other common filetypes. Please do try our [dedupe
 estimator](https://github.com/huggingface/dedupe_estimator) and tell us about
 your findings!
-
diff --git a/rearchitecting-uploads-and-downloads.md b/rearchitecting-uploads-and-downloads.md
@@ -3,8 +3,11 @@ title: "Rearchitecting Hugging Face Uploads and Downloads"
 thumbnail: /blog/assets/rearchitecting-uploads-and-downloads/thumbnail.png
 authors:
   - user: port8080
+    org: xet-team
   - user: jsulz
+    org: xet-team
   - user: erinys
+    org: xet-team
 ---
 
 # Rearchitecting Hugging Face Uploads and Downloads