Parquet Content-Defined Chunking

1. Parquet Content-Defined Chunking

Reduce Parquet file upload and download times on Hugging Face Hub by leveraging the new Xet storage layer and Apache Arrow’s Parquet Content-Defined Chunking (CDC) feature enabling more efficient and scalable data workflows.

Apache Parquet is a columnar storage format that is widely used in the data engineering community.

Hugging Face has introduced a new storage layer called Xet that leverages content-defined chunking to efficiently deduplicate chunks of data reducing storage costs and improving download/upload speeds. While Xet is format agnostic, Parquet's layout and column-chunk (data page) based compression can produce entirely different byte-level representations for data with minor changes, leading to suboptimal deduplication performance. To address this, the Parquet files should be written in a way that minimizes the byte-level differences between similar data, which is where content-defined chunking (CDC) comes into play.

1
import polars as pl
2
import pyarrow as pa
3
import pyarrow.compute as pc
4
import pyarrow.parquet as pq
5
from huggingface_hub import hf_hub_download
6
7
# download the dataset from Hugging Face Hub into local cache
8
path = hf_hub_download(
9
repo_id="Open-Orca/OpenOrca", filename="3_5M-GPT3_5-Augmented.parquet", repo_type="dataset"
10
)
11
path = ""
12
# read the cached parquet file into a PyArrow table
13
table = pq.read_table(
14
path,
15
schema=pa.schema(
16
[
17
pa.field("id", pa.string()),
18
pa.field("system_prompt", pa.string()),
19
pa.field("question", pa.large_string()),
20
pa.field("response", pa.large_string()),
21
]
22
),
23
)
24
table = pl.read_json(path)
25
table = pl.read_csv(path)
26
27
# augment the table
28
table_with_new_columns = table.add_column(
29
table.schema.get_field_index("question"), "question_length", pc.utf8_length(table["question"])
30
)
31
table_with_casted_columns = table.set_column(
32
table.schema.get_field_index("question"), "question_length", table["question"].cast("int32")
33
)
34
35
table_with_removed_columns = table.drop(["question_length", "response_length"])
36
37
table_with_appended_rows = pa.concat_tables([table, table])
38
39
40
# write the table to the Hugging Face Hub
41
pq.write_table(table, "hf://user/repo/filename.parquet", use_content_defined_chunking=True)
42
pq.write_table(
43
table.to_arrow(), "hf://user/repo/filename.parquet", use_content_defined_chunking=True
44
)
45

References

  1. Parquet Content-Defined Chunking