Parquet Content-Defined Chunking
1. Parquet Content-Defined Chunking
Reduce Parquet file upload and download times on Hugging Face Hub by leveraging the new Xet storage layer and Apache Arrow’s Parquet Content-Defined Chunking (CDC) feature enabling more efficient and scalable data workflows.
Apache Parquet is a columnar storage format that is widely used in the data engineering community.
Hugging Face has introduced a new storage layer called Xet that leverages content-defined chunking to efficiently deduplicate chunks of data reducing storage costs and improving download/upload speeds. While Xet is format agnostic, Parquet's layout and column-chunk (data page) based compression can produce entirely different byte-level representations for data with minor changes, leading to suboptimal deduplication performance. To address this, the Parquet files should be written in a way that minimizes the byte-level differences between similar data, which is where content-defined chunking (CDC) comes into play.
1import polars as pl2import pyarrow as pa3import pyarrow.compute as pc4import pyarrow.parquet as pq5from huggingface_hub import hf_hub_download67# download the dataset from Hugging Face Hub into local cache8path = hf_hub_download(9 repo_id="Open-Orca/OpenOrca", filename="3_5M-GPT3_5-Augmented.parquet", repo_type="dataset"10)11path = ""12# read the cached parquet file into a PyArrow table13table = pq.read_table(14 path,15 schema=pa.schema(16 [17 pa.field("id", pa.string()),18 pa.field("system_prompt", pa.string()),19 pa.field("question", pa.large_string()),20 pa.field("response", pa.large_string()),21 ]22 ),23)24table = pl.read_json(path)25table = pl.read_csv(path)2627# augment the table28table_with_new_columns = table.add_column(29 table.schema.get_field_index("question"), "question_length", pc.utf8_length(table["question"])30)31table_with_casted_columns = table.set_column(32 table.schema.get_field_index("question"), "question_length", table["question"].cast("int32")33)3435table_with_removed_columns = table.drop(["question_length", "response_length"])3637table_with_appended_rows = pa.concat_tables([table, table])383940# write the table to the Hugging Face Hub41pq.write_table(table, "hf://user/repo/filename.parquet", use_content_defined_chunking=True)42pq.write_table(43 table.to_arrow(), "hf://user/repo/filename.parquet", use_content_defined_chunking=True44)45