Skip to content

[VL] Parquet writer rowgroup number is bigger than Spark using the same parameter #11534

@FelixYBW

Description

@FelixYBW

Description

Using default parameters, Gluten parquet writer creates more rowgroup than Spark, which leads to much worse perf than using Spark files

df.repartition(tbl_filenum[tbl]['part_cnt'], tbl_filenum[tbl]['part']).write.mode("overwrite").format("parquet").option("compression", "zstd").partitionBy(tbl_filenum[tbl]['part']).option('fs.s3a.committer.name', 'magic').save(f"s3a://presto-workload/{databasename}/{tbl}")

The table shows one stage of TPCDS q9 when reading spark and Gluten generated store_sales table

Metric spark generated gluten generated Difference
Time of scan and filter 18.8 min 1.31 hours 4.2x slower
I/O wait time 2.09 hours 10.09 hours 4.8x slower
Time of scan 1.85 hours 8.84 hours 4.8x slower
Page load time 14.0 min 1.25 hours 5.4x slower
Data source read time 4.8 min 21.6 min 4.5x slower
Storage read bytes 31.5 GiB 34.7 GiB 10% more data read
Size of files read 240.7 GiB 256.6 GiB 6.6% larger
Peak memory 16.0 GiB 14.1 GiB 12% less memory used
Row groups processed 2,666 8,727 3.3x more row groups
Memory allocations 17,584,444 14,466,793 18% fewer allocations
Preloaded splits 1,585 1,710 8% more preloaded

Gluten version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions