-
Notifications
You must be signed in to change notification settings - Fork 582
Closed
Closed
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Description
Using default parameters, Gluten parquet writer creates more rowgroup than Spark, which leads to much worse perf than using Spark files
df.repartition(tbl_filenum[tbl]['part_cnt'], tbl_filenum[tbl]['part']).write.mode("overwrite").format("parquet").option("compression", "zstd").partitionBy(tbl_filenum[tbl]['part']).option('fs.s3a.committer.name', 'magic').save(f"s3a://presto-workload/{databasename}/{tbl}")
The table shows one stage of TPCDS q9 when reading spark and Gluten generated store_sales table
| Metric | spark generated | gluten generated | Difference |
|---|---|---|---|
| Time of scan and filter | 18.8 min | 1.31 hours | 4.2x slower |
| I/O wait time | 2.09 hours | 10.09 hours | 4.8x slower |
| Time of scan | 1.85 hours | 8.84 hours | 4.8x slower |
| Page load time | 14.0 min | 1.25 hours | 5.4x slower |
| Data source read time | 4.8 min | 21.6 min | 4.5x slower |
| Storage read bytes | 31.5 GiB | 34.7 GiB | 10% more data read |
| Size of files read | 240.7 GiB | 256.6 GiB | 6.6% larger |
| Peak memory | 16.0 GiB | 14.1 GiB | 12% less memory used |
| Row groups processed | 2,666 | 8,727 | 3.3x more row groups |
| Memory allocations | 17,584,444 | 14,466,793 | 18% fewer allocations |
| Preloaded splits | 1,585 | 1,710 | 8% more preloaded |
Gluten version
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request