ICEBERG TABLE - MultiStorage #5953

nikhilindikuzha · 2025-03-13T09:12:58Z

nikhilindikuzha
Mar 13, 2025

We have created an Iceberg table that initially contains four records. After a few days, due to retention policies, we need to move the first two records (i.e., the corresponding Parquet data files) to another storage layer (cold storage) in Azure Blob Storage. Once the data files are moved, the Iceberg metadata is updated accordingly. As a result, the table now references data files located in two different storage layers.

When executing a SELECT * FROM table, Spark should be able to access and query both storage layers seamlessly.

To facilitate this, we have set up two S3 proxy endpoints, each corresponding to a different storage system. However, in Spark, we can only set a single S3 endpoint per session using:

spark.conf.set("fs.s3a.endpoint", "")
spark.conf.set("fs.s3a.access.key", "")
spark.conf.set("fs.s3a.secret.key", "")
spark.conf.set("fs.s3a.aws.credential.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCrednetialProvider")
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

This limitation prevents us from querying both storage locations within the same Spark session.

Question:
How can we configure Spark to support multiple S3 proxy endpoints simultaneously, allowing seamless querying of data from both storage layers?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICEBERG TABLE - MultiStorage #5953

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

ICEBERG TABLE - MultiStorage #5953

Uh oh!

nikhilindikuzha Mar 13, 2025

Replies: 0 comments

nikhilindikuzha
Mar 13, 2025