You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Following an FSCK repair run from this library, a follow up lookup on the table history via Spark fails with the following error:
Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `java.lang.String` from Array value (token `JsonToken.START_ARRAY`)
I fixed a similar bug with optimize runs early last year: #2317
So I believe the fix here is similar (looking into it, but no luck so far).
I believe the specific issue is due to the repair run outputting files_removed as an array and not a string:
What you expected to happen: Following a repair run, delta-rs will write out the files_removed metric in a way Spark understands.
How to reproduce it:
Here is an MRE:
from deltalake import DeltaTable, write_deltalake
import pandas as pd
import time
# write some data into a delta table
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
write_deltalake("./data/delta", df, mode="append")
# second write so there's a file to delete manually
write_deltalake("./data/delta", df, mode="append")
# Load data from the delta table
dt = DeltaTable("./data/delta")
print("Take a moment to delete one of the parquet files written out")
time.sleep(10)
dt.repair()
### SPARK SECTION ###
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
# Initialize Spark session with Delta support
spark = SparkSession.builder \
.appName("DeltaTableHistory") \
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.2.1,org.apache.hadoop:hadoop-aws:3.2.1") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
delta_table_path = "./data/delta"
delta_table = DeltaTable.forPath(spark, delta_table_path)
# Get the history of the Delta table, fails here
history_df = delta_table.history()
More details:
From what I recall, I understand that delta-rs is technically following protocol spec. However from a potentially selfish perspective, this bug caused a breaking change in our Spark streaming jobs (which use table history under the hood) and I believe the best solution for now is to write this out in a way Spark understands.
The text was updated successfully, but these errors were encountered:
Environment
Local, S3
Delta-rs version: 0.24.0
Binding: Python
Environment:
Bug
What happened:
Following an FSCK repair run from this library, a follow up lookup on the table history via Spark fails with the following error:
I fixed a similar bug with optimize runs early last year: #2317
So I believe the fix here is similar (looking into it, but no luck so far).
I believe the specific issue is due to the repair run outputting
files_removed
as an array and not a string:What you expected to happen: Following a repair run, delta-rs will write out the
files_removed
metric in a way Spark understands.How to reproduce it:
Here is an MRE:
More details:
From what I recall, I understand that delta-rs is technically following protocol spec. However from a potentially selfish perspective, this bug caused a breaking change in our Spark streaming jobs (which use table history under the hood) and I believe the best solution for now is to write this out in a way Spark understands.
The text was updated successfully, but these errors were encountered: