Skip to content

Delta Lake 3.2.1 RC1

Pre-release
Pre-release
Compare
Choose a tag to compare
@vkorukanti vkorukanti released this 04 Sep 16:48
· 523 commits to master since this release

We are excited to announce the release of Delta Lake 3.2.1 RC1! This release contains important bug fixes to 3.2.1 and it is recommended that users update to 3.2.1. Instructions for how to use this release candidate are at the end of these notes. To give feedback on this release candidate, please post in the Delta Users Slack here or create issues in our Delta repository.

Details by each component.

Delta Spark

Delta Spark 3.2.1 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key changes of this release are:

  • Support for Apache Spark 3.5.2.
  • Support QueryExecutionListener for MERGE queries submitted through Scala API.
  • #3474
  • Support RESTORE on a Delta table with clustering enabled
  • Fix replacing clustered table with non-clustered table.
  • Fix an issue when running clustering on table with single column selected as clustering columns.

Delta Universal Format (UniForm)

Documentation: https://docs.delta.io/3.2.1/delta-uniform.html
RC1 artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13

The key changes of this release are:

  • Uniform iceberg conversion transaction should not convert commit with only AddFiles without datachange

Delta Sharing Spark

The key changes of this release are:
Upgrade delta-sharing-client to version 1.1.1 which removes the pre-signed url address from the error message on access errors.
Fix an issue with DeltaSharingLogFileStatus

Delta Kernel

The key changes of this release are:

  • Fix comparison issues with string values having characters with surrogate pairs. This fixes a corner case when comparing characters (e.g. emojis) that have surrogate pairs in UTF-16 representation.
  • Fix ClassNotFoundException issue when loading LogStores in Kernel default Engine module. This issue happens in some environments where the thread local class loader is not set.
  • Fix error when querying tables with spaces in the path name. Now you can query tables with paths having any valid path characters.
  • Fix an issue with writing decimal as binary when writing decimals with certain scale and precision when writing them to the Parquet file.
  • Throw proper exception when unsupported VOID data type is encountered in Delta tables when reading.
  • Handle long type values in field metadata of columns in schema. Earlier Kernel was throwing a parsing exception, now Kernel handles long types.
  • Fix an issue where Kernel retries multiple times when _last_checkpoint file is not found. Now Kernel tries just once when file not found exception is thrown.
  • Support reading Parquet files with legacy map type physical formats. Earlier Kernel used to throw errors, now Kernel can read data from file containing legacy map physical formats.
  • Support reading Parquet files with legacy 3-level repeated type physical formats.
  • Write timestamp data to Parquet file as INT64 physical format instead of INT96 physical format. INT96 is a legacy physical format that is deprecated.

For more information, refer to:

  • User guide on step-by-step process of using Kernel in a standalone Java program or in a distributed processing connector.
  • Slides explaining the rationale behind Kernel and the API design.
  • Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
  • Table and default Engine API Java documentation

Delta Standalone (deprecated in favor of Delta Kernel)

  1. API documentation: https://docs.delta.io/3.2.1/delta-standalone.html
  2. RC1 artifacts:delta-standalone_2.12, delta-standalone_2.13

No update to Standalone in this release. Standalone is being deprecated in favor of Delta Kernel which supports advanced features in Delta tables.

Delta Storage

The key changes of this release are:

  • Fix an issue with VACUUM when using the S3DynamoDBLogStore where the LogStore made unnecessary listFrom calls to DynamoDB, causing a ProvisionedThroughputExceededException

How to use this Release Candidate [RC only]

Download Spark 3.5 from https://spark.apache.org/downloads.html.

For this release candidate, we have published the artifacts to a staging repository. Here’s how you can use them:

Spark Submit

Add --repositories https://oss.sonatype.org/content/repositories/iodelta-1166 to the command line arguments.
Example:

spark-submit --packages io.delta:delta-spark_2.12:3.2.1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1166 examples/examples.py

Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.2.1 by just providing the --packages io.delta:delta-spark_2.12:3.2.1 argument.

Spark Shell

bin/spark-shell --packages io.delta:delta-spark_2.12:3.2.1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1166 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Spark SQL

bin/spark-sql --packages io.delta:delta-spark_2.12:3.2.1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1166 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Maven

<repositories>
  <repository>
    <id>staging-repo</id>
    <url>https://oss.sonatype.org/content/repositories/iodelta-1166</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-spark_2.12</artifactId>
  <version>3.2.1</version>
</dependency>

SBT Project

libraryDependencies += "io.delta" %% "delta-spark" % "3.2.1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1166

(PySpark) Delta-Spark

  • Download two artifacts from pre-release: https://github.com/delta-io/delta/releases/tag/v3.2.1rc1
  • Artifacts to download are:
    • delta-spark-3.2.1.tar.gz
    • delta_spark-3.2.1-py3-none-any.whl
  • Keep them in one directory. Let’s call that ~/Downloads
  • pip install ~/Downloads/delta_spark-3.2.1-py3-none-any.whl
  • pip show delta-spark should show output similar to the below
Name: delta-spark
Version: 3.2.1
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: [email protected]
License: Apache-2.0
Location: /home/<user.name>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark

Credits

Abhishek Radhakrishnan, Allison Portis, Charlene Lyu, Fred Storage Liu, Jiaheng Tang, Johan Lasperas, Lin Zhou, Marko Ilić, Scott Sandre, Tathagata Das, Tom van Bussel, Venki Korukanti, Wenchen Fan, Zihao Xu