The Apache Spark Connector for Lance allows Apache Spark to efficiently read datasets stored in Lance format.
Lance is a modern columnar data format optimized for machine learning workflows and datasets, supporting distributed, parallel scans, and optimizations such as column and filter pushdown to improve performance. Additionally, Lance provides high-performance random access that is 100 times faster than Parquet without sacrificing scan performance.
By using the Apache Spark Connector for Lance, you can leverage Apache Spark's powerful data processing, SQL querying, and machine learning training capabilities on the AI data lake powered by Lance.
The connector is built using the Spark DatasourceV2 (DSv2) API. Please check this presentation to learn more about DSv2 features. Specifically, you can use the Apache Spark Connector for Lance to:
- Query Lance Datasets: Seamlessly query datasets stored in the Lance format using Spark.
- Distributed, Parallel Scans: Leverage Spark's distributed computing capabilities to perform parallel scans on Lance datasets.
- Column and Filter Pushdown: Optimize query performance by pushing down column selections and filters to the data source.
Java: 8, 11, 17
Scala: 2.12
Spark: 3.4, 3.5
Operating System: Linux x86, macOS
Jars with full dependency are uploaded in public S3 bucket spark-lance-artifacts
,
with name pattern lance-spark-{spark-version}-{scala-version}-{connector-version}-jar-with-dependencies.jar
.
FOr example, to get Spark 3.5 Scala 2.12 jar for connector version 0.1.0, do:
wget https://spark-lance-artifacts.s3.amazonaws.com/lance-spark-3.5-2.12-0.1.0-jar-with-dependencies.jar
Launch spark-shell
with your selected JAR according to your Spark and Scala version:
spark-shell --jars lance-spark-3.5-2.12-0.1.0-jar-with-dependencies.jar
Example Usage:
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession.builder()
.appName("spark-lance-connector-test")
.master("local")
.getOrCreate();
Dataset<Row> data = spark.read()
.format("lance")
.option("db", "/path/to/example_db")
.option("dataset", "lance_example_dataset")
.load();
data.show(100);
More examples can be found in SparkLanceConnectorReadTest.
This package is dependent on the Lance Java SDK and
Lance Catalog Java Modules.
You need to build those repositories locally first before building this repository.
If your have changes affecting those repositories,
the PR in lancedb/lance-spark
will only pass CI after the PRs in lancedb/lance
and lance/lance-catalog
are merged.
This connector is built using Maven. To build everything:
./mvnw clean install
To build everything without running tests:
./mvnw clean install -DskipTests
We offer the following build profiles for you to switch among different build versions:
- scala-2.12
- scala-2.13
- spark-3.4
- spark-3.5
For example, to use Scala 2.13:
./mvnw clean install -Pscala-2.13
To build a specific version like Spark 3.4:
./mvnw clean install -Pspark-3.4
To build only Spark 3.4:
./mvnw clean install -Pspark-3.4 -pl lance-spark-3.4 -am
Use the shade-jar
profile to create the jar with all dependencies for Spark 3.4:
./mvnw clean install -Pspark-3.4 -Pshade-jar -pl lance-spark-3.4 -am
We use checkstyle and spotless to lint the code.
To verify checkstyle:
./mvnw checkstyle:check
To verify spotless:
./mvnw spotless:check
To apply spotless changes:
./mvnw spotless:apply