You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
dbt seed command produces java.nio.file.FileSystemException: Old entries for table s3://<bucket_name>/<table_path> still exist in the external log store error
Expected Behavior
dbt seed successfully creates table from csv seed file
Steps To Reproduce
dbt 1.8.5 + dbt-spark 1.8.0
seeds:
+file_format: 'delta'
Apache Spark configured with Delta Lake session and s3 dynamodb lock store:
spark.delta.logStore.s3.impl=io.delta.storage.S3DynamoDBLogStore
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.extensions=o.delta.sql.DeltaSparkSessionExtension
spark.io.delta.storage.S3DynamoDBLogStore.ddb.tableName=<your dynamo db table name>
Relevant log output
[2024-08-20T15:50:46.627+0000] {logging_mixin.py:188} INFO - 15:50:46 Running with dbt=1.8.5
[2024-08-20T15:50:46.804+0000] {logging_mixin.py:188} INFO - 15:50:46 Registered adapter: spark=1.8.0
[2024-08-20T15:50:46.845+0000] {logging_mixin.py:188} INFO - 15:50:46 Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-08-20T15:50:49.738+0000] {logging_mixin.py:188} INFO - 15:50:49 [WARNING]: Deprecated functionality
The `tests` config has been renamed to `data_tests`. Please see
https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
information.
[2024-08-20T15:50:50.267+0000] {logging_mixin.py:188} INFO - 15:50:50 Found 6 models, 42 data tests, 1 seed, 36 sources, 733 macros, 25 unit tests
[2024-08-20T15:50:50.275+0000] {logging_mixin.py:188} INFO - 15:50:50
[2024-08-20T15:52:03.105+0000] {logging_mixin.py:188} INFO - 15:52:03 Concurrency: 4 threads (target='dev')
[2024-08-20T15:52:03.106+0000] {logging_mixin.py:188} INFO - 15:52:03
[2024-08-20T15:52:03.110+0000] {logging_mixin.py:188} INFO - 15:52:03 1 of 1 START seed file <database>.<table> ........... [RUN]
[2024-08-20T15:52:11.569+0000] {logging_mixin.py:188} INFO - 15:52:11 1 of 1 ERROR loading seed file <database>.<table> ... [ERROR in 8.45s]
[2024-08-20T15:52:11.947+0000] {logging_mixin.py:188} INFO - 15:52:11
[2024-08-20T15:52:11.948+0000] {logging_mixin.py:188} INFO - 15:52:11 Finished running 1 seed in 0 hours 1 minutes and 21.67 seconds (81.67s).
[2024-08-20T15:52:12.023+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.024+0000] {logging_mixin.py:188} INFO - 15:52:12 Completed with 1 error and 0 warnings:
[2024-08-20T15:52:12.025+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.027+0000] {logging_mixin.py:188} INFO - 15:52:12 Runtime Error in seed <table> (seeds/<table>.csv)
Database Error
org.apache.hive.service.cli.HiveSQLException: Error running query: java.nio.file.FileSystemException: Old entries fortable <s3_path> still existin the external log store
at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.nio.file.FileSystemException: Old entries fortable <s3_path> still existin the external log store
at io.delta.storage.BaseExternalLogStore.write(BaseExternalLogStore.java:222)
at org.apache.spark.sql.delta.storage.LogStoreAdaptor.write(LogStore.scala:444)
at org.apache.spark.sql.delta.storage.DelegatingLogStore.write(DelegatingLogStore.scala:119)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile(OptimisticTransaction.scala:1806)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile$(OptimisticTransaction.scala:1798)
at org.apache.spark.sql.delta.OptimisticTransaction.writeCommitFile(OptimisticTransaction.scala:142)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit(OptimisticTransaction.scala:1711)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit$(OptimisticTransaction.scala:1682)
at org.apache.spark.sql.delta.OptimisticTransaction.doCommit(OptimisticTransaction.scala:142)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$3(OptimisticTransaction.scala:1651)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$2(OptimisticTransaction.scala:1648)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140)
at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138)
at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:142)
at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133)
at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128)
at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117)
at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:142)
at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132)
Possible root cause.
dbt-spark uses drop table + create table for seeds.
On the first seed run, everything works well, data is inserted and table is created. However, since we're using ddb lock store, the following records for delta log files are created in the dynamodb
On the next run, table is dropped, s3 delta lake files are deleted, but records in the dynamodb aren't. When dbt-spark executes 'create table' command, Delta tries to write 00000000000000000000.json, but fails, since the corresponding lock record already exists.
It feels like the better way to work with delta lake tables are not drop + create, but create if not exist, truncate + insert. Or merge if that makes more sense.
The text was updated successfully, but these errors were encountered:
Is this a new bug in dbt-spark?
Current Behavior
dbt seed
command producesjava.nio.file.FileSystemException: Old entries for table s3://<bucket_name>/<table_path> still exist in the external log store
errorExpected Behavior
dbt seed
successfully creates table from csv seed fileSteps To Reproduce
dbt 1.8.5 + dbt-spark 1.8.0
Apache Spark configured with Delta Lake session and s3 dynamodb lock store:
Relevant log output
Environment
Additional Context
Possible root cause.
dbt-spark uses drop table + create table for seeds.
On the first seed run, everything works well, data is inserted and table is created. However, since we're using ddb lock store, the following records for delta log files are created in the dynamodb
On the next run, table is dropped, s3 delta lake files are deleted, but records in the dynamodb aren't. When dbt-spark executes 'create table' command, Delta tries to write 00000000000000000000.json, but fails, since the corresponding lock record already exists.
It feels like the better way to work with delta lake tables are not drop + create, but create if not exist, truncate + insert. Or merge if that makes more sense.
The text was updated successfully, but these errors were encountered: