-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52988][SQL] Fix race conditions at CREATE TABLE and FUNCTION when IF NOT EXISTS is used #51696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @attilapiros .
- Do you think your test code can be a part of test coverage?
createUserDefinedFunction
seems to be missed if we want to cover all function create/delete/alter.- And, is this enough? For example,
functionExists
is okay because it's read-only?
cc @cloud-fan , @peter-toth , @yaooqinn , @LuciferYang , too
I do not think so. This change is fairly simple. With my test I just would liked to illustrate how easy to reproduce this.
It is not needed as the
In the sense of create/drop/alter it is not needed because of the |
instead of locking, can we do something like
|
This seems not to be a function-specific issue |
Yes, that's also would work for the "CREATE FUNCTION IF NOT EXISTS" but is not the locking the safer / more correct solution especially when the Or is there some performance concern behind your suggestion? |
Yea locking is not good for high concurrency, and in fact the locking is just best effort as we can't prevent other Spark applications/clients from creating/droping tables at the same time. |
b5a1c59
to
4e6242e
Compare
I have updated the PR by catching the exceptions. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Outdated
Show resolved
Hide resolved
Thank you, @attilapiros, @cloud-fan , and @yaooqinn . |
…hen IF NOT EXISTS is used ### What changes were proposed in this pull request? Fixing race conditions at create table and create function when IF NOT EXISTS is given. ### Why are the changes needed? Even when "CREATE FUNCTION IF NOT EXISTS" is used in parallel can fail with the following exception: ``` 2025-07-25 01:22:21,731 [AA-Rule-ThreadPoolExec-2] ERROR ***** - An error occured : org.apache.spark.sql.AnalysisException: Function default.SparkTestUDF already exists; line 1 pos 6734 at org.apache.spark.sql.errors.QueryCompilationErrors$.functionAlreadyExistsError(QueryCompilationErrors.scala:654) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.registerFunction(SessionCatalog.scala:1487) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.resolvePersistentFunctionInternal(SessionCatalog.scala:1719) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.resolvePersistentFunction(SessionCatalog.scala:1675) at ... ``` Regarding `CREATE TABLE`: ``` scala> import scala.collection.parallel.CollectionConverters._ import scala.collection.parallel.CollectionConverters._ scala> (1 to 5).toList.par.foreach(_ => spark.sql("create table if not exists spark52988(a int)")) | 25/08/11 15:47:18 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 25/08/11 15:47:18 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore apiros10.96.131.100 25/08/11 15:47:18 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 25/08/11 15:47:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 25/08/11 15:47:19 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: [TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `default`.`spark52988` because it already exists. Choose a different name, drop or replace the existing object, or add the IF NOT EXISTS clause to tolerate pre-existing objects. SQLSTATE: 42P07 at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:226) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:218) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:422) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:123) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:79) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:77).... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. `CREATE FUNCTION` after this change: ``` scala> (1 to 100).foreach { j => (1 to 25).toList.par.foreach(_ => spark.sql(s"create function if not exists f$j(i int) returns int return i * i")) } 25/08/12 10:34:20 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 25/08/12 10:34:20 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 25/08/12 10:34:20 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51696 from attilapiros/SPARK-52988. Authored-by: attilapiros <[email protected]> Signed-off-by: attilapiros <[email protected]>
What changes were proposed in this pull request?
Fixing race conditions at create table and create function when IF NOT EXISTS is given.
Why are the changes needed?
Even when "CREATE FUNCTION IF NOT EXISTS" is used in parallel can fail with the following exception:
Regarding
CREATE TABLE
:Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manually.
CREATE FUNCTION
after this change:Was this patch authored or co-authored using generative AI tooling?
No