Skip to content

PLUGIN-1823: Retrying all SQLTransientExceptions #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

sgarg-CS
Copy link
Contributor

@sgarg-CS sgarg-CS commented May 16, 2025

PLUGIN-1823

Add Failsafe Retry poilcy to all the places in the database-plugins where SQLTransientException could be thrown.

Added three new properties (hidden from UI)

  • Initial Retry Duration (Default: 5sec)
  • Max Retry Duration (Default: 80 sec)
  • Max Retry Count (Default: 5)

Copy link

google-cla bot commented May 16, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@sgarg-CS sgarg-CS force-pushed the patch/plugin-1823 branch from 9da75b8 to ac813f0 Compare May 26, 2025 05:28
@sgarg-CS sgarg-CS added build and removed build labels May 27, 2025
@sgarg-CS sgarg-CS requested a review from itsankit-google May 30, 2025 04:56
Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks everything is getting wrapped within Failsafe where we might end up with having nested level retries, we need to ensure we add retries only where we are actually interacting with JDBC client and not top level functions.

For example adding retries to DriverManager.getConnection(connectionString, connectionProperties) makes sense because you are actually interacting with the source db but adding retries to whole loadSchema(Connection connection, String query) do not makes sense we need to be careful while adding such retries.

@itsankit-google
Copy link
Member

Please note E2E should not be modified and not fail with these changes. Otherwise, we have done something wrong which does not give expected failure messages.

Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see some level of duplication in both AbstractDBSource & AbstractDBSink, can we please move it to the common AbstractDBUtil class?

Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public final class RetryUtils {

  public static Connection createConnectionWithRetry(RetryPolicy<?> retryPolicy, String connectionString,
                                                     Properties connectionProperties, String externalDocumentationLink) throws Exception {
    try {
      return Failsafe.with(retryPolicy).get(() ->
        DriverManager.getConnection(connectionString, connectionProperties)
      );
    } catch (Exception e) {
      throw unwrapFailsafeException(e, externalDocumentationLink);
    }
  }

  public static Statement createStatementWithRetry(RetryPolicy<?> retryPolicy,
                                                   Connection connection, String externalDocumentationLink) throws Exception {
    try {
      return Failsafe.with(retryPolicy).get(connection::createStatement);
    } catch (Exception e) {
      throw unwrapFailsafeException(e, externalDocumentationLink);
    }
  }

  public static PreparedStatement prepareStatementWithRetry(RetryPolicy<?> retryPolicy,
                                                            Connection connection,
                                                            String sqlQuery, String externalDocumentationLink) throws Exception {
    try {
      return Failsafe.with(retryPolicy).get(() ->
        connection.prepareStatement(sqlQuery)
      );
    } catch (Exception e) {
      throw unwrapFailsafeException(e, externalDocumentationLink);
    }
  }

 public static ResultSet executeWithRetry(RetryPolicy<?> retryPolicy,
                                                            Connection connection,
                                                            String sqlQuery, String externalDocumentationLink) throws Exception {
        try {
            return Failsafe.with(retryPolicy).get(() -> connection.createStatement().executeQuery(sqlQuery));
        } catch (Exception e) {
            throw unwrapFailsafeException(e, externalDocumentationLink);
        }
    }

 private static Exception unwrapFailsafeException(Exception e) {
    if (e instanceof FailsafeException && e.getCause() instanceof Exception) {
        if (e instanceOf SQLException) {
           return programFailureException(e, externalDocumentationLink);
        } else {
          return (Exception) e.getCause();
       }
    }
    return e;
  }
  
private static ProgramFailureException programFailureException(SQLException e, String externalDocumentationLink) {
    // wrap exception to ensure SQLException-child instances not exposed to contexts without jdbc
    // driver in classpath
    String errorMessage =
      String.format("SQL Exception occurred: [Message='%s', SQLState='%s', ErrorCode='%s'].",
        e.getMessage(), e.getSQLState(), e.getErrorCode());
    String errorMessageWithDetails = String.format("Error occurred while trying to" +
      " get schema from database." + "Error message: '%s'. Error code: '%s'. SQLState: '%s'", e.getMessage(),
        e.getErrorCode(), e.getSQLState());
 
    if (!Strings.isNullOrEmpty(externalDocumentationLink)) {
      if (!errorMessage.endsWith(".")) {
        errorMessage = errorMessage + ".";
      }
      errorMessage = String.format("%s For more details, see %s", errorMessage, externalDocumentationLink);
    }
    return ErrorUtils.getProgramFailureException(new ErrorCategory(ErrorCategory.ErrorCategoryEnum.PLUGIN),
      errorMessage, errorMessageWithDetails, ErrorType.USER, false, ErrorCodeType.SQLSTATE, e.getSQLState(),
        externalDocumentationLink, e);
  }
}

You can create a RetryUtils like above which accepts connection params.

Move retry logic into a separate class: RetryUtils and add exception handling
@sgarg-CS
Copy link
Contributor Author

sgarg-CS commented Jun 3, 2025

Overall, this looks everything is getting wrapped within Failsafe where we might end up with having nested level retries, we need to ensure we add retries only where we are actually interacting with JDBC client and not top level functions.

For example adding retries to DriverManager.getConnection(connectionString, connectionProperties) makes sense because you are actually interacting with the source db but adding retries to whole loadSchema(Connection connection, String query) do not makes sense we need to be careful while adding such retries.

Refactored the code to add the retry logic only for the methods interacting with the JDBC client.

Copy link
Member

@itsankit-google itsankit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it affect the UI if validateSchema fails for DB sinks? Can we test before/after the change?

Moved all the retry constants to RetryUtils class, Added the methods to determine error type and error category methods from Error code and SQL State
@sgarg-CS
Copy link
Contributor Author

sgarg-CS commented Jun 5, 2025

How does it affect the UI if validateSchema fails for DB sinks? Can we test before/after the change?

Yes. Will check and update the behaviour here.

I've reverted the changes done to handle SQLException thrown by validateSchema() method call from AbstractDBSink.configurePipeline class. Still, I see a change in the error message on the UI. This is probably due to catching SQLException and then wrapping it to throw ProgramFailureException. While this wrapping was done for inferSchema() but not for validateSchema() in AbstractSink class earlier.

Test Scenario: Validate the schema in PostgreSQL Sink Plugin, if connection is not active. (Postgres DB is down)

[BEFORE CHANGES]

Error message on the UI:
Exception while trying to validate schema of database table '"users2"' for connection 'jdbc:postgresql://localhost:5433/postgres' with Connection to localhost:5433 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

[AFTER CHANGES]

Error message on the UI:
Error encountered while configuring the stage: 'Error occurred while trying to get schema from database. Error message: 'Connection to localhost:5433 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.'. Error code: '0'. SQLState: '08001''

Is the new error message acceptable or should we revert the changes done to throw ProgramFailureException for validateSchema() method?

@itsankit-google
Copy link
Member

How does it affect the UI if validateSchema fails for DB sinks? Can we test before/after the change?

Yes. Will check and update the behaviour here.

I've reverted the changes done to handle SQLException thrown by validateSchema() method call from AbstractDBSink.configurePipeline class. Still, I see a change in the error message on the UI. This is probably due to catching SQLException and then wrapping it to throw ProgramFailureException. While this wrapping was done for inferSchema() but not for validateSchema() in AbstractSink class earlier.

Test Scenario: Validate the schema in PostgreSQL Sink Plugin, if connection is not active. (Postgres DB is down)

[BEFORE CHANGES]

Error message on the UI: Exception while trying to validate schema of database table '"users2"' for connection 'jdbc:postgresql://localhost:5433/postgres' with Connection to localhost:5433 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

[AFTER CHANGES]

Error message on the UI: Error encountered while configuring the stage: 'Error occurred while trying to get schema from database. Error message: 'Connection to localhost:5433 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.'. Error code: '0'. SQLState: '08001''

Is the new error message acceptable or should we revert the changes done to throw ProgramFailureException for validateSchema() method?

LGTM

@@ -12,7 +12,7 @@ errorMessageInvalidImportQuery=Import Query select must contain the string '$CON
\ to 1. Include '$CONDITIONS' in the Import Query
errorMessageBlankUsername=Username is required when password is given.
errorMessageInvalidTableName=Error encountered while configuring the stage: 'Error occurred while trying to get schema from database.Error message: 'IO Error: Unknown host specified '. Error code: '17002'. SQLState: '08006''
errorMessageInvalidSinkDatabase=Error encountered while configuring the stage: 'Error occurred while trying to get schema from database.Error message: 'IO Error: Unknown host specified '. Error code: '17002'. SQLState: '08006''
errorMessageInvalidSinkDatabase=Error encountered while configuring the stage: 'Error occurred while trying to get schema from database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, why are we making the error message generic, it should cover the exact error message like before.

@@ -1,4 +1,4 @@
errorMessageInvalidSourceDatabase=Error occurred while trying to get schema from database.Error message: 'Access denied for user '
errorMessageInvalidSourceDatabase=Error occurred while trying to get schema from database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here, it applies for all the error messages in the PR where we are reducing the exact messages and making it generic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the error messages are not coming like before, there is something wrong with the changes and we should fix there.

…rn types for overridden methods to base class
Comment on lines 385 to 386
protected String getExternalDocumentationLink() {
return null;
return "https://en.wikipedia.org/wiki/SQLSTATE";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method still needed? can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now. Removed. 7c8815d

Comment on lines 165 to 167
protected String getExternalDocumentationLink() {
return "https://en.wikipedia.org/wiki/SQLSTATE";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not needed now. Removed it. 7c8815d

@@ -87,13 +86,17 @@ public abstract class AbstractDBSource<T extends PluginConfig & DatabaseSourceCo
Pattern.CASE_INSENSITIVE);
private static final Pattern WHERE_CONDITIONS = Pattern.compile("\\s+where \\$conditions",
Pattern.CASE_INSENSITIVE);
private final RetryPolicy<?> retryPolicy;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 0b57ce6

import java.sql.SQLTransientException;
import java.util.HashSet;
import java.util.Set;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 0b57ce6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants