Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DC-760: dataset builder query generation #1516

Merged
merged 36 commits into from
Nov 8, 2023

Conversation

pshapiro4broad
Copy link
Member

@pshapiro4broad pshapiro4broad commented Sep 29, 2023

Port over some code from tanagra that handles generation SQL queries from java objects. This will be used in a future PR that handles APIs requests from the dataset builder to generate SQL based on a set of criteria in the cohort builder UI.

The code supports T-sql and big query SQL, and provides a Query API that supports:

  • multi-table select/where
  • join/left join
  • order by
  • filter expressions, such as =, !=, <, >, IN clauses

This doesn't include the code to execute these queries in TDR, although this work was already done as part of a proof-of-concept, see #1433

Since the code isn't used yet, I'm aiming for 80%+ code coverage in tests.

}

@Override
public String renderSQL(SqlPlatform platform) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is platform not used here? Should that get passed through somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be needed in the future, but wasn't needed yet. Everything this calls is private so if we were to use the sql platform in the future it wouldn't affect this class's contract.

I'm not totally wild about the way I added support for azure vs bq here since it involved threading the sql platform through all the code but it seemed like the best way at the time, and does work.


@Override
public String renderSQL(SqlPlatform platform) {
// TODO: use named parameters for literals to protect against SQL injection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something we should make sure gets done before this code ends up in production and is callable by any non-admin. Is there a ticket for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I can create it. I'm not an expert in the way the TDR currently generates SQL code but I don't think named templates are used in all cases.

src/main/java/bio/terra/tanagra/query/Query.java Outdated Show resolved Hide resolved
}

// render the primary TableVariable
String sql =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a todo to use PreparedStatements or something else safer?

Copy link
Member Author

@pshapiro4broad pshapiro4broad Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that would work, but I agree, I'd much rather use the prepared statement API if possible. I think you need a SQL connection to create one, so it would be hard to do in the current flow; that would require passing in some state in addition to platform.

src/main/java/bio/terra/tanagra/query/TablePointer.java Outdated Show resolved Hide resolved
src/main/java/bio/terra/tanagra/query/SqlPlatform.java Outdated Show resolved Hide resolved
.build(),
tableVariable,
"alias");
assertThat(fieldVariableSqlFunctionWrapper.renderSQL(null), is("custom(t.field)"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be worth breaking this test up and creating individual tests for each assertion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was planning to revisit this once I had more tests written; the way the code sets up FieldPointer and TablePointer is pretty verbose right now. In the original code these objects are only constructed using serialization so no much effort was given to manual construction.

/** Enum for the types of external data pointers supported by Tanagra. */
public enum Type {
BQ_DATASET,
AZURE_DATASET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to specify this as a Synapse query (i.e. SYNAPSE_DATASET), as the counterpart to BQ.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tanagra code there's a concept of "query executor" which handles issuing the queries to the database, which is what this defines. I added a concept of "sql type" to handle differences between bigquery and T-SQL. That was OK for the proof-of-concept, for the new code we will use the TDR dataset ID to figure out how to run the query, so this concept can go away.

And with your suggestion to change the way table names are handled, it may be possible to remove the sql type concept too.

src/main/java/bio/terra/tanagra/query/UnionQuery.java Outdated Show resolved Hide resolved
Copy link
Contributor

@snf2ye snf2ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more at the code, I would like to suggest a light integration with ANTLR. The idea would be to (1) Build a general SQL query using the code in this PR, but ignoring differences between BQ & TSQL and ignoring small details like aliases, and then (2) Parse the generated query using ether the BQVisitor or SynapseVisitor. I think this has a few advantages:

  • Simplify the code already in this PR - We have to somehow generate a base query, but let's make it as straightforward as possible.
  • We can handle all language specific changes in one spot (SynapseVisitor or BQVisitor)
  • ANTLR will handle the complexities of queries from parquet files out of the box -- We already need to support updating a query to select from a set of parquet files in the SynapseVisitor.
  • Inherently by using the parser, the generated SQL will be checked to ensure it is valid.

src/main/java/bio/terra/tanagra/query/Query.java Outdated Show resolved Hide resolved
src/main/java/bio/terra/tanagra/query/Query.java Outdated Show resolved Hide resolved
@pshapiro4broad
Copy link
Member Author

Looking more at the code, I would like to suggest a light integration with ANTLR. The idea would be to (1) Build a general SQL query using the code in this PR, but ignoring differences between BQ & TSQL and ignoring small details like aliases, and then (2) Parse the generated query using ether the BQVisitor or SynapseVisitor. I think this has a few advantages:

I'm still of two minds about this.. If the output of this code is passed to the parser, then the data flow looks like:

<UI objects / configuration objects> 
  -> <tanagra Query objects> 
  -> <SQL String> 
  -> <grammer/antlr Query objects> 
  -> <SQL String>

It seems wasteful to generate SQL, then parse it, only to generate SQL again. To avoid this, the logic in DatasetAwareVisitor and subclasses could be shared with a class that does the same work in the query object. One approach would be to change from Query.renderSql(SqlPlatform platform) to Query.renderSql(DatasetAwareVisitor visitor).

@pshapiro4broad
Copy link
Member Author

@snf2ye Can you take another look at this? I've simplified things even further, although I haven't addressed all the comments above. The remaining issues I can see are

  • table aliases are generated statically instead of using a visitor
  • BQ/Azure table names aren't generated at all yet

I'll make two tickets for this work. When it's done, it should work the same way that SynapseVisitor and BigQueryVisitor do, and will hopefully share code with it as well.

Copy link

sonarcloud bot commented Nov 7, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

92.0% 92.0% Coverage
0.0% 0.0% Duplication

@pshapiro4broad pshapiro4broad merged commit 4fd8996 into develop Nov 8, 2023
10 checks passed
@pshapiro4broad pshapiro4broad deleted the ps/dc-760-dataset-builder-query branch November 8, 2023 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants