DC-760: dataset builder query generation #1516

pshapiro4broad · 2023-09-29T18:06:44Z

Port over some code from tanagra that handles generation SQL queries from java objects. This will be used in a future PR that handles APIs requests from the dataset builder to generate SQL based on a set of criteria in the cohort builder UI.

The code supports T-sql and big query SQL, and provides a Query API that supports:

multi-table select/where
join/left join
order by
filter expressions, such as =, !=, <, >, IN clauses

This doesn't include the code to execute these queries in TDR, although this work was already done as part of a proof-of-concept, see #1433

Since the code isn't used yet, I'm aiming for 80%+ code coverage in tests.

…)`; reformat

…actorings

…r misc changes

… places; move validation to construtors;

s-rubenstein · 2023-10-25T18:37:25Z

src/main/java/bio/terra/tanagra/query/FieldVariable.java

+  }
+
+  @Override
+  public String renderSQL(SqlPlatform platform) {


Why is platform not used here? Should that get passed through somehow?

It might be needed in the future, but wasn't needed yet. Everything this calls is private so if we were to use the sql platform in the future it wouldn't affect this class's contract.

I'm not totally wild about the way I added support for azure vs bq here since it involved threading the sql platform through all the code but it seemed like the best way at the time, and does work.

s-rubenstein · 2023-10-25T18:40:07Z

src/main/java/bio/terra/tanagra/query/Literal.java

+
+  @Override
+  public String renderSQL(SqlPlatform platform) {
+    // TODO: use named parameters for literals to protect against SQL injection


This seems like something we should make sure gets done before this code ends up in production and is callable by any non-admin. Is there a ticket for this?

No, but I can create it. I'm not an expert in the way the TDR currently generates SQL code but I don't think named templates are used in all cases.

src/main/java/bio/terra/tanagra/query/Query.java

s-rubenstein · 2023-10-25T18:54:19Z

src/main/java/bio/terra/tanagra/query/Query.java

+    }
+
+    // render the primary TableVariable
+    String sql =


Should this have a todo to use PreparedStatements or something else safer?

I'm not sure how that would work, but I agree, I'd much rather use the prepared statement API if possible. I think you need a SQL connection to create one, so it would be hard to do in the current flow; that would require passing in some state in addition to platform.

src/main/java/bio/terra/tanagra/query/TablePointer.java

src/main/java/bio/terra/tanagra/query/SqlPlatform.java

src/test/java/bio/terra/tanagra/query/ColumnHeaderSchemaTest.java

s-rubenstein · 2023-10-25T20:17:17Z

src/test/java/bio/terra/tanagra/query/FieldVariableTest.java

+                .build(),
+            tableVariable,
+            "alias");
+    assertThat(fieldVariableSqlFunctionWrapper.renderSQL(null), is("custom(t.field)"));


Do you think it would be worth breaking this test up and creating individual tests for each assertion?

Sure, I was planning to revisit this once I had more tests written; the way the code sets up FieldPointer and TablePointer is pretty verbose right now. In the original code these objects are only constructed using serialization so no much effort was given to manual construction.

snf2ye · 2023-10-26T14:04:43Z

src/main/java/bio/terra/tanagra/query/datapointer/DataPointer.java

+  /** Enum for the types of external data pointers supported by Tanagra. */
+  public enum Type {
+    BQ_DATASET,
+    AZURE_DATASET


We may want to specify this as a Synapse query (i.e. SYNAPSE_DATASET), as the counterpart to BQ.

In the tanagra code there's a concept of "query executor" which handles issuing the queries to the database, which is what this defines. I added a concept of "sql type" to handle differences between bigquery and T-SQL. That was OK for the proof-of-concept, for the new code we will use the TDR dataset ID to figure out how to run the query, so this concept can go away.

And with your suggestion to change the way table names are handled, it may be possible to remove the sql type concept too.

src/main/java/bio/terra/tanagra/query/UnionQuery.java

snf2ye

Looking more at the code, I would like to suggest a light integration with ANTLR. The idea would be to (1) Build a general SQL query using the code in this PR, but ignoring differences between BQ & TSQL and ignoring small details like aliases, and then (2) Parse the generated query using ether the BQVisitor or SynapseVisitor. I think this has a few advantages:

Simplify the code already in this PR - We have to somehow generate a base query, but let's make it as straightforward as possible.
We can handle all language specific changes in one spot (SynapseVisitor or BQVisitor)
ANTLR will handle the complexities of queries from parquet files out of the box -- We already need to support updating a query to select from a set of parquet files in the SynapseVisitor.
Inherently by using the parser, the generated SQL will be checked to ensure it is valid.

src/main/java/bio/terra/tanagra/query/Query.java

pshapiro4broad · 2023-10-27T14:28:54Z

Looking more at the code, I would like to suggest a light integration with ANTLR. The idea would be to (1) Build a general SQL query using the code in this PR, but ignoring differences between BQ & TSQL and ignoring small details like aliases, and then (2) Parse the generated query using ether the BQVisitor or SynapseVisitor. I think this has a few advantages:

I'm still of two minds about this.. If the output of this code is passed to the parser, then the data flow looks like:

<UI objects / configuration objects> 
  -> <tanagra Query objects> 
  -> <SQL String> 
  -> <grammer/antlr Query objects> 
  -> <SQL String>

It seems wasteful to generate SQL, then parse it, only to generate SQL again. To avoid this, the logic in DatasetAwareVisitor and subclasses could be shared with a class that does the same work in the query object. One approach would be to change from Query.renderSql(SqlPlatform platform) to Query.renderSql(DatasetAwareVisitor visitor).

* Add test for more complicated sql * Move test to OnDemand suite * Add back to query test and split up assert * Update src/test/java/bio/terra/service/snapshotbuilder/query/QueryTest.java Co-authored-by: Phil Shapiro <[email protected]> * Move Boolean to boolean * Fix syntax --------- Co-authored-by: Phil Shapiro <[email protected]>

pshapiro4broad · 2023-11-06T20:52:16Z

@snf2ye Can you take another look at this? I've simplified things even further, although I haven't addressed all the comments above. The remaining issues I can see are

table aliases are generated statically instead of using a visitor
BQ/Azure table names aren't generated at all yet

I'll make two tickets for this work. When it's done, it should work the same way that SynapseVisitor and BigQueryVisitor do, and will hopefully share code with it as well.

sonarcloud · 2023-11-07T19:57:31Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
4 Code Smells

92.0% Coverage
0.0% Duplication

pshapiro4broad added 19 commits September 26, 2023 17:01

initally add all files

24893b0

replace StringSubstitutor with ST

38f04ce

a few unit tests

5994f9e

Update tests; use ST as a return value from `getSubstitutionTemplate(…

4c05445

…)`; reformat

remove some serialization code

b549e13

finish removing serialization code; remove unused code; many misc ref…

935316f

…actorings

refactor

21509e1

spotless

9f07671

disable mutable Date warnings for now

1e138e9

unit tests

28d6e78

unit tests

923c708

remove unused code revealed by coverage tests

19e0a4f

added a few more tests; removed builder pattern from Query; many othe…

6471e64

…r misc changes

misc error handling cleanups; remove use of custom exception in a few…

6ef904b

… places; move validation to construtors;

remove underlay code for now

822986c

move datapointer into query package

be6d595

Merge branch 'develop' into ps/dc-760-dataset-builder-query

ee86b07

remove executor, dataset code for now

9f1ba2c

remove more unused code relating to query execution

0c13fcb

pshapiro4broad marked this pull request as ready for review October 25, 2023 14:39

pshapiro4broad requested review from snf2ye, nmalfroy, samanehsan and okotsopoulos as code owners October 25, 2023 14:39

pshapiro4broad requested a review from s-rubenstein October 25, 2023 14:39

spotless

1b55455

s-rubenstein reviewed Oct 25, 2023

View reviewed changes

unit tests for TableVariable

d2bb812

snf2ye reviewed Oct 26, 2023

View reviewed changes

src/main/java/bio/terra/tanagra/query/Query.java Outdated Show resolved Hide resolved

src/main/java/bio/terra/tanagra/query/Query.java Outdated Show resolved Hide resolved

pshapiro4broad and others added 13 commits October 27, 2023 11:05

code review feedback: remove DataPointer, remove other unused code

2a1d115

remove unused classes

9af127f

move code out of tanagra package and into snapshotbuilder package

9e8c2b6

whoops, move tests too

7663221

de-builderize FieldPointer

d3fc04c

spotless; add tests for BinaryFilter and BooleanAndOrFilter

babe312

more tests

acb3c22

literal tests

a326ad6

remove unused code related to query response

2270245

more test coverage

a9323c3

remove platform; remove LIMIT and ORDER BY support

99760b4

add GROUP BY test

e847587

s-rubenstein approved these changes Nov 6, 2023

View reviewed changes

pshapiro4broad requested a review from snf2ye November 6, 2023 20:52

rjohanek self-requested a review November 7, 2023 13:10

spotbugs doesn't like 3.14

0ebb357

snf2ye approved these changes Nov 7, 2023

View reviewed changes

spotbugs doesn't like Date

6a4fa8b

pshapiro4broad merged commit 4fd8996 into develop Nov 8, 2023
10 checks passed

pshapiro4broad deleted the ps/dc-760-dataset-builder-query branch November 8, 2023 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DC-760: dataset builder query generation #1516

DC-760: dataset builder query generation #1516

pshapiro4broad commented Sep 29, 2023 •

edited

Loading

s-rubenstein Oct 25, 2023

pshapiro4broad Oct 26, 2023

s-rubenstein Oct 25, 2023

pshapiro4broad Oct 26, 2023

s-rubenstein Oct 25, 2023

pshapiro4broad Oct 26, 2023 •

edited

Loading

s-rubenstein Oct 25, 2023

pshapiro4broad Oct 27, 2023

snf2ye Oct 26, 2023

pshapiro4broad Oct 27, 2023

snf2ye left a comment

pshapiro4broad commented Oct 27, 2023

pshapiro4broad commented Nov 6, 2023

sonarcloud bot commented Nov 7, 2023

DC-760: dataset builder query generation #1516

DC-760: dataset builder query generation #1516

Conversation

pshapiro4broad commented Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pshapiro4broad Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snf2ye left a comment

Choose a reason for hiding this comment

pshapiro4broad commented Oct 27, 2023

pshapiro4broad commented Nov 6, 2023

sonarcloud bot commented Nov 7, 2023

pshapiro4broad commented Sep 29, 2023 •

edited

Loading

pshapiro4broad Oct 26, 2023 •

edited

Loading