Skip to content

HIVE-28938: Error in LATERAL VIEW with non native tables due to prese… #5798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 12, 2025

Conversation

soumyakanti3578
Copy link
Contributor

@soumyakanti3578 soumyakanti3578 commented Apr 30, 2025

…nce of incorrect virtual columns in RowResolver

What changes were proposed in this pull request?

  1. Refactored virtual column adding logic in CalcitePlanner::genTableLogicalPlan
  2. Removed obtainTableType(Table tabMetaData) as the logic inside it was insufficient for classifying tables
  3. Also removed enum TableType containing types NATIVE, DRUID, and JDBC.
  4. Added methods isDruidTable and isJdbcTable in Table as an alternative to the enum types. isNonNative is already present in Table.

Why are the changes needed?

Explained in detail in https://issues.apache.org/jira/browse/HIVE-28938, but in short, iceberg tables were getting misclassified as NATIVE tables leading to addition of wrong virtual columns in RowResolver.

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -pl itests/qtest-iceberg -Piceberg -Pitests -Dtest=TestIcebergCliDriver -Dtest.output.overwrite=true -Dqfile="iceberg_multiple_lateral_views.q"

@amansinha100
Copy link

Explained in detail in https://issues.apache.org/jira/browse/HIVE-28938, but in short, iceberg tables were getting misclassified as NATIVE tables leading to addition of wrong virtual columns in RowResolver.

The nature of the code change in the PR is not limited to lateral views, so I am curious is this a wider issue for Iceberg tables ? Can you test with nested subqueries in the FROM clause, nested regular views, nested materialized views ?

@soumyakanti3578
Copy link
Contributor Author

This is indeed a wider issue. All ICEBERG tables were classified as NATIVE tables in CalcitePlanner::genTableLogicalPlan, however, the side effect was only visible because the lateral view references all virtual columns too, as shown in the Jira. This is also visible when we do an explain formatted select * from <table> as shown below by running:

create external table test (a int, b int, c int, d struct<e:string, f:int>) stored by iceberg;
explain formatted select * from test;

We see NATIVE and NON NATIVE virtual columns in the output:

{"CBOPlan":"{
  \"rels\": [
    {
      \"id\": \"0\",
      \"relOp\": \"org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableScan\",
      \"table\": [
        \"default\",
        \"test\"
      ],
      \"table:alias\": \"test\",
      \"inputs\": [],
      \"rowCount\": 1.0,
      \"avgRowSize\": 248.0,
      \"rowType\": {
        \"fields\": [
          {
            \"type\": \"INTEGER\",
            \"nullable\": true,
            \"name\": \"a\"
          },
          {
            \"type\": \"INTEGER\",
            \"nullable\": true,
            \"name\": \"b\"
          },
          {
            \"type\": \"INTEGER\",
            \"nullable\": true,
            \"name\": \"c\"
          },
          {
            \"fields\": {
              \"fields\": [
                {
                  \"type\": \"VARCHAR\",
                  \"nullable\": true,
                  \"precision\": 2147483647,
                  \"name\": \"e\"
                },
                {
                  \"type\": \"INTEGER\",
                  \"nullable\": true,
                  \"name\": \"f\"
                }
              ],
              \"nullable\": true
            },
            \"nullable\": true,
            \"name\": \"d\"
          },
          {
            \"type\": \"BIGINT\",
            \"nullable\": true,
            \"name\": \"BLOCK__OFFSET__INSIDE__FILE\"
          },
          {
            \"type\": \"VARCHAR\",
            \"nullable\": true,
            \"precision\": 2147483647,
            \"name\": \"INPUT__FILE__NAME\"
          },
          {
            \"fields\": {
              \"fields\": [
                {
                  \"type\": \"BIGINT\",
                  \"nullable\": true,
                  \"name\": \"writeid\"
                },
                {
                  \"type\": \"INTEGER\",
                  \"nullable\": true,
                  \"name\": \"bucketid\"
                },
                {
                  \"type\": \"BIGINT\",
                  \"nullable\": true,
                  \"name\": \"rowid\"
                }
              ],
              \"nullable\": true
            },
            \"nullable\": true,
            \"name\": \"ROW__ID\"
          },
          {
            \"type\": \"BOOLEAN\",
            \"nullable\": true,
            \"name\": \"ROW__IS__DELETED\"
          },
          {
            \"type\": \"INTEGER\",
            \"nullable\": true,
            \"name\": \"PARTITION__SPEC__ID\"
          },
          {
            \"type\": \"BIGINT\",
            \"nullable\": true,
            \"name\": \"PARTITION__HASH\"
          },
          {
            \"type\": \"VARCHAR\",
            \"nullable\": true,
            \"precision\": 2147483647,
            \"name\": \"FILE__PATH\"
          },
          {
            \"type\": \"BIGINT\",
            \"nullable\": true,
            \"name\": \"ROW__POSITION\"
          },
          {
            \"type\": \"VARCHAR\",
            \"nullable\": true,
            \"precision\": 2147483647,
            \"name\": \"PARTITION__PROJECTION\"
          },
          {
            \"type\": \"BIGINT\",
            \"nullable\": true,
            \"name\": \"SNAPSHOT__ID\"
          }
        ],
        \"nullable\": false
      },
      \"colStats\": [
        {
          \"name\": \"a\",
          \"ndv\": 1,
          \"minValue\": -9223372036854775808,
          \"maxValue\": 9223372036854775807
        },
        {
          \"name\": \"b\",
          \"ndv\": 1,
          \"minValue\": -9223372036854775808,
          \"maxValue\": 9223372036854775807
        },
        {
          \"name\": \"c\",
          \"ndv\": 1,
          \"minValue\": -9223372036854775808,
          \"maxValue\": 9223372036854775807
        },
        {
          \"name\": \"d\",
          \"ndv\": 1
        }
      ]
    },
    {
      \"id\": \"1\",
      \"relOp\": \"org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject\",
      \"fields\": [
        \"a\",
        \"b\",
        \"c\",
        \"d\"
      ],
      \"exprs\": [
        {
          \"input\": 0,
          \"name\": \"$0\"
        },
        {
          \"input\": 1,
          \"name\": \"$1\"
        },
        {
          \"input\": 2,
          \"name\": \"$2\"
        },
        {
          \"input\": 3,
          \"name\": \"$3\"
        }
      ],
      \"rowCount\": 1.0
    }
  ]
  }","optimizedSQL":"SELECT `a`, `b`, `c`, `d`
FROM `default`.`test`","cboInfo":"Plan optimized by CBO.","STAGE DEPENDENCIES":{"Stage-0":{"ROOT STAGE":"TRUE"}},"STAGE PLANS":{"Stage-0":{"Fetch Operator":{"limit:":"-1","Processor Tree:":{"TableScan":{"alias:":"test","columns:":["a","b","c","d"],"database:":"default","table:":"test","isTempTable:":"false","OperatorId:":"TS_0","children":{"Select Operator":{"expressions:":"a (type: int), b (type: int), c (type: int), d (type: struct<e:string,f:int>)","columnExprMap:":{"_col0":"a","_col1":"b","_col2":"c","_col3":"d"},"outputColumnNames:":["_col0","_col1","_col2","_col3"],"OperatorId:":"SEL_1","children":{"ListSink":{"OperatorId:":"LIST_SINK_3"}}}}}}}}}}

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests! Left some nit comments but can be merged even without addressing those.

@@ -151,14 +151,14 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
expressions: UDFToDouble(key) (type: double)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why HBase (and Kudu in another file) column type changed from int to double. The stats above shows 4 bytes which indicates integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we don't see any virtual columns for the HBase table, whereas earlier virtual columns for native tables were getting wrongly added to the HBase tables. This affects the plan in the CBO, especially in RelFieldTrimmer.

With the PR, the plan after RelFieldTrimmer is:

2025-05-09T12:06:15,305 DEBUG [0de9da36-a1b4-4747-8aa3-b84858aed485 main] rules.RelFieldTrimmer: Plan after trimming unused fields
HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[20])
  HiveProject(key=[$1], value=[$2])
    HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available])
      HiveProject(EXPR$0=[CAST($0):DOUBLE])
        HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
          HiveProject(key=[$0])
            HiveTableScan(table=[[default, hbase_table_1]], table:alias=[hbase_table_1])
      HiveProject(key=[$0], value=[$1], EXPR$0=[CAST($0):DOUBLE])
        HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
          HiveProject(key=[$0], value=[$1])
            HiveTableScan(table=[[default, src]], table:alias=[src])

whereas earlier the plan was:

2025-05-09T12:11:50,457 DEBUG [bd84dbdb-d563-43d7-a012-3132221196b4 main] rules.RelFieldTrimmer: Plan after trimming unused fields
HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[20])
  HiveProject(key=[$1], value=[$2])
    HiveJoin(condition=[=(CAST($0):DOUBLE, CAST($1):DOUBLE)], joinType=[inner], algorithm=[none], cost=[not available])
      HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
        HiveProject(key=[$0])
          HiveTableScan(table=[[default, hbase_table_1]], table:alias=[hbase_table_1])
      HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
        HiveProject(key=[$0], value=[$1])
          HiveTableScan(table=[[default, src]], table:alias=[src])

We see that the CASTs are getting pushed down from the Join now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new plan with the pushed down cast 'key' to double on both sides of the join looks good to me, although I don't know why the extraneous virtual columns in the past would have prevented the push down since the cast is on a user column. But that's a separate topic.

Copy link

@soumyakanti3578
Copy link
Contributor Author

@zabetak @kasakrisz @amansinha100
I have addressed all the comments and the tests are finally passing.
Please review and merge if everything looks good.

Copy link

@amansinha100 amansinha100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -151,14 +151,14 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
expressions: UDFToDouble(key) (type: double)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new plan with the pushed down cast 'key' to double on both sides of the join looks good to me, although I don't know why the extraneous virtual columns in the past would have prevented the push down since the cast is on a user column. But that's a separate topic.

@zabetak zabetak merged commit 0e287d4 into apache:master May 12, 2025
4 checks passed
@zabetak
Copy link
Member

zabetak commented May 12, 2025

Thanks for the PR @soumyakanti3578 and for the reviews @amansinha100 and @kasakrisz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants