[HUDI-8182]Cache internalSchema for hive read, avoid each split reloa… #11914

muyihao · 2024-09-08T10:32:59Z

…d active timeline.

Change Logs

Addressing this issue: 11723
When using MR to read Hudi, use a global InternalSchema to avoid each reader listing the metadata directory to obtain the InternalSchema, which places a significant load on the HDFS NameNode.

Impact

Reduced metadata listing frequency to alleviate pressure on the HDFS NameNode.

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-09-08T23:26:44Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

@@ -375,6 +389,45 @@ public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
    // clear work from ThreadLocal after splits generated in case of thread is reused in pool.
    Utilities.clearWorkMapForConf(job);

+    // build internal schema for the query
+    if (job.getBoolean(INTERNAL_SCHEMA_CACHE_ENABLE, true)) {


Should it always be true because this is a deterministic optimization?

yes, you are right

danny0405 · 2024-09-08T23:27:06Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

+          HoodieStorage storage = null;
+          try {
+            storage = new HoodieHadoopStorage(path.getFileSystem(job));
+            Option<StoragePath> tablePath = TablePathUtils.getTablePath(storage, HadoopFSUtils.convertToStoragePath(path));


In which case there are multiple table paths?

In which case there are multiple table paths?

When we perform a query on two different Hudi tables, such as union, join, or subquery, there will be multiple table paths.

danny0405 · 2024-09-08T23:27:38Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

+          job.set(INTERNAL_SCHEMA_CACHE_PREFIX + "." + k, v);
+        });
+
+        job.setBoolean(INTERNAL_SCHEMA_CACHE_VISIT, true);


Can we just check if the cache key exists instead?

Nice suggestion. I will make the changes as recommended

danny0405 · 2024-09-08T23:28:31Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

+      }
+    } else {
+      try {
+        TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient);


This is invalid code path because the table itself is invalid.

Got it, I will fix it.

yihua · 2024-09-11T05:52:23Z

@muyihao any update on addressing the comments in the PR?

muyihao · 2024-09-11T13:25:56Z

@muyihao any update on addressing the comments in the PR?

Currently working on verifying whether HoodieCombineHiveInputFormat can handle multiple Hudi tables and making modifications based on the comments.

danny0405 · 2024-09-17T00:28:54Z

Hi, @muyihao Is this ready for review now?

muyihao · 2024-09-17T09:03:03Z

Hi, @muyihao Is this ready for review now?

@danny0405 Yes, sorry for the late reply. Thank you for your help with the review. Currently, the modifications cache the latest schema for different tables when there are multiple table paths. However, this does not completely avoid loading the ActiveTimeline. When reading COW, it will still trigger the loading of the active timeline when searching for the fileSchema.

danny0405 · 2024-09-18T01:21:58Z

When reading COW, it will still trigger the loading of the active timeline when searching for the fileSchema.

Can you show us the line that triggers this logic?

muyihao · 2024-09-18T12:07:45Z

When reading COW, it will still trigger the loading of the active timeline when searching for the fileSchema.

Can you show us the line that triggers this logic?

InternalSchemaCache#searchSchemaAndCache will trigger InternalSchemaCache#getSchemaByReadingCommitFile or InternalSchemaCache##getHistoricalSchemas, both may load active timeline

danny0405 · 2024-09-19T00:58:47Z

@muyihao Thanks for pinning it out, the metaClient.getActiveTimeline also choose to return the cached timeline first, for one metaClient instantce, the list would happen only once, so I guess it is not a problem if the metaClient can be shared for multiple input splits.

BTW, while I reviewing the code, I found a bug for FileBasedInternalSchemaStorageManager, the passed in metaClient does not always work if the base path of it is not end up with .schema, we should fix it like this:

  public FileBasedInternalSchemaStorageManager(HoodieTableMetaClient metaClient) {
    this.baseSchemaPath = new StoragePath(metaClient.getMetaPath(), SCHEMA_NAME);
    this.storage = metaClient.getStorage();
    this.metaClient = metaClient.getBasePath().getName().equalsIgnoreCase(SCHEMA_NAME) ? metaClient : null;
  }

muyihao · 2024-09-20T12:59:57Z

@danny0405
Thank you for your reply. I also tried to serialize the metaClient and put it into the conf, but found that the StoragePath serialization fails.

do you have any better ideas?

danny0405 · 2024-09-21T04:01:19Z

@muyihao Thanks, I have applied a patch again the master code(instead of your patch) to address the timeline list issue, you can merge the patch first then your Job conf set up logic, then it should be working:

8182.patch.zip

muyihao · 2024-09-21T09:04:03Z

@danny0405 Thanks for the help, I have merged it.

danny0405 · 2024-09-21T09:18:51Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

+          if (schema.isPresent()) {
+            LOG.info("Set internal schema and avro schema of path: " + path.toString());
+            job.set(INTERNAL_SCHEMA_CACHE_KEY_PREFIX + "." + path, SerDeHelper.toJson(schema.get()));
+            job.set(SCHEMA_CACHE_KEY_PREFIX + "." + path, schemaUtil.getTableAvroSchema().toString());


We cache the schema by path or by table name? The table name looks more straight-forward. And the schemaUtil.getTableAvroSchema() should be invoked first to set up the commit metadata cache in TableSchemaResolver to avoid redundant commit metadata deserialization.

@danny0405 That's a great suggestion. When we handle tables from two different databases together, there might be the same table name. For example, dev.table_xx and online.table_xx, where table_xx is the same. Should we consider this case?

I think we can use table name now, can extent it to path if needed.

danny0405 · 2024-09-21T09:22:54Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

      return;
    }
+    this.metaClient = metaClientOption.isPresent() ? metaClientOption.get() : setUpHoodieTableMetaClient();
+    this.internalSchemaOption = getInternalSchemaFromCache();
+    LOG.info("finish init schema evolution for split: {}", split);


Do we need this line for debugging? It looks like not very helpful, maybe we can switch to DEBUG devel or just remove it.

danny0405 · 2024-09-21T09:23:23Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

+        json -> SerDeHelper.fromJson(json)
+    );
+    if (internalSchema != null && internalSchema.isPresent()) {
+      LOG.info("get internal schema from conf for split: {}" + split);


Ditto, I think we should remove this line.

Will remove it

danny0405 · 2024-09-21T09:25:46Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java

@@ -139,7 +186,10 @@ public void doEvolutionForRealtimeInputFormat(AbstractRealtimeRecordReader realt
      return;
    }
    if (internalSchemaOption.isPresent()) {
-      Schema tableAvroSchema = new TableSchemaResolver(metaClient).getTableAvroSchema();
+      Schema tableAvroSchema = getAvroSchemaFromCache();
+      if (tableAvroSchema == null) {


When internalSchemaOption.isPresent(), the getAvroSchemaFromCache() should also exists, this is decided by the cache.

Yes, it seems a bit redundant, I will remove it.

…ad active timeline.

danny0405 · 2024-09-22T00:52:04Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java

+        for (String path : uniqTablePaths) {
+          HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setBasePath(path).setConf(new HadoopStorageConfiguration(job)).build();
+          TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient);
+          String avroSchema = schemaUtil.getTableAvroSchema().toString();


yeah, the cache would be utilized.

hudi-bot · 2024-09-22T04:14:48Z

CI report:

2d84829 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2024-09-22T12:54:05Z

The test failures are not related, will merge it.

muyihao · 2024-09-23T00:47:34Z

The test failures are not related, will merge it.

Thanks for helping to land this PR. :)

github-actions bot added the size:M PR with lines of changes in (100, 300] label Sep 8, 2024

muyihao force-pushed the feat/cache-internal-schema-for-mr-read branch from 018164e to 6b5667d Compare September 8, 2024 10:43

danny0405 reviewed Sep 8, 2024

View reviewed changes

muyihao force-pushed the feat/cache-internal-schema-for-mr-read branch 2 times, most recently from 59d00b6 to a2a07b6 Compare September 16, 2024 03:28

muyihao force-pushed the feat/cache-internal-schema-for-mr-read branch from a2a07b6 to aacb48d Compare September 21, 2024 09:01

danny0405 reviewed Sep 21, 2024

View reviewed changes

muyihao force-pushed the feat/cache-internal-schema-for-mr-read branch from aacb48d to 19bed8b Compare September 21, 2024 10:13

[HUDI-8182] Cache internalSchema for hive read, avoid each split relo…

0e785c5

…ad active timeline.

muyihao force-pushed the feat/cache-internal-schema-for-mr-read branch from 19bed8b to 0e785c5 Compare September 21, 2024 10:16

danny0405 reviewed Sep 22, 2024

View reviewed changes

fix test failures

2d84829

danny0405 approved these changes Sep 22, 2024

View reviewed changes

danny0405 merged commit 77eb9e5 into apache:master Sep 22, 2024
42 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8182]Cache internalSchema for hive read, avoid each split reloa… #11914

[HUDI-8182]Cache internalSchema for hive read, avoid each split reloa… #11914

muyihao commented Sep 8, 2024 •

edited

Loading

danny0405 Sep 8, 2024

muyihao Sep 11, 2024

danny0405 Sep 8, 2024

muyihao Sep 16, 2024

danny0405 Sep 8, 2024

muyihao Sep 11, 2024

danny0405 Sep 8, 2024

muyihao Sep 11, 2024

yihua commented Sep 11, 2024

muyihao commented Sep 11, 2024

danny0405 commented Sep 17, 2024

muyihao commented Sep 17, 2024

danny0405 commented Sep 18, 2024

muyihao commented Sep 18, 2024

danny0405 commented Sep 19, 2024

muyihao commented Sep 20, 2024 •

edited

Loading

danny0405 commented Sep 21, 2024 •

edited

Loading

muyihao commented Sep 21, 2024

danny0405 Sep 21, 2024 •

edited

Loading

muyihao Sep 21, 2024

danny0405 Sep 22, 2024

danny0405 Sep 21, 2024

danny0405 Sep 21, 2024

muyihao Sep 21, 2024

danny0405 Sep 21, 2024

muyihao Sep 21, 2024

danny0405 Sep 22, 2024

hudi-bot commented Sep 22, 2024

danny0405 commented Sep 22, 2024

muyihao commented Sep 23, 2024

[HUDI-8182]Cache internalSchema for hive read, avoid each split reloa… #11914

[HUDI-8182]Cache internalSchema for hive read, avoid each split reloa… #11914

Conversation

muyihao commented Sep 8, 2024 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua commented Sep 11, 2024

muyihao commented Sep 11, 2024

danny0405 commented Sep 17, 2024

muyihao commented Sep 17, 2024

danny0405 commented Sep 18, 2024

muyihao commented Sep 18, 2024

danny0405 commented Sep 19, 2024

muyihao commented Sep 20, 2024 • edited Loading

danny0405 commented Sep 21, 2024 • edited Loading

muyihao commented Sep 21, 2024

danny0405 Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Sep 22, 2024

CI report:

danny0405 commented Sep 22, 2024

muyihao commented Sep 23, 2024

muyihao commented Sep 8, 2024 •

edited

Loading

muyihao commented Sep 20, 2024 •

edited

Loading

danny0405 commented Sep 21, 2024 •

edited

Loading

danny0405 Sep 21, 2024 •

edited

Loading