Skip to content

HDFS-17916: DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to ByteArrayManager#8466

Open
charlesconnell wants to merge 1 commit intoapache:trunkfrom
HubSpot:HDFS-17916
Open

HDFS-17916: DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to ByteArrayManager#8466
charlesconnell wants to merge 1 commit intoapache:trunkfrom
HubSpot:HDFS-17916

Conversation

@charlesconnell
Copy link
Copy Markdown

Description of PR

A certain code path in the DFS client DataStreamer appears to discard DFSPacket objects without returning their contained byte arrays to the ByteArrayManager. I discovered this bug at my company after we had HBase server threads hung for hours at ByteArrayManager#allocate(). Because the leak only happens in an error-handling path, the problem requires an unhealthy HDFS cluster in order to be exposed.

I took a heap dump of a high-uptime but relatively healthy HBase server, and found evidence of leaked byte arrays there too. In the heap dump, the two FixedLengthManagers both had numAllocated = 9, but there were zero live DFSPacket objects. This suggests that the byte arrays, and their containing DFSPackets had been garbage collected, unbeknownst to FixedLengthManager.

In DataStreamer.java starting at line 1410, the DFSPacket that is remove()'d from dataQueue is allowed to be garbage collected without further interaction.

    if (!streamerClosed && dfsClient.clientRunning) {
      if (stage == BlockConstructionStage.PIPELINE_CLOSE) {        // If we had an error while closing the pipeline, we go through a fast-path
        // where the BlockReceiver does not run. Instead, the DataNode just finalizes
        // the block immediately during the 'connect ack' process. So, we want to pull
        // the end-of-block packet from the dataQueue, since we don't actually have
        // a true pipeline to send it over.
        //
        // We also need to set lastAckedSeqno to the end-of-block Packet's seqno, so that
        // a client waiting on close() will be aware that the flush finished.
        synchronized (dataQueue) {
          DFSPacket endOfBlockPacket = dataQueue.remove();  // remove the end of block packet
          // Close any trace span associated with this Packet
          Span span = endOfBlockPacket.getSpan();
          if (span != null) {
            span.finish();
            endOfBlockPacket.setSpan(null);
          }
          assert endOfBlockPacket.isLastPacketInBlock();
          assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
          lastAckedSeqno = endOfBlockPacket.getSeqno();
          pipelineRecoveryCount = 0;
          dataQueue.notifyAll();
        }
        endBlock();
      } else {
        initDataStreaming();
      }
    } 

This PR adds this line in order to return the packet's buffer to the ByteArrayManager:

endOfBlockPacket.releaseBuffer(byteArrayManager);

Contains content generated by Claude Opus 4.7

How was this patch tested?

New unit tests added

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 0m 32s Docker failed to build run-specific yetus/hadoop:tp-2658}.
Subsystem Report/Notes
GITHUB PR #8466
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8466/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@charlesconnell
Copy link
Copy Markdown
Author

Yetus failure appears unrelated to this PR

Copy link
Copy Markdown
Contributor

@ZanderXu ZanderXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants