Skip to content

[Debugify] Improve reduction of debugify coverage build output #150212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

SLTozer
Copy link
Contributor

@SLTozer SLTozer commented Jul 23, 2025

In current DebugLoc coverage builds, the output for any reasonably large build can become very large if any missing DebugLocs are present; this happens because single errors in LLVM may result in many errors being reported in the output report. The main cause of this is that the empty locations attached to instructions may be propagated to other instructions in later passes, which will each be reported as new errors. This patch prevents this by adding an "unknown" annotation to instructions after reporting them once, ensuring that any other DebugLocs copied or derived from the original empty location will not be marked as new errors.

As a separate but related change, this patch updates the report generation script to deduplicate results using the recorded stacktrace if they are available, instead of the pass+instruction combination. This reduces the size of the reduction, but makes the reduction highly reliable, as the stacktrace allows us to very precisely identify when two bugs have originated from the same place.

In current DebugLoc coverage builds, the output for any reasonably large
build can become very large if any missing DebugLocs are present; this
happens because single errors in LLVM may result in many errors being
reported in the output report. The main cause of this is that the empty
locations attached to instructions may be propagated to other instructions
in later passes, which will each be reported as new errors. This patch
prevents this by adding an "unknown" annotation to instructions after
reporting them once, ensuring that any other DebugLocs copied or derived
from the original empty location will not be marked as new errors.

As a separate but related change, this patch updates the report generation
script to deduplicate results using the recorded stacktrace if they are
available, instead of the pass+instruction combination. This reduces the
size of the reduction, but makes the reduction highly reliable, as the
stacktrace allows us to very precisely identify when two bugs have
originated from the same place.
@llvmbot
Copy link
Member

llvmbot commented Jul 23, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-debuginfo

Author: Stephen Tozer (SLTozer)

Changes

In current DebugLoc coverage builds, the output for any reasonably large build can become very large if any missing DebugLocs are present; this happens because single errors in LLVM may result in many errors being reported in the output report. The main cause of this is that the empty locations attached to instructions may be propagated to other instructions in later passes, which will each be reported as new errors. This patch prevents this by adding an "unknown" annotation to instructions after reporting them once, ensuring that any other DebugLocs copied or derived from the original empty location will not be marked as new errors.

As a separate but related change, this patch updates the report generation script to deduplicate results using the recorded stacktrace if they are available, instead of the pass+instruction combination. This reduces the size of the reduction, but makes the reduction highly reliable, as the stacktrace allows us to very precisely identify when two bugs have originated from the same place.


Full diff: https://github.com/llvm/llvm-project/pull/150212.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/Debugify.cpp (+9)
  • (modified) llvm/utils/llvm-original-di-preservation.py (+33-17)
diff --git a/llvm/lib/Transforms/Utils/Debugify.cpp b/llvm/lib/Transforms/Utils/Debugify.cpp
index 4210ce6da1eb2..39ff6e94510fb 100644
--- a/llvm/lib/Transforms/Utils/Debugify.cpp
+++ b/llvm/lib/Transforms/Utils/Debugify.cpp
@@ -707,6 +707,15 @@ bool llvm::checkDebugInfoMetadata(Module &M,
       DILocsBefore, DILocsAfter, InstToDelete, NameOfWrappedPass,
       FileNameFromCU, ShouldWriteIntoJSON, Bugs);
 
+#if LLVM_ENABLE_DEBUGLOC_TRACKING_COVERAGE
+  // If we are tracking DebugLoc coverage, replace each empty DebugLoc with an
+  // annotated location now so that it does not show up in future passes even if
+  // it is propagated to other instructions.
+  for (const auto &L : DILocsAfter) {
+    if (!L.second)
+      L.first->setDebugLoc(DebugLoc::getUnknown());
+#endif
+
   bool ResultForVars = checkVars(DIVarsBefore, DIVarsAfter, NameOfWrappedPass,
                                  FileNameFromCU, ShouldWriteIntoJSON, Bugs);
 
diff --git a/llvm/utils/llvm-original-di-preservation.py b/llvm/utils/llvm-original-di-preservation.py
index b5ccd7a3224f8..cdca2ec8153b8 100755
--- a/llvm/utils/llvm-original-di-preservation.py
+++ b/llvm/utils/llvm-original-di-preservation.py
@@ -6,6 +6,7 @@
 from __future__ import print_function
 import argparse
 import os
+import re
 import sys
 from json import loads
 from collections import defaultdict
@@ -22,6 +23,14 @@ def __init__(self, origin, action, bb_name, fn_name, instr):
     def key(self):
         return self.action + self.bb_name + self.fn_name + self.instr
 
+    def reduced_key(self, bug_pass):
+        if self.origin is not None:
+            # If we have the origin stacktrace available, we can use it to efficiently deduplicate identical errors. We
+            # just need to remove the pointer values from the string first, so that we can deduplicate across files.
+            origin_no_addr = re.sub(r"0x[0-9a-fA-F]+", "", self.origin)
+            return origin_no_addr
+        return bug_pass + self.instr
+
     def to_dict(self):
         result = {
             "instr": self.instr,
@@ -42,6 +51,9 @@ def __init__(self, action, fn_name):
     def key(self):
         return self.action + self.fn_name
 
+    def reduced_key(self, bug_pass):
+        return bug_pass + self.fn_name
+
     def to_dict(self):
         return {
             "fn_name": self.fn_name,
@@ -58,6 +70,9 @@ def __init__(self, action, name, fn_name):
     def key(self):
         return self.action + self.name + self.fn_name
 
+    def reduced_key(self, bug_pass):
+        return bug_pass + self.name
+
     def to_dict(self):
         return {
             "fn_name": self.fn_name,
@@ -478,7 +493,11 @@ def get_json_chunk(file, start, size):
 # Parse the program arguments.
 def parse_program_args(parser):
     parser.add_argument("file_name", type=str, help="json file to process")
-    parser.add_argument("--reduce", action="store_true", help="create reduced report")
+    parser.add_argument(
+        "--reduce",
+        action="store_true",
+        help="create reduced report by deduplicating bugs within and across files",
+    )
 
     report_type_group = parser.add_mutually_exclusive_group(required=True)
     report_type_group.add_argument(
@@ -523,13 +542,10 @@ def Main():
     di_sp_bugs_summary = OrderedDict()
     di_var_bugs_summary = OrderedDict()
 
-    # Compress similar bugs.
-    # DILocBugs with same pass & instruction name.
-    di_loc_pass_instr_set = set()
-    # DISPBugs with same pass & function name.
-    di_sp_pass_fn_set = set()
-    # DIVarBugs with same pass & variable name.
-    di_var_pass_var_set = set()
+    # If we are using --reduce, use these sets to deduplicate similar bugs within and across files.
+    di_loc_reduced_set = set()
+    di_sp_reduced_set = set()
+    di_var_reduced_set = set()
 
     start_line = 0
     chunk_size = 1000000
@@ -585,9 +601,9 @@ def Main():
                     if not di_loc_bug.key() in di_loc_set:
                         di_loc_set.add(di_loc_bug.key())
                         if opts.reduce:
-                            pass_instr = bugs_pass + instr
-                            if not pass_instr in di_loc_pass_instr_set:
-                                di_loc_pass_instr_set.add(pass_instr)
+                            reduced_key = di_loc_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_loc_reduced_set:
+                                di_loc_reduced_set.add(reduced_key)
                                 di_loc_bugs.append(di_loc_bug)
                         else:
                             di_loc_bugs.append(di_loc_bug)
@@ -608,9 +624,9 @@ def Main():
                     if not di_sp_bug.key() in di_sp_set:
                         di_sp_set.add(di_sp_bug.key())
                         if opts.reduce:
-                            pass_fn = bugs_pass + name
-                            if not pass_fn in di_sp_pass_fn_set:
-                                di_sp_pass_fn_set.add(pass_fn)
+                            reduced_key = di_sp_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_sp_reduced_set:
+                                di_sp_reduced_set.add(reduced_key)
                                 di_sp_bugs.append(di_sp_bug)
                         else:
                             di_sp_bugs.append(di_sp_bug)
@@ -632,9 +648,9 @@ def Main():
                     if not di_var_bug.key() in di_var_set:
                         di_var_set.add(di_var_bug.key())
                         if opts.reduce:
-                            pass_var = bugs_pass + name
-                            if not pass_var in di_var_pass_var_set:
-                                di_var_pass_var_set.add(pass_var)
+                            reduced_key = di_var_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_var_reduced_set:
+                                di_var_reduced_set.add(reduced_key)
                                 di_var_bugs.append(di_var_bug)
                         else:
                             di_var_bugs.append(di_var_bug)

Copy link

github-actions bot commented Jul 23, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants