feat: topic graph quality scoring and auto-prune pipeline#607
Conversation
Add GTD/LTD embedding-based quality metrics for topic graphs with CLI commands for scoring, threshold optimization, and prune overlay on inspect.
Summary of ChangesHello @scp7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces experimental features for evaluating and optimizing topic graph quality. It provides new CLI commands to score topic graphs based on embedding-based drift metrics (Global Topic Drift and Local Topic Drift) and to optimize the thresholds for these metrics. Additionally, the existing Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new features for topic graph quality analysis, including topic score and topic optimize-thresholds commands. The changes are well-structured, with a new topic_quality.py module containing the core logic, updates to the CLI, and corresponding documentation and tests. The implementation is generally robust, with good error handling and optimizations like memoization. I have one suggestion to improve the robustness of the embedding generation logic.
Replace the 3-step GTD/LTD pipeline with a simplified 4-step cascade using renamed metrics: global_coherence, parent_coherence, sibling_coherence. Pipeline steps (each operates on surviving nodes from prior steps): 1. global_coherence < 0 (hardcoded gate) 2. parent_coherence < threshold (default 0.25) 3. sibling_coherence < lower threshold (default 0.20, outliers) 4. sibling_coherence > upper threshold (default 0.68, repetitive) Changes: - Add _compute_sibling_coherence_by_id() to topic_quality.py - Rewrite _evaluate_thresholds() with 4-step cascade logic - Update CLI flags for topic score and optimize-thresholds commands - Add per-step removal breakdown to CLI score output - Move sentence-transformers to optional [scoring] extra - Update tests with 6-node fixture and new metric assertions - Rewrite topic-score.md and topic-optimize-thresholds.md docs Verified: identical output to research reference script on seo-graph-10tools-single-5depth.json (3906 nodes, 1545 removed, per-node metric diffs < 3e-7).
… strategy Wire topic scoring into the generate pipeline via ScoringConfig in YAML. When scoring.prune is true, the pruned graph is saved as a _scored.json derived artifact while the original graph is preserved untouched. Score-only mode (prune: false) reports metrics without modifying the graph. Includes sibling coherence threshold validation, root node guard, and in-memory graph path for score-only mode.
Document the topics.scoring YAML section in the configuration reference, add a Topic Scoring and Pruning section to the generate command docs, and cross-link from topic-score.md back to the config reference.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: topic graph quality scoring and an automated pruning pipeline. The changes are well-structured, with a new topic_quality.py module for the core logic, new CLI commands, and configuration options. The implementation preserves the original graph, which is a good design choice. I've identified a few areas for improvement, mainly around code duplication in the CLI and error handling in the new commands, which can enhance maintainability. Overall, this is a solid addition to the project.
| import warnings | ||
|
|
||
| from typing import Literal | ||
| from typing import Literal, Self |
There was a problem hiding this comment.
The Self type was added in Python 3.11. To maintain compatibility with Python 3.10 as specified in pyproject.toml, you should import Self from typing_extensions for versions older than 3.11.
| from typing import Literal, Self | |
| from typing import Literal | |
| from typing_extensions import Self |
| if scoring.prune: | ||
| graph, report = prune_graph( | ||
| topic_model, | ||
| parent_coherence=scoring.parent_coherence, | ||
| sibling_coherence_lower=scoring.sibling_coherence_lower, | ||
| sibling_coherence_upper=scoring.sibling_coherence_upper, | ||
| embedding_key=scoring.embedding_key, | ||
| embedding_model=scoring.embedding_model, | ||
| ) | ||
|
|
||
| summary = report["summary"] | ||
| tui.success("Topic graph scored and pruned") | ||
| tui.console.print(f" Nodes: {summary['original_node_count']}") | ||
| tui.console.print(f" Flagged: {summary['flagged_node_count']}") | ||
| tui.console.print(f" Pruned: {summary['removed_node_count']}") | ||
| tui.console.print(f" Remaining: {summary['remaining_node_count']}") | ||
|
|
||
| # Save pruned graph as derived artifact, original stays untouched | ||
| p = Path(graph_save_path) | ||
| scored_path = str(p.with_stem(f"{p.stem}_scored")) | ||
| graph.save(scored_path) | ||
| tui.info(f"Pruned graph saved to {scored_path}") | ||
| tui.info(f"Original graph preserved at {graph_save_path}") | ||
|
|
||
| if scoring.save_report: | ||
| report_path = derive_topic_score_report_path(graph_save_path) | ||
| write_topic_score_report(report, report_path) | ||
| tui.info(f"Score report saved to {report_path}") | ||
|
|
||
| return graph | ||
|
|
||
| # score-only mode: generate report without pruning | ||
| report = score_topic_graph( | ||
| topic_model, | ||
| parent_coherence=scoring.parent_coherence, | ||
| sibling_coherence_lower=scoring.sibling_coherence_lower, | ||
| sibling_coherence_upper=scoring.sibling_coherence_upper, | ||
| embedding_key=scoring.embedding_key, | ||
| embedding_model=scoring.embedding_model, | ||
| ) | ||
|
|
||
| summary = report["summary"] | ||
| tui.success("Topic graph scored (prune disabled)") | ||
| tui.console.print(f" Nodes: {summary['original_node_count']}") | ||
| tui.console.print(f" Flagged: {summary['flagged_node_count']}") | ||
| tui.console.print(f" Would remove: {summary['removed_node_count']}") | ||
|
|
||
| if scoring.save_report: | ||
| report_path = derive_topic_score_report_path(graph_save_path) | ||
| write_topic_score_report(report, report_path) | ||
| tui.info(f"Score report saved to {report_path}") | ||
|
|
||
| return topic_model |
There was a problem hiding this comment.
This function has a significant amount of duplicated code between the if scoring.prune: block and the else block for score-only mode. The arguments passed to prune_graph and score_topic_graph are identical, and the logic for saving the report is also repeated. This can be refactored to improve maintainability and reduce redundancy.
You can extract the common logic, such as preparing arguments and saving the report, to be executed outside the conditional blocks. The summary printing can also be consolidated.
scoring_args = {
"parent_coherence": scoring.parent_coherence,
"sibling_coherence_lower": scoring.sibling_coherence_lower,
"sibling_coherence_upper": scoring.sibling_coherence_upper,
"embedding_key": scoring.embedding_key,
"embedding_model": scoring.embedding_model,
}
if scoring.prune:
graph, report = prune_graph(topic_model, **scoring_args)
summary = report["summary"]
tui.success("Topic graph scored and pruned")
tui.console.print(f" Nodes: {summary['original_node_count']}")
tui.console.print(f" Flagged: {summary['flagged_node_count']}")
tui.console.print(f" Pruned: {summary['removed_node_count']}")
tui.console.print(f" Remaining: {summary['remaining_node_count']}")
# Save pruned graph as derived artifact, original stays untouched
p = Path(graph_save_path)
scored_path = str(p.with_stem(f"{p.stem}_scored"))
graph.save(scored_path)
tui.info(f"Pruned graph saved to {scored_path}")
tui.info(f"Original graph preserved at {graph_save_path}")
result_model = graph
else:
# score-only mode: generate report without pruning
report = score_topic_graph(topic_model, **scoring_args)
summary = report["summary"]
tui.success("Topic graph scored (prune disabled)")
tui.console.print(f" Nodes: {summary['original_node_count']}")
tui.console.print(f" Flagged: {summary['flagged_node_count']}")
tui.console.print(f" Would remove: {summary['removed_node_count']}")
result_model = topic_model
if scoring.save_report:
report_path = derive_topic_score_report_path(graph_save_path)
write_topic_score_report(report, report_path)
tui.info(f"Score report saved to {report_path}")
return result_model| def subtree_has_highlight(node_id: int, stack: set[int] | None = None) -> bool: | ||
| if node_id in memo_has_highlight: | ||
| return memo_has_highlight[node_id] | ||
|
|
||
| local_stack = stack or set() | ||
| if node_id in local_stack: | ||
| return str(node_id) in highlighted_ids | ||
|
|
||
| local_stack.add(node_id) | ||
| has_self = str(node_id) in highlighted_ids | ||
| has_descendant = any( | ||
| subtree_has_highlight(child.id, local_stack) for child in graph.nodes[node_id].children | ||
| ) | ||
| local_stack.remove(node_id) | ||
| result = has_self or has_descendant | ||
| memo_has_highlight[node_id] = result | ||
| return result |
There was a problem hiding this comment.
This recursive function subtree_has_highlight for checking if a subtree contains a highlighted node can be simplified. The current implementation with local_stack to handle cycles is a bit complex for what it's doing. Since you're already using a memo_has_highlight cache, you can leverage it more effectively to prevent re-computation and simplify the cycle detection logic. A more direct check against the cache and the highlighted_ids set would be cleaner.
| def subtree_has_highlight(node_id: int, stack: set[int] | None = None) -> bool: | |
| if node_id in memo_has_highlight: | |
| return memo_has_highlight[node_id] | |
| local_stack = stack or set() | |
| if node_id in local_stack: | |
| return str(node_id) in highlighted_ids | |
| local_stack.add(node_id) | |
| has_self = str(node_id) in highlighted_ids | |
| has_descendant = any( | |
| subtree_has_highlight(child.id, local_stack) for child in graph.nodes[node_id].children | |
| ) | |
| local_stack.remove(node_id) | |
| result = has_self or has_descendant | |
| memo_has_highlight[node_id] = result | |
| return result | |
| def subtree_has_highlight(node_id: int) -> bool: | |
| if node_id in memo_has_highlight: | |
| return memo_has_highlight[node_id] | |
| # To prevent infinite recursion on cycles, assume false for now | |
| memo_has_highlight[node_id] = False | |
| has_self = str(node_id) in highlighted_ids | |
| has_descendant = any( | |
| subtree_has_highlight(child.id) for child in graph.nodes[node_id].children | |
| ) | |
| result = has_self or has_descendant | |
| memo_has_highlight[node_id] = result | |
| return result |
| except FileNotFoundError as e: | ||
| tui.error(str(e)) | ||
| sys.exit(1) | ||
| except ValueError as e: | ||
| tui.error(str(e)) | ||
| sys.exit(1) | ||
| except ConfigurationError as e: | ||
| tui.error(str(e)) | ||
| sys.exit(1) | ||
| except Exception as e: | ||
| tui.error(f"Error scoring graph: {e}") | ||
| sys.exit(1) |
There was a problem hiding this comment.
The exception handling here is a bit broad. Catching a generic Exception can mask specific issues and make debugging harder. It's better to catch more specific exceptions that you expect, and let unexpected ones propagate or be handled by a more general error handler. In this case, you've already handled FileNotFoundError, ValueError, and ConfigurationError, so the generic Exception catch is likely for other unexpected issues. It would be better to log the full traceback in debug mode to aid in diagnosing these unexpected errors.
| except (FileNotFoundError, ValueError, ConfigurationError) as e: | ||
| tui.error(str(e)) | ||
| sys.exit(1) | ||
| except Exception as e: | ||
| tui.error(f"Error optimizing thresholds: {e}") | ||
| sys.exit(1) |
There was a problem hiding this comment.
Similar to the topic_score command, the exception handling here is too broad. Catching a generic Exception can hide the root cause of problems. It's better to handle specific, expected exceptions and have a more general handler at a higher level if needed. For unexpected errors, logging the full traceback when in debug mode would be very helpful for developers.
| texts = [graph.nodes[node_id].topic for node_id in missing_ids] | ||
| vectors = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True) | ||
|
|
||
| for node_id, vector in zip(missing_ids, vectors, strict=False): |
There was a problem hiding this comment.
The zip function is used with strict=False. While this prevents a ValueError if the iterables have different lengths, it might hide potential bugs where missing_ids and vectors are expected to have the same length. If they are always expected to be the same length, using strict=True (the default in Python 3.10+) would be safer and make the code's intent clearer. If they can have different lengths, a comment explaining why would be helpful.
| raise ValueError("trials must be greater than zero") | ||
|
|
||
| if search == "random": | ||
| rng = random.Random(seed) # noqa: S311 |
There was a problem hiding this comment.
Using random.Random(seed) is a good practice for reproducibility. However, the comment noqa: S311 suggests awareness of a potential security issue (use of random instead of secrets for security-sensitive purposes). While this is not a security-critical context, it's worth noting that for cryptographic or security-related random number generation, the secrets module should be used. For this use case (reproducible random search), random is appropriate.
| feasible = [e for e in evaluations if e["passes_constraints"]] | ||
| pool = feasible if feasible else evaluations | ||
| pool_sorted = sorted(pool, key=lambda e: e["objective"]) | ||
| best = pool_sorted[0] if pool_sorted else None |
There was a problem hiding this comment.
- How are we sure pool_sorted[0] is the optimum? We could in fact tune one parameters up and the other one down in an unbalanced way and still achieve a value closest to the desired constraint. Maybe we can enforce a similar strength of pruning across the three thresholds? In other words, parent_coherence value needs to be similar to sibling_coherence_lower? And sibling_coherence_upper + sibling_coherence_lower should be close to 0.9?
This makes sure the distribution of the three values are similar to default:
• parent_coherence: 0.25
• sibling_coherence_lower: 0.2
• sibling_coherence_upper: 0.68
| ) | ||
| removed_ratio = (removed_count / total_nodes) if total_nodes else 0.0 | ||
| internal_removed_ratio = (internal_removed / total_nodes) if total_nodes else 0.0 | ||
| objective = removed_ratio + (1.5 * internal_removed_ratio) |
There was a problem hiding this comment.
Do you want to penalise it more when we remove nodes with lower depth? So gradually decreases penalisation as we approach nodes on the outside. To implement the new function, rather than just have one broad category called internal_removed_ratio, we can have removed_ratio for each depth, and penalise the removal of the lower depth more.
|
|
||
| if prune_overlay and prune_overlay.get("thresholds"): | ||
| thresholds = prune_overlay["thresholds"] | ||
| tui.console.print( |
There was a problem hiding this comment.
Add the coherence score in topic inspect --show-pruned, to help user determine a threshold by inspecting the data. So basically add the coherence score for that node behind LOW_PARENT_COHERENCE.
| graph_save_path = topics_save_as or config.topics.save_as or "topic_graph.json" | ||
|
|
||
| if scoring.prune: | ||
| graph, report = prune_graph( |
There was a problem hiding this comment.
Enable another argument where the user can input id of nodes they wish to rescue from the filtering. For instance, the user might choose to keep some nodes, after manual inspection, but throw all other nodes that fall outside the threshold.
| return combos[:trials] | ||
|
|
||
|
|
||
| def optimize_topic_thresholds( |
There was a problem hiding this comment.
We should add the score_report also to the output of optimize-thresholds so the user can use it to inspect graph. Currently, it only writes _threshold_optimization.json, which cannot be used as an input to graph inspection. So currently, to generate the score_report.json, the user has to copy the coherence score determined and manually use them to score graph again, which adds additional step.
Kexin-xu-01
left a comment
There was a problem hiding this comment.
Hi @scp7 , I have tested the workflow on a few datasets and left some feedback. Overall it looks great! Thank you so much for integrating the functions into the workflow!
Summary
generatepipeline via a newtopics.scoringYAML config section withprune: true/falsecontrol_scoredderivative (Option B file strategy)topic scoreandtopic optimize-thresholdsCLI commands for standalone scoring and threshold searchtopic inspect --score-report --show-prunedsentence-transformersto[scoring]optional extra to avoid dependency conflictsChanges
topic_quality.py(+702)prune_graph()config.pyScoringConfigPydantic model with threshold validationcli.pytopic score,topic optimize-thresholdscommands,_score_and_prune_topic_model()pipeline helpertest_topic_quality.py,test_config.py,test_topic_inspector.pyconfiguration.md,generate.md,topic-score.md,topic-optimize-thresholds.mdpyproject.toml[scoring]optional extra forsentence-transformersTest plan
uv run pytest)make lint)make format)deepfabric generate config.yamlwithscoring:section in YAMLdeepfabric topic score graph.jsonstandalonedeepfabric topic inspect graph.json --score-report report.json --show-pruned