Skip to content

Add Module Disassembler for Code Deduplication and Restructuring #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions MODULE_DISASSEMBLER_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Module Disassembler and Restructurer

A powerful tool for analyzing codebases, identifying duplicate and redundant code, and restructuring modules based on their functionality.

## Overview

The Module Disassembler is designed to help you:

1. **Analyze** your codebase to understand its structure
2. **Identify** duplicate and redundant code
3. **Group** functions by their functionality
4. **Restructure** your codebase into more logical modules

This tool is particularly useful for:
- Refactoring large codebases
- Understanding unfamiliar code
- Improving code organization
- Reducing technical debt
- Preparing for architectural changes

## Features

- **Function Extraction**: Extracts all functions from your codebase using Codegen SDK
- **Duplicate Detection**: Identifies exact and near-duplicate functions
- **Functionality Grouping**: Groups functions based on their purpose
- **Module Restructuring**: Generates new modules organized by functionality
- **Comprehensive Reporting**: Provides detailed reports in multiple formats

## Codegen SDK Integration

This tool leverages the Codegen SDK to provide advanced code analysis capabilities:

- **Symbol Extraction**: Uses the SDK to extract functions, classes, and other symbols
- **Dependency Analysis**: Analyzes function call relationships using the SDK's call graph
- **Type Information**: Leverages the SDK's type inference capabilities
- **Semantic Understanding**: Benefits from the SDK's semantic code understanding

## Installation

The Module Disassembler requires the Codegen SDK to be installed:

```bash
# Clone the repository
git clone https://github.com/Zeeeepa/codegen.git
cd codegen

# Install dependencies
pip install -e .
```

## Usage

```bash
python module_disassembler.py --repo-path /path/to/your/repo --output-dir /path/to/output
```

### Command Line Options

- `--repo-path`: Path to the repository to analyze (required)
- `--output-dir`: Directory to output restructured modules (required)
- `--report-file`: File to write the report to (default: disassembler_report.json)
- `--similarity-threshold`: Threshold for considering functions similar (0.0-1.0, default: 0.8)
- `--language`: Programming language of the codebase (auto-detected if not provided)

### Example Usage

```bash
# Analyze a repository and generate restructured modules
python module_disassembler.py --repo-path ./src/codegen --output-dir ./restructured

# Focus on a specific directory
python example_usage.py --repo-path . --focus-dir codegen-on-oss --output-dir ./restructured

# Specify a language and similarity threshold
python module_disassembler.py --repo-path ./src/codegen --output-dir ./restructured --language python --similarity-threshold 0.7
```

## How It Works

1. **Function Extraction**: The tool uses Codegen SDK to extract all functions from your codebase, with a fallback to AST parsing if needed.

2. **Duplicate Detection**: Functions are compared to identify exact duplicates (same hash) and near-duplicates (similarity above a threshold).

3. **Dependency Analysis**: The tool builds a dependency graph showing which functions call other functions, using the SDK's call graph capabilities.

4. **Functionality Grouping**: Functions are grouped based on their names and purposes using predefined patterns.

5. **Module Restructuring**: New modules are generated for each function group, with proper imports and documentation.

## Function Categories

Functions are grouped into the following categories:

- **analysis**: Functions for analyzing, extracting, parsing, or processing data
- **visualization**: Functions for visualizing, plotting, or displaying data
- **utility**: Helper and utility functions
- **io**: Functions for reading/writing data
- **validation**: Functions for validating or checking data
- **metrics**: Functions for measuring, calculating, or computing metrics
- **core**: Core functionality like initialization, main functions, etc.
- **other**: Functions that don't fit into other categories

## Output

The tool generates:

1. **Restructured Modules**: Python files organized by functionality
2. **Package Structure**: An `__init__.py` file that imports all modules
3. **Analysis Report**: A detailed report of the analysis results

## Fallback Mechanisms

The tool is designed to work even if the Codegen SDK is not fully functional:

- If SDK function extraction fails, it falls back to AST-based extraction
- If SDK dependency analysis fails, it falls back to AST-based dependency analysis
- All core functionality works without the SDK, but with reduced capabilities

## Limitations

- The current implementation uses regex for function extraction, which may not handle all edge cases correctly. A more robust implementation would use AST parsing.
- Function grouping is based on name patterns, which may not always accurately reflect the function's purpose.
- The tool currently only supports Python code, but could be extended to support other languages.

## Future Improvements

- Use AST parsing for more accurate function extraction
- Implement more sophisticated code similarity algorithms
- Add support for additional programming languages
- Improve function grouping using NLP techniques
- Add visualization of code dependencies
- Implement interactive mode for manual grouping

## License

This tool is part of the Codegen SDK and is subject to the same license terms.
102 changes: 102 additions & 0 deletions example_usage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#!/usr/bin/env python3
"""Example usage of the Module Disassembler for Codegen.

This script demonstrates how to use the module disassembler to analyze
and restructure the Codegen codebase.
"""

import argparse
import sys
import os
from pathlib import Path
from module_disassembler import ModuleDisassembler

# Import Codegen SDK
try:
from codegen.sdk.core.codebase import Codebase
from codegen.configs.models.codebase import CodebaseConfig
from codegen.configs.models.secrets import SecretsConfig
from codegen.shared.enums.programming_language import ProgrammingLanguage
except ImportError:
print("Codegen SDK not found. Please install it first.")
sys.exit(1)


def main():
"""Example usage of the module disassembler."""
parser = argparse.ArgumentParser(description="Example usage of the Module Disassembler")

parser.add_argument("--repo-path", default=".", help="Path to the repository to analyze (default: current directory)")
parser.add_argument("--output-dir", default="./restructured_modules", help="Directory to output the restructured modules")
parser.add_argument("--report-file", default="./disassembler_report.json", help="Path to the output report file")
parser.add_argument("--similarity-threshold", type=float, default=0.8, help="Threshold for considering functions similar (0.0-1.0)")
parser.add_argument("--focus-dir", default=None, help="Focus on a specific directory (e.g., 'codegen-on-oss')")
parser.add_argument("--language", default=None, help="Programming language of the codebase (auto-detected if not provided)")

args = parser.parse_args()

# Resolve paths
repo_path = Path(args.repo_path).resolve()
output_dir = Path(args.output_dir).resolve()
report_file = Path(args.report_file).resolve()

# If focus directory is specified, adjust the repo path
if args.focus_dir:
focus_path = repo_path / args.focus_dir
if not focus_path.exists():
print(f"Error: Focus directory '{args.focus_dir}' does not exist in '{repo_path}'")
sys.exit(1)
repo_path = focus_path

print(f"Analyzing repository: {repo_path}")
print(f"Output directory: {output_dir}")
print(f"Report file: {report_file}")
print(f"Similarity threshold: {args.similarity_threshold}")
print(f"Language: {args.language or 'Auto-detected'}")

try:
# Initialize the disassembler with Codegen SDK
disassembler = ModuleDisassembler(repo_path=repo_path, language=args.language)

# Perform the analysis
print("Analyzing codebase...")
disassembler.analyze(similarity_threshold=args.similarity_threshold)

# Print summary statistics
print("\nAnalysis Summary:")
print(f"Total functions: {len(disassembler.functions)}")
print(f"Duplicate groups: {len(disassembler.duplicate_groups)}")
print(f"Similar groups: {len(disassembler.similar_groups)}")
print("\nFunctions by category:")
for category, funcs in disassembler.categorized_functions.items():
print(f" - {category}: {len(funcs)} functions")

# Generate restructured modules
print("\nGenerating restructured modules...")
disassembler.generate_restructured_modules(output_dir=output_dir)

# Generate report
print("Generating report...")
disassembler.generate_report(output_file=report_file)

print(f"\nAnalysis complete!")
print(f"Restructured modules saved to: {output_dir}")
print(f"Report saved to: {report_file}")

# Provide next steps
print("\nNext steps:")
print("1. Review the generated report to understand the codebase structure")
print("2. Examine the restructured modules to see the new organization")
print("3. Use the restructured modules as a reference for refactoring")
print("4. Import the restructured modules in your project")
print("5. Run tests to ensure functionality is preserved")

except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)


if __name__ == "__main__":
main()
629 changes: 629 additions & 0 deletions module_disassembler.py

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions run_disassembler.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
# Run the module disassembler on the codegen-on-oss directory

# Ensure the script is executable
# chmod +x run_disassembler.sh

# Set the paths
REPO_PATH="."
FOCUS_DIR="codegen-on-oss"
OUTPUT_DIR="./restructured_modules"
REPORT_FILE="./disassembler_report.json"

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run the module disassembler
python example_usage.py --repo-path "$REPO_PATH" --focus-dir "$FOCUS_DIR" --output-dir "$OUTPUT_DIR" --report-file "$REPORT_FILE"

# Print completion message
echo "Module disassembler completed!"
echo "Restructured modules are in: $OUTPUT_DIR"
echo "Report file: $REPORT_FILE"

# Optionally, you can run tests to verify the restructured modules
# python -m unittest test_module_disassembler.py
221 changes: 221 additions & 0 deletions test_module_disassembler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
#!/usr/bin/env python3
"""Test script for the Module Disassembler.
This script tests the functionality of the module disassembler
by creating a simple test codebase with known duplicates and
verifying that they are correctly identified.
"""

import json
import os
import shutil
import tempfile
import unittest
import sys

from module_disassembler import ModuleDisassembler

# Try to import Codegen SDK for testing
try:
from codegen.sdk.core.codebase import Codebase
from codegen.configs.models.codebase import CodebaseConfig
from codegen.configs.models.secrets import SecretsConfig
HAS_SDK = True
except ImportError:
HAS_SDK = False
print("Warning: Codegen SDK not found. Some tests will be skipped.")


class TestModuleDisassembler(unittest.TestCase):
"""Test cases for the Module Disassembler."""

def setUp(self):
"""Set up a test codebase with known duplicates."""
# Create a temporary directory for the test codebase
self.test_dir = tempfile.mkdtemp()
self.output_dir = tempfile.mkdtemp()
self.report_file = os.path.join(self.test_dir, "report.json")

# Create a simple test codebase
self._create_test_codebase()

def tearDown(self):
"""Clean up temporary directories."""
shutil.rmtree(self.test_dir)
shutil.rmtree(self.output_dir)

def _create_test_codebase(self):
"""Create a simple test codebase with known duplicates."""
# Create directory structure
os.makedirs(os.path.join(self.test_dir, "module1"))
os.makedirs(os.path.join(self.test_dir, "module2"))

# Create test files with duplicate functions

# File 1: module1/file1.py
with open(os.path.join(self.test_dir, "module1", "file1.py"), "w") as f:
f.write("""def analyze_data(data):
\"\"\"Analyze the given data.\"\"\"
result = {}
for item in data:
if item in result:
result[item] += 1
else:
result[item] = 1
return result
def format_output(data):
\"\"\"Format the output data.\"\"\"
return "\\n".join([f"{k}: {v}" for k, v in data.items()])
""")

# File 2: module1/file2.py
with open(os.path.join(self.test_dir, "module1", "file2.py"), "w") as f:
f.write("""def process_data(data):
\"\"\"Process the given data.\"\"\"
result = {}
for item in data:
if item in result:
result[item] += 1
else:
result[item] = 1
return result
def validate_input(data):
\"\"\"Validate the input data.\"\"\"
if not isinstance(data, list):
raise ValueError("Input must be a list")
return True
""")

# File 3: module2/file3.py
with open(os.path.join(self.test_dir, "module2", "file3.py"), "w") as f:
f.write("""def analyze_data(data):
\"\"\"Analyze the given data.\"\"\"
result = {}
for item in data:
if item in result:
result[item] += 1
else:
result[item] = 1
return result
def display_results(data):
\"\"\"Display the results.\"\"\"
for key, value in data.items():
print(f"{key}: {value}")
""")

# File 4: module2/file4.py
with open(os.path.join(self.test_dir, "module2", "file4.py"), "w") as f:
f.write("""def format_results(data):
\"\"\"Format the results.\"\"\"
return "\\n".join([f"{k}: {v}" for k, v in data.items()])
def save_results(data, filename):
\"\"\"Save the results to a file.\"\"\"
with open(filename, "w") as f:
f.write(format_results(data))
""")

def test_function_extraction(self):
"""Test that functions are correctly extracted from the codebase."""
disassembler = ModuleDisassembler(repo_path=self.test_dir)

# Use AST-based extraction for testing
disassembler._extract_functions_with_ast()

# We should have 7 functions in total
self.assertEqual(len(disassembler.functions), 7)

# Check that all expected functions are found
function_names = [func.name for func in disassembler.functions.values()]
expected_names = ["analyze_data", "format_output", "process_data", "validate_input",
"display_results", "format_results", "save_results"]

for name in expected_names:
self.assertIn(name, function_names)

def test_duplicate_detection(self):
"""Test that duplicates are correctly identified."""
disassembler = ModuleDisassembler(repo_path=self.test_dir)
disassembler._extract_functions_with_ast()
disassembler._identify_duplicates(similarity_threshold=0.8)

# We should have 1 duplicate group (analyze_data appears twice)
self.assertEqual(len(disassembler.duplicate_groups), 1)

# We should have 1 similar group (format_output and format_results are similar)
self.assertEqual(len(disassembler.similar_groups), 1)

def test_categorization(self):
"""Test that functions are correctly categorized."""
disassembler = ModuleDisassembler(repo_path=self.test_dir)
disassembler._extract_functions_with_ast()
disassembler._categorize_functions()

# Check that functions are categorized correctly
self.assertIn("analysis", disassembler.categorized_functions)
self.assertIn("validation", disassembler.categorized_functions)
self.assertIn("visualization", disassembler.categorized_functions)

# analyze_data and process_data should be in the analysis category
analysis_names = [func.name for func in disassembler.categorized_functions["analysis"]]
self.assertIn("analyze_data", analysis_names)
self.assertIn("process_data", analysis_names)

# validate_input should be in the validation category
validation_names = [func.name for func in disassembler.categorized_functions["validation"]]
self.assertIn("validate_input", validation_names)

# display_results should be in the visualization category
visualization_names = [func.name for func in disassembler.categorized_functions["visualization"]]
self.assertIn("display_results", visualization_names)

def test_full_analysis(self):
"""Test the full analysis process."""
disassembler = ModuleDisassembler(repo_path=self.test_dir)

# For testing, we'll use the AST-based methods directly
disassembler._extract_functions_with_ast()
disassembler._identify_duplicates(similarity_threshold=0.8)
disassembler._build_dependency_graph_with_ast()
disassembler._categorize_functions()

disassembler.generate_restructured_modules(output_dir=self.output_dir)
disassembler.generate_report(output_file=self.report_file)

# Check that the output directory contains the expected structure
self.assertTrue(os.path.exists(os.path.join(self.output_dir, "__init__.py")))
self.assertTrue(os.path.exists(os.path.join(self.output_dir, "README.md")))

# Check that the report file was created
self.assertTrue(os.path.exists(self.report_file))

# Load the report and check its structure
with open(self.report_file) as f:
report = json.load(f)

self.assertIn("summary", report)
self.assertIn("duplicates", report)
self.assertIn("similar", report)
self.assertIn("categories", report)

@unittest.skipIf(not HAS_SDK, "Codegen SDK not available")
def test_sdk_integration(self):
"""Test integration with Codegen SDK."""
# This test only runs if the SDK is available
disassembler = ModuleDisassembler(repo_path=self.test_dir)

# Mock the SDK codebase for testing
disassembler.codebase = None

# Test that the fallback to AST works when SDK fails
disassembler._extract_functions_with_sdk()

# We should still have functions extracted via AST fallback
self.assertGreater(len(disassembler.functions), 0)


if __name__ == "__main__":
unittest.main()