Zeeeepa · codegen-sh · May 12, 2025 · May 12, 2025 · May 12, 2025 · May 12, 2025
diff --git a/MODULE_DISASSEMBLER_README.md b/MODULE_DISASSEMBLER_README.md
@@ -0,0 +1,136 @@
+# Module Disassembler and Restructurer
+
+A powerful tool for analyzing codebases, identifying duplicate and redundant code, and restructuring modules based on their functionality.
+
+## Overview
+
+The Module Disassembler is designed to help you:
+
+1. **Analyze** your codebase to understand its structure
+2. **Identify** duplicate and redundant code
+3. **Group** functions by their functionality
+4. **Restructure** your codebase into more logical modules
+
+This tool is particularly useful for:
+- Refactoring large codebases
+- Understanding unfamiliar code
+- Improving code organization
+- Reducing technical debt
+- Preparing for architectural changes
+
+## Features
+
+- **Function Extraction**: Extracts all functions from your codebase using Codegen SDK
+- **Duplicate Detection**: Identifies exact and near-duplicate functions
+- **Functionality Grouping**: Groups functions based on their purpose
+- **Module Restructuring**: Generates new modules organized by functionality
+- **Comprehensive Reporting**: Provides detailed reports in multiple formats
+
+## Codegen SDK Integration
+
+This tool leverages the Codegen SDK to provide advanced code analysis capabilities:
+
+- **Symbol Extraction**: Uses the SDK to extract functions, classes, and other symbols
+- **Dependency Analysis**: Analyzes function call relationships using the SDK's call graph
+- **Type Information**: Leverages the SDK's type inference capabilities
+- **Semantic Understanding**: Benefits from the SDK's semantic code understanding
+
+## Installation
+
+The Module Disassembler requires the Codegen SDK to be installed:
+
+```bash
+# Clone the repository
+git clone https://github.com/Zeeeepa/codegen.git
+cd codegen
+
+# Install dependencies
+pip install -e .
+```
+
+## Usage
+
+```bash
+python module_disassembler.py --repo-path /path/to/your/repo --output-dir /path/to/output
+```
+
+### Command Line Options
+
+- `--repo-path`: Path to the repository to analyze (required)
+- `--output-dir`: Directory to output restructured modules (required)
+- `--report-file`: File to write the report to (default: disassembler_report.json)
+- `--similarity-threshold`: Threshold for considering functions similar (0.0-1.0, default: 0.8)
+- `--language`: Programming language of the codebase (auto-detected if not provided)
+
+### Example Usage
+
+```bash
+# Analyze a repository and generate restructured modules
+python module_disassembler.py --repo-path ./src/codegen --output-dir ./restructured
+
+# Focus on a specific directory
+python example_usage.py --repo-path . --focus-dir codegen-on-oss --output-dir ./restructured
+
+# Specify a language and similarity threshold
+python module_disassembler.py --repo-path ./src/codegen --output-dir ./restructured --language python --similarity-threshold 0.7
+```
+
+## How It Works
+
+1. **Function Extraction**: The tool uses Codegen SDK to extract all functions from your codebase, with a fallback to AST parsing if needed.
+
+2. **Duplicate Detection**: Functions are compared to identify exact duplicates (same hash) and near-duplicates (similarity above a threshold).
+
+3. **Dependency Analysis**: The tool builds a dependency graph showing which functions call other functions, using the SDK's call graph capabilities.
+
+4. **Functionality Grouping**: Functions are grouped based on their names and purposes using predefined patterns.
+
+5. **Module Restructuring**: New modules are generated for each function group, with proper imports and documentation.
+
+## Function Categories
+
+Functions are grouped into the following categories:
+
+- **analysis**: Functions for analyzing, extracting, parsing, or processing data
+- **visualization**: Functions for visualizing, plotting, or displaying data
+- **utility**: Helper and utility functions
+- **io**: Functions for reading/writing data
+- **validation**: Functions for validating or checking data
+- **metrics**: Functions for measuring, calculating, or computing metrics
+- **core**: Core functionality like initialization, main functions, etc.
+- **other**: Functions that don't fit into other categories
+
+## Output
+
+The tool generates:
+
+1. **Restructured Modules**: Python files organized by functionality
+2. **Package Structure**: An `__init__.py` file that imports all modules
+3. **Analysis Report**: A detailed report of the analysis results
+
+## Fallback Mechanisms
+
+The tool is designed to work even if the Codegen SDK is not fully functional:
+
+- If SDK function extraction fails, it falls back to AST-based extraction
+- If SDK dependency analysis fails, it falls back to AST-based dependency analysis
+- All core functionality works without the SDK, but with reduced capabilities
+
+## Limitations
+
+- The current implementation uses regex for function extraction, which may not handle all edge cases correctly. A more robust implementation would use AST parsing.
+- Function grouping is based on name patterns, which may not always accurately reflect the function's purpose.
+- The tool currently only supports Python code, but could be extended to support other languages.
+
+## Future Improvements
+
+- Use AST parsing for more accurate function extraction
+- Implement more sophisticated code similarity algorithms
+- Add support for additional programming languages
+- Improve function grouping using NLP techniques
+- Add visualization of code dependencies
+- Implement interactive mode for manual grouping
+
+## License
+
+This tool is part of the Codegen SDK and is subject to the same license terms.
diff --git a/example_usage.py b/example_usage.py
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+"""Example usage of the Module Disassembler for Codegen.
+
+This script demonstrates how to use the module disassembler to analyze
+and restructure the Codegen codebase.
+"""
+
+import argparse
+import sys
+import os
+from pathlib import Path
+from module_disassembler import ModuleDisassembler
+
+# Import Codegen SDK
+try:
+    from codegen.sdk.core.codebase import Codebase
+    from codegen.configs.models.codebase import CodebaseConfig
+    from codegen.configs.models.secrets import SecretsConfig
+    from codegen.shared.enums.programming_language import ProgrammingLanguage
+except ImportError:
+    print("Codegen SDK not found. Please install it first.")
+    sys.exit(1)
+
+
+def main():
+    """Example usage of the module disassembler."""
+    parser = argparse.ArgumentParser(description="Example usage of the Module Disassembler")
+
+    parser.add_argument("--repo-path", default=".", help="Path to the repository to analyze (default: current directory)")
+    parser.add_argument("--output-dir", default="./restructured_modules", help="Directory to output the restructured modules")
+    parser.add_argument("--report-file", default="./disassembler_report.json", help="Path to the output report file")
+    parser.add_argument("--similarity-threshold", type=float, default=0.8, help="Threshold for considering functions similar (0.0-1.0)")
+    parser.add_argument("--focus-dir", default=None, help="Focus on a specific directory (e.g., 'codegen-on-oss')")
+    parser.add_argument("--language", default=None, help="Programming language of the codebase (auto-detected if not provided)")
+
+    args = parser.parse_args()
+
+    # Resolve paths
+    repo_path = Path(args.repo_path).resolve()
+    output_dir = Path(args.output_dir).resolve()
+    report_file = Path(args.report_file).resolve()
+
+    # If focus directory is specified, adjust the repo path
+    if args.focus_dir:
+        focus_path = repo_path / args.focus_dir
+        if not focus_path.exists():
+            print(f"Error: Focus directory '{args.focus_dir}' does not exist in '{repo_path}'")
+            sys.exit(1)
+        repo_path = focus_path
+
+    print(f"Analyzing repository: {repo_path}")
+    print(f"Output directory: {output_dir}")
+    print(f"Report file: {report_file}")
+    print(f"Similarity threshold: {args.similarity_threshold}")
+    print(f"Language: {args.language or 'Auto-detected'}")
+
+    try:
+        # Initialize the disassembler with Codegen SDK
+        disassembler = ModuleDisassembler(repo_path=repo_path, language=args.language)
+
+        # Perform the analysis
+        print("Analyzing codebase...")
+        disassembler.analyze(similarity_threshold=args.similarity_threshold)
+
+        # Print summary statistics
+        print("\nAnalysis Summary:")
+        print(f"Total functions: {len(disassembler.functions)}")
+        print(f"Duplicate groups: {len(disassembler.duplicate_groups)}")
+        print(f"Similar groups: {len(disassembler.similar_groups)}")
+        print("\nFunctions by category:")
+        for category, funcs in disassembler.categorized_functions.items():
+            print(f"  - {category}: {len(funcs)} functions")
+
+        # Generate restructured modules
+        print("\nGenerating restructured modules...")
+        disassembler.generate_restructured_modules(output_dir=output_dir)
+
+        # Generate report
+        print("Generating report...")
+        disassembler.generate_report(output_file=report_file)
+
+        print(f"\nAnalysis complete!")
+        print(f"Restructured modules saved to: {output_dir}")
+        print(f"Report saved to: {report_file}")
+
+        # Provide next steps
+        print("\nNext steps:")
+        print("1. Review the generated report to understand the codebase structure")
+        print("2. Examine the restructured modules to see the new organization")
+        print("3. Use the restructured modules as a reference for refactoring")
+        print("4. Import the restructured modules in your project")
+        print("5. Run tests to ensure functionality is preserved")
+
+    except Exception as e:
+        print(f"Error: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/module_disassembler.py b/module_disassembler.py
diff --git a/run_disassembler.sh b/run_disassembler.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+# Run the module disassembler on the codegen-on-oss directory
+
+# Ensure the script is executable
+# chmod +x run_disassembler.sh
+
+# Set the paths
+REPO_PATH="."
+FOCUS_DIR="codegen-on-oss"
+OUTPUT_DIR="./restructured_modules"
+REPORT_FILE="./disassembler_report.json"
+
+# Create the output directory if it doesn't exist
+mkdir -p "$OUTPUT_DIR"
+
+# Run the module disassembler
+python example_usage.py --repo-path "$REPO_PATH" --focus-dir "$FOCUS_DIR" --output-dir "$OUTPUT_DIR" --report-file "$REPORT_FILE"
+
+# Print completion message
+echo "Module disassembler completed!"
+echo "Restructured modules are in: $OUTPUT_DIR"
+echo "Report file: $REPORT_FILE"
+
+# Optionally, you can run tests to verify the restructured modules
+# python -m unittest test_module_disassembler.py
diff --git a/test_module_disassembler.py b/test_module_disassembler.py
@@ -0,0 +1,221 @@
+#!/usr/bin/env python3
+"""Test script for the Module Disassembler.
+
+This script tests the functionality of the module disassembler
+by creating a simple test codebase with known duplicates and
+verifying that they are correctly identified.
+"""
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+import sys
+
+from module_disassembler import ModuleDisassembler
+
+# Try to import Codegen SDK for testing
+try:
+    from codegen.sdk.core.codebase import Codebase
+    from codegen.configs.models.codebase import CodebaseConfig
+    from codegen.configs.models.secrets import SecretsConfig
+    HAS_SDK = True
+except ImportError:
+    HAS_SDK = False
+    print("Warning: Codegen SDK not found. Some tests will be skipped.")
+
+
+class TestModuleDisassembler(unittest.TestCase):
+    """Test cases for the Module Disassembler."""
+
+    def setUp(self):
+        """Set up a test codebase with known duplicates."""
+        # Create a temporary directory for the test codebase
+        self.test_dir = tempfile.mkdtemp()
+        self.output_dir = tempfile.mkdtemp()
+        self.report_file = os.path.join(self.test_dir, "report.json")
+
+        # Create a simple test codebase
+        self._create_test_codebase()
+
+    def tearDown(self):
+        """Clean up temporary directories."""
+        shutil.rmtree(self.test_dir)
+        shutil.rmtree(self.output_dir)
+
+    def _create_test_codebase(self):
+        """Create a simple test codebase with known duplicates."""
+        # Create directory structure
+        os.makedirs(os.path.join(self.test_dir, "module1"))
+        os.makedirs(os.path.join(self.test_dir, "module2"))
+
+        # Create test files with duplicate functions
+
+        # File 1: module1/file1.py
+        with open(os.path.join(self.test_dir, "module1", "file1.py"), "w") as f:
+            f.write("""def analyze_data(data):
+    \"\"\"Analyze the given data.\"\"\"
+    result = {}
+    for item in data:
+        if item in result:
+            result[item] += 1
+        else:
+            result[item] = 1
+    return result
+
+def format_output(data):
+    \"\"\"Format the output data.\"\"\"
+    return "\\n".join([f"{k}: {v}" for k, v in data.items()])
+""")
+
+        # File 2: module1/file2.py
+        with open(os.path.join(self.test_dir, "module1", "file2.py"), "w") as f:
+            f.write("""def process_data(data):
+    \"\"\"Process the given data.\"\"\"
+    result = {}
+    for item in data:
+        if item in result:
+            result[item] += 1
+        else:
+            result[item] = 1
+    return result
+
+def validate_input(data):
+    \"\"\"Validate the input data.\"\"\"
+    if not isinstance(data, list):
+        raise ValueError("Input must be a list")
+    return True
+""")
+
+        # File 3: module2/file3.py
+        with open(os.path.join(self.test_dir, "module2", "file3.py"), "w") as f:
+            f.write("""def analyze_data(data):
+    \"\"\"Analyze the given data.\"\"\"
+    result = {}
+    for item in data:
+        if item in result:
+            result[item] += 1
+        else:
+            result[item] = 1
+    return result
+
+def display_results(data):
+    \"\"\"Display the results.\"\"\"
+    for key, value in data.items():
+        print(f"{key}: {value}")
+""")
+
+        # File 4: module2/file4.py
+        with open(os.path.join(self.test_dir, "module2", "file4.py"), "w") as f:
+            f.write("""def format_results(data):
+    \"\"\"Format the results.\"\"\"
+    return "\\n".join([f"{k}: {v}" for k, v in data.items()])
+
+def save_results(data, filename):
+    \"\"\"Save the results to a file.\"\"\"
+    with open(filename, "w") as f:
+        f.write(format_results(data))
+""")
+
+    def test_function_extraction(self):
+        """Test that functions are correctly extracted from the codebase."""
+        disassembler = ModuleDisassembler(repo_path=self.test_dir)
+
+        # Use AST-based extraction for testing
+        disassembler._extract_functions_with_ast()
+
+        # We should have 7 functions in total
+        self.assertEqual(len(disassembler.functions), 7)
+
+        # Check that all expected functions are found
+        function_names = [func.name for func in disassembler.functions.values()]
+        expected_names = ["analyze_data", "format_output", "process_data", "validate_input", 
+                         "display_results", "format_results", "save_results"]
+
+        for name in expected_names:
+            self.assertIn(name, function_names)
+
+    def test_duplicate_detection(self):
+        """Test that duplicates are correctly identified."""
+        disassembler = ModuleDisassembler(repo_path=self.test_dir)
+        disassembler._extract_functions_with_ast()
+        disassembler._identify_duplicates(similarity_threshold=0.8)
+
+        # We should have 1 duplicate group (analyze_data appears twice)
+        self.assertEqual(len(disassembler.duplicate_groups), 1)
+
+        # We should have 1 similar group (format_output and format_results are similar)
+        self.assertEqual(len(disassembler.similar_groups), 1)
+
+    def test_categorization(self):
+        """Test that functions are correctly categorized."""
+        disassembler = ModuleDisassembler(repo_path=self.test_dir)
+        disassembler._extract_functions_with_ast()
+        disassembler._categorize_functions()
+
+        # Check that functions are categorized correctly
+        self.assertIn("analysis", disassembler.categorized_functions)
+        self.assertIn("validation", disassembler.categorized_functions)
+        self.assertIn("visualization", disassembler.categorized_functions)
+
+        # analyze_data and process_data should be in the analysis category
+        analysis_names = [func.name for func in disassembler.categorized_functions["analysis"]]
+        self.assertIn("analyze_data", analysis_names)
+        self.assertIn("process_data", analysis_names)
+
+        # validate_input should be in the validation category
+        validation_names = [func.name for func in disassembler.categorized_functions["validation"]]
+        self.assertIn("validate_input", validation_names)
+
+        # display_results should be in the visualization category
+        visualization_names = [func.name for func in disassembler.categorized_functions["visualization"]]
+        self.assertIn("display_results", visualization_names)
+
+    def test_full_analysis(self):
+        """Test the full analysis process."""
+        disassembler = ModuleDisassembler(repo_path=self.test_dir)
+
+        # For testing, we'll use the AST-based methods directly
+        disassembler._extract_functions_with_ast()
+        disassembler._identify_duplicates(similarity_threshold=0.8)
+        disassembler._build_dependency_graph_with_ast()
+        disassembler._categorize_functions()
+
+        disassembler.generate_restructured_modules(output_dir=self.output_dir)
+        disassembler.generate_report(output_file=self.report_file)
+
+        # Check that the output directory contains the expected structure
+        self.assertTrue(os.path.exists(os.path.join(self.output_dir, "__init__.py")))
+        self.assertTrue(os.path.exists(os.path.join(self.output_dir, "README.md")))
+
+        # Check that the report file was created
+        self.assertTrue(os.path.exists(self.report_file))
+
+        # Load the report and check its structure
+        with open(self.report_file) as f:
+            report = json.load(f)
+
+        self.assertIn("summary", report)
+        self.assertIn("duplicates", report)
+        self.assertIn("similar", report)
+        self.assertIn("categories", report)
+
+    @unittest.skipIf(not HAS_SDK, "Codegen SDK not available")
+    def test_sdk_integration(self):
+        """Test integration with Codegen SDK."""
+        # This test only runs if the SDK is available
+        disassembler = ModuleDisassembler(repo_path=self.test_dir)
+
+        # Mock the SDK codebase for testing
+        disassembler.codebase = None
+
+        # Test that the fallback to AST works when SDK fails
+        disassembler._extract_functions_with_sdk()
+
+        # We should still have functions extracted via AST fallback
+        self.assertGreater(len(disassembler.functions), 0)
+
+
+if __name__ == "__main__":
+    unittest.main()