A novel RAG pipeline for precisely localizing malicious payloads in Android apps at the method level, bridging high-level behavior descriptions with low-level Smali code semantics.
RAML addresses the challenge of localizing exact malicious payloads in Android apps by leveraging LLMs to connect high-level malware family behavior knowledge with low-level Smali bytecode, enabling precise method-level localization with human-readable explanations.
- Novel RAG Framework: First work to apply RAG paradigm for localizing malicious payloads in Android malware
- Method-Level Localization: Precisely identifies specific methods with detailed explanations
- Evaluation Framework: Includes LocApp dataset and real-world MalRadar analysis
- Class-Level Description Generation: LLMs generate natural-language descriptions from Smali code
- Vector Embedding: Embed descriptions into vector database for semantic search
- Semantic Retrieval: Find relevant classes using similarity search
- LLM Re-ranking: Re-rank candidates with LLM assistance
- Method-Level Analysis: Pinpoint specific malicious methods and explain functions
git clone <repository-url>
cd RAML
pip install -r requirements.txt
echo "OPENAI_API_KEY=your_key_here" > .envpython main.py /path/to/smali/folder --behaviors 1 3 4 --app-name "app_name"smali_folder: Path to Smali files (required)--behaviors: Behavior IDs to analyze (1-12, required)--app-name: App name (optional)--output-dir: Output directory (optional)--force-rebuild: Rebuild vector store (optional)
| ID | Behavior | Description |
|---|---|---|
| 1 | Privacy Stealing | Access/exfiltrate contacts, SMS, location, call logs |
| 2 | SMS/CALL Abuse | Send SMS, make calls, manipulate communication |
| 3 | Remote Control | C&C server communication, remote command execution |
| 4 | Bank/Financial Stealing | Banking trojans, credential theft, overlay attacks |
| 5 | Ransomware | File encryption, screen locking, ransom demands |
| 6 | Accessibility Abuse | Exploit accessibility services for automation |
| 7 | Privilege Escalation | Root exploits, system modifications |
| 8 | Stealthy Download | Covert app installation, silent downloads |
| 9 | Aggressive Advertising | Click fraud, ad manipulation |
| 10 | Cryptocurrency Mining | Background mining operations |
| 11 | Evasion Techniques | App hiding, anti-analysis measures |
| 12 | Premium Service Abuse | WAP billing fraud, hidden subscriptions |
- SmaliLoader: Loads and parses Smali files
- SmaliParser: Extracts class structure, methods, permissions, API calls
- RetrievalEngine: Two-stage RAG for behavior detection
- ReportGenerator: Creates analysis reports
- Logger: Analysis session logging
- Embedding Model: OpenAI text-embedding-ada-002
- LLM Model: GPT-4 for analysis and reasoning
- Vector Store: ChromaDB for similarity search
- Output: JSON reports with explanations and confidence scores
- LocApp: Custom Android app with common malicious behaviors for controlled evaluation
- Real-World Analysis: Assessed on MalRadar malware samples
- Detailed JSON Report: Complete analysis with all findings
- Summary Report: Human-readable summary of key findings
- Analysis Metadata: App name, behaviors, duration, confidence scores
- Method Analysis: Specific methods involved in each behavior
Configure via config.py:
- OpenAI settings (API key, model, temperature)
- Vector store settings (persistence, collection)
- Retrieval parameters (top-k, thresholds)
- Processing options (chunk sizes, limits)