An interactive web application built with Streamlit that uses a Retrieval-Augmented Generation (RAG) pipeline to classify security event logs into standardized categories.
- Problem Statement
- Features
- RAG Architecture
- How It Works
- Tech Stack
- Setup & Installation
- Steps to Run
- Usage
- Example Input/Output
- Configuration
- Future Improvements
- License
Security teams deal with thousands of security event logs daily from various sources (firewalls, IDS/IPS, authentication systems, etc.). These logs need to be classified into standardized categories for:
- Compliance Requirements: Meeting regulatory standards (SIEM, SOC reporting)
- Incident Response: Quick identification and prioritization of security events
- Threat Detection: Pattern recognition across different log formats
- Automation: Reducing manual classification effort and human error
- Logs come in different formats and structures
- Manual classification is time-consuming and error-prone
- Need for consistent categorization across different log sources
- Requirement for confidence scoring to identify uncertain classifications
This application uses a RAG (Retrieval-Augmented Generation) pipeline to automatically classify security logs into five standardized fields:
eventClass: Type of security eventeventOutcome: Result of the event (success/failure)eventSeverity: Impact leveleventAction: Action taken or requiredeventCategory: High-level category
- Interactive UI: Simple and intuitive web interface powered by Streamlit.
- CSV Upload: Easily upload your log files in
.csvformat. - RAG Pipeline: Leverages a state-of-the-art RAG architecture for intelligent log classification.
- Vector Search: Uses FAISS for efficient similarity search to find relevant context for each log.
- LLM-Powered Classification: Utilizes Google's Gemini models to reason over log data and provide structured output.
- Downloadable Results: Export the classification results in both JSON and CSV formats.
- Combined Confidence Score: Calculates a hybrid confidence score based on both vector similarity and the LLM's own confidence.
The application implements a Retrieval-Augmented Generation architecture with the following components:
┌─────────────────────────────────────────────────────────────────┐
│ INPUT LAYER │
│ User uploads CSV with raw security log messages │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Sentence-Transformers Model │ │
│ │ (all-MiniLM-L6-v2) │ │
│ │ Converts text to 384-dimensional vectors │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ FAISS Vector Database │ │
│ │ • Stores knowledge base embeddings │ │
│ │ • Performs similarity search │ │
│ │ • Returns top-k relevant contexts │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ AUGMENTATION LAYER │
│ Combines: │
│ • Original log message │
│ • Retrieved context (classification rules) │
│ • Structured prompt template │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GENERATION LAYER │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Google Gemini LLM │ │
│ │ • Analyzes log + context │ │
│ │ • Generates structured JSON output │ │
│ │ • Provides confidence score │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT LAYER │
│ Structured Classification: │
│ • eventClass, eventOutcome, eventSeverity │
│ • eventAction, eventCategory │
│ • Combined confidence score │
│ • Export as JSON/CSV │
└─────────────────────────────────────────────────────────────────┘
-
Knowledge Base:
- Predefined classification rules and examples
- Covers common security event patterns
- Encoded into vector embeddings
-
Vector Store (FAISS):
- Fast similarity search
- Efficient indexing of embeddings
- Low memory footprint
-
Embedding Model:
sentence-transformers/all-MiniLM-L6-v2- 384-dimensional embeddings
- Optimized for semantic similarity
-
LLM (Google Gemini):
- Context-aware classification
- Structured JSON output
- Confidence estimation
-
Hybrid Scoring:
- Combines vector similarity score
- LLM confidence score
- Weighted average for final confidence
The application follows a Retrieval-Augmented Generation (RAG) pipeline to classify each log message:
- Knowledge Base Creation: A predefined set of classification rules and descriptions is encoded into vector embeddings using
sentence-transformers. - Vector Database: These embeddings are stored in a FAISS index for fast and efficient retrieval.
- User Input: The user uploads a CSV file containing raw log messages.
- Retrieval: For each log message, the system creates an embedding and queries the FAISS index to retrieve the most semantically similar rules and context from the knowledge base.
- Generation: The original log message, along with the retrieved context, is passed to a Google Gemini model within a structured prompt.
- Output: The LLM analyzes the information and generates a structured JSON object containing the classification for the 5 required fields (
eventClass,eventOutcome, etc.) and a confidence score.
- Frontend: Streamlit
- Backend: Python
- Data Handling: Pandas
- Language Model: Google Gemini
- Embedding Model: Sentence-Transformers
- Vector Database: FAISS (Facebook AI Similarity Search)
Follow these steps to set up and run the project locally.
- Python 3.9 or higher
- pip package manager
- Google AI API key (Get it here)
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-namepython -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`Create a requirements.txt file with the following content:
streamlit
pandas
faiss-cpu
sentence-transformers
google-generativeai
Then, install the packages:
pip install -r requirements.txtThis project requires a Google AI API key. For security, do not hardcode your key in app.py. Use Streamlit's secrets management.
Create a folder and file: .streamlit/secrets.toml
Add your API key to the secrets.toml file:
# .streamlit/secrets.toml
GOOGLE_API_KEY = "AIzaSy..."In app.py, replace the hardcoded API key with a call to Streamlit secrets:
# In app.py, replace this line:
# api_key = "AIzaSyDPIzRvmlT70wpmYt3LmnWxKl8QuW5K5pk"
# With this line:
api_key = st.secrets["GOOGLE_API_KEY"]-
Ensure all dependencies are installed:
pip install -r requirements.txt
-
Set up your API key in
.streamlit/secrets.toml -
Run the Streamlit application:
streamlit run app.py
-
Open your browser to
http://localhost:8501
If you prefer using Docker:
# Build the image
docker build -t log-classifier .
# Run the container
docker run -p 8501:8501 log-classifier- Open your web browser and navigate to the local URL provided (usually
http://localhost:8501). - Upload your CSV file. The CSV must contain a column named
Messagewith the raw log text. - Adjust the number of samples you want to process using the slider.
- Click the "Run Classification" button.
- View the results in the interactive table.
- Use the download buttons to save the output as JSON or CSV.
Message
"Failed login attempt for user admin from IP 192.168.1.100"
"Firewall blocked incoming connection from 10.0.0.50 on port 443"
"User john.doe successfully authenticated via SSO"
"Malware detected in file document.exe, quarantined by antivirus"
"Suspicious outbound traffic to known C2 server detected"JSON Format:
[
{
"original_message": "Failed login attempt for user admin from IP 192.168.1.100",
"eventClass": "Authentication Failure",
"eventOutcome": "Failure",
"eventSeverity": "Medium",
"eventAction": "Alert",
"eventCategory": "Authentication",
"confidence": 0.89,
"retrieval_score": 0.92,
"llm_confidence": 0.85
},
{
"original_message": "Firewall blocked incoming connection from 10.0.0.50 on port 443",
"eventClass": "Network Access Control",
"eventOutcome": "Blocked",
"eventSeverity": "Low",
"eventAction": "Block",
"eventCategory": "Network Security",
"confidence": 0.93,
"retrieval_score": 0.95,
"llm_confidence": 0.90
},
{
"original_message": "User john.doe successfully authenticated via SSO",
"eventClass": "Authentication Success",
"eventOutcome": "Success",
"eventSeverity": "Informational",
"eventAction": "Allow",
"eventCategory": "Authentication",
"confidence": 0.96,
"retrieval_score": 0.98,
"llm_confidence": 0.94
}
]CSV Format:
| original_message | eventClass | eventOutcome | eventSeverity | eventAction | eventCategory | confidence |
|---|---|---|---|---|---|---|
| Failed login attempt for user admin from IP 192.168.1.100 | Authentication Failure | Failure | Medium | Alert | Authentication | 0.89 |
| Firewall blocked incoming connection from 10.0.0.50 on port 443 | Network Access Control | Blocked | Low | Block | Network Security | 0.93 |
| User john.doe successfully authenticated via SSO | Authentication Success | Success | Informational | Allow | Authentication | 0.96 |
- eventClass: Specific type of security event (e.g., "Authentication Failure", "Malware Detection")
- eventOutcome: Result of the event (Success, Failure, Blocked, Detected, etc.)
- eventSeverity: Impact level (Critical, High, Medium, Low, Informational)
- eventAction: Recommended or taken action (Alert, Block, Allow, Quarantine, Investigate)
- eventCategory: High-level category (Authentication, Network Security, Malware, etc.)
- confidence: Combined score (0-1) indicating classification certainty
Edit the knowledge_base list in the create_knowledge_base() function in app.py:
knowledge_base = [
"Authentication failures indicate unsuccessful login attempts...",
"Firewall blocks represent denied network connections...",
# Add your custom classification rules here
]Change the preferred_models list in find_working_models():
preferred_models = [
'gemini-1.5-flash',
'gemini-1.5-pro',
'gemini-pro'
]Modify the number of context snippets retrieved:
# In classify_log method
context = self.retrieve_context(message, k=5) # Change k valueModify the hybrid confidence calculation:
# In classify_log method
combined_confidence = (
0.6 * retrieval_score + # Change weights as needed
0.4 * llm_confidence
)- Batch Processing: Add parallel processing for large CSV files
- Custom Knowledge Base Upload: Allow users to upload their own classification rules
- Filtering Options: Add filters for confidence threshold and severity levels
- Visualization Dashboard: Add charts for classification distribution and confidence scores
- Error Handling: Improve error messages and validation
- Multi-language Support: Extend to non-English security logs
- Fine-tuning: Train a custom model on domain-specific security logs
- Active Learning: Allow users to correct classifications and retrain
- API Endpoint: Create REST API for programmatic access
- Real-time Processing: Support streaming log ingestion
- Integration with SIEM: Connect to popular SIEM platforms
- Anomaly Detection: Add unsupervised learning for novel threat detection
- Contextual Analysis: Include temporal and relational analysis of events
- Multi-modal Input: Support for logs with additional metadata
- Explainability: Add LIME/SHAP explanations for classifications
- Cache embeddings for repeated queries
- Use GPU acceleration for embedding generation
- Implement incremental FAISS index updates
- Add Redis for session management
- Optimize prompt engineering for faster LLM responses
- Add authentication and authorization
- Implement rate limiting
- Add audit logging
- Support for on-premise deployment
- Data encryption at rest and in transit
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or support, please open an issue on GitHub or contact the maintainers.
- Streamlit for the amazing web framework
- Google AI for the Gemini API
- Sentence-Transformers for embedding models
- Facebook AI for FAISS vector search
- The open-source community for inspiration and support