Security Event Log RAG Classifier

An interactive web application built with Streamlit that uses a Retrieval-Augmented Generation (RAG) pipeline to classify security event logs into standardized categories.

📋 Table of Contents

Problem Statement
Features
RAG Architecture
How It Works
Tech Stack
Setup & Installation
Steps to Run
Usage
Example Input/Output
Configuration
Future Improvements
License

🎯 Problem Statement

Security teams deal with thousands of security event logs daily from various sources (firewalls, IDS/IPS, authentication systems, etc.). These logs need to be classified into standardized categories for:

Compliance Requirements: Meeting regulatory standards (SIEM, SOC reporting)
Incident Response: Quick identification and prioritization of security events
Threat Detection: Pattern recognition across different log formats
Automation: Reducing manual classification effort and human error

Challenges:

Logs come in different formats and structures
Manual classification is time-consuming and error-prone
Need for consistent categorization across different log sources
Requirement for confidence scoring to identify uncertain classifications

Solution:

This application uses a RAG (Retrieval-Augmented Generation) pipeline to automatically classify security logs into five standardized fields:

eventClass: Type of security event
eventOutcome: Result of the event (success/failure)
eventSeverity: Impact level
eventAction: Action taken or required
eventCategory: High-level category

✨ Features

Interactive UI: Simple and intuitive web interface powered by Streamlit.
CSV Upload: Easily upload your log files in .csv format.
RAG Pipeline: Leverages a state-of-the-art RAG architecture for intelligent log classification.
Vector Search: Uses FAISS for efficient similarity search to find relevant context for each log.
LLM-Powered Classification: Utilizes Google's Gemini models to reason over log data and provide structured output.
Downloadable Results: Export the classification results in both JSON and CSV formats.
Combined Confidence Score: Calculates a hybrid confidence score based on both vector similarity and the LLM's own confidence.

🏗️ RAG Architecture

The application implements a Retrieval-Augmented Generation architecture with the following components:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        INPUT LAYER                              │
│  User uploads CSV with raw security log messages                │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EMBEDDING LAYER                              │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Sentence-Transformers Model                             │   │
│  │  (all-MiniLM-L6-v2)                                      │   │
│  │  Converts text to 384-dimensional vectors                │   │
│  └──────────────────────────────────────────────────────────┘   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    RETRIEVAL LAYER                              │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  FAISS Vector Database                                   │   │
│  │  • Stores knowledge base embeddings                      │   │
│  │  • Performs similarity search                            │   │
│  │  • Returns top-k relevant contexts                       │   │
│  └──────────────────────────────────────────────────────────┘   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                   AUGMENTATION LAYER                            │
│  Combines:                                                      │
│  • Original log message                                         │
│  • Retrieved context (classification rules)                     │
│  • Structured prompt template                                   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    GENERATION LAYER                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Google Gemini LLM                                       │   │
│  │  • Analyzes log + context                                │   │
│  │  • Generates structured JSON output                      │   │
│  │  • Provides confidence score                             │   │
│  └──────────────────────────────────────────────────────────┘   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      OUTPUT LAYER                               │
│  Structured Classification:                                     │
│  • eventClass, eventOutcome, eventSeverity                      │
│  • eventAction, eventCategory                                   │
│  • Combined confidence score                                    │
│  • Export as JSON/CSV                                           │
└─────────────────────────────────────────────────────────────────┘

Component Details

Knowledge Base:
- Predefined classification rules and examples
- Covers common security event patterns
- Encoded into vector embeddings
Vector Store (FAISS):
- Fast similarity search
- Efficient indexing of embeddings
- Low memory footprint
Embedding Model:
- sentence-transformers/all-MiniLM-L6-v2
- 384-dimensional embeddings
- Optimized for semantic similarity
LLM (Google Gemini):
- Context-aware classification
- Structured JSON output
- Confidence estimation
Hybrid Scoring:
- Combines vector similarity score
- LLM confidence score
- Weighted average for final confidence

🧠 How It Works

The application follows a Retrieval-Augmented Generation (RAG) pipeline to classify each log message:

Knowledge Base Creation: A predefined set of classification rules and descriptions is encoded into vector embeddings using sentence-transformers.
Vector Database: These embeddings are stored in a FAISS index for fast and efficient retrieval.
User Input: The user uploads a CSV file containing raw log messages.
Retrieval: For each log message, the system creates an embedding and queries the FAISS index to retrieve the most semantically similar rules and context from the knowledge base.
Generation: The original log message, along with the retrieved context, is passed to a Google Gemini model within a structured prompt.
Output: The LLM analyzes the information and generates a structured JSON object containing the classification for the 5 required fields (eventClass, eventOutcome, etc.) and a confidence score.

🛠️ Tech Stack

Frontend: Streamlit
Backend: Python
Data Handling: Pandas
Language Model: Google Gemini
Embedding Model: Sentence-Transformers
Vector Database: FAISS (Facebook AI Similarity Search)

🚀 Setup & Installation

Follow these steps to set up and run the project locally.

Prerequisites

Python 3.9 or higher
pip package manager
Google AI API key (Get it here)

1. Clone the Repository

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

2. Create a Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

3. Install Dependencies

Create a requirements.txt file with the following content:

streamlit
pandas
faiss-cpu
sentence-transformers
google-generativeai

Then, install the packages:

pip install -r requirements.txt

4. Configure Your API Key

This project requires a Google AI API key. For security, do not hardcode your key in app.py. Use Streamlit's secrets management.

Create a folder and file: .streamlit/secrets.toml

Add your API key to the secrets.toml file:

# .streamlit/secrets.toml
GOOGLE_API_KEY = "AIzaSy..."

In app.py, replace the hardcoded API key with a call to Streamlit secrets:

# In app.py, replace this line:
# api_key = "AIzaSyDPIzRvmlT70wpmYt3LmnWxKl8QuW5K5pk"

# With this line:
api_key = st.secrets["GOOGLE_API_KEY"]

▶️ Steps to Run

Method 1: Local Development

Ensure all dependencies are installed:
```
pip install -r requirements.txt
```
Set up your API key in .streamlit/secrets.toml
Run the Streamlit application:
```
streamlit run app.py
```
Open your browser to http://localhost:8501

Method 2: Docker (Optional)

If you prefer using Docker:

# Build the image
docker build -t log-classifier .

# Run the container
docker run -p 8501:8501 log-classifier

📖 Usage

Open your web browser and navigate to the local URL provided (usually http://localhost:8501).
Upload your CSV file. The CSV must contain a column named Message with the raw log text.
Adjust the number of samples you want to process using the slider.
Click the "Run Classification" button.
View the results in the interactive table.
Use the download buttons to save the output as JSON or CSV.

📊 Example Input/Output

Sample Input CSV

Message
"Failed login attempt for user admin from IP 192.168.1.100"
"Firewall blocked incoming connection from 10.0.0.50 on port 443"
"User john.doe successfully authenticated via SSO"
"Malware detected in file document.exe, quarantined by antivirus"
"Suspicious outbound traffic to known C2 server detected"

Sample Output

JSON Format:

[
  {
    "original_message": "Failed login attempt for user admin from IP 192.168.1.100",
    "eventClass": "Authentication Failure",
    "eventOutcome": "Failure",
    "eventSeverity": "Medium",
    "eventAction": "Alert",
    "eventCategory": "Authentication",
    "confidence": 0.89,
    "retrieval_score": 0.92,
    "llm_confidence": 0.85
  },
  {
    "original_message": "Firewall blocked incoming connection from 10.0.0.50 on port 443",
    "eventClass": "Network Access Control",
    "eventOutcome": "Blocked",
    "eventSeverity": "Low",
    "eventAction": "Block",
    "eventCategory": "Network Security",
    "confidence": 0.93,
    "retrieval_score": 0.95,
    "llm_confidence": 0.90
  },
  {
    "original_message": "User john.doe successfully authenticated via SSO",
    "eventClass": "Authentication Success",
    "eventOutcome": "Success",
    "eventSeverity": "Informational",
    "eventAction": "Allow",
    "eventCategory": "Authentication",
    "confidence": 0.96,
    "retrieval_score": 0.98,
    "llm_confidence": 0.94
  }
]

CSV Format:

original_message	eventClass	eventOutcome	eventSeverity	eventAction	eventCategory	confidence
Failed login attempt for user admin from IP 192.168.1.100	Authentication Failure	Failure	Medium	Alert	Authentication	0.89
Firewall blocked incoming connection from 10.0.0.50 on port 443	Network Access Control	Blocked	Low	Block	Network Security	0.93
User john.doe successfully authenticated via SSO	Authentication Success	Success	Informational	Allow	Authentication	0.96

Field Descriptions

eventClass: Specific type of security event (e.g., "Authentication Failure", "Malware Detection")
eventOutcome: Result of the event (Success, Failure, Blocked, Detected, etc.)
eventSeverity: Impact level (Critical, High, Medium, Low, Informational)
eventAction: Recommended or taken action (Alert, Block, Allow, Quarantine, Investigate)
eventCategory: High-level category (Authentication, Network Security, Malware, etc.)
confidence: Combined score (0-1) indicating classification certainty

📝 Configuration

Customizing the Knowledge Base

Edit the knowledge_base list in the create_knowledge_base() function in app.py:

knowledge_base = [
    "Authentication failures indicate unsuccessful login attempts...",
    "Firewall blocks represent denied network connections...",
    # Add your custom classification rules here
]

Adjusting the LLM Model

Change the preferred_models list in find_working_models():

preferred_models = [
    'gemini-1.5-flash',
    'gemini-1.5-pro',
    'gemini-pro'
]

Tuning Retrieval Parameters

Modify the number of context snippets retrieved:

# In classify_log method
context = self.retrieve_context(message, k=5)  # Change k value

Adjusting Confidence Weighting

Modify the hybrid confidence calculation:

# In classify_log method
combined_confidence = (
    0.6 * retrieval_score +  # Change weights as needed
    0.4 * llm_confidence
)

🔮 Future Improvements

Short-term Enhancements

Batch Processing: Add parallel processing for large CSV files
Custom Knowledge Base Upload: Allow users to upload their own classification rules
Filtering Options: Add filters for confidence threshold and severity levels
Visualization Dashboard: Add charts for classification distribution and confidence scores
Error Handling: Improve error messages and validation

Medium-term Enhancements

Multi-language Support: Extend to non-English security logs
Fine-tuning: Train a custom model on domain-specific security logs
Active Learning: Allow users to correct classifications and retrain
API Endpoint: Create REST API for programmatic access
Real-time Processing: Support streaming log ingestion

Long-term Enhancements

Integration with SIEM: Connect to popular SIEM platforms
Anomaly Detection: Add unsupervised learning for novel threat detection
Contextual Analysis: Include temporal and relational analysis of events
Multi-modal Input: Support for logs with additional metadata
Explainability: Add LIME/SHAP explanations for classifications

Performance Optimizations

Cache embeddings for repeated queries
Use GPU acceleration for embedding generation
Implement incremental FAISS index updates
Add Redis for session management
Optimize prompt engineering for faster LLM responses

Security Enhancements

Add authentication and authorization
Implement rate limiting
Add audit logging
Support for on-premise deployment
Data encryption at rest and in transit

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📧 Contact

For questions or support, please open an issue on GitHub or contact the maintainers.

🙏 Acknowledgments

Streamlit for the amazing web framework
Google AI for the Gemini API
Sentence-Transformers for embedding models
Facebook AI for FAISS vector search
The open-source community for inspiration and support

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py
input.csv		input.csv
input.json		input.json
output.json		output.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Security Event Log RAG Classifier

📋 Table of Contents

🎯 Problem Statement

Challenges:

Solution:

✨ Features

🏗️ RAG Architecture

Architecture Diagram

Component Details

🧠 How It Works

🛠️ Tech Stack

🚀 Setup & Installation

Prerequisites

1. Clone the Repository

2. Create a Virtual Environment (Recommended)

3. Install Dependencies

4. Configure Your API Key

▶️ Steps to Run

Method 1: Local Development

Method 2: Docker (Optional)

📖 Usage

📊 Example Input/Output

Sample Input CSV

Sample Output

Field Descriptions

📝 Configuration

Customizing the Knowledge Base

Adjusting the LLM Model

Tuning Retrieval Parameters

Adjusting Confidence Weighting

🔮 Future Improvements

Short-term Enhancements

Medium-term Enhancements

Long-term Enhancements

Performance Optimizations

Security Enhancements

🤝 Contributing

📄 License

📧 Contact

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages