Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions schema/FileTypeToMetadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Mapping table with all metadata types and all file types:

| Metadata Type | Description | File Types |
|---------------|-------------|------------|
| **PEMetadata** | Windows executables and legacy DOS formats | PE, Malformed PE, DOS |
| **ELFMetadata** | Linux/Unix executables and kernel images | ELF, Linux Kernel Image |
| **MachOMetadata** | Apple/macOS native executables and packages | MACHOFAT, MACHOFAT64, EFIFAT, MACHO32, MACHO64, IPA, MACOS_DMG |
| **CoffMetadata** | Common Object File Format family (Unix/legacy) | COFF, XCOFF32, XCOFF64, ECOFF |
| **JavaMetadata** | Java bytecode and Android packages | JAVACLASS, JAR, WAR, EAR, APK |
| **AOUTMetadata** | Legacy Unix a.out executable format | A.OUT big, A.OUT little |
| **RPMMetadata** | Red Hat Package Manager packages | RPM Package |
| **UbootImageMetadata** | U-Boot bootloader images (embedded systems) | UIMAGE |
| **DockerImageMetadata** | Docker container image formats | DOCKER_GZIP, DOCKER_TAR |
| **OleMetadata** | Microsoft OLE/COM compound documents and installers | OLE, MSCAB, ISCAB, MSIX |
| **NativeMetadata** | LLVM compiler intermediate representations | LLVM_BITCODE, LLVM_IR |
| **OtherMetadata** | Archives, compressed files, and miscellaneous formats | GZIP, BZIP2, XZ, TAR, RAR, ZIP, AR_LIB, OMF_LIB, ZLIB, CPIO_BIN big, CPIO_BIN little, CPIO_ASCII_OLD, CPIO_ASCII_NEW, CPIO_ASCII_NEW_CRC, ZSTANDARD, ZSTANDARD_DICTIONARY, ISO_9660_CD |

**Notes:**
- **JavascriptMetadata** is defined but has no file types mapped (no JS-related types in your enum)
- All 52 file type enum values are accounted for
- All 13 metadata types are included
- Each file type appears exactly once

If you want to use JavascriptMetadata, you'll need to add file types like "JS", "NODE", "JAVASCRIPT" to your enum and map them accordingly.

# Implementation
FileType is an enum near the beginning of the `observation.schema.json`.

Each specific metadata type is defined as `OneOf` like:

```JSON
"oneOf": [
{
"$ref": "#/$defs/AOUTMetadata",
"x-duckdb-table": "aout_metadata",
"x-duckdb-pk": "observation_uuid",
"x-duckdb-when": ["A.OUT big", "A.OUT little"]

},
```

Then details later in the file, via the `#/$defs` reference:

```JSON
"AOUTMetadata": {
"type": "object",
"required": [ "aoutMachineType" ],
"properties": {
"aoutMachineType": { "type": "string" }
}
},
```
29 changes: 29 additions & 0 deletions schema/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
PYTHON = python3
DCG = datamodel-codegen
SCHEMA = observation.schema.json
MODELS = eyeon_models.py
DDL = eyeon_ddl.sql

.PHONY: all clean models ddl regen

all: models ddl

models: $(SCHEMA)
$(PYTHON) -m pip show datamodel-code-generator >/dev/null || $(PYTHON) -m pip install datamodel-code-generator
$(DCG) \
--input $(SCHEMA) \
--input-file-type jsonschema \
--target-python-version 3.13 \
--output $(MODELS) \
--class-name ObservationModel \
--use-standard-collections \
--output-model-type pydantic_v2.BaseModel \
--use-default

ddl: $(SCHEMA)
$(PYTHON) gen_ddl.py $(SCHEMA) > $(DDL)

regen: clean all

clean:
rm -f $(MODELS) $(DDL)
117 changes: 117 additions & 0 deletions schema/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## Overview

This project uses a **JSON Schema** as the single source of truth for defining the structure and constraints of observation data. By maintaining the schema in one place, we ensure consistency across both application code and data storage.

### What It Does

- **Generates Python Data Classes:**
Automatically creates Pydantic model classes from the JSON Schema. These classes provide type-safe validation, serialization, and autocompletion within your Python code.

- **Generates Database DDL Scripts:**
Produces database schema (DDL) scripts that match the JSON Schema, ensuring the data stored in your database adheres to the same structure and constraints as your application code.

### Why This Approach

- **Consistency:**
Using the JSON Schema as the authoritative source eliminates discrepancies between code and database definitions.

- **Automation:**
Changes to the schema automatically propagate to both Python models and database scripts, reducing manual effort and the risk of errors.

- **Validation:**
Pydantic models enforce data validation at runtime, while the database schema enforces constraints at the storage layer.

- **Developer Experience:**
Autogenerated classes support IDE autocompletion and static analysis, making development faster and safer.

---

**In summary:**
This workflow streamlines development by ensuring that your data definitions are always up-to-date and consistent across your Python code and database, all driven from a single, version-controlled JSON Schema file.

# Workflow Process

Typically, the workflow is simply:

1. Modify the JSON schema as needed and all unit tests pass on
(Seth, can you provide an example of how to do this?)
2. Run `make all` in this directory, which will:
1. `pip install datamodel-code-generator` if its not already installed
2. Generate the `eyeon_models.py` which will have the dataclasses
3. Generate a SQL DDL file to create necessary tables for the backend database.
3. Go back to CLI development

# Reference

## Generates a pydantic class

```bash
datamodel-codegen \
--input observation.schema.json \
--input-file-type jsonschema \
--target-python-version 3.11 \
--output eyeon_models.py \
--class-name ObservationModel \
--use-standard-collections \
--output-model-type pydantic_v2.BaseModel
```

## Example Usage

```python
from eyeon_models import ObservationModel
eyeon = ObservationModel(bytecount=5, filename='test.me', magic='ooh/magic', md5='sumhash', observation_ts='10/13/2025', sha1='shame', sha256='sha256', uuid='xyz')
```

## datamodel-codegen command overview

This command uses datamodel-code-generator to convert a JSON Schema file into Python classes targeting Pydantic v2, with specific output and type behavior.

Official documentation:
- https://koxudaxi.github.io/datamodel-code-generator/

### What the command does
- Reads your JSON Schema file, observation.schema.json.
- Generates Python model classes that conform to the schema.
- Targets Python 3.13 syntax and standard library types.
- Emits Pydantic v2 BaseModel classes for validation and serialization.
- Uses a specific top-level class name for the root schema.

### Command, broken down by argument

- `datamodel-codegen`
- The CLI tool that generates Python models from various schema sources.

- `--input observation.schema.json`
- Path to the source schema file.
- Tells the generator what to parse.

- `--input-file-type jsonschema`
- Explicitly sets the input format, helpful when the file extension is ambiguous.
- Ensures the parser interprets the file as JSON Schema, not OpenAPI or others.

- `--target-python-version 3.11`
- Controls the Python syntax features in the generated code.
- For Python 3.11, you get modern typing and dataclass, typing behaviors appropriate for that version.

- `--output observation_models.py`
- Destination file for the generated models.
- All classes are written into this single module.

- `--class-name ObservationModel`
- Sets the name of the root model class that represents the schema’s top-level object.
- Useful for autocompletion and clear imports in your app code.

- `--use-standard-collections`
- Uses built-in collection types like list and dict instead of typing.List and typing.Dict in annotations where appropriate.
- Generally results in cleaner, more modern type hints.

- `--output-model-type pydantic_v2.BaseModel`
- Targets Pydantic v2’s BaseModel for the generated classes.
- Ensures compatibility with Pydantic v2 APIs, including model_dump, model_validate, and RootModel for root schemas where needed.

## gen_ddl.py overview

This script loads the JSON schema and uses that to generate corresponding SQL tables. The goal is to convert from a JSON object model to a normalized SQL model with multiple tables.

For example, the Observation table is composed of the simple, top-level fields of the JSON model. It specifically excludes the more complex fields such as metadata and certificates. Those are built as separate tables that can be joined back to observation at query time.
68 changes: 68 additions & 0 deletions schema/X-DuckDb Directives.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Here's a comprehensive table of all x-duckdb directives we've defined:

| Directive | Function | Example Definition | Example SQL Output |
|-----------|----------|-------------------|-------------------|
| **x-duckdb-table** | Specifies the table name for this schema node | `"signatures": { "type": "array", "x-duckdb-table": "signatures" }` | `CREATE TABLE signatures (...)` |
| **x-duckdb-pk** | Defines the primary key column name | `"x-duckdb-pk": "uuid"` | `uuid VARCHAR PRIMARY KEY` |
| **x-duckdb-pk-type** | Specifies the data type for the primary key (default: VARCHAR) | `"x-duckdb-pk": "signature_id", "x-duckdb-pk-type": "INTEGER"` | `signature_id INTEGER PRIMARY KEY` |
| **x-duckdb-pk-seq** | Defines the SEQUENCE object to use for auto-incrementing the PK. Used when no natural PK exists | `"x-duckdb-pk-seq": "certificate_seq"` | `CREATE SEQUENCE certificate_seq; PRIMARY KEY DEFAULT NEXTVAL('certificate_seq')` |
| **x-duckdb-fk** | Defines the foreign key column name referencing parent table | `"x-duckdb-fk": "observation_uuid"` | `observation_uuid VARCHAR, FOREIGN KEY (observation_uuid) REFERENCES observations(uuid)` |
| **x-duckdb-type** | Specifies the data type for this column (default: type defined in schema) | `"x-duckdb-type": "BIGINT"` | `bytesize BIGINT` |
| **x-duckdb-flatten** | Flattens nested object properties into parent table with prefix | `"elfIdent": { "type": "object", "x-duckdb-flatten": true, "properties": {"EI_CLASS": {...}} }` | `elfIdent_EI_CLASS INTEGER DEFAULT NULL` (in parent table) |
| **x-duckdb-array** | Stores array as native DuckDB array type | `"elfDependencies": { "type": "array", "x-duckdb-array": true, "items": {"type": "string"} }` | `elfDependencies VARCHAR[] DEFAULT NULL` |
| **x-duckdb-json** | Stores complex nested structures as JSON | `"binaries": { "type": "array", "x-duckdb-json": true }` | `binaries JSON DEFAULT NULL` |
| **x-duckdb-value-column** | For primitive arrays in separate tables, names the value column | `"hosts": { "type": "array", "x-duckdb-table": "observation_hosts", "x-duckdb-value-column": "host" }` | `CREATE TABLE observation_hosts (observation_uuid VARCHAR, host VARCHAR)` |
| **x-duckdb-discriminator** | Field used to determine polymorphic type (oneOf) | `"metadata": { "x-duckdb-discriminator": "filetype", "oneOf": [...] }` | Used to populate `filetype_type` column and view WHERE clauses |
| **x-duckdb-when** | List of discriminator values that trigger this variant | `{"$ref": "#/$defs/ELFMetadata", "x-duckdb-when": ["ELF"]}` | `WHERE filetype_type = 'ELF'` (in view definition) |
| **x-duckdb-polymorphic-strategy** | Strategy for handling oneOf: "separate-tables" or "single-table" | `"x-duckdb-polymorphic-strategy": "single-table"` | **separate-tables**: Multiple tables (pe_metadata, elf_metadata)<br>**single-table**: One wide table + views |

## Polymorphic Strategy Comparison

### separate-tables (default)
```json
"metadata": {
"x-duckdb-discriminator": "filetype",
"x-duckdb-polymorphic-strategy": "separate-tables",
"oneOf": [
{"$ref": "#/$defs/PEMetadata", "x-duckdb-table": "pe_metadata", "x-duckdb-pk": "observation_uuid"}
]
}
```

```sql
CREATE TABLE pe_metadata (
observation_uuid VARCHAR PRIMARY KEY,
peMachine VARCHAR,
peIsExe BOOLEAN DEFAULT NULL,
FOREIGN KEY (observation_uuid) REFERENCES observations(uuid)
);
```

### single-table
```json
"metadata": {
"x-duckdb-discriminator": "filetype",
"x-duckdb-polymorphic-strategy": "single-table",
"x-duckdb-table": "file_metadata",
"oneOf": [
{"$ref": "#/$defs/PEMetadata", "x-duckdb-table": "pe_metadata", "x-duckdb-when": ["PE"]}
]
}
```

```sql
CREATE TABLE file_metadata (
observation_uuid VARCHAR PRIMARY KEY,
filetype_type VARCHAR,
peMachine VARCHAR DEFAULT NULL,
peIsExe BOOLEAN DEFAULT NULL,
elfDependencies VARCHAR[] DEFAULT NULL,
-- all columns from all variants
FOREIGN KEY (observation_uuid) REFERENCES observations(uuid)
);

CREATE VIEW pe_metadata AS
SELECT observation_uuid, peMachine, peIsExe, ...
FROM file_metadata
WHERE filetype_type = 'PE';
```
106 changes: 106 additions & 0 deletions schema/dlt/EyeOnRawViewer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Requirements are: streamlit, duckdb

import streamlit as st
import duckdb

st.title("Binary Metadata Viewer")

try:
# Compact settings in collapsible section
with st.sidebar:
st.subheader("Database")
db_path = st.text_input("Database path:", "eyeon_metadata.duckdb")

con = duckdb.connect(db_path, read_only=True)
schema_list = [s[0] for s in con.execute(
"SELECT distinct schema_name FROM information_schema.schemata order by all"
).fetchall()]

# Schema selection inside the same expander context
cur_schema = st.selectbox("Schema to use", schema_list)

if cur_schema is not None:
con.sql(f"use {cur_schema}")
table_list = [s[0] for s in con.execute("show tables").fetchall()]
with st.expander("Tables"):
st.table(table_list)

if "raw_obs" not in table_list:
st.warning("Pick a valid schema. This one doesn't have the RAW_OBS table")
st.stop()

# Display some stats about the database
# Go steal code from the old streamlit app...
st.markdown("_Cool Stats Here_")

# Main UI - prominently displayed
filter_text = st.text_input(
"🔍 Filter files:",
placeholder="Use % or * for wildcard (case insensitive)"
)

# Apply filter
if filter_text:
files_df = con.execute(
"SELECT uuid, filename, bytecount FROM raw_obs WHERE filename ILIKE ? ORDER BY filename",
[f"%{filter_text.replace('*', '%') }%"]
).df()
else:
files_df = con.execute(
"SELECT uuid, filename, bytecount FROM raw_obs ORDER BY filename"
).df()

# File selector
if len(files_df) > 0:
selected_file = st.selectbox(
"📄 Select a file:",
files_df['filename'].tolist(),
format_func=lambda x: x if x else "(unnamed)"
)

# Get the UUID for selected file
uuid = files_df[files_df['filename'] == selected_file]['uuid'].iloc[0]

st.subheader("File Info")
file_info = files_df[files_df['uuid'] == uuid]
st.dataframe(file_info, use_container_width=True)

# Get metadata tables dynamically
metadata_tables = [s[0] for s in con.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_name LIKE 'metadata_%' AND NOT contains(table_name,'__') "
"ORDER BY table_name"
).fetchall()]

for table in metadata_tables:
result = con.execute(
f"SELECT * FROM {table} WHERE uuid = ?",
[uuid]
).df()

if len(result) > 0:
st.subheader(table.replace('metadata_', '').upper())
st.dataframe(result, use_container_width=True)

# Find any nested tables
nested_tables = [s[0] for s in con.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_name LIKE ? AND contains(table_name,'__') "
"ORDER BY table_name",
[f"{table}__%"]
).fetchall()]

for nested_table in nested_tables:
nested_result = con.execute(
f"SELECT * FROM {nested_table} WHERE _dlt_parent_id = ?",
[result['_dlt_id'].iloc[0]]
).df()

if len(nested_result) > 0:
st.subheader(nested_table.replace('metadata_', '').upper())
st.dataframe(nested_result, use_container_width=True)
else:
st.warning("No files found in database")

except Exception as e:
st.error(f"Error: {e}")
Loading
Loading