Skip to content

jackwilkie/Signal-Based-Malware-Classification

Repository files navigation

Signal-Based Malware Classification using 1D CNNs

Official Github repo for the paper Signal-Based Malware Classification using 1D CNNs, which proposes the use of 1D signal representations as an alternative to byteplot images. This repository covers a reference implementation of the data processing pipeline, 1D CNNs, and evaluation procedure.

Malware Signals

Malware signals are 1D representations of the bytecode of an executable which act as an alternative to byteplot images as input to machine learning models. These signals can be statically extracted from various formated (e.g. EXE, APK), and used to train a 1D CNN for malware classification. By using a 1D representation of the binaries, more information from the original binary is preserved and the addition of spurious spatial correlation is avoided, resulting in improved downstream model performance. A comparison of malware signals with byteplot images is shown below:

The python function bytes_to_signal is provided to convert np.int8 numpy arrays representing bytes to 1D signal representations. Example usage:

import numpy as np
from create_signal import bytes_to_signal

# Original byte sequence (each integer represents one byte)
# Replace these with your own bytes
byte_array = np.array([0x4D, 0x5A, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00], dtype=np.int8)

# Desired 1D signal length
length = 65536

# Convert byte array into signal representation
# Returns an np.float32 array of shape (length,)
signal = bytes_to_signal(byte_array, length)

print(f"signal shape: {signal.shape}, dtype: {signal.dtype}")

Additionally, the script create_signal.py can be used to convert a np.int8 array to a 1D signal representation from the terminal:

python3 create_signal.py

Defaults

  • Uses the example binary at example_signals/example_binary.npy
  • Saves the signal to example_signals/example_signal.npy

Customization

  • Set the input binary path with --binary_path
  • Set the output save path with --save_path
  • Control the signal length with --length

1D CNNs

The 1D signal representations can be used to train 1D CNNs. 2D CNN architectures developed for byteplot images can easily be adated to operate on the 1D signals by squaring the kernel size and stride convolution parameters, with the 1D equivalent models being found to outperform their 2D counterparts. Example usage for a ResNet1D18 model:

import torch as T
from model.model_factory import model_factory

# Select model name. 
# Can be any model from our model dict in ./model/model_factory.py
model_name = 'resnet1d18'

# Select activation function
# Can be 'relu' or 'gelu'
act = 'relu'

# Select task granularity
# Can be 'binary', 'type', or 'family'
task = 'type'

# Get number of classes based on task
if task == 'binary':
    n_classes = 2
elif task == 'type':
    n_classes = 47
else: # family level classification
    n_classes = 696
        
# Build model
model = model_factory(dict(
  name = model_name,
  n_classes = n_classes,
  in_channels = 1,
  act_layer = act,
))

# Get test input
# Should be (batch size, 1, length)
L = 65536 # signal length
B = 8 # example batch size
x = T.randn((B, 1, L), dtype = T.float32)

# Test forward pass
# Output will be (batch size, num classes)
z = model(x)
print(f'output size: {z.size()}')

Comparison

After adapting a ResNetV2-152D model to operate on 1D signals and inserting squeeze-and-excitation layers, the ResNet1DV2-152D-SE model achieves state-of-the-art performance on the binary, type, and famliy task granularities on the MalNet dataset.

Model Binary Type Family
F1 Score Precision Recall F1 Score Precision Recall F1 Score Precision Recall
ResNet1DV2-152D-SE .874 .907 .846 .503 .643 .453 .507 .580 .480
SHERLOCK .854 .920 .810 .497 .628 .447 .491 .568 .461
ResNet18 .862 .893 .837 .467 .556 .424 .454 .538 .423
ResNet50 .854 .907 .814 .479 .566 .441 .468 .541 .443
DenseNet121 .864 .900 .834 .471 .558 .428 .461 .529 .438
Densenet169 .864 .890 .841 .477 .573 .433 .462 .545 .434
MobileNetV2(x.5) .857 .894 .827 .460 .547 .424 .451 .528 .423
MobileNetV2(x1) .854 .889 .825 .452 .527 .419 .438 .532 .405

Running

(1) Install Requirements

This repository requires python3, Pytorch and some common Python libraries. To install the required dependencies run:

pip install -r requirements.txt

(2) Download Dataset

This work uses primarily uses the MalNet signal dataset, a variation of the Malnet Image dataset where the binaries have been preprocessed into malware signal representations instead of images. The MalNet Signal dataset can be downloaded from Hugging Face using the Hugging Face client:

huggingface-cli download --repo-type dataset jackwilkie/malnet_signal --local-dir dataset

After downloading and extracting the MalNet Signal dataset, it should be placed in the dataset/signals directory in with the following structure:

dataset/
└── signals/
    └── <family>/
        └── <type>/
            └── <sample>.npy

(3) Train Model

1D CNNs can be trained on the Malnet Signal dataset by running train.py:

python3 train.py

Model Selection

  • Choose the CNN architecture with the --model argument
  • Available models are defined in our models dictionary
  • Default: ResNet1DV2-152D-SE

Task Granularity

  • Set the classification task with the --task argument
  • Options: binary, type, family
  • Default: type (type-level classification)

Checkpoints

  • Model weights are saved to weights/<task>.pt.tar by default
    • Example: weights/type.pt.tar
  • Customize the save path with --checkpoint_path

Pretrained Weights

  • Alternatively you can download pretrained models from Hugging Face for binary, type, and family level classification
  • For binary and family level classification a ResNet1DV2-152D-SE model using the GELU activation function is provided
  • For type level classification a ResNet1D-152 model using the ReLU activation function is provided, as this was found to have improved performance for less compute

(4) Eval Model

Trained models can then be evaluated using eval.py:

python3 eval.py

Notes

  • Ensure that the --task, --model, and --activation arguments match the values used during training
  • The checkpoint path should be specified using the --checkpoint_path argument
    • Example: weights/type.pt.tar

Output

  • After evaluation, performance metrics will be printed to the terminal.

Citation

@misc{wilkie2025signalbasedmalwareclassificationusing,
    title={Signal-Based Malware Classification Using 1D CNNs}, 
    author={Jack Wilkie and Hanan Hindy and Ivan Andonovic and Christos Tachtatzis and Robert Atkinson},
    year={2025},
    eprint={2509.06548},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url={https://arxiv.org/abs/2509.06548}, 
}

About

Official Github repo for the paper "Signal-Based Malware Classification using 1D CNNs".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages