Signal-Based Malware Classification using 1D CNNs

Official Github repo for the paper Signal-Based Malware Classification using 1D CNNs, which proposes the use of 1D signal representations as an alternative to byteplot images. This repository covers a reference implementation of the data processing pipeline, 1D CNNs, and evaluation procedure.

Malware Signals

Malware signals are 1D representations of the bytecode of an executable which act as an alternative to byteplot images as input to machine learning models. These signals can be statically extracted from various formated (e.g. EXE, APK), and used to train a 1D CNN for malware classification. By using a 1D representation of the binaries, more information from the original binary is preserved and the addition of spurious spatial correlation is avoided, resulting in improved downstream model performance. A comparison of malware signals with byteplot images is shown below:

The python function bytes_to_signal is provided to convert np.int8 numpy arrays representing bytes to 1D signal representations. Example usage:

import numpy as np
from create_signal import bytes_to_signal

# Original byte sequence (each integer represents one byte)
# Replace these with your own bytes
byte_array = np.array([0x4D, 0x5A, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00], dtype=np.int8)

# Desired 1D signal length
length = 65536

# Convert byte array into signal representation
# Returns an np.float32 array of shape (length,)
signal = bytes_to_signal(byte_array, length)

print(f"signal shape: {signal.shape}, dtype: {signal.dtype}")

Additionally, the script create_signal.py can be used to convert a np.int8 array to a 1D signal representation from the terminal:

python3 create_signal.py

Defaults

Uses the example binary at example_signals/example_binary.npy
Saves the signal to example_signals/example_signal.npy

Customization

Set the input binary path with --binary_path
Set the output save path with --save_path
Control the signal length with --length

1D CNNs

The 1D signal representations can be used to train 1D CNNs. 2D CNN architectures developed for byteplot images can easily be adated to operate on the 1D signals by squaring the kernel size and stride convolution parameters, with the 1D equivalent models being found to outperform their 2D counterparts. Example usage for a ResNet1D18 model:

import torch as T
from model.model_factory import model_factory

# Select model name. 
# Can be any model from our model dict in ./model/model_factory.py
model_name = 'resnet1d18'

# Select activation function
# Can be 'relu' or 'gelu'
act = 'relu'

# Select task granularity
# Can be 'binary', 'type', or 'family'
task = 'type'

# Get number of classes based on task
if task == 'binary':
    n_classes = 2
elif task == 'type':
    n_classes = 47
else: # family level classification
    n_classes = 696
        
# Build model
model = model_factory(dict(
  name = model_name,
  n_classes = n_classes,
  in_channels = 1,
  act_layer = act,
))

# Get test input
# Should be (batch size, 1, length)
L = 65536 # signal length
B = 8 # example batch size
x = T.randn((B, 1, L), dtype = T.float32)

# Test forward pass
# Output will be (batch size, num classes)
z = model(x)
print(f'output size: {z.size()}')

Comparison

After adapting a ResNetV2-152D model to operate on 1D signals and inserting squeeze-and-excitation layers, the ResNet1DV2-152D-SE model achieves state-of-the-art performance on the binary, type, and famliy task granularities on the MalNet dataset.

Model	Binary			Type			Family
Model	F1 Score	Precision	Recall	F1 Score	Precision	Recall	F1 Score	Precision	Recall
ResNet1DV2-152D-SE	.874	.907	.846	.503	.643	.453	.507	.580	.480
SHERLOCK	.854	.920	.810	.497	.628	.447	.491	.568	.461
ResNet18	.862	.893	.837	.467	.556	.424	.454	.538	.423
ResNet50	.854	.907	.814	.479	.566	.441	.468	.541	.443
DenseNet121	.864	.900	.834	.471	.558	.428	.461	.529	.438
Densenet169	.864	.890	.841	.477	.573	.433	.462	.545	.434
MobileNetV2(x.5)	.857	.894	.827	.460	.547	.424	.451	.528	.423
MobileNetV2(x1)	.854	.889	.825	.452	.527	.419	.438	.532	.405

Running

(1) Install Requirements

This repository requires python3, Pytorch and some common Python libraries. To install the required dependencies run:

pip install -r requirements.txt

(2) Download Dataset

This work uses primarily uses the MalNet signal dataset, a variation of the Malnet Image dataset where the binaries have been preprocessed into malware signal representations instead of images. The MalNet Signal dataset can be downloaded from Hugging Face using the Hugging Face client:

huggingface-cli download --repo-type dataset jackwilkie/malnet_signal --local-dir dataset

After downloading and extracting the MalNet Signal dataset, it should be placed in the dataset/signals directory in with the following structure:

dataset/
└── signals/
    └── <family>/
        └── <type>/
            └── <sample>.npy

(3) Train Model

1D CNNs can be trained on the Malnet Signal dataset by running train.py:

python3 train.py

Model Selection

Choose the CNN architecture with the --model argument
Available models are defined in our models dictionary
Default: ResNet1DV2-152D-SE

Task Granularity

Set the classification task with the --task argument
Options: binary, type, family
Default: type (type-level classification)

Checkpoints

Model weights are saved to weights/<task>.pt.tar by default
- Example: weights/type.pt.tar
Customize the save path with --checkpoint_path

Pretrained Weights

Alternatively you can download pretrained models from Hugging Face for binary, type, and family level classification
For binary and family level classification a ResNet1DV2-152D-SE model using the GELU activation function is provided
For type level classification a ResNet1D-152 model using the ReLU activation function is provided, as this was found to have improved performance for less compute

(4) Eval Model

Trained models can then be evaluated using eval.py:

python3 eval.py

Notes

Ensure that the --task, --model, and --activation arguments match the values used during training
The checkpoint path should be specified using the --checkpoint_path argument
- Example: weights/type.pt.tar

Output

After evaluation, performance metrics will be printed to the terminal.

Citation

@misc{wilkie2025signalbasedmalwareclassificationusing,
    title={Signal-Based Malware Classification Using 1D CNNs}, 
    author={Jack Wilkie and Hanan Hindy and Ivan Andonovic and Christos Tachtatzis and Robert Atkinson},
    year={2025},
    eprint={2509.06548},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url={https://arxiv.org/abs/2509.06548}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dataset/split_info		dataset/split_info
example_signals		example_signals
images		images
model		model
training_fns		training_fns
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_signal.py		create_signal.py
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Signal-Based Malware Classification using 1D CNNs

Malware Signals

Defaults

Customization

1D CNNs

Comparison

Running

(1) Install Requirements

(2) Download Dataset

(3) Train Model

Model Selection

Task Granularity

Checkpoints

Pretrained Weights

(4) Eval Model

Notes

Output

Citation

About

Uh oh!

Releases

Packages

Languages

License

jackwilkie/Signal-Based-Malware-Classification

Folders and files

Latest commit

History

Repository files navigation

Signal-Based Malware Classification using 1D CNNs

Malware Signals

Defaults

Customization

1D CNNs

Comparison

Running

(1) Install Requirements

(2) Download Dataset

(3) Train Model

Model Selection

Task Granularity

Checkpoints

Pretrained Weights

(4) Eval Model

Notes

Output

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages