Official Github repo for the paper Signal-Based Malware Classification using 1D CNNs, which proposes the use of 1D signal representations as an alternative to byteplot images. This repository covers a reference implementation of the data processing pipeline, 1D CNNs, and evaluation procedure.
Malware signals are 1D representations of the bytecode of an executable which act as an alternative to byteplot images as input to machine learning models. These signals can be statically extracted from various formated (e.g. EXE, APK), and used to train a 1D CNN for malware classification. By using a 1D representation of the binaries, more information from the original binary is preserved and the addition of spurious spatial correlation is avoided, resulting in improved downstream model performance. A comparison of malware signals with byteplot images is shown below:
The python function bytes_to_signal is provided to convert np.int8 numpy arrays representing bytes to 1D signal representations. Example usage:
import numpy as np
from create_signal import bytes_to_signal
# Original byte sequence (each integer represents one byte)
# Replace these with your own bytes
byte_array = np.array([0x4D, 0x5A, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00], dtype=np.int8)
# Desired 1D signal length
length = 65536
# Convert byte array into signal representation
# Returns an np.float32 array of shape (length,)
signal = bytes_to_signal(byte_array, length)
print(f"signal shape: {signal.shape}, dtype: {signal.dtype}")Additionally, the script create_signal.py can be used to convert a np.int8 array to a 1D signal representation from the terminal:
python3 create_signal.py
- Uses the example binary at
example_signals/example_binary.npy - Saves the signal to
example_signals/example_signal.npy
- Set the input binary path with
--binary_path - Set the output save path with
--save_path - Control the signal length with
--length
The 1D signal representations can be used to train 1D CNNs. 2D CNN architectures developed for byteplot images can easily be adated to operate on the 1D signals by squaring the kernel size and stride convolution parameters, with the 1D equivalent models being found to outperform their 2D counterparts. Example usage for a ResNet1D18 model:
import torch as T
from model.model_factory import model_factory
# Select model name.
# Can be any model from our model dict in ./model/model_factory.py
model_name = 'resnet1d18'
# Select activation function
# Can be 'relu' or 'gelu'
act = 'relu'
# Select task granularity
# Can be 'binary', 'type', or 'family'
task = 'type'
# Get number of classes based on task
if task == 'binary':
n_classes = 2
elif task == 'type':
n_classes = 47
else: # family level classification
n_classes = 696
# Build model
model = model_factory(dict(
name = model_name,
n_classes = n_classes,
in_channels = 1,
act_layer = act,
))
# Get test input
# Should be (batch size, 1, length)
L = 65536 # signal length
B = 8 # example batch size
x = T.randn((B, 1, L), dtype = T.float32)
# Test forward pass
# Output will be (batch size, num classes)
z = model(x)
print(f'output size: {z.size()}')After adapting a ResNetV2-152D model to operate on 1D signals and inserting squeeze-and-excitation layers, the ResNet1DV2-152D-SE model achieves state-of-the-art performance on the binary, type, and famliy task granularities on the MalNet dataset.
| Model | Binary | Type | Family | ||||||
|---|---|---|---|---|---|---|---|---|---|
| F1 Score | Precision | Recall | F1 Score | Precision | Recall | F1 Score | Precision | Recall | |
| ResNet1DV2-152D-SE | .874 | .907 | .846 | .503 | .643 | .453 | .507 | .580 | .480 |
| SHERLOCK | .854 | .920 | .810 | .497 | .628 | .447 | .491 | .568 | .461 |
| ResNet18 | .862 | .893 | .837 | .467 | .556 | .424 | .454 | .538 | .423 |
| ResNet50 | .854 | .907 | .814 | .479 | .566 | .441 | .468 | .541 | .443 |
| DenseNet121 | .864 | .900 | .834 | .471 | .558 | .428 | .461 | .529 | .438 |
| Densenet169 | .864 | .890 | .841 | .477 | .573 | .433 | .462 | .545 | .434 |
| MobileNetV2(x.5) | .857 | .894 | .827 | .460 | .547 | .424 | .451 | .528 | .423 |
| MobileNetV2(x1) | .854 | .889 | .825 | .452 | .527 | .419 | .438 | .532 | .405 |
This repository requires python3, Pytorch and some common Python libraries. To install the required dependencies run:
pip install -r requirements.txtThis work uses primarily uses the MalNet signal dataset, a variation of the Malnet Image dataset where the binaries have been preprocessed into malware signal representations instead of images. The MalNet Signal dataset can be downloaded from Hugging Face using the Hugging Face client:
huggingface-cli download --repo-type dataset jackwilkie/malnet_signal --local-dir datasetAfter downloading and extracting the MalNet Signal dataset, it should be placed in the dataset/signals directory in with the following structure:
dataset/
└── signals/
└── <family>/
└── <type>/
└── <sample>.npy
1D CNNs can be trained on the Malnet Signal dataset by running train.py:
python3 train.py- Choose the CNN architecture with the
--modelargument - Available models are defined in our
modelsdictionary - Default:
ResNet1DV2-152D-SE
- Set the classification task with the
--taskargument - Options:
binary,type,family - Default:
type(type-level classification)
- Model weights are saved to
weights/<task>.pt.tarby default- Example:
weights/type.pt.tar
- Example:
- Customize the save path with
--checkpoint_path
- Alternatively you can download pretrained models from Hugging Face for binary, type, and family level classification
- For binary and family level classification a
ResNet1DV2-152D-SEmodel using the GELU activation function is provided - For type level classification a
ResNet1D-152model using the ReLU activation function is provided, as this was found to have improved performance for less compute
Trained models can then be evaluated using eval.py:
python3 eval.py- Ensure that the
--task,--model, and--activationarguments match the values used during training - The checkpoint path should be specified using the --checkpoint_path argument
- Example:
weights/type.pt.tar
- Example:
- After evaluation, performance metrics will be printed to the terminal.
@misc{wilkie2025signalbasedmalwareclassificationusing,
title={Signal-Based Malware Classification Using 1D CNNs},
author={Jack Wilkie and Hanan Hindy and Ivan Andonovic and Christos Tachtatzis and Robert Atkinson},
year={2025},
eprint={2509.06548},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2509.06548},
}


