Replacing Training with Reasoning: Reinterpreting Classic ML Pipelines with LLMs

In this repository, we host all the data and code related to our paper titled "Replacing Training with Reasoning: Reinterpreting Classic ML Pipelines with LLMs".

📜 Abstract

Large Language Models (LLMs) are increasingly used in software engineering tasks due to their strong performance across diverse applications. In this paper, we ask a fundamental and novel question: To what extent can LLMs replace traditional machine learning pipelines that rely on labeled data, feature engineering, and retraining? Our intuition is that many long-standing approaches in software engineering can be reimagined through the lens of reasoning rather than training. Unlike conventional pipelines that learn statistical patterns from data, LLMs can directly reason about contextual consistency using their pretrained knowledge. To illustrate this idea, we revisit a well-known anomaly detection pipeline (CHABADA for Android apps) and show how its clustering and retraining stages can be replaced with a simple prompting strategy. The result is a streamlined, zero-shot workflow that leverages semantic reasoning without labeled datasets, feature extraction, or retraining. Our goal is not to propose a new tool, but to highlight a broader paradigm: LLMs open the door to reinterpreting established ML-based workflows as reasoning pipelines. This perspective suggests a path toward lighter-weight, training-free alternatives for many specialized software engineering tasks.

🗂️ Repository Organization

The repository is organized into main directories:

📁 0_Data

This directory contains all the data needed to run our experiments.
📂 1_Code
Contains all the code relative to our approach. The code is provided into the form of multiple Jupyter Notebooks to facilitate execution.

📋 Requirements

🐍 Conda Environment

To launch the Jupyter Notebooks, you will need various libraries. We provide a requirements.txt file which you can use to create a conda environment.

Follow the steps below:

Create a conda environment named demoEnv:
```
conda create --name demoEnv python=3.8
```
Activate the newly created environment:
```
conda activate demoEnv
```
Install the required packages using pip and requirements.txt:
```
pip install -r requirements.txt
```

Once these steps are complete, your environment will be set up with all the necessary libraries.

🔧 ApkTool

To decompile APKs, ApkTool must be installed on your system. Follow the steps below to set it up:

Download ApkTool:
Visit the official ApkTool page at https://ibotpeaches.github.io/Apktool/ and download the latest version.
Install ApkTool:
Follow the installation instructions for your operating system, which typically involve:
- Placing the downloaded JAR file in a suitable directory.
- Adding the ApkTool executable to your system's PATH for easier access.
Verify Installation:
Ensure ApkTool is installed correctly by running the following command in your terminal:
```
apktool
```
This should display the ApkTool usage instructions if the installation was successful.

🔑 Environment File (.env)

To execute the entire code, two API keys are required: one for AndroZoo and another for the OpenAI API. These keys should be set in an environment file named .env, which should be placed in the main folder of the provided repository.

The API Keys should be named ANDROZOO_API_KEY and OPENAI_API_KEY.

ANDROZOO_API_KEY: This key is necessary to download apps from the AndroZoo Repository, as various operations on the APK files are performed "on-the-fly," such as app download, extraction, and deletion. It can be requested here: https://androzoo.uni.lu/access
OPENAI_API_KEY: This key is required to utilize the Embedding models from OpenAI through their official API (https://platform.openai.com/overview).

💸 Note: Please be aware that using OpenAI’s models may incur costs depending on the volume and type of API usage. Refer to OpenAI's pricing page for details.

⚙️ Usage

The provided Jupyter Notebooks facilitate the execution of our approach. The notebooks should be executed in the order listed here to ensure correct data processing and dependencies are met.

1_SensitiveAPIsExtraction.ipynb
Run this notebook first to extract all sensitive APIs invoked by each app using ApkTool and Androguard, based on a permission-to-method mapping from prior work.
2_AnalysisWithLLM.ipynb
This notebook performs anomaly detection by prompting an LLM to assess whether each sensitive API call aligns with the app's inferred functionality, producing context mismatch scores from 1 (benign) to 5 (suspicious).
3_AblationStudy.ipynb
This notebook repeats the detection process without generating an intermediate summary of app functionalities, directly evaluating APIs using the raw app description to study the impact of summarization.

⚠️ Note: Due to the probabilistic nature of LLMs, outputs may slightly vary between runs. As a result, you might observe small deviations in the generated results compared to those presented in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
0_Data		0_Data
1_Code		1_Code
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Replacing Training with Reasoning: Reinterpreting Classic ML Pipelines with LLMs

📜 Abstract

🗂️ Repository Organization

📋 Requirements

🐍 Conda Environment

🔧 ApkTool

🔑 Environment File (.env)

⚙️ Usage

About

Uh oh!

Releases

Packages

Languages

Trustworthy-Software/RTWR

Folders and files

Latest commit

History

Repository files navigation

Replacing Training with Reasoning: Reinterpreting Classic ML Pipelines with LLMs

📜 Abstract

🗂️ Repository Organization

📋 Requirements

🐍 Conda Environment

🔧 ApkTool

🔑 Environment File (.env)

⚙️ Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages