AI-Generated Code Contracts

This repository provides the setup and results to generate OpenJML code contracts for Java source code by fine-tuning and employing the resulting CodeT5 and CodeT5+ transformer models.

Contents of this repository

This repository contains all artifacts that we used to generate code contracts with CodeT5 models and evaluate their capabilities. Our code contract generation setup involved the training of the AI models and application which we conducted with Python scripts. Furthermore, we analyzed the Java source code we used in the experiment and the type of OpenJML compilation errors. Both methods, together with the results are similarly provided. In concrete, this repository contains:

Scripts We include the following Python scripts, which we used for training and adding the OpenJML code contracts to the Java methods.
- 1_dataset.py collects the dataset from Sourcegraph
- 2_training.py performs the fine-tuning ont the CodeT5 and CodeT5+ model (as specified)
- 3_application.py applies the fine-tuned models to the methods in a given Java project, to generate code-contracts
- 4_quick-gen.py serves as quick validation, where you can put in a method body and let generate the corresponding contract by the fine-tuned model
- requirements.txt for executing above Python scripts

Furthermore, we performed automated analyses of the studied source code classes and the type of compilation errors, which involved the following Java classes

JavaStatisticsScanner computes the number of contained annotations, LOC, ... of Java files in a given directory
CompilationAnalysisScanner.java counts the number of OpenJML compilation errors in a given file

Finally, we add and link (https://doi.org/10.5281/zenodo.13351003) the results of our experiment:

Results from our study including
- the gathered dataset, the fine-tuned models with and without the weka-project
- the results of compiling the generated contracts with OpenJML
- the results of analyzing the logic validity

Getting started

Prerequisites

To use the Python scripts you need to have installed:

a recent version of Python 3
srcML
Sourcegraph CLI

To use Sourcegraph, you need to create an access token by following the instructions here: https://docs.sourcegraph.com/cli/how-tos/creating_an_access_token

Setup

Clone the contents of this package to your local machine
Create a Python Virtual Environment inside the package directory. This is usually done as follows, but might be slightly different depending on your OS or Python distribution: $ python -m venv .venv
Activate your new virtual environment by excuting one of the activation scripts in the folder /.venv/Scripts/ called activate, activate.bat and Activate.ps1
Use pip to install the required Python packages for the scripts like this: pip install -r requirements.txt

You are now ready to use our scripts.

Scripts

All scripts are well documented to explain their purpose and usage. The Pyhton scripts encompass a "man page" that can be display using the -h argument; e.g. python 1_dataset.py -h will print the description of the the script including all the possible or required arguments.

For this reason we only provide a short summary of the main scripts' functionality here.

1_dataset.py

This script collects the Java methods with JML annotations from the Sourcegraph code search engine. It takes care of all the necessary preprocessing that is needed to use the collected dataset for the fine-tuning with 2_training.py.

2_training.py

This script is used to fine-tune CodeT5 or CodeT5+ using the dataset created by 1_dataset.py. A GPU compatible with CUDA is highly recommended to speed up this script.

3_application.py

This script is used to apply the fine-tuned model to a Java project. The script will create a copy of the project with JML annotated copies of the Java source files. A GPU compatible with CUDA is highly recommended to speed up this script.

Results

Our results can be found in the Zenodo replication package (https://doi.org/10.5281/zenodo.13351003) that contains the following artifacts:

Sourcegraph Search Results

sourcegraph-results.tar: the results of the Sourcegraph search queries

Datasets

dataset.tar: the dataset including the weka-project which contributes two-thirds of the contracts
dataset-withoutweka.tar: the dataset without weka, which is significantly smaller and was used to examine the performance bias when training and testing without weka

CodeT5 Models

codet5-contracts.tar: the best performing CodeT5 model which was fine-tuned to create OpenJML annotations for methods
codet5p-contracts.tar: the best performing CodeT5+ model which was fine-tuned to create OpenJML annotations for methods
codet5p-contracts-withoutweka.tar: the CodeT5+ model which was trained without weka on the same task

Analysis Results

analysis-results.tar/compilability-analysis: the results of the compilability analysis
- the subjects to which we applied the best performing CodeT5+
- the compilation results and their analysis
analysis-results.tar/logical-analysis the results of the logical analysis
- the analysis of logic validity of SimpleStack and SimpleTicTacToe

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
results		results
scripts		scripts
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Generated Code Contracts

Contents of this repository

Getting started

Prerequisites

Setup

Scripts

1_dataset.py

2_training.py

3_application.py

Results

Sourcegraph Search Results

Datasets

CodeT5 Models

Analysis Results

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

SEG-UNIBE/auto-generated-code-contracts

Folders and files

Latest commit

History

Repository files navigation

AI-Generated Code Contracts

Contents of this repository

Getting started

Prerequisites

Setup

Scripts

1_dataset.py

2_training.py

3_application.py

Results

Sourcegraph Search Results

Datasets

CodeT5 Models

Analysis Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages