Type migration with large language models for code. Migrates JavaScript to TypeScript by predicting type annotations and generating type definitions.
This is the code repository for the dissertation Predicting TypeScript Type Annotations and Definitions With Machine Learning, specifically, Chapter 5.
The training dataset is on Hugging Face.
Parts of the code may refer to it as ts-training-get4
. This is a preprocessed version of
ts-training
, revision v1.1p1
.
The final StenoType model is on Hugging Face.
You will need to accept the agreement to access the model. The code and results
may refer to this model as stenotype-7b-a6d445d-ckpt1000
, as it was fine-tuned
based on commit a6d445d
.
There are two evaluation datasets: stenotype-eval-ts
(also called stenotype-eval-dataset-subset
in the code and TS-Sourced
in the dissertation) and
stenotype-eval-js
(also called
typeweaver-bundle-filtered-subset
in the code and JS-Sourced
in the dissertation). To type check the
stenotype-eval-js
dataset, you will also need to download the
tarball from
Hugging Face.
Figures and result summaries are in the results/
directory. Full results are
on Hugging Face.
- Clone the repository:
git clone [email protected]:nuprl/StenoType.git
cd StenoType
git submodule update --init --recursive
-
Follow the instructions to set up miniconda.
-
Create a conda environment with Python 3.11 and install dependencies:
conda create -n gpu python=3.11
conda activate gpu
pip install -r requirements.txt
conda install -c conda-forge nodejs=20.8.1
npm install -g --no-save [email protected]
-
Download the StarCoderBase-7b and StenoType models:
a. Ensure that you have a Hugging Face account.
b. Accept the agreements for StarCoderBase-7b and StenoType.
c. On the command line, log into Hugging Face with
huggingface-cli login
.d. In a directory of your choosing, e.g.
../models
, rungit clone [email protected]:bigcode/starcoderbase-7b
andgit clone [email protected]:nuprl/stenotype
.e. To save space, you can delete the
.git
directory (and possiblypytorch_model*.bin
ifmodel*.safetensors
already exists). -
Accept the agreement for the ts-eval evaluation dataset.
-
Now you can run the experiments:
# See what configurations can be wron
python src/main.py --show_configs
# To run inference on config 0 (this is very slow):
python src/main.py --infer 0
# To evaluate (this is CPU-bound):
python src/main.py --evaluate
# To generate dataset-level summaries (this is pretty fast):
python src/main.py --summarize
- To browse the results, you can use the viewer. Type "help" for help.
python src/viewer.py --dataset path/to/results/dataset.parquet
- git
- Python 3
Using Conda or virtual environments is recommended.