Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruct #47

Closed
wants to merge 122 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
e265bf6
encodings scripts
sordonia Mar 21, 2023
f4813c3
faster encoding
sordonia Mar 21, 2023
6eefc1c
small model for debugging
pclucas14 Mar 22, 2023
882f6fd
cleaner version of clustering/encoding
sordonia Mar 22, 2023
0d0071b
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
pclucas14 Mar 22, 2023
543b277
fix tiny bugs
pclucas14 Mar 22, 2023
7df5c23
num workers
pclucas14 Mar 22, 2023
8342a1e
nit
sordonia Mar 22, 2023
07fd9d2
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
sordonia Mar 22, 2023
d3e62c9
integration instruction hashes
sordonia Mar 22, 2023
ae94099
pass instruction hash to valid t0 data
sordonia Mar 22, 2023
b4c1dbf
online zero-shot eval for t0
sordonia Mar 22, 2023
5790e33
option for t0 online zs eval
sordonia Mar 22, 2023
4a6ec48
input not inputs
sordonia Mar 23, 2023
ddfc57a
use norm by default
sordonia Mar 23, 2023
6ea7301
return input_text for finetune datasets, to fix encodings
sordonia Mar 23, 2023
b8a074b
regex fix
pclucas14 Mar 23, 2023
875ad89
verbose
pclucas14 Mar 24, 2023
e96299c
import fix
pclucas14 Mar 24, 2023
3564e92
poly lora defaults
pclucas14 Mar 24, 2023
2bf0820
online eval fix
sordonia Mar 24, 2023
1b91c71
potentially load model from cache
pclucas14 Mar 24, 2023
08da1d6
fix knn path bug
sordonia Mar 25, 2023
8f13642
fix error input type not propagated to knn clusters
sordonia Mar 25, 2023
ebfaa49
continue fixing instruction input_type for knn
sordonia Mar 25, 2023
7594de5
full 0-shot online eval
sordonia Mar 25, 2023
8d65ee0
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
pclucas14 Mar 26, 2023
2791830
load poly granularity
pclucas14 Mar 26, 2023
8ae6c2c
filtering by task name
sordonia Mar 26, 2023
56efef2
clean adapters a bit
sordonia Mar 28, 2023
234442c
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
pclucas14 Mar 28, 2023
e194a8a
clean instruction creation for ni
sordonia Mar 29, 2023
39a71b5
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
pclucas14 Mar 30, 2023
8e4c792
do not tokenize NI during example creation
sordonia Apr 1, 2023
ecf36f4
online zero shot for NI + refresh configs
sordonia Apr 3, 2023
e8818d2
Merge branch 'as/cluster-tuning' of github.com:microsoft/mttl into as…
pclucas14 Apr 3, 2023
efc93f7
lm adapt version of t5l
pclucas14 Apr 3, 2023
680b678
t5l lm adapt
pclucas14 Apr 3, 2023
81ff82e
try and cache model weights
pclucas14 Apr 3, 2023
74b2805
alpaca first commit
pclucas14 Apr 3, 2023
ae4c337
alpaca random val split
pclucas14 Apr 4, 2023
be18e7e
round 2 pr
pclucas14 Apr 4, 2023
ce09309
new branch containing llama adapter, code for training llamas and alp…
oleksost May 17, 2023
7a54f57
moved things around, nvm
oleksost May 24, 2023
a9569c1
using torch AdamW as in tloen
oleksost May 24, 2023
bc9824e
tloen does not seem to use eos
oleksost May 24, 2023
7249b8a
debug alpaca-lora
May 26, 2023
469dd1c
no training on instructionsof self-instruct
oleksost May 27, 2023
2c0d045
debug
May 27, 2023
e74062e
no train on source
oleksost May 27, 2023
94efda4
no train on source
oleksost May 27, 2023
977370a
no train on source
oleksost May 27, 2023
2983d6e
add eos
oleksost May 27, 2023
5ad1ecf
prune_unused_loras flag
oleksost May 29, 2023
f161f87
debug on the remote_machine
tangzhy May 30, 2023
3cd81d1
Merge branch 'instr_composition' of https://github.com/microsoft/mttl…
tangzhy May 30, 2023
71f3903
train_on_inputs option
oleksost May 31, 2023
e0db4be
cluster at different depth
oleksost Jun 2, 2023
745a373
cluster at different depth
oleksost Jun 2, 2023
5f6ccec
debug
tangzhy Jun 5, 2023
36af1f6
added longform dataset
oleksost Jun 6, 2023
60b129b
Merge branch 'instr_composition' of https://github.com/microsoft/mttl…
tangzhy Jun 7, 2023
59460f8
vm removal backup
oleksost Jun 8, 2023
df1891b
add gen
Jun 15, 2023
1a200ea
add config
Jun 15, 2023
17fb799
debug cluster on A100 succeed.
Jun 16, 2023
a36086b
debug cluster on A100 succeed.
Jun 16, 2023
3482a77
debug cluster on A100 succeed.
Jun 16, 2023
2e144c4
debug cluster on A100 succeed.
Jun 16, 2023
e536ab5
update config
Jun 17, 2023
af4b872
remove unused adapters
Jun 20, 2023
101592f
debug
Jun 21, 2023
5fc6c7c
debug
Jun 23, 2023
70fc993
merge
Jun 27, 2023
b1a307e
merge from instruct_composition
Jun 27, 2023
ade1f11
update ni predictions
Jun 27, 2023
36d8e7c
update dir
Jun 27, 2023
b59ec90
remote absolute path, make it more flexible
Jul 3, 2023
1d0bcc7
random split the code
Jul 4, 2023
21156f3
use full dataset
Jul 7, 2023
960839e
add poly-mu for alpaca
Jul 8, 2023
c369ffd
add config
Jul 8, 2023
af94d36
add poly for alpaca
Jul 9, 2023
4b1d36e
add wizard
Jul 9, 2023
139db36
add poly_mu
Jul 10, 2023
d2359f0
update cluster_alpaca, add json config
Jul 10, 2023
0c92d97
debuge mttl/config.py
Jul 10, 2023
81257f3
add enhanced alpaca
Jul 11, 2023
1aec37e
compare with/without glance
Jul 11, 2023
89abea6
update config file
Jul 11, 2023
ede53ec
change poly decache, add module logits to the poly_mu
Jul 16, 2023
5fcdbae
debug
Jul 16, 2023
649520c
debug cluster, make it uniform by 1/8
Jul 18, 2023
59c5961
debug
Jul 19, 2023
b5c9753
Merge branch 'instruct' of https://github.com/microsoft/mttl into ins…
Jul 19, 2023
d5030d2
added flan_v2 subset, wizzard, dolly
Jul 24, 2023
3bdda9f
configs
Jul 24, 2023
d471183
add MMLU evaluation
Jul 27, 2023
51ef184
Merge branch 'instruct' of https://github.com/microsoft/mttl into ins…
Jul 27, 2023
191491d
update readme file for instruction following superni and mmlu
Jul 31, 2023
6847d76
change the rank=4 lora config
Jul 31, 2023
4444adf
add BBH evaluation
Jul 31, 2023
f861c55
integarate the camels evaluation to the MTTL
Jul 31, 2023
acfcc20
add all benchmark from camels
Aug 1, 2023
81b4907
update config for the clustering routing
Aug 2, 2023
11dbedc
merge with inst_composition
Aug 2, 2023
ad82030
debug update merge with main
Aug 4, 2023
e926795
merge with main debug succeed
Aug 8, 2023
8634705
add tlora for llama
Aug 9, 2023
fcdf826
wip: learned router
Aug 9, 2023
5aa94d0
add poly with tlora
Aug 9, 2023
6ab85a8
(wip) learnable router
Aug 9, 2023
97d2d81
configs
Aug 11, 2023
1a1b5e4
merge with instr_decomposition
Aug 11, 2023
15dcb5d
Merge branch 'instr_composition' into instruct
Aug 11, 2023
d936017
cammels datasets
Aug 11, 2023
dfbcfdb
reacking entropy + routing adaptation parameters
Aug 13, 2023
5ca1b92
merge with instr_composition
Aug 16, 2023
2b7e22c
loose the datasets lib version
Aug 16, 2023
de9991c
debug upgrade it to pytorch_lightning2.0
Aug 16, 2023
2362156
debug requirements
Aug 16, 2023
d0474f8
add peft
Aug 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,10 @@ dmypy.json

# Pyre type checker
.pyre/

inst_follow/sandbox/
inst_follow/sandbox/*
*env_vars.sh

models_gcr/*
*[ignore]*
47 changes: 47 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,53 @@ You can check [scripts/pretrain](scripts/pretrain) for examples.
### Test Fine-Tuning

To perform finetuning for a test task, use the script `pl_finetune.py`
## Instruction Following Zero-Shot Evaluation

### pretrain the model based on 52K Alpaca instructions

```
python -m finetune_llama -c llama/finetune_atlas_cluster_by_inst_16_my.json
```


### Zero-shot on SuperNI:

1. clone the SuperNI tasks at the parent dir:

```
git clone https://github.com/allenai/natural-instructions.git ../
```

2. preprocess the tasks as described in SuperNI repo:

```
python reorder_instances_for_testing.py
```

3. Generate the outputs of SuperNI

```
python gen_ni_predictions_my.py --out_prefix=alpaca_dense --model_path=llama_alpaca_finetune/loss=0.3825.ckpt
```

The `loss=0.3825.ckpt` is our pretrained model checkpoint.

### zero-shot on MMLU:

1. copy the processed dataset from the repo: https://github.com/allenai/open-instruct

2. Evaluate the llama-7B on MMLU

```
python -m inst_follow.eval.mmlu.run_eval \
--ntrain 0 \
--data_dir data/eval/mmlu \
--save_dir results/mmlu/alpaca-lora-7B-0shot/ \
--eval_batch_size 2 \
--model_name_or_path tloen/alpaca-lora-7b \
--tokenizer_name_or_path yahma/llama-7b-hf
```


### Hyper-parameter Search for Test Fine-Tuning

Expand Down
307 changes: 307 additions & 0 deletions cluster_alpaca.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
from typing import List
import pickle
import tqdm
import os
import time
import torch
from mttl.config import parse_config
from nomic import atlas, AtlasProject
import numpy as np
import argparse
import openai
from transformers import AutoTokenizer, AutoModel
from mttl.datamodule.alpaca_data_module import AlpacaDataModule, AlpacaDataset
from mttl.cluster_tuning.encodings import ClusterInfos
from finetune_llama import parse_config

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

openai.api_key_path = "api.openai"


def store_embeddings(embeddings, args):
global embeddings_path
with open(embeddings_path, "wb") as f:
pickle.dump(
{
"embeddings": embeddings,
"meta_data": {
"embedding_model": args.embedding_model,
"cluster_with": args.cluster_with,
},
},
f,
)


def get_embeddings(documents):
if args.embedding_model == "open_ai":
batch_size = 128
embeddings = []
for i in tqdm.tqdm(
range(0, len(documents), batch_size), desc="Computing embeddings"
):
batch_instructions = documents[i : i + batch_size]
response = openai.Embedding.create(
input=batch_instructions, engine="text-embedding-ada-002"
)
embeddings.extend([np.array(d["embedding"]) for d in response.data])
embeddings = np.stack(embeddings)
return embeddings
else:
model = AutoModel.from_pretrained(args.embedding_model).to(device)
tokenizer = AutoTokenizer.from_pretrained(args.embedding_model) # .to(device)
embeddings = []
with torch.no_grad():
batch_size = 128 # lower this if needed
for i in tqdm.tqdm(range(0, len(documents), batch_size)):
batch = [document for document in documents[i : i + batch_size]]
encoded_input = tokenizer(batch, return_tensors="pt", padding=True).to(
device
)
cls_embeddings = model(**encoded_input)["last_hidden_state"][:, 0]
embeddings.append(cls_embeddings)

embeddings = torch.cat(embeddings).cpu().numpy()
return embeddings


def get_atlas_map(dm: AlpacaDataModule, args):
def create_new_proj():
print("Creating new project and map")
dataset: AlpacaDataset = dm.get_dataset()
print(" Getting embeddings...")
if args.cluster_with == "instruction":
embeddings = get_embeddings([d["instruction"] for d in dataset.dataset])
elif args.cluster_with == "prompt":
embeddings = get_embeddings(
[
f"Instruction: {d['instruction']} \n Input: {d['input']}"
for d in dataset.dataset
]
)
# save embeddings to a file
store_embeddings(embeddings, args)
instructions_dicts = [
{
"full": f"Instruction: {d['instruction']} \n Input: {d['input']} \n Output: {d['output']}",
"instruction": d["instruction"],
"input": d["input"],
"output": d["output"],
"prompt": f"Instruction: {d['instruction']} \n Input: {d['input']} \n Output:",
}
for d in dataset.dataset
]
project = atlas.map_embeddings(
data=instructions_dicts, # [d[args.cluster_with] for d in instructions_dicts], #performs bag of words based topic modeling
# indexed_field=args.cluster_with,
embeddings=embeddings,
name=f"Alpaca_{args.cluster_with}",
reset_project_if_exists=True,
topic_label_field=args.cluster_with,
build_topic_model=True,
)
print(" waiting for project to be ready...")
# wait until project is ready
while not project.is_accepting_data:
time.sleep(5)

return project

if not args.rebuild_embeddings:
try:
project = AtlasProject(name=f"Alpaca_{args.cluster_with}")
map = project.get_map(name=f"Alpaca_{args.cluster_with}")
except Exception as e:
print(e)
project = create_new_proj()
map = project.get_map(name=f"Alpaca_{args.cluster_with}")
else:
project = create_new_proj()
map = project.get_map(name=f"Alpaca_{args.cluster_with}")
return map


def main(args, config):
dm = AlpacaDataModule(config)
dm.setup()
map = None
# # load embeddings from GPT
global embeddings_path
embeddings_path = (
args.embeddings_path + f"/embeddings_of_{args.cluster_with}_{args.depth}.pkl"
)
if args.use_atlas:
map = get_atlas_map(dm, args) if map is None else map
topics_all = map.get_topic_data()
# try loading embeddings file
if os.path.exists(embeddings_path):
with open(embeddings_path, "rb") as f:
embeddings_file = pickle.load(f)
else:
raise Exception(
"Embeddings file not found, try running the script with --rebuild_embeddings flag set to True"
)
# remove key from embeddings_file
depth = args.depth # 2
emb_column_name = f"atlas_topics_by_{args.cluster_with}_l{depth}"
if not emb_column_name in embeddings_file:
embeddings_file[emb_column_name] = []
batch_size = 28 # lower this if needed
for r in tqdm.tqdm(
range(0, len(embeddings_file["embeddings"]), batch_size)
):
batch = embeddings_file["embeddings"][r : r + batch_size]
# q = np.expand_dims(r, axis=0)
topics = map.vector_search_topics(queries=batch, depth=depth)["topics"]
embeddings_file[emb_column_name].extend(topics)
# save responses as pkl
with open(embeddings_path, "wb") as f:
pickle.dump(embeddings_file, f)

cluster_infos = ClusterInfos()
indices_train = dm.train_dataset.indices
print("Getting cluster infos for train set")
for i, example in tqdm.tqdm(enumerate(dm.train_dataset)):
topics = embeddings_file[emb_column_name][indices_train[i]]
hash = example.hash
probs = np.zeros((len(topics_all)))
for i, k in enumerate(topics):
probs[int(k) - 1] = topics[k]
assert sum(probs) == 1
main_t = np.argmax(probs)
cluster_infos.is_test.extend([0])
cluster_infos.task_names.extend(
[topics_all[int(main_t)]["topic_short_description"]]
)
cluster_infos.cluster_ids.extend([int(main_t)])
cluster_infos.hashes.extend([hash])
cluster_infos.cluster_dists.extend([probs.tolist()])

print("Getting cluster infos for dev set")
indices_dev = dm.dev_dataset.indices
for i, example in tqdm.tqdm(enumerate(dm.dev_dataset)):
topics = embeddings_file[emb_column_name][indices_dev[i]]
hash = example.hash
probs = np.zeros((len(topics_all)))
for i, k in enumerate(topics):
probs[int(k) - 1] = topics[k]
assert sum(probs) == 1
main_t = np.argmax(probs)
cluster_infos.is_test.extend([1])
cluster_infos.task_names.extend(
[topics_all[int(main_t)]["topic_short_description"]]
)
cluster_infos.cluster_ids.extend([int(main_t)])
cluster_infos.hashes.extend([hash])
cluster_infos.cluster_dists.extend([probs.tolist()])

cluster_infos.save(args.example_to_ids_path)
else:
raise NotImplementedError
# TODO: double check this
import faiss

if not args.rebuild_embeddings:
if os.path.exists(embeddings_path):
with open(embeddings_path, "rb") as f:
embeddings_file = pickle.load(f)
else:
raise Exception(
"Embeddings file not found, try running the script with --rebuild_embeddings flag set to True"
)
else:
dataset: AlpacaDataset = dm.get_dataset()
embeddings = get_embeddings([d[args.cluster_with] for d in dataset.dataset])
# save embeddings to a file
store_embeddings(embeddings, args)

n_clusters = args.n_clusters
embedings = np.stack(embeddings_path["embedings"])
# for k in tqdm.tqdm(k_range):
kmeans = faiss.Kmeans(
embedings.shape[-1],
k=n_clusters,
niter=10,
verbose=True,
gpu=True,
nredo=5,
max_points_per_centroid=10_000_000,
)
kmeans.train(embedings)

cluster_infos = ClusterInfos()
D, I = kmeans.index.search(embedings, n_clusters)
cluster_infos.centroids = kmeans.centroids
# Inference:
# kmeans.train(logits, init_centroids=centroids) # this ensures that kmeans.index is created
# assert np.sum(kmeans.centroids - centroids) == 0, "centroids are not the same" # sanity check
# cluster_distances, cluster_indices = kmeans.assign(new_data)
distances = np.zeros((D.shape[0], D.shape[1]))
for i in range(D.shape[0]):
for j in range(D.shape[1]):
distances[i, I[i, j]] = D[i, j]
indices_train = dm.train_dataset.indices
for i, example in tqdm.tqdm(enumerate(dm.train_dataset)):
hash = example.hash
idx = indices_train[i]
cluster_infos.hashes.append(hash)
cluster_infos.cluster_ids.append(I[idx, 0])
cluster_infos.is_test.append(0)
cluster_infos.cluster_dists.append(distances[idx].tolist())
# assert chunk_data.input_type == cluster_infos.input_type

indices_dev = dm.dev_dataset.indices
for i, example in tqdm.tqdm(enumerate(dm.dev_dataset)):
hash = example.hash
idx = indices_dev[i]
cluster_infos.hashes.append(hash)
cluster_infos.cluster_ids.append(I[idx, 0])
cluster_infos.is_test.append(1)
cluster_infos.cluster_dists.append(distances[idx].tolist())

assert len(cluster_infos.hashes) == len(cluster_infos.cluster_ids)
cluster_sizes = np.bincount(cluster_infos.cluster_ids)

print("Sorted cluster sizes:", sorted(cluster_sizes))
print(
"Bigger to smaller ratio:",
np.max(cluster_sizes) / (np.min(cluster_sizes) + 0.1),
)

cluster_infos.save(config.example_to_ids_path)


if __name__ == "__main__":
# add params with default values

parser = argparse.ArgumentParser()
parser.add_argument(
"--cluster_with",
type=str,
default="prompt",
choices=["full", "instruction", "output", "prompt"],
)
parser.add_argument("--n_clusters", type=int, default=20)
parser.add_argument("--use_atlas", type=bool, default=True)
parser.add_argument(
"--use_topic_modeling", type=bool, default=False
) # if True uses topic modeling for lcustering, if False uses OpenAI embeddings
parser.add_argument(
"--embeddings_path",
type=str,
default="inst_follow/data/self_instruct_GPT3_embeddings",
)
parser.add_argument("-c", "--config", type=str, default="config/mttl/mttl.yaml")
parser.add_argument(
"--example_to_ids_path",
type=str,
default="inst_follow/cluster_infos/atlas_by_instr_text-embedding-ada-002_ldalayer1.pkl",
)
parser.add_argument("--rebuild_embeddings", type=bool, default=True)
parser.add_argument("--embedding_model", type=str, default="open_ai")
parser.add_argument("--depth", type=int, default=1)
args = parser.parse_args()

config = parse_config()
main(args, config)
14 changes: 14 additions & 0 deletions configs/alpaca/finetune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"dataset": "alpaca",
"optimizer": "adamw",
"train_dir": "$AP_DATA_DIR",
"warmup_proportion": 0.06,
"total_steps": 1000,
"learning_rate": 1e-3,
"max_grad_norm": 0.1,
"weight_decay": 0.01,
"train_batch_size": 8,
"predict_batch_size": 8,
"precision": "bf16",
"model": "t5-small"
}
Loading
Loading