Move in projects (#40)

* change callbacks * pbar * T0 works with torch==2.0.1 and pyl==2.0.4 * accum inference outputs * update config for finetuning and fix progress * fix inference in t0 * update reqs * move to projects/mhr * create mhr config * remove old import * update t0-3b args * added scripts * added tests * fix value input for Z tuning * update setuup instructions * Loosen python dependencies. * Fix setup and tests. * Increase test verbosity. Removed dependabot pipeline. * Bump pytorch-lightning from 2.0.4 to 2.0.5 * Fix T0 dataset creation script * use /tmp for the datasets preparation * move dataset scripts inside projects/mhr * move some files around * temporarily move _set_defaults back to mttl/config * remove output folder and it ignore them anywhere in the repo. * removed bb * fix missing files in the setup bundle * move fineture scripts to scripts/finetune * remove pl_zeroshot * Review scripts to use same envvar. Add instructions to readme. Save processed data inside the projects/mhr folder. * Removed hardcoded train_dir * add env var to load storycloze dataset * add STORYCLOZE_DIR to readme --------- Co-authored-by: Alessandro Sordoni <[email protected]> Co-authored-by: Lucas Caccia <[email protected]> Co-authored-by: matheper <[email protected]>
microsoft · Aug 1, 2023 · ce4ca51 · ce4ca51
1 parent f84f266
commit ce4ca51
Show file tree

Hide file tree

Showing 71 changed files with 291 additions and 299 deletions.
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -37,4 +37,4 @@ jobs:
       #     flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
       - name: Test with pytest
         run: |
-          pytest
+          pytest -vv
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,12 @@ amulet_*/
 wandb
 .amltconfig
 .amltignore
+cache/
+output/
+**/output
+data/
+**/data
+.vscode/
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/README.md b/README.md
@@ -4,33 +4,81 @@ MTTL - Multi-Task Transfer Learning
 
 ## Setup
 
-Install Python packages:
+MTTL supports `Python 3.8` and `Python 3.9`. It is recommended to create a virtual environment for MTTL using `virtualenv` or `conda`. For example, with `conda`:
 
-`pip install -r requirements.txt`
+    conda create -n mttl python=3.9
+    conda activate mttl
 
-_The package `promptsource` currently requires Python 3.7. Alternative versions require local installations (see their [documentation](https://github.com/bigscience-workshop/promptsource#setup))._
+Install the required Python packages:
 
-Download the datasets:
+    pip install -e .
 
-`bash scripts/create_datasets.sh`
 
-## Multi-task Pre-training
 
-The general command:
+## Multi-Head Adapter Routing
 
-`python pl_train.py -c $CONFIG_FILES -k $KWARGS`
+Please ensure that you have navigated to the `projects/mhr` directory before running the Multi-Head Adapter Routing scripts:
+
+    cd projects/mhr
+
+
+### Data Preparation
+
+Download and prepare the datasets for the experiments using the following script:
+
+    bash datasets/create_datasets.sh
+
+
+### Environment Variables
+
+Based on your experiments, you may need to export one or more of the following environment variables:
+
+    T0_DATA_DIR:  `data/t0_data/processed` if you ran the `create_datasets.sh`
+    NI_DATA_DIR: `data/ni_data/processed` if you ran the `create_datasets.sh`
+    XFIT_DATA_DIR: `data/ni_data/processed` if you ran the `create_datasets.sh`
+    CHECKPOINT_DIR
+    OUTPUT_DIR
+    CACHE_DIR
+    STORYCLOZE_DIR: path to your downloaded `.csv` files. See [the storycloze official website](https://cs.rochester.edu/nlp/rocstories/)
+
+
+### Multi-task Pre-training
+
+The general command for pre-training a model is:
+
+    python pl_train.py -c $CONFIG_FILES -k $KWARGS
 
 Multiple `CONFIG_FILES` can be concatenated as `file1+file2`. To modify defaults, `KWARGS` can be expressed as `key=value`.
+You can check [scripts/pretrain](scripts/pretrain) for examples.
 
-## Test Fine-Tuning
+### Test Fine-Tuning
 
 To perform finetuning for a test task, use the script `pl_finetune.py`
 
-## Hyper-parameter Search for Test Fine-Tuning
+### Hyper-parameter Search for Test Fine-Tuning
 
 To perform an hyperparameter search for a test task, use the script `pl_finetune_tune.py`.
 The script will just call the functions in `pl_finetune.py` in a loop. The script itself defines hp ranges for different fine-tuning types.
 
+
+### Pre-Configured Scripts
+
+Alternatively, you can run the pre-configured scripts from the `scripts` folder. For example:
+
+    bash scripts/mhr_pretrain.sh
+
+### Know Issues
+If you run into issues with protoc `TypeError: Descriptors cannot not be created directly.`, you can try to downgrade protobuf to 3.20.*:
+
+    pip install protobuf==3.20.*
+
+
+## Running Tests
+
+    pip install -e ".[test]"
+    pytest -vv tests
+
+
 ## Contributing
 
 This project welcomes contributions and suggestions.  Most contributions require you to agree to a

diff --git a/configs/t0/3b.json b/configs/t0/3b.json
diff --git a/mttl/config.py b/mttl/config.py
@@ -6,16 +6,113 @@
 
 
 class Config:
+
     def __init__(self, filenames=None, kwargs=None, raise_error=True):
         # Stores personalization of the config file in a dict (json serializable)
         self._updated_kwargs = {}
+        self.filenames = filenames
+        self._set_defaults()
+
+        if filenames:
+            for filename in filenames.split("+"):
+                if not os.path.exists(filename):
+                    filename = os.path.join(os.getenv("CONFIG_PATH", default="configs"), filename)
+
+                self.update_kwargs(json.load(open(filename)), eval=False, raise_error=raise_error)
+
+        if kwargs:
+            self.update_kwargs(kwargs, raise_error=raise_error)
+
+        self.save_config(self.output_dir)
+
+    def was_overridden(self, key):
+        return key in self._updated_kwargs
+
+    def was_default(self, key):
+        return key not in self._updated_kwargs
+
+    def update_kwargs(self, kwargs, eval=True, raise_error=True):
+        for (k, v) in kwargs.items():
+            if eval:
+                try:
+                    v = ast.literal_eval(v)
+                except (ValueError, SyntaxError):
+                    v = v
+            else:
+                v = v
+            if not hasattr(self, k) and raise_error:
+                raise ValueError(f"{k} is not in the config")
+
+            if eval:
+                print("Overwriting {} to {}".format(k, v))
+
+            if k == 'finegrained':
+                k = 'poly_granularity'
+                v = 'finegrained' if v else 'coarsegrained'
+            elif k in ['train_dir', 'output_dir']:
+                # this raises an error if the env. var does not exist
+                v = Template(v).substitute(os.environ)
+
+            setattr(self, k, v)
+            self._updated_kwargs[k] = v
+
+    def __getitem__(self, item):
+        return getattr(self, item, None)
+
+    def to_json(self):
+        """
+        Converts parameter values in config to json
+        :return: json
+        """
+        import copy
+
+        to_save = copy.deepcopy(self.__dict__)
+        to_save.pop("_updated_kwargs")
+
+        return json.dumps(to_save, indent=4, sort_keys=False)
+
+    def save_config(self, output_dir):
+        """
+        Saves the config
+        """
+        os.makedirs(output_dir, exist_ok=True)
+
+        with open(os.path.join(output_dir, "config.json"), "w+") as fout:
+            fout.write(self.to_json())
+            fout.write("\n")
+
+    @classmethod    
+    def parse(cls, extra_kwargs=None, raise_error=True):
+        import itertools
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("-c", "--config_files", required=False)
+        parser.add_argument("-k", "--kwargs", nargs="*", action='append')
+        args = parser.parse_args()
+
+        kwargs = {}
+        if args.kwargs:
+            kwargs_opts = list(itertools.chain(*args.kwargs))
+            for value in kwargs_opts:
+                key, _, value = value.partition('=')
+                kwargs[key] = value
+        args.kwargs = kwargs
+        if extra_kwargs:
+            args.kwargs.update(extra_kwargs)
+
+        config = cls(args.config_files, args.kwargs, raise_error=raise_error)
+
+        print(config.to_json())
+        return config
+
+    def _set_defaults(self):
         self.cache_dir = os.getenv("CACHE_DIR", "./cache")
         self.free_up_space = False
         # Data config
         self.dataset = None
         self.custom_tasks_splits = None
-        self.train_dir = os.getenv("AMLT_DATA_DIR", "/tmp/")
-        self.output_dir = os.getenv("AMLT_OUTPUT_DIR", "./output")
+        self.train_dir = os.getenv("TRAIN_DIR", "/tmp/")
+        self.output_dir = os.getenv("OUTPUT_DIR", "./output")
         self.finetune_task_name = None
         self.example_to_ids_path = None  # path to clustering of data
         self.embeddings_path = None
@@ -103,12 +200,12 @@ def __init__(self, filenames=None, kwargs=None, raise_error=True):
         self.poly_use_shared_skill = False     # use one skill shared by all tasks
 
         """
-        poly_granularity : how granular is the module selection : 
+        poly_granularity : how granular is the module selection :
         coarsegrained : 1 single selector across all linear layers
         coderwise : 2 selectors (1 for encoder, 1 for decoder)
         blockwise : 1 selector for each block of K attention layers (and layernorm)
-        layerwise : 1 selector for each attention layer (and layernorm) 
-        finegrained : 1 selector for every linear layer 
+        layerwise : 1 selector for each attention layer (and layernorm)
+        finegrained : 1 selector for every linear layer
         """
         self.poly_granularity = 'finegrained'
 
@@ -119,75 +216,6 @@ def __init__(self, filenames=None, kwargs=None, raise_error=True):
         self.adapters_weight_decay = None
         self.module_logits_dropout = 0.
         self.module_logits_l2_norm = False
-        self.filenames = filenames
-
-        if filenames:
-            for filename in filenames.split("+"):
-                if not os.path.exists(filename):
-                    filename = os.path.join(os.getenv("CONFIG_PATH", default="configs"), filename)
-
-                self.update_kwargs(json.load(open(filename)), eval=False, raise_error=raise_error)
-
-        if kwargs:
-            self.update_kwargs(kwargs, raise_error=raise_error)
-
-        self.save_config(self.output_dir)
-
-    def was_overridden(self, key):
-        return key in self._updated_kwargs
-
-    def was_default(self, key):
-        return key not in self._updated_kwargs
-
-    def update_kwargs(self, kwargs, eval=True, raise_error=True):
-        for (k, v) in kwargs.items():
-            if eval:
-                try:
-                    v = ast.literal_eval(v)
-                except (ValueError, SyntaxError):
-                    v = v
-            else:
-                v = v
-            if not hasattr(self, k) and raise_error:
-                raise ValueError(f"{k} is not in the config")
-
-            if eval:
-                print("Overwriting {} to {}".format(k, v))
-
-            if k == 'finegrained':
-                k = 'poly_granularity'
-                v = 'finegrained' if v else 'coarsegrained'
-            elif k in ['train_dir', 'output_dir']:
-                # this raises an error if the env. var does not exist
-                v = Template(v).substitute(os.environ)
-
-            setattr(self, k, v)
-            self._updated_kwargs[k] = v
-
-    def __getitem__(self, item):
-        return getattr(self, item, None)
-
-    def to_json(self):
-        """
-        Converts parameter values in config to json
-        :return: json
-        """
-        import copy
-
-        to_save = copy.deepcopy(self.__dict__)
-        to_save.pop("_updated_kwargs")
-
-        return json.dumps(to_save, indent=4, sort_keys=False)
-
-    def save_config(self, output_dir):
-        """
-        Saves the config
-        """
-        os.makedirs(output_dir, exist_ok=True)
-
-        with open(os.path.join(output_dir, "config.json"), "w+") as fout:
-            fout.write(self.to_json())
-            fout.write("\n")
 
 
 class ParseKwargs(argparse.Action):
@@ -196,27 +224,3 @@ def __call__(self, parser, namespace, values, option_string=None):
         for value in values:
             key, value = value.split('=')
             getattr(namespace, self.dest)[key] = value
-
-
-def parse_config(extra_kwargs=None, raise_error=True):
-    import itertools
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-c", "--config_files", required=False)
-    parser.add_argument("-k", "--kwargs", nargs="*", action='append')
-    args = parser.parse_args()
-
-    kwargs = {}
-    if args.kwargs:
-        kwargs_opts = list(itertools.chain(*args.kwargs))
-        for value in kwargs_opts:
-            key, _, value = value.partition('=')
-            kwargs[key] = value
-    args.kwargs = kwargs
-    if extra_kwargs:
-        args.kwargs.update(extra_kwargs)
-
-    config = Config(args.config_files, args.kwargs, raise_error=raise_error)
-
-    print(config.to_json())
-    return config