update readme for llama export with sharding

rsuderman · rsuderman · commit 4b76ae63d3e2 · 2024-10-24T11:45:19.000-07:00
diff --git a/llama-readme.md b/llama-readme.md
@@ -35,10 +35,29 @@ llama.cpp/build/bin/llama-quantize --pure <input_gguf> <output_gguf> <format> <t
 
 For formats we should target  `Q_4_1` and `Q_4_K`.
 
+## IRPA Export
+
+Converting a GGUF file to an IRPA file allows us to store tensor data in your preferred data.
+
+```
+python3 -m sharktank.tools.dump_gguf --gguf-file <gguf_file> --save <irpa_file>
+```
+
+## IRPA Sharding (optional)
+
+If you want to run a sharded version of the model you need to shard the `irpa` file so that
+the appropriate constants a replicated and sharded. We need to use as specially sharded `irpa` so
+loads occur separately. (This should be fixed in the future).
+
+```
+python3 -m sharktank.models.llama.tools.shard_llama --irpa-file <unsharded-irpa>  --output-file <sharded-irpa> --shard_count=<sharding>
+```
+
 ## MLIR Generation
 
-Once we have any particular gguf file we can export the representative IR
+Once we have any particular gguf file we can export the representative IR, we choose to use
+IRPA files as the loading process can be better optimized for quantzied types
 
 ```
-python3 -m sharktank.examples.export_paged_llm_v1 --gguf-file <input_gguf> --output-mlir <output-mlir>
+python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file <input_gguf> --output-mlir <output-mlir>
 ```