Skip to content

Commit 4b76ae6

Browse files
committed
update readme for llama export with sharding
1 parent 7d34c93 commit 4b76ae6

File tree

1 file changed

+21
-2
lines changed

1 file changed

+21
-2
lines changed

Quantization.md renamed to llama-readme.md

+21-2
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,29 @@ llama.cpp/build/bin/llama-quantize --pure <input_gguf> <output_gguf> <format> <t
3535

3636
For formats we should target `Q_4_1` and `Q_4_K`.
3737

38+
## IRPA Export
39+
40+
Converting a GGUF file to an IRPA file allows us to store tensor data in your preferred data.
41+
42+
```
43+
python3 -m sharktank.tools.dump_gguf --gguf-file <gguf_file> --save <irpa_file>
44+
```
45+
46+
## IRPA Sharding (optional)
47+
48+
If you want to run a sharded version of the model you need to shard the `irpa` file so that
49+
the appropriate constants a replicated and sharded. We need to use as specially sharded `irpa` so
50+
loads occur separately. (This should be fixed in the future).
51+
52+
```
53+
python3 -m sharktank.models.llama.tools.shard_llama --irpa-file <unsharded-irpa> --output-file <sharded-irpa> --shard_count=<sharding>
54+
```
55+
3856
## MLIR Generation
3957

40-
Once we have any particular gguf file we can export the representative IR
58+
Once we have any particular gguf file we can export the representative IR, we choose to use
59+
IRPA files as the loading process can be better optimized for quantzied types
4160

4261
```
43-
python3 -m sharktank.examples.export_paged_llm_v1 --gguf-file <input_gguf> --output-mlir <output-mlir>
62+
python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file <input_gguf> --output-mlir <output-mlir>
4463
```

0 commit comments

Comments
 (0)