File tree 1 file changed +21
-2
lines changed
1 file changed +21
-2
lines changed Original file line number Diff line number Diff line change @@ -35,10 +35,29 @@ llama.cpp/build/bin/llama-quantize --pure <input_gguf> <output_gguf> <format> <t
35
35
36
36
For formats we should target ` Q_4_1 ` and ` Q_4_K ` .
37
37
38
+ ## IRPA Export
39
+
40
+ Converting a GGUF file to an IRPA file allows us to store tensor data in your preferred data.
41
+
42
+ ```
43
+ python3 -m sharktank.tools.dump_gguf --gguf-file <gguf_file> --save <irpa_file>
44
+ ```
45
+
46
+ ## IRPA Sharding (optional)
47
+
48
+ If you want to run a sharded version of the model you need to shard the ` irpa ` file so that
49
+ the appropriate constants a replicated and sharded. We need to use as specially sharded ` irpa ` so
50
+ loads occur separately. (This should be fixed in the future).
51
+
52
+ ```
53
+ python3 -m sharktank.models.llama.tools.shard_llama --irpa-file <unsharded-irpa> --output-file <sharded-irpa> --shard_count=<sharding>
54
+ ```
55
+
38
56
## MLIR Generation
39
57
40
- Once we have any particular gguf file we can export the representative IR
58
+ Once we have any particular gguf file we can export the representative IR, we choose to use
59
+ IRPA files as the loading process can be better optimized for quantzied types
41
60
42
61
```
43
- python3 -m sharktank.examples.export_paged_llm_v1 --gguf -file <input_gguf> --output-mlir <output-mlir>
62
+ python3 -m sharktank.examples.export_paged_llm_v1 --irpa -file <input_gguf> --output-mlir <output-mlir>
44
63
```
You can’t perform that action at this time.
0 commit comments