-
Notifications
You must be signed in to change notification settings - Fork 16
Expand file tree
/
Copy path.env.example
More file actions
607 lines (515 loc) · 27.1 KB
/
.env.example
File metadata and controls
607 lines (515 loc) · 27.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
# ============================================================================
# NETWORK SELECTION
# ============================================================================
# Which Bittensor network to connect to (if BT_CHAIN_ENDPOINT is not set):
# - finney : mainnet (default) - SDK resolves to official mainnet endpoint
# - test : testnet - SDK resolves to official testnet endpoint
# - local : local node (typically ws://localhost:9944)
BT_NETWORK=finney
# Optional: Override with explicit websocket endpoint for custom/local nodes
# If set, this takes precedence over BT_NETWORK
# Examples:
# - Local node: ws://localhost:9944
# - Custom remote: wss://my-node.example.com:443
# Leave empty to use BT_NETWORK (recommended for production)
BT_CHAIN_ENDPOINT=
# Your target subnet NETUID (e.g., 81 mainnet, 429 your test subnet)
NETUID=81
# ============================================================================
# WALLET CONFIGURATION
# ============================================================================
# These are *names* (aliases) you assign when creating the coldkey and hotkey.
# You can create a new one for mining/validating like `btcli wallet new_coldkey --wallet.name mywallet`
# Not the SS58 address or file path. Just the user-chosen wallet name.
# 1. BT_WALLET_COLD — the name you gave your coldkey (e.g., "mywallet")
# This key stays offline and is used for staking, registration, and ownership.
BT_WALLET_COLD=
# 2. BT_WALLET_HOT — the name you gave your hotkey (e.g., "myhotkey")
# You can create a new one `btcli wallet new_hotkey --wallet.name mywallet --wallet.hotkey myhotkey`
# Note that mywallet here is the same from the previous set.
# This key is safe to use online and used for mining, validation, or inference.
BT_WALLET_HOT=default
# ============================================================================
# REQUIRED: STORAGE (Cloudflare R2) - DUAL CREDENTIAL SYSTEM
# ============================================================================
# IMPORTANT: GRAIL uses a dual-credential system for R2:
# - WRITE credentials: Private, never shared, used for uploading your data
# - READ credentials: Shared on-chain, allows others to read your data
# 1. R2_ACCOUNT_ID: Your Cloudflare account ID (used in API endpoints)
# ➤ Go to: https://dash.cloudflare.com > Click your account (top left) > Overview
# ➤ Copy the "Account ID" shown there.
R2_ACCOUNT_ID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# 2. R2_BUCKET_ID: The name of the R2 bucket you create
# ⚠️ IMPORTANT: The bucket name MUST be the same as your R2_ACCOUNT_ID above!
# ➤ Go to https://dash.cloudflare.com > R2 > "Create Bucket"
# ➤ Set bucket name to your Account ID (same value as R2_ACCOUNT_ID)
# ➤ Set region to ENAM (required)
# ➤ Example: If your account ID is "abc123def456", your bucket name should be "abc123def456"
R2_BUCKET_ID=""
# 3. WRITE CREDENTIALS (Private - NEVER shared on chain)
# These are your API credentials with full read/write access to R2
# ➤ Go to: https://dash.cloudflare.com > R2 > "Manage R2 API Tokens"
# ➤ Click "Create API Token"
# ➤ Name it something like "grail-write-access"
# ➤ Select **Edit Permissions**, and:
# - Scope: `Account.Cloudflare R2 Storage` (or select R2 bucket explicitly)
# - Permissions: `Edit` (for full read/write access)
# ➤ Generate and copy both keys.
R2_WRITE_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
R2_WRITE_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# 4. READ CREDENTIALS (Public - Will be posted to chain)
# These are API credentials with read-only access to R2
# ➤ Go to: https://dash.cloudflare.com > R2 > "Manage R2 API Tokens"
# ➤ Click "Create API Token" again
# ➤ Name it something like "grail-read-only"
# ➤ Select **Read Permissions**, and:
# - Scope: `Account.Cloudflare R2 Storage` (or select R2 bucket explicitly)
# - Permissions: `Read` (for read-only access)
# ➤ Generate and copy both keys.
# NOTE: These credentials will be committed to the blockchain for transparency
R2_READ_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
R2_READ_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# ============================================================================
# OPTIONAL: HUGGING FACE
# ============================================================================
# 5. HF_TOKEN: Your Hugging Face access token for uploading datasets
# ➤ Go to: https://huggingface.co/settings/tokens
# ➤ Click "New token" and create a token with write permissions
# ➤ Copy the token (starts with hf_...)
# This is used to upload validated rollouts to the public Hugging Face dataset
HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# 6. HF_USERNAME: Your Hugging Face username (optional)
# ➤ This is your HuggingFace username (e.g., "your-hf-username")
# ➤ The dataset will be created at: {HF_USERNAME}/grail-sat-rollouts
HF_USERNAME=
# ============================================================================
# GRAIL TRAINER MODEL LOADING CONFIGURATION
# ============================================================================
# The trainer supports flexible model loading at startup via environment variables.
# Both training and reference models can be loaded from different sources:
# - "latest": Download and use the latest checkpoint from R2 bucket
# - "hf": Load a model directly from Hugging Face Hub
# - "window": Load a specific checkpoint from a particular window
# Training Model Configuration
# Mode for loading the training model (required, no default)
# Options: latest | hf | window
GRAIL_TRAIN_MODEL_MODE=
# Required if GRAIL_TRAIN_MODEL_MODE=hf
# HuggingFace model ID (e.g., "Qwen/Qwen2.5-7B")
GRAIL_TRAIN_MODEL_ID=
# Required if GRAIL_TRAIN_MODEL_MODE=window
# Specific window number to load checkpoint from (e.g., 72000)
GRAIL_TRAIN_CHECKPOINT_WINDOW=
# Reference Model Configuration
# Mode for loading the reference model (required, no default)
# Options: latest | hf | window
GRAIL_REF_MODEL_MODE=
# Required if GRAIL_REF_MODEL_MODE=hf
# HuggingFace model ID (e.g., "Qwen/Qwen2.5-7B")
GRAIL_REF_MODEL_ID=
# Required if GRAIL_REF_MODEL_MODE=window
# Specific window number to load checkpoint from (e.g., 72000)
GRAIL_REF_CHECKPOINT_WINDOW=
# Example Configurations:
#
# 1. Train with latest checkpoint, reference with HF model:
# GRAIL_TRAIN_MODEL_MODE=latest
# GRAIL_REF_MODEL_MODE=hf
# GRAIL_REF_MODEL_ID=Qwen/Qwen2.5-7B
#
# 2. Train with HF model, reference with specific checkpoint:
# GRAIL_TRAIN_MODEL_MODE=hf
# GRAIL_TRAIN_MODEL_ID=Qwen/Qwen2.5-7B
# GRAIL_REF_MODEL_MODE=window
# GRAIL_REF_CHECKPOINT_WINDOW=72000
#
# 3. Both from latest checkpoints:
# GRAIL_TRAIN_MODEL_MODE=latest
# GRAIL_REF_MODEL_MODE=latest
# ============================================================================
# KERNEL EVALUATION CONFIGURATION (Miners and Validators — Triton Kernel environment)
# ============================================================================
# These settings control GPU-based kernel correctness evaluation for both
# miners (generating + evaluating kernels) and validators (re-executing
# miners' kernels to verify correctness). Only relevant when the active
# environment is triton_kernel.
# Enable GPU-based kernel evaluation (default: false)
# Must be true for triton_kernel environment to verify generated kernels on-GPU.
# When false, only compilation is checked (max reward capped at 0.35).
GRAIL_GPU_EVAL=true
# GPU device indices for kernel evaluation (comma-separated)
# IMPORTANT: These are PHYSICAL GPU indices, not relative to CUDA_VISIBLE_DEVICES.
# The eval subprocess overrides CUDA_VISIBLE_DEVICES internally, so always use
# the physical device number (as shown by nvidia-smi).
# These GPUs should be separate from decoding and proof GPUs.
#
# Miner layout (3 GPUs): GPU 0=generation, GPU 1=proof, GPU 2=kernel eval → set to "2"
# Validator layout (2 GPUs): GPU 0=model/proof, GPU 1=kernel eval → set to "1"
# Multiple GPUs: "2,3" for two GPUs in parallel
KERNEL_EVAL_GPU_IDS=2
# Evaluation backend (default: persistent)
# - persistent: Long-lived worker per GPU, reuses CUDA context between evals.
# ~40x faster than subprocess. Auto-recovers from CUDA sticky errors.
# - subprocess: Each kernel runs in an isolated subprocess with its own CUDA context.
# Immune to CUDA sticky errors. Automatic retry on transient CUDA failures.
# - basilica: Cloud GPU workers via Basilica (no local GPU needed; not yet implemented)
KERNEL_EVAL_BACKEND=persistent
# Per-kernel evaluation timeout in seconds (default: 60)
KERNEL_EVAL_TIMEOUT=60
# ============================================================================
# PIPELINED MINING CONFIGURATION (mandatory; requires 2+ GPUs)
# ============================================================================
# The pipelined mining engine is the only generation path: the legacy
# single-GPU fallback was removed in the Stage 1 refactor and the
# GRAIL_PIPELINE_ENABLED flag was deleted (setting it has no effect).
#
# Pipeline mode overlaps proof computation (GPU 1) with kernel eval and
# generation (GPU 0), significantly improving per-window throughput.
#
# GPU layout:
# GPU 0 = SGLang/vLLM generation server
# GPU 1 = HuggingFace model for proof-of-work (logprobs + commitments)
# GPU 2 = Triton kernel evaluation (set via KERNEL_EVAL_GPU_IDS above)
#
# Requires at least 2 GPUs (gen + proof). On a 4-GPU machine, GPU 3 can be
# used for a second kernel eval GPU (KERNEL_EVAL_GPU_IDS=2,3).
# Generation backend: sglang (recommended, default) or vllm
# SGLang is the validated production path: it uses the native /generate
# endpoint with input_ids, avoiding text re-tokenization issues entirely.
GRAIL_PIPELINE_BACKEND=sglang
# Tensor-parallel size for the generation server (1 = single GPU, 4 = TP=4
# across consecutive GPUs starting at GRAIL_PIPELINE_VLLM_GPU). Increase
# only if you have spare GPUs and the model fits comfortably with TP=1.
GRAIL_PIPELINE_VLLM_TP=1
# GPU index for the SGLang/vLLM generation server (relative to CUDA_VISIBLE_DEVICES)
GRAIL_PIPELINE_VLLM_GPU=0
# GPU index for the HuggingFace proof-of-work model (relative to CUDA_VISIBLE_DEVICES)
GRAIL_PIPELINE_PROOF_GPU=1
# Use Flash Attention 2 for proof computation (default: false)
# WARNING: Must match the validator's attention implementation (SDPA).
# Setting this to true causes sketch divergence and proof failures.
GRAIL_PIPELINE_PROOF_FLASH_ATTN=false
# Generation server tuning (sensible defaults for A100 80GB)
GRAIL_PIPELINE_GPU_MEM_UTIL=0.90
GRAIL_PIPELINE_MAX_MODEL_LEN=12288
GRAIL_PIPELINE_MAX_NUM_SEQS=64
GRAIL_PIPELINE_MAX_CONCURRENT=48
GRAIL_PIPELINE_SERVER_TIMEOUT=300
# Symlink directory for vLLM weight reload (empty = auto, uses checkpoint parent dir)
# Set to an ephemeral/tmpfs path if main disk is small:
# GRAIL_PIPELINE_SYMLINK_DIR=/dev/shm/grail
# Only used by the vLLM backend; harmless when GRAIL_PIPELINE_BACKEND=sglang.
GRAIL_PIPELINE_SYMLINK_DIR=
# ============================================================================
# GRAIL MINER GENERATION CONFIGURATION
# ============================================================================
# Number of rollouts to generate in parallel per batch (default: 1)
# Higher values increase GPU utilization and throughput but require more VRAM.
# Must be <= ROLLOUTS_PER_PROBLEM (currently 16) and must divide evenly into it.
# Valid options: 1, 2, 4, 8, 16 (factors of ROLLOUTS_PER_PROBLEM)
# Recommended tuning: Start with 1, gradually increase to 2, 4, 8, or 16
# Monitor GPU memory with nvidia-smi to avoid OOM errors.
# Example values:
# - 1: Sequential generation (baseline, lowest memory)
# - 2: ~1.5-1.8x throughput
# - 4: ~3-4x throughput
# - 8: ~6-7x throughput
# - 16: ~8-10x throughput (requires significant VRAM, ideal for H100/H200 144GB or B200)
GRAIL_GENERATION_BATCH_SIZE=4
# ============================================================================
# GRAIL TRAINING CONFIGURATION
# ============================================================================
# Learning rate for GRPO training (default: 1e-6)
GRAIL_TRAINER_LR=1e-6
# Number of training epochs per window (default: 1)
GRAIL_TRAINER_EPOCHS=1
# Batch size for training (default: 4)
GRAIL_TRAINER_BATCH_SIZE=4
# Gradient accumulation steps (default: 128)
# Effective batch size = batch_size × grad_accum_steps = 4 × 128 = 512
# Optimizer steps per epoch = ceil(total_rollouts / effective_batch_size)
GRAIL_TRAINER_GRAD_ACCUM_STEPS=128
# Total training windows (horizon) for LR scheduler (default: 1000)
# This determines the learning rate schedule duration
# Total optimizer steps across training = TOTAL_WINDOWS
# Warmup steps = TOTAL_WINDOWS × WARMUP_FRACTION (typically 5%)
# Example: With 1000 windows, you get ~50 warmup steps + 950 main phase steps
GRAIL_TRAINER_TOTAL_WINDOWS=1000
# LR scheduler type after warmup phase (default: "constant")
# Options:
# - "constant": Warmup then constant LR (recommended for GRPO/RL)
# - "cosine": Warmup then cosine annealing decay (common for SFT)
# Research suggests constant or linear decay often works better for policy gradient methods
# PPO/GRPO implementations typically use constant LR after warmup
GRAIL_LR_SCHEDULER_TYPE=constant
# Warmup fraction as portion of total windows (default: 0.05 = 5%)
# Warmup steps = total_windows × warmup_fraction
# With TOTAL_WINDOWS=1000 and WARMUP_FRACTION=0.05 → warmup_steps=50
GRAIL_WARMUP_FRACTION=0.05
# Minimum learning rate floor for cosine annealing (default: 1e-7)
# Only used when GRAIL_LR_SCHEDULER_TYPE=cosine
# Used as eta_min in CosineAnnealingLR scheduler
# Lower values allow training to continue with smaller updates at end of schedule
GRAIL_SCHEDULER_ETA_MIN=1e-7
# Maximum sequence length for training (default: 2048)
GRAIL_TRAINER_MAX_LENGTH=2048
# Gradient clipping threshold (default: 1.0)
GRAIL_TRAINER_GRAD_CLIP=1.0
# KL divergence coefficient for regularization (default: 0.00)
GRAIL_TRAINER_KL_COEF=0.00
# Entropy coefficient for exploration (default: 0.0005)
GRAIL_TRAINER_ENTROPY_COEF=0.0005
# Advantage clipping percentile (default: 99.0)
GRAIL_TRAINER_ADV_CLIP_PERCENTILE=99.0
# Group advantage sum tolerance for GRPO (default: 0.01)
GRAIL_TRAINER_GROUP_ADV_SUM_TOL=0.01
# Enable Flash Attention for training model optimization (default: 1/true)
# Flash Attention 2 provides 2-4x faster training with no quality loss
# Requires flash-attn package: uv pip install flash-attn --no-build-isolation
# Set to 0 to disable (useful for debugging or older GPUs without Flash Attention support)
GRAIL_TRAINER_USE_FLASH_ATTENTION=1
# Enable gradient checkpointing for memory efficiency (default: 1/true)
# Reduces activation memory by ~20-30% via recomputation on backward pass
# Trade-off: ~10-15% slower training (recomputation cost)
# Benefits: Enables larger effective batch sizes, longer sequences, or training with fewer GPUs
# Set to 0 to disable if you have plenty of memory and want fastest training speed
GRAIL_TRAINER_USE_GRADIENT_CHECKPOINTING=1
# ============================================================================
# GRPO GROUP FILTERING & RANKING CONFIGURATION
# ============================================================================
# Maximum number of groups to use for training (default: 32)
# When more groups are available after filtering, the top-ranked groups by
# combined efficiency score are selected
GRAIL_GRPO_MAX_GROUPS=32
# Maximum completion tokens per rollout (default: 512, 0 disables)
# Filters out groups with any rollout exceeding this length
GRAIL_GRPO_MAX_COMPLETION_TOKENS=512
# Minimum success fraction within a group (default: 0.0, range: 0.0-1.0)
# Filters out groups where fewer than this fraction of rollouts succeeded
# Example: 0.25 requires at least 25% of rollouts in a group to succeed
GRAIL_GRPO_MIN_SUCCESS_FRACTION=0.0
# Minimum mean reward per token threshold (default: 0.0, <=0 disables)
# Filters out groups with reward/token below this threshold
# Implements GFPO (Group Filtered Policy Optimization) efficiency filtering
GRAIL_GRPO_MIN_REWARD_PER_TOKEN=0.0
# Drop lowest quantile by reward/token (default: 0.0, range: 0.0-1.0)
# Drops the bottom X% of groups ranked by reward/token before final selection
# Example: 0.1 drops the worst 10% of groups
GRAIL_GRPO_REWARD_PER_TOKEN_DROP_QUANTILE=0.0
# Group ranking weights for combined efficiency score (must sum to 1.0)
# These weights determine how groups are ranked when selecting top groups:
# - REWARD_WEIGHT: Prioritizes token-efficient solutions (GFPO principle)
# - VARIANCE_WEIGHT: Prioritizes strong learning signal (advantage spread)
# Default: 0.7/0.3 favors efficiency while considering signal strength
GRAIL_GRPO_RANKING_REWARD_WEIGHT=0.7
GRAIL_GRPO_RANKING_VARIANCE_WEIGHT=0.3
# ============================================================================
# GRAIL CHECKPOINT CONFIGURATION
# ============================================================================
# Interval for milestone checkpoints (every N windows, default: 100)
GRAIL_CHECKPOINT_MILESTONE_INTERVAL=100
# Delta codec format for trainer uploads
# Options: sparse_codec_v2 (default), sparse_codec_v3, sparse_codec_v3.1
GRAIL_DELTA_FORMAT=sparse_codec_v2
# ============================================================================
# DOCKER COMPOSE - VALIDATOR DEPLOYMENT
# ============================================================================
# These variables are consumed by docker/docker-compose.validator.yml. They
# only matter when you run the validator via `docker compose` (the bare-metal
# `grail validate` command ignores them). Operators running on bare metal can
# leave them blank.
# Override the validator image. Defaults to the public ghcr.io tag published
# by .github/workflows/docker-publish.yml. Use a locally-built tag for
# pre-release testing, or a pinned semver/sha256 tag in production:
# GRAIL_VALIDATOR_IMAGE=ghcr.io/one-covenant/grail:v0.5.10
# GRAIL_VALIDATOR_IMAGE=grail:base # local build
GRAIL_VALIDATOR_IMAGE=
# Set to "true" to start the validator with --test-mode (it only validates
# its own files and does NOT publish weights on-chain). Required for local
# test deployments; MUST be unset/false in production.
GRAIL_VALIDATOR_TEST_MODE=false
# Comma-separated list of physical GPU indices to pin the validator to,
# matching docker compose's deploy.resources.reservations.devices.device_ids
# contract. Default "0" assigns the validator to GPU 0 alone (sufficient for
# the validator path when GRAIL_GPU_EVAL=false). Bump to e.g. "0,1" if you
# enable on-GPU kernel evaluation, or "2" to keep the validator off GPUs
# 0+1 when a co-located miner uses them.
GRAIL_VALIDATOR_GPU_IDS=0
# Shared-memory size for the validator container. HF dataloaders and
# multiprocessing IPC need more than the 64 MB Docker default.
GRAIL_VALIDATOR_SHM_SIZE=2g
# ─────────────────────────────────────────────────────────────────────────
# Trainer-side protocol configuration (set on the TRAINER host)
# ─────────────────────────────────────────────────────────────────────────
# These values are baked into every published checkpoint's metadata. The
# miner and validator override their own GRAIL_THINKING_MODE / env_id /
# generation params from the loaded checkpoint at runtime, so the trainer
# is the single source of truth for everything that affects prompt rendering
# or generation. You only need to set these on the TRAINER host; the miner
# and validator pick them up via R2 metadata. They are also passed through
# the validator docker compose so any pre-checkpoint code path (boot
# rendering, parser regex compile) sees the right value.
#
# GRAIL_THINKING_MODE controls the system prompt and chat template:
# "instructed" (DEFAULT) — custom <start_working_out>/</end_working_out>
# for thinking + <SOLUTION>/</SOLUTION> for the
# answer, injected via system prompt + a custom
# ChatML template (mode grail builds itself)
# "native" — model's built-in <think>/</think> tokens
# (e.g. Qwen3) + <SOLUTION> for the answer
# Miner and validator MUST agree, so the trainer publishes this and the
# consumer sides override their local env var from the checkpoint.
GRAIL_THINKING_MODE=instructed
# Environment selection (trainer-side; passed via checkpoint metadata)
GRAIL_ENV_ID=mbpp
GRAIL_ENV_SPLIT=train
# Generation parameters (trainer-side; passed via checkpoint metadata)
# Defaults match grail/trainer/checkpoint_publisher.py::get_default_generation_params
GRAIL_GEN_MAX_TOKENS=2048
GRAIL_GEN_TEMPERATURE=0.7
GRAIL_GEN_TOP_P=0.9
GRAIL_GEN_TOP_K=50
GRAIL_GEN_REPETITION_PENALTY=1.0
# Host storage path for Docker volume mount (Docker Compose only)
# If your machine has a large secondary disk (e.g., NVMe ephemeral storage),
# set this to its mount point so checkpoints and caches are stored there.
# The path is bind-mounted into the container at the same path.
#
# Examples:
# GRAIL_HOST_STORAGE_PATH=/ephemeral
# GRAIL_HOST_STORAGE_PATH=/mnt/data
#
# When set, also update GRAIL_CACHE_DIR below to point inside it:
# GRAIL_CACHE_DIR=/ephemeral/grail_cache
#
# Leave blank to use default (~/grail-storage):
GRAIL_HOST_STORAGE_PATH=
# Local cache directory for checkpoints (default: ~/.cache/grail)
#
# IMPORTANT:
# - If left blank, defaults to ~/.cache/grail (persistent on disk)
# - Empty strings should NOT be used as they can cause path resolution issues
#
# Recommended for faster checkpoint I/O (RAM disk / tmpfs):
# GRAIL_CACHE_DIR=/dev/shm/grail
#
# Example (persistent on disk):
# GRAIL_CACHE_DIR=/home/user/.cache/grail
#
# Leave blank to use default (~/.cache/grail):
GRAIL_CACHE_DIR=
# HuggingFace models cache directory (default: ~/.cache/huggingface)
#
# IMPORTANT:
# - Redirects all HuggingFace model downloads (AutoModel, AutoTokenizer)
# - If left blank, HuggingFace defaults to ~/.cache/huggingface
#
# Recommended for faster model I/O (RAM disk / tmpfs):
# HF_HOME=/dev/shm/huggingface
#
# Example (persistent on disk):
# HF_HOME=/home/user/.cache/huggingface
#
# Leave blank to use default (~/.cache/huggingface):
HF_HOME=
# ============================================================================
# MONITORING SYSTEM CONFIGURATION
# ============================================================================
# Backend type for monitoring ("wandb" for WandB, "null" to disable)
GRAIL_MONITORING_BACKEND=wandb
# ============================================================================
# WANDB (WEIGHTS & BIASES) CONFIGURATION
# ============================================================================
# WandB API key (create at https://wandb.ai/settings)
# Required for WANDB_MODE="online"
WANDB_API_KEY=
# WandB project name (creates a project in your WandB workspace)
WANDB_PROJECT=grail
# WandB entity/team name (your username or team name, optional)
WANDB_ENTITY=tplr
# WandB mode - controls how data is sent to WandB
# - "online": Send data to WandB cloud in real-time (production)
# - "offline": Store data locally, sync later with: wandb sync
# - "disabled": Disable WandB completely
WANDB_MODE=online
# Tags for organizing runs (comma-separated)
WANDB_TAGS=grail,bittensor,production
# Description/notes for runs
WANDB_NOTES=GRAIL production monitoring
# Resume behavior for interrupted runs
# - "allow": Resume if possible, create new otherwise
# - "must": Must resume existing run (fails if not found)
# - "never": Always create new run
# - "auto": Automatically resume based on run ID
WANDB_RESUME=allow
# WandB cache directories (optional - set if local storage is limited)
# Point to a location with more available space to prevent disk full errors
WANDB_CACHE_DIR=
WANDB_DATA_DIR=
# ============================================================================
# MONITORING PERFORMANCE TUNING
# ============================================================================
# Number of metrics to buffer before flushing to WandB (default: 100)
# Higher values = less network calls but more memory usage
GRAIL_METRIC_BUFFER_SIZE=100
# Interval in seconds between automatic metric flushes (default: 30.0)
# Lower values = more real-time data but more network overhead
GRAIL_METRIC_FLUSH_INTERVAL=30.0
# ============================================================================
# NETWORK TIMEOUT AND RETRY CONFIGURATION
# ============================================================================
# Timeout in seconds for Bittensor network calls (default: 15.0)
# Increase if you experience frequent timeouts on slow networks
BT_CALL_TIMEOUT=15.0
# Number of retries for failed Bittensor network calls (default: 3)
# Higher values provide more resilience but may increase latency
BT_CALL_RETRIES=3
# Backoff delay in seconds between retry attempts (default: 5.0)
# Exponential backoff helps avoid overwhelming busy nodes
BT_CALL_BACKOFF=5.0
# ============================================================================
# OBSERVABILITY (Grafana + Loki via Promtail) [Optional]
# ============================================================================
# IMPORTANT: This section is completely optional. Only configure if you want
# centralized logging and monitoring with Grafana + Loki.
#
# Use Promtail to ship logs to Loki for better reliability
# and performance. The app writes logs locally; Promtail tails and forwards them.
#
# SETUP REQUIREMENTS:
# 1. Deploy Grafana + Loki on a separate server using:
# docker-compose --env-file .env -f docker/compose.grafana.yaml up -d
# 2. Configure PROMTAIL_LOKI_URL to point to your Loki instance
# 3. Set GRAIL_ENV and ensure your wallet/network vars are configured
# 4. Set GRAIL_LOG_FILE to enable file logging for Promtail to tail
#
# Promtail will automatically use labels from your environment:
# - env (from GRAIL_ENV), service=grail, network (from BT_NETWORK)
# - netuid (from NETUID), wallet (from BT_WALLET_COLD), hotkey (from BT_WALLET_HOT)
#
# If you don't want observability, set PROMTAIL_ENABLE=false
# Enable Promtail-based log shipping (required for Grafana logging)
# For docker-compose.validator.yml:
# - PROMTAIL_ENABLE=false (default): Promtail will NOT start
# - To enable: docker-compose --profile promtail -f docker/docker-compose.validator.yml up -d
# - Or set: export COMPOSE_PROFILES=promtail
PROMTAIL_ENABLE=false
# Loki push endpoint URL for Promtail (replace with your Grafana/Loki server)
# Example: http://your-grafana-server.com:3100/loki/api/v1/push
PROMTAIL_LOKI_URL=http://loki:3100/loki/api/v1/push
# Job name for Promtail scraping
PROMTAIL_JOB=grail
# App log file path (required for Promtail tailing)
# Promtail will tail this file and ship logs to Loki
GRAIL_LOG_FILE=/var/log/grail/grail.log
# Optional: log rotation for file logging (used by RotatingFileHandler)
# Accepts bytes or units: KB, MB, GB (e.g., 100MB)
GRAIL_LOG_MAX_SIZE=100MB
# Number of rotated log files to keep
GRAIL_LOG_BACKUP_COUNT=5
# Environment label for promtail logs (dev, staging, prod, etc.)
GRAIL_ENV=prod
# Grafana server root URL (used when deploying Grafana with compose.grafana.yaml)
# This should match the public URL where your Grafana instance will be accessible
# Default: http://localhost:3000 (if not set)
# GF_SERVER_ROOT_URL=http://your-grafana-server.com:3000