Unified ML training infrastructure for Elixir/BEAM
CrucibleTrain provides a complete, platform-agnostic training infrastructure for ML workloads on the BEAM. It includes:
- Renderers: Message-to-token transformation for all major model families (Llama3, Qwen3, DeepSeek, etc.)
- Training Loops: Supervised learning, RL, DPO, and distillation
- Type System: Unified Datum, ModelInput, and related types
- Ports & Adapters: Pluggable backends for any training platform
- Logging: Multiplexed ML logging (JSON, console, custom backends)
- Crucible Integration: Stage implementations for pipeline composition
Add to your mix.exs:
def deps do
[
{:crucible_train, "~> 0.3.0"}
]
endalias CrucibleTrain.Supervised.{Train, Config}
alias CrucibleTrain.Renderers
renderer = Renderers.get_renderer("meta-llama/Llama-3.1-8B")
config = %Config{
training_client: my_client,
train_dataset: my_dataset,
learning_rate: 1.0e-4,
num_epochs: 3
}
{:ok, result} = Train.main(config)This package provides Crucible stages for ML training workflows:
| Stage | Name | Description |
|---|---|---|
SupervisedTrain |
:supervised_train |
Standard supervised learning with configurable optimizer/loss |
DPOTrain |
:dpo_train |
Direct Preference Optimization with beta parameter |
RLTrain |
:rl_train |
Reinforcement Learning (PPO, DQN, A2C, REINFORCE) |
Distillation |
:distillation |
Knowledge Distillation with temperature/alpha |
All stages implement the Crucible.Stage behaviour with full describe/1 schemas for introspection.
# View stage schema
schema = CrucibleTrain.Stages.SupervisedTrain.describe(%{})
# => %{
# name: :supervised_train,
# description: "Runs supervised learning training...",
# required: [],
# optional: [:epochs, :batch_size, :learning_rate, :optimizer, :loss_fn, :metrics],
# types: %{epochs: :integer, batch_size: :integer, ...}
# }Use in Crucible pipelines:
alias CrucibleIR.StageDef
stages = [
%StageDef{name: :supervised_train, options: %{epochs: 3, batch_size: 32}}
]CrucibleTrain supports multiple logging backends for experiment tracking:
alias CrucibleTrain.Logging
# Local JSONL logging
{:ok, logger} = Logging.create_logger(:json, log_dir: "./logs")
# Console table output
{:ok, logger} = Logging.create_logger(:pretty)
# Log metrics and hyperparameters
Logging.log_hparams(logger, %{learning_rate: 1.0e-4})
Logging.log_metrics(logger, step, %{loss: 0.5, accuracy: 0.9})
Logging.close(logger)Full integration with Weights & Biases for experiment tracking:
# Setup: export WANDB_API_KEY="your-api-key"
{:ok, logger} = Logging.create_logger(:wandb,
api_key: System.get_env("WANDB_API_KEY"),
project: "my-project",
entity: "my-team", # optional
run_name: "experiment-1" # optional
)
# Get run URL
url = Logging.get_url(logger)
# => "https://wandb.ai/my-team/my-project/runs/experiment-1"
# Log hyperparameters (nested maps supported)
Logging.log_hparams(logger, %{
model: "llama-3.1-8b",
optimizer: %{name: "adamw", lr: 1.0e-4}
})
# Log training metrics
Logging.log_metrics(logger, step, %{loss: 0.5, accuracy: 0.9})
# Log long-form text (summaries, notes)
Logging.log_long_text(logger, "eval_notes", "Model performed well on...")
Logging.close(logger)Rate Limiting: W&B free tier has strict rate limits (~60 requests/minute per run). Rate limiting is enabled by default with conservative settings:
- 500ms minimum interval between requests
- Automatic retry with exponential backoff on 429 errors
# Custom rate limit settings
{:ok, logger} = Logging.create_logger(:wandb,
project: "my-project",
rate_limit: [min_interval_ms: 1000, max_retries: 5]
)
# Disable rate limiting (not recommended for free tier)
{:ok, logger} = Logging.create_logger(:wandb,
project: "my-project",
rate_limit: false
)Full integration with Neptune.ai for experiment tracking:
# Setup:
# export NEPTUNE_API_TOKEN="your-api-token"
# export NEPTUNE_PROJECT="workspace/project-name"
{:ok, logger} = Logging.create_logger(:neptune,
api_token: System.get_env("NEPTUNE_API_TOKEN"),
project: System.get_env("NEPTUNE_PROJECT")
)
# Get run URL
url = Logging.get_url(logger)
# => "https://app.neptune.ai/workspace/project-name/e/RUN-1"
# Same logging API as other backends
Logging.log_hparams(logger, %{model: "deepseek-v3", batch_size: 64})
Logging.log_metrics(logger, step, %{loss: 0.3, grad_norm: 1.2})
Logging.close(logger)Rate Limiting: Enabled by default with 200ms minimum interval. Configure the same way as W&B.
Both W&B and Neptune loggers support the following rate limit options:
| Option | Default (W&B) | Default (Neptune) | Description |
|---|---|---|---|
min_interval_ms |
500 | 200 | Minimum ms between requests |
max_retries |
3 | 3 | Retry attempts on 429 |
base_backoff_ms |
1000 | 1000 | Initial backoff duration |
max_backoff_ms |
30000 | 30000 | Maximum backoff cap |
Pluggable scoring system for model evaluation:
alias CrucibleTrain.Eval.{Scoring, BatchRunner}
# Score individual outputs
Scoring.score(:exact_match, "Paris", "Paris") # => 1.0
Scoring.score(:contains, "The answer is 42", "42") # => 1.0
# Streaming batch evaluation
results =
samples
|> BatchRunner.stream_evaluate(config, chunk_size: 25)
|> Enum.to_list()
metrics = BatchRunner.aggregate_metrics(results)
# => %{mean_score: 0.85, total: 100, correct: 85}Flexible LR schedules with warmup support:
alias CrucibleTrain.Supervised.Config
# Cosine annealing with warmup
config = %Config{
learning_rate: 1.0e-4,
lr_schedule: {:warmup, 100, :cosine}
}
# Available schedules: :constant, :linear, :cosine
# Warmup: {:warmup, warmup_steps, base_schedule}See the examples/ directory for runnable demos:
# Run all local examples
./examples/run_all.sh
# Run individual examples
mix run --no-start examples/json_logger_example.exs
mix run --no-start examples/wandb_logger_example.exs
mix run --no-start examples/scoring_example.exs| Example | Description |
|---|---|
json_logger_example.exs |
Local JSONL logging |
pretty_print_logger_example.exs |
Console table output |
multiplex_logger_example.exs |
Multiple backends |
wandb_logger_example.exs |
Weights & Biases |
neptune_logger_example.exs |
Neptune.ai |
scoring_example.exs |
Evaluation scoring |
batch_runner_example.exs |
Batch evaluation |
lr_scheduling_example.exs |
LR schedules |
See examples/README.md for setup instructions for cloud services.
Full documentation available at HexDocs.
MIT License - see LICENSE for details.