if judge_type == 'binary':
# For binary judges, use MLflow's default SIMBAAlignmentOptimizer
# This is optimized for Pass/Fail classification
yield "Creating Binary SIMBA optimizer (using MLflow default)..."
try:
from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
optimizer = SIMBAAlignmentOptimizer(
model=optimizer_model_uri,
)
yield f"Binary optimizer created with model={optimizer_model_uri}"
yield "Using MLflow's default SIMBA for binary Pass/Fail optimization"
except ImportError as e:
error_msg = f"MLflow SIMBA optimizer not available: {e}"
yield f"ERROR: {error_msg}"
yield {"error": error_msg, "success": False}
return
else:
# For Likert scale judges (default), use custom LikertSIMBAAlignmentOptimizer
# This uses a custom agreement metric for 1-5 scale
yield "Creating Likert SIMBA optimizer..."
optimizer = LikertSIMBAAlignmentOptimizer(
model=optimizer_model_uri,
batch_size=6,
max_demos=0,
verbose=True,
)
yield f"Likert optimizer created with model={optimizer_model_uri}, batch_size=6"
yield "Using custom Likert agreement metric for 1-5 scale optimization"
yield f"Running alignment with {len(mlflow_traces)} traces... (this may take 20+ minutes)"
# Run alignment in background thread so we can yield logs periodically
aligned_judge_container: Dict[str, Any] = {}
alignment_error: Optional[Exception] = None
last_status_emit = time.time()
def _alignment_worker():
nonlocal alignment_error
try:
aligned_judge_container["judge"] = judge.align(mlflow_traces, optimizer)
except Exception as exc:
alignment_error = exc
logger.exception("Alignment failed: %s", exc)
Starting MLflow evaluation job...
Evaluation job started (ID: 7350916f...)
Evaluation job started
Initializing evaluation service...
Starting evaluation for judge: helpful_judge
Prepared 13 traces for evaluation
search_traces returned 13 tagged rows; evaluating 13 traces
Using MLflow experiment ID: 1880719540071822
Created evaluation DataFrame with 13 rows via search_traces
Using evaluation model: databricks:/databricks-gpt-5-1
Detected binary rubric - creating judge with feedback_value_type=float (expecting 0 or 1)
Created judge: helpful_judge
Preparing inputs/outputs columns from MLflow trace data...
WARNING: Missing inputs for 0 traces, missing outputs for 3 traces
Filtered out 3 traces with missing inputs/outputs
Running mlflow.genai.evaluate()...
Evaluation complete. Processing results...
Available columns in result_df: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
Looking for column 'helpful_judge/value': found
WARNING: No reasoning/explanation column found. Available columns: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
🔍 Judge type detection: judge_type='binary', is_binary=True
Detected binary rubric - will convert PASS/FAIL to 1/0 and reject any values not 0 or 1
🔍 Raw MLflow response for trace tr-4138d...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-4138d...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-4138d... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-6ec66...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-6ec66...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-6ec66... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-c7aff...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-c7aff...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-c7aff... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-e3e24...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-e3e24...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-e3e24... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-95267...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-95267...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-95267... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-35854...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-35854...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-35854... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-2f52e...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-2f52e...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-2f52e... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-de442...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-de442...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-de442... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-a4b06...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-a4b06...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-a4b06... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-0fb5f...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-0fb5f...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-0fb5f... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
WARNING: Missing evaluation scores for 3 traces.
Extracted 0/10 evaluations with scores (null predictions: 0)
Computing metrics for judge type: binary
Evaluation results prepared for 10 traces
Saved 10 trace evaluations to database
Saved evaluation results for Judge Prompt (id=8a9c5225-46f9-4483-8296-0f98977989f6)
Evaluation completed successfully
The Problem
When using a binary rubric (expecting 0 or 1 / PASS or FAIL), MLflow is returning 3.0 (a float in Likert scale range) instead of binary values. All 10 evaluations are being rejected because the values are invalid for a binary judge.
Relavent code pieces:
Judge tuning mlflow evalute logs: