Binary Scale Issue with SIMBA Optimizer / MLflow Evaluation

The Problem
When using a binary rubric (expecting 0 or 1 / PASS or FAIL), MLflow is returning 3.0 (a float in Likert scale range) instead of binary values. All 10 evaluations are being rejected because the values are invalid for a binary judge.

1. Binary rubric detected → feedback_type = bool
2. make_judge() called with feedback_value_type=bool
3. MLflow evaluate() runs
4. Model returns 3.0 (Likert-style) instead of 0/1
5. No reasoning column available to parse text
6. Validation rejects 3.0 as invalid binary value
7. All evaluations rejected → 0/10 valid results

Relavent code pieces:
```
     if judge_type == 'binary':
                # For binary judges, use MLflow's default SIMBAAlignmentOptimizer
                # This is optimized for Pass/Fail classification
                yield "Creating Binary SIMBA optimizer (using MLflow default)..."
                
                try:
                    from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
                    
                    optimizer = SIMBAAlignmentOptimizer(
                        model=optimizer_model_uri,
                    )
                    yield f"Binary optimizer created with model={optimizer_model_uri}"
                    yield "Using MLflow's default SIMBA for binary Pass/Fail optimization"
                except ImportError as e:
                    error_msg = f"MLflow SIMBA optimizer not available: {e}"
                    yield f"ERROR: {error_msg}"
                    yield {"error": error_msg, "success": False}
                    return
            else:
                # For Likert scale judges (default), use custom LikertSIMBAAlignmentOptimizer
                # This uses a custom agreement metric for 1-5 scale
                yield "Creating Likert SIMBA optimizer..."
                
                optimizer = LikertSIMBAAlignmentOptimizer(
                    model=optimizer_model_uri,
                    batch_size=6,
                    max_demos=0,
                    verbose=True,
                )
                yield f"Likert optimizer created with model={optimizer_model_uri}, batch_size=6"
                yield "Using custom Likert agreement metric for 1-5 scale optimization"
            
            yield f"Running alignment with {len(mlflow_traces)} traces... (this may take 20+ minutes)"
            
            # Run alignment in background thread so we can yield logs periodically
            aligned_judge_container: Dict[str, Any] = {}
            alignment_error: Optional[Exception] = None
            last_status_emit = time.time()
            
            def _alignment_worker():
                nonlocal alignment_error
                try:
                    aligned_judge_container["judge"] = judge.align(mlflow_traces, optimizer)
                except Exception as exc:
                    alignment_error = exc
                    logger.exception("Alignment failed: %s", exc)
```

Judge tuning mlflow evalute logs:
```
Starting MLflow evaluation job...
Evaluation job started (ID: 7350916f...)
Evaluation job started
Initializing evaluation service...
Starting evaluation for judge: helpful_judge
Prepared 13 traces for evaluation
search_traces returned 13 tagged rows; evaluating 13 traces
Using MLflow experiment ID: 1880719540071822
Created evaluation DataFrame with 13 rows via search_traces
Using evaluation model: databricks:/databricks-gpt-5-1
Detected binary rubric - creating judge with feedback_value_type=float (expecting 0 or 1)
Created judge: helpful_judge
Preparing inputs/outputs columns from MLflow trace data...
WARNING: Missing inputs for 0 traces, missing outputs for 3 traces
Filtered out 3 traces with missing inputs/outputs
Running mlflow.genai.evaluate()...
Evaluation complete. Processing results...
Available columns in result_df: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
Looking for column 'helpful_judge/value': found
WARNING: No reasoning/explanation column found. Available columns: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
🔍 Judge type detection: judge_type='binary', is_binary=True
Detected binary rubric - will convert PASS/FAIL to 1/0 and reject any values not 0 or 1
🔍 Raw MLflow response for trace tr-4138d...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-4138d...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-4138d... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-6ec66...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-6ec66...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-6ec66... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-c7aff...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-c7aff...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-c7aff... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-e3e24...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-e3e24...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-e3e24... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-95267...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-95267...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-95267... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-35854...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-35854...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-35854... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-2f52e...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-2f52e...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-2f52e... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-de442...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-de442...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-de442... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-a4b06...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-a4b06...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-a4b06... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-0fb5f...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-0fb5f...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-0fb5f... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
WARNING: Missing evaluation scores for 3 traces.
Extracted 0/10 evaluations with scores (null predictions: 0)
Computing metrics for judge type: binary
Evaluation results prepared for 10 traces
Saved 10 trace evaluations to database
Saved evaluation results for Judge Prompt (id=8a9c5225-46f9-4483-8296-0f98977989f6)
Evaluation completed successfully


```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary Scale Issue with SIMBA Optimizer / MLflow Evaluation #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Binary Scale Issue with SIMBA Optimizer / MLflow Evaluation #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions