Skip to content

Binary Scale Issue with SIMBA Optimizer / MLflow Evaluation #10

@vivian-xie-db

Description

@vivian-xie-db

The Problem
When using a binary rubric (expecting 0 or 1 / PASS or FAIL), MLflow is returning 3.0 (a float in Likert scale range) instead of binary values. All 10 evaluations are being rejected because the values are invalid for a binary judge.

  1. Binary rubric detected → feedback_type = bool
  2. make_judge() called with feedback_value_type=bool
  3. MLflow evaluate() runs
  4. Model returns 3.0 (Likert-style) instead of 0/1
  5. No reasoning column available to parse text
  6. Validation rejects 3.0 as invalid binary value
  7. All evaluations rejected → 0/10 valid results

Relavent code pieces:

     if judge_type == 'binary':
                # For binary judges, use MLflow's default SIMBAAlignmentOptimizer
                # This is optimized for Pass/Fail classification
                yield "Creating Binary SIMBA optimizer (using MLflow default)..."
                
                try:
                    from mlflow.genai.judges.optimizers import SIMBAAlignmentOptimizer
                    
                    optimizer = SIMBAAlignmentOptimizer(
                        model=optimizer_model_uri,
                    )
                    yield f"Binary optimizer created with model={optimizer_model_uri}"
                    yield "Using MLflow's default SIMBA for binary Pass/Fail optimization"
                except ImportError as e:
                    error_msg = f"MLflow SIMBA optimizer not available: {e}"
                    yield f"ERROR: {error_msg}"
                    yield {"error": error_msg, "success": False}
                    return
            else:
                # For Likert scale judges (default), use custom LikertSIMBAAlignmentOptimizer
                # This uses a custom agreement metric for 1-5 scale
                yield "Creating Likert SIMBA optimizer..."
                
                optimizer = LikertSIMBAAlignmentOptimizer(
                    model=optimizer_model_uri,
                    batch_size=6,
                    max_demos=0,
                    verbose=True,
                )
                yield f"Likert optimizer created with model={optimizer_model_uri}, batch_size=6"
                yield "Using custom Likert agreement metric for 1-5 scale optimization"
            
            yield f"Running alignment with {len(mlflow_traces)} traces... (this may take 20+ minutes)"
            
            # Run alignment in background thread so we can yield logs periodically
            aligned_judge_container: Dict[str, Any] = {}
            alignment_error: Optional[Exception] = None
            last_status_emit = time.time()
            
            def _alignment_worker():
                nonlocal alignment_error
                try:
                    aligned_judge_container["judge"] = judge.align(mlflow_traces, optimizer)
                except Exception as exc:
                    alignment_error = exc
                    logger.exception("Alignment failed: %s", exc)

Judge tuning mlflow evalute logs:

Starting MLflow evaluation job...
Evaluation job started (ID: 7350916f...)
Evaluation job started
Initializing evaluation service...
Starting evaluation for judge: helpful_judge
Prepared 13 traces for evaluation
search_traces returned 13 tagged rows; evaluating 13 traces
Using MLflow experiment ID: 1880719540071822
Created evaluation DataFrame with 13 rows via search_traces
Using evaluation model: databricks:/databricks-gpt-5-1
Detected binary rubric - creating judge with feedback_value_type=float (expecting 0 or 1)
Created judge: helpful_judge
Preparing inputs/outputs columns from MLflow trace data...
WARNING: Missing inputs for 0 traces, missing outputs for 3 traces
Filtered out 3 traces with missing inputs/outputs
Running mlflow.genai.evaluate()...
Evaluation complete. Processing results...
Available columns in result_df: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
Looking for column 'helpful_judge/value': found
WARNING: No reasoning/explanation column found. Available columns: ['trace_id', 'user_satisfaction/value', 'helpful_judge/value', 'accuracy/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_1/value', 'helpful/value', 'safety/value', '4d2d5770-c18a-4462-96f2-1c924dd4980e_0/value', 'context_suffiency_judge/value', 'intent_recognition_judge/value', 'trace', 'client_request_id', 'state', 'request_time', 'execution_duration', 'request', 'response', 'trace_metadata', 'tags', 'spans', 'assessments']
🔍 Judge type detection: judge_type='binary', is_binary=True
Detected binary rubric - will convert PASS/FAIL to 1/0 and reject any values not 0 or 1
🔍 Raw MLflow response for trace tr-4138d...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-4138d...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-4138d... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-6ec66...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-6ec66...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-6ec66... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-c7aff...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-c7aff...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-c7aff... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-e3e24...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-e3e24...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-e3e24... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-95267...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-95267...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-95267... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-35854...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-35854...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-35854... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-2f52e...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-2f52e...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-2f52e... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-de442...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-de442...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-de442... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-a4b06...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-a4b06...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-a4b06... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
🔍 Raw MLflow response for trace tr-0fb5f...: type=<class 'float'>, value=3.0
⚠️ No raw text response available for trace tr-0fb5f...
ERROR: Invalid binary rating 3.0 (type: float) for trace tr-0fb5f... - must be 0 or 1, rejecting evaluation. MLflow incorrectly parsed as: 3.0. No raw text response available to parse. MLflow's feedback_value_type=bool is not working correctly - it's returning float instead of bool.
WARNING: Missing evaluation scores for 3 traces.
Extracted 0/10 evaluations with scores (null predictions: 0)
Computing metrics for judge type: binary
Evaluation results prepared for 10 traces
Saved 10 trace evaluations to database
Saved evaluation results for Judge Prompt (id=8a9c5225-46f9-4483-8296-0f98977989f6)
Evaluation completed successfully


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions