Bug: Candidate selection ignores score; retries reuse same seed (identical audio)

# Bug: Candidate selection ignores score; retries reuse same seed (identical audio)

## Summary

Two issues in `Chatter.py` lead to sub‑optimal selection and ineffective retries:

* After Whisper validation, the final pick is by **shortest duration**, discarding the **similarity score**.
* Retry rounds regenerate the **same audio** because seed derivation doesn’t include the retry round (attempt stays 0 and `this_seed` is reused).

## Impact

* High‑quality, slightly longer candidates lose to shorter, worse ones.
* Retries don’t explore new samples; repeated identical outputs waste time/compute.

## Repro

1. Generate multiple candidates per chunk with varying durations and scores; run validation.
2. Observe final selection favors the shortest clip even when a higher‑score clip exists.
3. Trigger a retry with `max_attempts_per_candidate=1`; outputs across retries are bit‑identical.

## Expected vs Actual

* **Expected:** Select by highest score (tie‑break by shortest duration). Retries should vary seeds.
* **Actual:** Selection by shortest duration; retries reuse the same seed → identical audio.

## Proposed Fix

**A) Preserve score and select correctly**

```python
# when a candidate passes validation
chunk_validations[idx].append((score, cand['duration'], cand['path']))

# when selecting the winner
best_path = sorted(
    chunk_validations[idx],
    key=lambda x: (-x[0], x[1])  # score desc, duration asc
)[0][2]
```

**B) Vary seed across retry rounds**
Option 1 (no API change):

```python
# inside process_one_chunk_deterministic before derive_seed(...)
salted_seed = this_seed ^ (0x9E3779B1 * int(retry_attempt_number))
candidate_seed = derive_seed(salted_seed, idx, cand_idx, attempt)
```

Option 2 (API change):

```python
def derive_seed(base_seed, chunk_idx, cand_idx, attempt_idx, retry_round=0):
    return mix_to_int(base_seed, chunk_idx, cand_idx, attempt_idx, retry_round)

candidate_seed = derive_seed(this_seed, idx, cand_idx, attempt, retry_attempt_number)
```

## Nice‑to‑have

Introduce a separate `max_retry_rounds` (distinct from `max_attempts_per_candidate`) to avoid conflating generation attempts with retry cycles.

## Acceptance Criteria

* Selection prefers higher validation scores; duration only breaks ties.
* Consecutive retries produce different audio (non‑identical seeds).
* Optional: independent knob for retry rounds.

## File/Line References (Chatter.py)

### A) Candidate selection ignores score

* **Validation stores only duration & path**

  * L1210:

    ```py
    chunk_validations[chunk_idx].append((cand['duration'], cand['path']))
    ```
  * L1264:

    ```py
    chunk_validations[chunk_idx].append((cand['duration'], cand['path']))
    ```
* **Winner chosen by shortest duration**

  * L1277–L1279:

    ```py
    if chunk_validations[chunk_idx]:
        best_path = sorted(chunk_validations[chunk_idx], key=lambda x: x[0])[0][1]
    ```
* **Fix (where to patch)**

  * At **L1210** and **L1264**, append `(score, cand['duration'], cand['path'])` instead.
  * At **L1278**, select by highest score then shortest duration:

    ```py
    best_path = sorted(chunk_validations[chunk_idx], key=lambda x: (-x[0], x[1]))[0][2]
    ```

### B) Retry rounds reuse the same seed (identical audio)

* **Seed derivation omits retry round**

  * L0337–L0348 (`derive_seed`): no `retry_round` parameter.
* **Generation uses same base seed on retries**

  * L0734–L0737 and L0811–L0814: `candidate_seed = derive_seed(this_seed, idx, cand_idx, attempt)`.
* **Retry loop passes `retry_attempt_number` but it isn’t mixed into the seed**

  * L1234–L1245: `process_one_chunk_deterministic(..., this_seed, ..., 1, ..., chunk_attempts[chunk_idx] + 1)`.
* **Filenames show incrementing `try{}` but the `seed{}` stays the same**

  * L0751 and L0847: path pattern `..._try{retry_attempt_number}_seed{candidate_seed}.wav`.
* **Fix (two options)**

  1. *Local salt, no API change* — in both `process_one_chunk` and `process_one_chunk_deterministic`, immediately before calling `derive_seed` (around **L0736** and **L0813**):

     ```py
     salted_seed = this_seed ^ (0x9E3779B1 * int(retry_attempt_number))
     candidate_seed = derive_seed(salted_seed, idx, cand_idx, attempt)
     ```
  2. *API change* — extend `derive_seed` (around **L0337**) to accept `retry_round`, and call with it at **L0736** and **L0813**:

     ```py
     def derive_seed(base_seed, chunk_idx, cand_idx, attempt_idx, retry_round=0):
         mix = (np.uint64(base_seed) * np.uint64(1000003)
                + np.uint64(chunk_idx) * np.uint64(10007)
                + np.uint64(cand_idx) * np.uint64(10009)
                + np.uint64(attempt_idx) * np.uint64(101)
                + np.uint64(retry_round) * np.uint64(10037))
         s = int(mix & np.uint64(0xFFFFFFFF)) or 1
         return s

     # then use
     candidate_seed = derive_seed(this_seed, idx, cand_idx, attempt, retry_attempt_number)
     ```

### C) Optional: separate knobs

* The retry loop (starting around **L1218**) reuses `max_attempts_per_candidate` for the number of retry rounds; consider adding a distinct `max_retry_rounds` for clarity.

## PR-ready patch (no API change)

Apply this unified diff to fix both issues (selection by score; seed varies on retries):

```diff
--- a/Chatter.py
+++ b/Chatter.py
@@ -733,7 +733,9 @@
 
         for cand_idx in range(num_candidates_per_chunk):
             for attempt in range(max_attempts_per_candidate):
-                candidate_seed = derive_seed(this_seed, idx, cand_idx, attempt)
+                salted_seed = this_seed ^ (0x9E3779B1 * int(retry_attempt_number))
+
+                candidate_seed = derive_seed(salted_seed, idx, cand_idx, attempt)
                 set_seed(candidate_seed)
                 try:
                     print(f"[32m[DEBUG] Generating candidate {cand_idx+1} attempt {attempt+1} for chunk {idx}...[0m")
@@ -810,7 +812,9 @@
 
         for cand_idx in range(num_candidates_per_chunk):
             for attempt in range(max_attempts_per_candidate):
-                candidate_seed = derive_seed(this_seed, idx, cand_idx, attempt)
+                salted_seed = this_seed ^ (0x9E3779B1 * int(retry_attempt_number))
+
+                candidate_seed = derive_seed(salted_seed, idx, cand_idx, attempt)
                 print(f"[32m[DEBUG] [DET] Generating cand {...pt {attempt+1} for chunk {idx} (seed={candidate_seed}).[0m")
 
                 try:
@@ -1207,7 +1211,7 @@
                         path, score, transcribed = whisper_chec...ndidate_path, sentence_group, whisper_model, use_faster_whisper)
                         print(f"[32m[DEBUG] [Chunk {chunk_i...: score={score:.3f}, transcript=[33m'{transcribed}'[0m")
                         if score >= 0.85:
-                            chunk_validations[chunk_idx].append((cand['duration'], cand['path']))
+                            chunk_validations[chunk_idx].append((score, cand['duration'], cand['path']))
                         else:
                             chunk_failed_candidates[chunk_idx].append((score, cand['path'], transcribed))
                     except Exception as e:
@@ -1261,7 +1265,7 @@
                                 path, score, transcribed = whisper_check_mp(candidate_path, sentence_group, whisper_model, use_faster_whisper)
                                 print(f"[32m[DEBUG] [Chunk ...: score={score:.3f}, transcript=[33m'{transcribed}'[0m")
                                 if score >= 0.95:
-                                    chunk_validations[chunk_idx].append((cand['duration'], cand['path']))
+                                    chunk_validations[chunk_idx].append((score, cand['duration'], cand['path']))
                                 else:
                                     chunk_failed_candidates[chunk_idx].append((score, cand['path'], transcribed))
                             except Exception as e:
@@ -1275,7 +1279,7 @@
                 # Assemble waveform list
                 for chunk_idx in sorted(chunk_candidate_map.keys()):
                     if chunk_validations[chunk_idx]:
-                        best_path = sorted(chunk_validations[chunk_idx], key=lambda x: x[0])[0][1]
+                        best_path = sorted(chunk_validations[chunk_idx], key=lambda x: (-x[0], x[1]))[0][2]
                         print(f"[32m[DEBUG] Selected {best_... for chunk {chunk_idx} [1;33m(PASSED Whisper check)[0m")
                         waveform, sr = torchaudio.load(best_path)
                         waveform_list.append(waveform)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Candidate selection ignores score; retries reuse same seed (identical audio) #45

Bug: Candidate selection ignores score; retries reuse same seed (identical audio)

Summary

Impact

Repro

Expected vs Actual

Proposed Fix

Nice‑to‑have

Acceptance Criteria

File/Line References (Chatter.py)

A) Candidate selection ignores score

B) Retry rounds reuse the same seed (identical audio)

C) Optional: separate knobs

PR-ready patch (no API change)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Candidate selection ignores score; retries reuse same seed (identical audio) #45

Description

Bug: Candidate selection ignores score; retries reuse same seed (identical audio)

Summary

Impact

Repro

Expected vs Actual

Proposed Fix

Nice‑to‑have

Acceptance Criteria

File/Line References (Chatter.py)

A) Candidate selection ignores score

B) Retry rounds reuse the same seed (identical audio)

C) Optional: separate knobs

PR-ready patch (no API change)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions