Accidental trimming of overlapping text cues

Let's consider the following WebVTT stream:

```webvtt
WEBVTT

0:00:00.000 --> 0:00:10.000 position:10%
Cue A

0:00:01.000 --> 0:00:02.000 position:80%
Cue B

0:00:05.000 --> 0:00:06.000 position:80%
Cue C
```

WebVTT cues can overlap, as they do in the above example. Graphically, this is what the above example looks like:

<img alt="diagram of the cues in a timeline" src="https://github.com/user-attachments/assets/a26ff24f-f735-4a2a-b092-b365633066f5" height="160">

Notably, as far as I've been able to find, the MSE spec has no concept of cues, only coded frames. Also, as far as I've been able to find, **the relation between cues and coded frames is not specified.**

For the remainder of this explanation, **let's assume a 1 to 1 mapping between cues and coded frames**: each cue would be encoded and represented as one coded frame. As a consequence, the presentation intervals of the coded frames can overlap. **This is how WebVTT cues represented in samples in WebM** (in both `S_TEXT/WEBVTT` and `D_WEBVTT/kind` formats).

> [!NOTE]
> This is not the only possible mapping that could be defined between cues and coded frames, nor the least problematic one.
>
> For comparison, when encoded in MP4/ISO BMFF, overlapping WebVTT cues are split into multiple non-overlapping samples. A different mapping could be 1 coded frame representing 1 ISO BMFF WebVTT sample, and this is what the macOS port of WebKit uses.

With the current *Coded Frame Processing* algorithm, this is what would happen for the WebVTT stream above:

* Initially:
    * last decode timestamp = unset
    * last frame duration = unset
    * highest end timestamp = unset
    * Samples: empty
* A coded frame representing Cue A with presentation interval [0, 10)s is processed:
    * last decode timestamp = 0
    * last frame duration = 10
    * highest end timestamp = 10
    * Samples:
        * Cue A: [0, 10)s
* A coded frame representing Cue B with presentation interval [1, 2)s is processed:
    * last decode timestamp = 1
    * last frame duration = 1
    * highest end timestamp = 10
    * Samples:
        * Cue A: [0, 10)s
        * Cue B: [1, 2)s
* A coded frame representing Cue C with presentation interval [5, 6)s is processed:
    * A new coded group is frame is started because, quoting step 1.1.6:
    
        > last decode timestamp for track buffer is set and the difference between decode timestamp and last decode timestamp is greater than 2 times last frame duration

        As such last decode timestamp, last frame duration and highest end timestamp are all unset.
    * **Splicing of Cue A occurs** because, quoting step 1.1.13:
        
        > last decode timestamp for track buffer is unset and presentation timestamp falls within the presentation interval of a coded frame in track buffer

        This changes the duration of Cue A from 10 seconds to 5 seconds.
    * last decode timestamp = 5
    * last frame duration = 1
    * highest end timestamp = 6
    * Samples:
        * Cue A: [0, 5)s
        * Cue B: [1, 2)s
        * Cue C: [2, 3)s

Graphically, this is the result:

<img alt="diagram of the cues in time where Cue A has been unexpectedly trimmed short" src="https://github.com/user-attachments/assets/942857f9-6153-4d6e-9219-299cf72b40a8" height="160">

A potential fix could be amending the check if step 1.1.6 so that it doesn't trigger when the presentation timestamp is less or equal than the highest end timestamp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accidental trimming of overlapping text cues #363

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Accidental trimming of overlapping text cues #363

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions