Skip to content

Accidental trimming of overlapping text cues #363

@ntrrgc

Description

@ntrrgc

Let's consider the following WebVTT stream:

WEBVTT

0:00:00.000 --> 0:00:10.000 position:10%
Cue A

0:00:01.000 --> 0:00:02.000 position:80%
Cue B

0:00:05.000 --> 0:00:06.000 position:80%
Cue C

WebVTT cues can overlap, as they do in the above example. Graphically, this is what the above example looks like:

diagram of the cues in a timeline

Notably, as far as I've been able to find, the MSE spec has no concept of cues, only coded frames. Also, as far as I've been able to find, the relation between cues and coded frames is not specified.

For the remainder of this explanation, let's assume a 1 to 1 mapping between cues and coded frames: each cue would be encoded and represented as one coded frame. As a consequence, the presentation intervals of the coded frames can overlap. This is how WebVTT cues represented in samples in WebM (in both S_TEXT/WEBVTT and D_WEBVTT/kind formats).

Note

This is not the only possible mapping that could be defined between cues and coded frames, nor the least problematic one.

For comparison, when encoded in MP4/ISO BMFF, overlapping WebVTT cues are split into multiple non-overlapping samples. A different mapping could be 1 coded frame representing 1 ISO BMFF WebVTT sample, and this is what the macOS port of WebKit uses.

With the current Coded Frame Processing algorithm, this is what would happen for the WebVTT stream above:

  • Initially:
    • last decode timestamp = unset
    • last frame duration = unset
    • highest end timestamp = unset
    • Samples: empty
  • A coded frame representing Cue A with presentation interval [0, 10)s is processed:
    • last decode timestamp = 0
    • last frame duration = 10
    • highest end timestamp = 10
    • Samples:
      • Cue A: [0, 10)s
  • A coded frame representing Cue B with presentation interval [1, 2)s is processed:
    • last decode timestamp = 1
    • last frame duration = 1
    • highest end timestamp = 10
    • Samples:
      • Cue A: [0, 10)s
      • Cue B: [1, 2)s
  • A coded frame representing Cue C with presentation interval [5, 6)s is processed:
    • A new coded group is frame is started because, quoting step 1.1.6:

      last decode timestamp for track buffer is set and the difference between decode timestamp and last decode timestamp is greater than 2 times last frame duration

      As such last decode timestamp, last frame duration and highest end timestamp are all unset.

    • Splicing of Cue A occurs because, quoting step 1.1.13:

      last decode timestamp for track buffer is unset and presentation timestamp falls within the presentation interval of a coded frame in track buffer

      This changes the duration of Cue A from 10 seconds to 5 seconds.

    • last decode timestamp = 5

    • last frame duration = 1

    • highest end timestamp = 6

    • Samples:

      • Cue A: [0, 5)s
      • Cue B: [1, 2)s
      • Cue C: [2, 3)s

Graphically, this is the result:

diagram of the cues in time where Cue A has been unexpectedly trimmed short

A potential fix could be amending the check if step 1.1.6 so that it doesn't trigger when the presentation timestamp is less or equal than the highest end timestamp.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions