You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WEBVTT
0:00:00.000 --> 0:00:10.000 position:10%
Cue A
0:00:01.000 --> 0:00:02.000 position:80%
Cue B
0:00:05.000 --> 0:00:06.000 position:80%
Cue C
WebVTT cues can overlap, as they do in the above example. Graphically, this is what the above example looks like:
Notably, as far as I've been able to find, the MSE spec has no concept of cues, only coded frames. Also, as far as I've been able to find, the relation between cues and coded frames is not specified.
For the remainder of this explanation, let's assume a 1 to 1 mapping between cues and coded frames: each cue would be encoded and represented as one coded frame. As a consequence, the presentation intervals of the coded frames can overlap. This is how WebVTT cues represented in samples in WebM (in both S_TEXT/WEBVTT and D_WEBVTT/kind formats).
Note
This is not the only possible mapping that could be defined between cues and coded frames, nor the least problematic one.
For comparison, when encoded in MP4/ISO BMFF, overlapping WebVTT cues are split into multiple non-overlapping samples. A different mapping could be 1 coded frame representing 1 ISO BMFF WebVTT sample, and this is what the macOS port of WebKit uses.
With the current Coded Frame Processing algorithm, this is what would happen for the WebVTT stream above:
Initially:
last decode timestamp = unset
last frame duration = unset
highest end timestamp = unset
Samples: empty
A coded frame representing Cue A with presentation interval [0, 10)s is processed:
last decode timestamp = 0
last frame duration = 10
highest end timestamp = 10
Samples:
Cue A: [0, 10)s
A coded frame representing Cue B with presentation interval [1, 2)s is processed:
last decode timestamp = 1
last frame duration = 1
highest end timestamp = 10
Samples:
Cue A: [0, 10)s
Cue B: [1, 2)s
A coded frame representing Cue C with presentation interval [5, 6)s is processed:
A new coded group is frame is started because, quoting step 1.1.6:
last decode timestamp for track buffer is set and the difference between decode timestamp and last decode timestamp is greater than 2 times last frame duration
As such last decode timestamp, last frame duration and highest end timestamp are all unset.
Splicing of Cue A occurs because, quoting step 1.1.13:
last decode timestamp for track buffer is unset and presentation timestamp falls within the presentation interval of a coded frame in track buffer
This changes the duration of Cue A from 10 seconds to 5 seconds.
last decode timestamp = 5
last frame duration = 1
highest end timestamp = 6
Samples:
Cue A: [0, 5)s
Cue B: [1, 2)s
Cue C: [2, 3)s
Graphically, this is the result:
A potential fix could be amending the check if step 1.1.6 so that it doesn't trigger when the presentation timestamp is less or equal than the highest end timestamp.
Let's consider the following WebVTT stream:
WebVTT cues can overlap, as they do in the above example. Graphically, this is what the above example looks like:
Notably, as far as I've been able to find, the MSE spec has no concept of cues, only coded frames. Also, as far as I've been able to find, the relation between cues and coded frames is not specified.
For the remainder of this explanation, let's assume a 1 to 1 mapping between cues and coded frames: each cue would be encoded and represented as one coded frame. As a consequence, the presentation intervals of the coded frames can overlap. This is how WebVTT cues represented in samples in WebM (in both
S_TEXT/WEBVTTandD_WEBVTT/kindformats).Note
This is not the only possible mapping that could be defined between cues and coded frames, nor the least problematic one.
For comparison, when encoded in MP4/ISO BMFF, overlapping WebVTT cues are split into multiple non-overlapping samples. A different mapping could be 1 coded frame representing 1 ISO BMFF WebVTT sample, and this is what the macOS port of WebKit uses.
With the current Coded Frame Processing algorithm, this is what would happen for the WebVTT stream above:
A new coded group is frame is started because, quoting step 1.1.6:
As such last decode timestamp, last frame duration and highest end timestamp are all unset.
Splicing of Cue A occurs because, quoting step 1.1.13:
This changes the duration of Cue A from 10 seconds to 5 seconds.
last decode timestamp = 5
last frame duration = 1
highest end timestamp = 6
Samples:
Graphically, this is the result:
A potential fix could be amending the check if step 1.1.6 so that it doesn't trigger when the presentation timestamp is less or equal than the highest end timestamp.