Improve performance #52

Luro02 · 2020-04-15T07:26:28Z

As mentioned in the PR #51, parsing MediaPlaylist files with MediaPlaylist::from_str is quite slow and this can definitely be improved :)

     Running target/release/deps/parse-0d26cc9e51751f86
Benchmarking parser/throughput: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 25.3s or reduce sample count to 30.
Benchmarking parser/throughput: Collecting 100 samples in estimated 25.304 s (50                                                                                parser/throughput       time:   [4.6647 ms 4.6793 ms 4.6968 ms]
                        thrpt:  [64.794 MiB/s 65.036 MiB/s 65.240 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@dholroyd

The text was updated successfully, but these errors were encountered:

Luro02 · 2020-04-15T07:33:14Z

An obvious solution would be to use Cows instead of Strings, which could avoid allocations. Note that this would not improve the MediaPlaylist::from_str performance, because the trait expects the struct to take ownership of the data, so one would have to use TryFrom;

pub struct MediaPlaylist<'a> { /* ... */ }

impl<'a> TryFrom<&'a str> for MediaPlaylist<'a> {
    type Error = Error;
    fn try_from(input: &'a str) -> Result<Self, Self::Error> {
        parse_media_playlist(input, &mut MediaPlaylist::builder())
    }
}

One might also consider adding beef (faster and smaller Cow implementation) as an optional dependency.

Luro02 · 2020-04-15T08:04:05Z

Luro02 · 2020-04-15T08:34:56Z

Maybe it would be possible to parse tags in parallel? I think that would be really difficult, because the m3u8 file format seems to be designed to be parsed sequentially (some tags depend on the previous or next tag).

Luro02 · 2020-04-15T09:01:24Z

Most likely the culprit is the BTreeMap that contains all the MediaSegment, which is quite inefficient with many MediaSegments, so maybe it would be better to use a HashMap and then sort them before writing with fmt::Display?

dholroyd · 2020-04-15T14:43:31Z

I made some experimental changes to try and get a speed-up. I'm afraid I gutted the code and will have broken lots of existing uses cases / tests, so I can share a branch, but not provide a PR yet.

Changing the segments representation from BTreeMap<usize, MediaSegment> to Vec<MediaSegment> was a good win
Changing keys to a SmallVec<[ExtXKey; 1]> also helped
I just commented out code handling description from the EXTINF line. Not great, but avoids another allocation per segment
Switch to hand-coded builder implementations for the types being exercised, since many of the #[derive(builder)] setter methods seem to clone their arguments (I did not investigate this much to see if there was another way)
And the big one (in terms of lines added) - switched to a nom5-based parser for the subset of syntax I'm testing
- its a single-pass for the most part (the current hls_m3u8 parser first scans for end-of-line, and having found a line it parses it - so its more like a two-pass process)
- The parser is byte-oriented, and defers UTF-8 conversion only for elements like URIs that might actually contain UTF-8 content (all the content likeEXTINF tags are in the ASCII subset of UTF-8 and can I think be safely parsed as bytes)
- the parser is 'left factored' - i.e. a rule recognises #EXT and then sub-rules try to recognise e.g. -X-MEDIA-SEQUENCE: suffix (current parser will re-rest the #EXT when attempting to match each tag, unless the optimiser is doing very clever tricks!)

with all that, I see the benchmark go from around 80MB/s to:

Benchmarking parser/throughput: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s or reduce sample count to 60.
parser/throughput       time:   [1.0120 ms 1.0175 ms 1.0243 ms]                               
                        thrpt:  [297.09 MiB/s 299.10 MiB/s 300.72 MiB/s]

Will try to push the branch to share work in progress later.

dholroyd · 2020-04-15T22:48:17Z

WIP over here dholroyd@f601da1

Benchmarking parser/from_str parser: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s or reduce sample count to 50.
parser/from_str parser  time:   [1.6824 ms 1.6864 ms 1.6912 ms]                                    
                        thrpt:  [179.94 MiB/s 180.46 MiB/s 180.89 MiB/s]
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
Benchmarking parser/nom parser: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s or reduce sample count to 60.
parser/nom parser       time:   [1.0867 ms 1.0916 ms 1.0976 ms]                               
                        thrpt:  [277.27 MiB/s 278.78 MiB/s 280.03 MiB/s]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

It's not intended to be ready to commit yet. In particular lots of MediaPlaylistBuilder::build() is commented out because I didn't understand it yet (and the switch to Vec<MediaSegment> breaks it). Sorry - I made a change to the data structure without really understanding it's function! Could you explain why the sequence numbers are inside MediaSegment and also tracked via the Map key?

I'm not particularly committed to using nom; just a tool I've tried out recently. The parser only handles a small portion of the syntax that the FromString version covers, that would need loads more work. This is an opportunity to add 'Span' support though.

I did not yet try your idea to handle uri / description with Cow (or beef) - could well help further.

Luro02 · 2020-04-16T07:48:41Z

Sorry - I made a change to the data structure without really understanding it's function!

Do not worry, all help is appreciated 🤗

Could you explain why the sequence numbers are inside MediaSegment and also tracked via the Map key?

A MediaSegment describes a small chunk of the output stream and the player might want to switch to a lower quality stream or higher quality stream. According to the RFC one should not rely on the Duration and instead each segment has a number assigned. The first segment has the discontinuity sequence number and all the following are previous+1. This makes it possible to switch the streams, by fetching the next segment from another VariantStream with the last sequence number+1. (I might be wrong, so maybe it would be better to look at the EXT-X-DISCONTINUITY documentation in the RFC)

I had quite a difficult time writing code that keeps the insertion order of the provided segments vector and that does allow for explicitly numbered segments (the implicitly numbered segments should fill in the free spaces in the vector).

I choose a BTreeMap as a data structure, because you could sort the segments after their number, it does not allow 2 segments with the same number and you can iterate through them in the correct order. The problem is that inserting something into a BTreeMap has a complexity of n log n (see here for a visualization of what this means: https://miro.medium.com/max/1400/1*5ZLci3SuR0zM_QlZOADv8Q.jpeg), which means for our case that the more segments you insert the slower it becomes (a lot slower).

We could replace the BTreeMap with a custom data structure that takes advantage of the fact that the segment numbers are just indexes (number - discontinuity number). I will look into this a bit more later today and write a custom data structure for it.

Simply using a Vec is not an option, because one could insert an element anywhere and all the following elements would be moved, which destroys the order of the segments.

I'm not particularly committed to using nom

I used it quite a while ago, when you had to use macros for everything, which was quite confusing/difficult for me and I think adding a parsing library would make it more difficult for people to contribute this this library, because they would have to first understand nom.

I would rather use a custom trait (Parser) and some kind of struct that can be used to get tokens from the stream (see #46).

How much did this improve performance?

This is an opportunity to add 'Span' support though.

I would wait with that...This is how I ended up rewriting most of this library, because I had so many ideas that I wanted to apply 😅.

dholroyd · 2020-04-16T08:37:32Z

This makes it possible to switch the streams, by fetching the next segment from another VariantStream with the last sequence number+1

My reading of the spec is that,

media-sequence-number increments by 1 for each segment in the Media Manifest
discontinuity-sequence-number increments by 1 for each discontinuity in the Media Manifest

Is that your understanding too?

So, writing out explicitly the per-segment values that are implied by the manifest + spec-rules:

msn	dsn	m3u8
0	0	`one.ts`
1	0	`two.ts`
.	.	`#EXT-X-DISCONTINUITY`
2	1	`three.ts`
3	1	`four.ts`

Or with explicit headers,

msn	dsn	m3u8
.	.	`#EXT-X-MEDIA-SEQUENCE:100`
.	.	`#EXT-X-DISCONTINUITY-SEQUENCE:5`
100	5	`one.ts`
101	5	`two.ts`
.	.	`#EXT-X-DISCONTINUITY`
102	6	`three.ts`
103	6	`four.ts`

dholroyd · 2020-04-16T09:12:43Z

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values, but I can't seem to find where this is stated or implied in the spec 😨

(And on that point, the 2nd edition spec introduces the X-GAP tag allowing the server to 'fill time' where media is missing, while maintaining the requirement for monotonically increasing MSN.)

Luro02 · 2020-04-16T12:47:29Z

My reading of the spec is that,

media-sequence-number increments by 1 for each segment in the Media Manifest

discontinuity-sequence-number increments by 1 for each discontinuity in the Media Manifest

Is that your understanding too?

Yes that is my understanding too. (Note: that I somehow ignored the discontinuity number, which should be inside the MediaSegment as a field (e.g. MediaSegment::discontinuity_number))

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values, but I can't seem to find where this is stated or implied in the spec fearful

Each Media Segment MUST carry the continuation of the encoded
bitstream from the end of the segment with the previous Media
Sequence Number, where values in a series such as timestamps and
Continuity Counters MUST continue uninterrupted. The only exceptions
are the first Media Segment ever to appear in a Media Playlist and
Media Segments that are explicitly signaled as discontinuities
(Section 4.3.2.3). Unmarked media discontinuities can trigger
playback errors.

https://tools.ietf.org/html/rfc8216#section-3

The RFC seems to indicate that gaps are allowed?

My quick prototype collection (a wrapper around Vec<Option<T>>) seems to be 17x faster than the BTreeMap:

Benchmarking SortedVec::insert_at: Collecting 10000 samples in estimated 5.0312                                                                                 SortedVec::insert_at    time:   [6.6438 ns 6.6571 ns 6.6705 ns]
                        change: [+4.8759% +5.4093% +6.1275%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1230 outliers among 10000 measurements (12.30%)
  562 (5.62%) high mild
  668 (6.68%) high severe

Benchmarking BTreeMap::insert: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 8.5s or reduce sample count to 5420.
Benchmarking BTreeMap::insert: Collecting 10000 samples in estimated 8.4930 s (5                                                                                BTreeMap::insert        time:   [96.255 ns 96.414 ns 96.581 ns]
                        change: [-3.7088% -3.0317% -2.2841%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 916 outliers among 10000 measurements (9.16%)
  346 (3.46%) low severe
  291 (2.91%) low mild
  87 (0.87%) high mild
  192 (1.92%) high severe

Benchmarking Vec::push: Collecting 10000 samples in estimated 5.3234 s (450M ite                                                                                Vec::push               time:   [5.4963 ns 5.5109 ns 5.5257 ns]
                        change: [+8.6607% +10.698% +12.587%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 736 outliers among 10000 measurements (7.36%)
  373 (3.73%) high mild
  363 (3.63%) high severe

I will make a PR against your fork later today (or tomorrow) :)

dholroyd · 2020-04-16T17:13:50Z

Simply using a Vec is not an option, because one could insert an element anywhere and all the following elements would be moved, which destroys the order of the segments.

So I guess I had naively assumed that callers could arrange for segments to be submitted 'in order'. Sounds like you have more advanced usages in mind? For me, it seems reasonable to require that insert-order == manifest-order, but I care far more about parsing today than I do about generating 😄

dholroyd · 2020-04-16T18:49:14Z

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values

The RFC seems to indicate that gaps are allowed?

I'm sorry, that earlier "not supposed to be any gaps" statement was just my confusion :)

dholroyd · 2020-04-17T08:41:17Z

I think adding a parsing library would make it more difficult for people to contribute this this library, because they would have to first understand nom.

I would rather use a custom trait (Parser) and some kind of struct that can be used to get tokens from the stream

I've not done the latter part, but FWIW I pushed a change to the branch which drops nom, so its 'just' a recursive descent parser in plain old rust now. Overall structure is the same, but no higher-order functions etc now. NB this parser does still depend on memchr (search for a u8) and lexical-core (parse numbers from &[u8]) crates. There may be a way to get equivalent performance to memchr using only std, but I didn't dig into that yet. The use of lexical-core rather than std is on the basis that the std functions would require checked utf8 conversion up front, and I'm betting that skipping this is a win - I've not profiled that change separately yet.

I can take a look at moving to some kind of mutable cursor, rather than threading &[u8] through all parser function inputs/outputs.

Luro02 · 2020-04-19T19:00:19Z

Small update from my side:

I just replaced almost all instances of String with Cow<'a, str>, which gave a huge performance boost:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.1s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.5804 ms 5.5952 ms 5.6146 ms]
                        thrpt:  [54.257 MiB/s 54.444 MiB/s 54.589 MiB/s]
                 change:
                        time:   [-5.6532% -4.4262% -2.9715%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0625% +4.6311% +5.9920%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.0s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5043 ms 2.5105 ms 2.5175 ms]
                        thrpt:  [121.00 MiB/s 121.34 MiB/s 121.64 MiB/s]
                 change:
                        time:   [+0.2180% +0.9366% +1.6358%] (p = 0.01 < 0.05)
                        thrpt:  [-1.6094% -0.9279% -0.2175%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

The TryFrom::try_from version is ~44.8% faster than the old implementation.

dholroyd · 2020-04-19T22:56:47Z

Nice!

I've made one attempt at moving my parser over to using a custom Cursor type, changing function signatures from approximately

fn foo(input: &[u8]) -> std::result::Result<((&[u8], T), E> {

to

fn foo(input: &mut Cursor) -> std::result::Result<T, E> {

but I was unable to overcome lifetime issues 😫

Luro02 · 2020-04-20T09:13:49Z

It would make more sense to change the function signatures to something like this

fn foo(input: &mut Cursor<'a>) -> Result<T, E>;

and the Cursor could be

use std::marker::PhantomData;

struct Cursor<'a, B = &'a [u8]>
where
    B: Buffer
{
    buffer: B,
    _p: PhantomData<&'a str>,
}

trait Buffer {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error>;

    // insert methods here required for a parsing?
    // for example:
    fn tag(&mut self, expected: &[u8]) -> Result<(), Error> {
        let mut buffer = Vec::with_capacity(expected.len());
        self.read(&mut buffer)?;
        if buffer == expected {
            Ok(())
        } else {
            unimplemented!("return error here")
        }
    }
}

impl<'a> Buffer for &'a [u8] {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error> {
        unimplemented!()
    }
}

dholroyd · 2020-04-20T13:13:50Z

So to demonstrate where I got to, I've extracted out a small portion of the WIP code the shows the problem,

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=405eb072639b4e96fb164ba3a3737d21

error[E0499]: cannot borrow `*input` as mutable more than once at a time
  --> src/main.rs:33:37
   |
28 | fn ext_tag<'a, 'b: 'a>(input: &'b mut Cursor<'a>) -> Result<Tag<'a>, ParseError<'a>> {
   |            -- lifetime `'a` defined here
...
31 |     let res = ext_inf(input).map(|(duration, title)| Tag::ExtInf { duration, title });
   |               --------------
   |               |       |
   |               |       first mutable borrow occurs here
   |               argument requires that `*input` is borrowed for `'a`
32 |     if res.is_ok() { return res; }
33 |     let res = ext_program_date_time(input).map(Tag::ExtXProgramDateTime);
   |                                     ^^^^^ second mutable borrow occurs here

As you can see from all the annotations, I've been trying to play lifetime whack-a-mole, and have lost! 😉

There are probably various strategies that can work around the lifetime issues by using additional types to defer ownership checks to runtime, however the goal of my exercise is to remove runtime costs!

dholroyd · 2020-04-20T13:24:33Z

As soon as I posted that last comment, I thought... hmm, those functions look kinda like they want to be instance methods.. 🤔

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=611f33fbdb623986601a26639e6c7e1d

...and indeed this transformation addresses the lifetime issues. I will try this in the real codebase! 😁

dholroyd · 2020-04-20T22:28:12Z

I believe mutable Cursor now working as of https://github.com/dholroyd/hls_m3u8/blob/bf87883501d16ab307010a978666443c25f3b3d7/src/parser.rs#L206

I'll rebase for your latest updates.

dholroyd · 2020-04-26T09:16:11Z

I've rebased my branch on master, gaining the Cow<str> changes too; here are the current standings in the throughput benchmarks,

from_str: ▇▇▇▇▇▇▇▇▇ 65.59
try_from: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 140.44
parser  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 329.78

(MB per second; more is better)

Luro02 · 2020-04-26T12:23:09Z

I just made some benchmarks on MediaPlaylistBuilder::build and it seems like almost no time is spent verifying the input:

Current master branch:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 30.2s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.7206 ms 5.7604 ms 5.8021 ms]
                        thrpt:  [52.503 MiB/s 52.883 MiB/s 53.251 MiB/s]
                 change:
                        time:   [+1.8543% +3.5562% +5.2609%] (p = 0.00 < 0.05)
                        thrpt:  [-4.9979% -3.4341% -1.8205%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  8 (8.00%) high mild
  8 (8.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5498 ms 2.5571 ms 2.5655 ms]
                        thrpt:  [118.74 MiB/s 119.13 MiB/s 119.47 MiB/s]
                 change:
                        time:   [-6.6711% -5.3264% -4.0808%] (p = 0.00 < 0.05)
                        thrpt:  [+4.2544% +5.6261% +7.1480%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [623.68 us 647.98 us 677.56 us]
                        change: [+4.7927% +7.0447% +9.8695%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Without `MediaPlaylistBuilder::validate`:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 31.7s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.6478 ms 5.6973 ms 5.7631 ms]
                        thrpt:  [52.859 MiB/s 53.468 MiB/s 53.937 MiB/s]
                 change:
                        time:   [+5.4297% +7.0603% +8.8352%] (p = 0.00 < 0.05)
                        thrpt:  [-8.1179% -6.5947% -5.1501%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.1s or reduce sample count to 30.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5807 ms 2.6233 ms 2.6721 ms]
                        thrpt:  [114.01 MiB/s 116.13 MiB/s 118.04 MiB/s]
                 change:
                        time:   [+1.8487% +3.5593% +5.4648%] (p = 0.00 < 0.05)
                        thrpt:  [-5.1817% -3.4370% -1.8151%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) high mild
  7 (7.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [566.60 us 568.95 us 572.01 us]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Without any kind of build verification:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 27.9s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.5241 ms 5.5325 ms 5.5415 ms]
                        thrpt:  [54.972 MiB/s 55.062 MiB/s 55.145 MiB/s]
                 change:
                        time:   [-8.4850% -6.9115% -5.4253%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7365% +7.4246% +9.2717%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.7s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.4998 ms 2.5062 ms 2.5129 ms]
                        thrpt:  [121.23 MiB/s 121.55 MiB/s 121.86 MiB/s]
                 change:
                        time:   [-2.5795% -1.8418% -1.0523%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0635% +1.8764% +2.6478%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [571.44 us 589.05 us 608.32 us]
                        change: [-11.326% -7.9614% -4.7652%] (p = 0.00 < 0.05)
                        Performance has improved.

as can be seen only a few MBs can be gained from skipping the builder.

dholroyd · 2020-04-26T13:04:54Z

Yup, I think it's fair to say that validation isn't the biggest concern, I can see it's about 2-3% of this cargo flamegraph profile on master...

...much more significant on master is the amount of time spent in segments() (35-40% of runtime), where all the items collected in the Vec get moved into the StableVec instance. One of the changes in my branch is to have the parsing process directly construct the StableVec representation, so that this work can be avoided:

https://github.com/dholroyd/hls_m3u8/blob/58f354cb06af952037aec385b00aa2bbb6ccfe27/src/parser.rs#L205

For avoidance of doubt, I want to manifests validated - this is one of my goals in using this crate - it's just that my priority in the short term is iterating on performance. My branch shouldn't be merged without validation working just as well as now, and all other tidy-ups etc being applied.

dholroyd · 2020-04-26T13:21:15Z

FWIW, on my branch the main offender (according to linux perf) now seems to be memmove...

+   29.90%  throughput  libc-2.30.so       [.] __memmove_avx_unaligned_erms                                                                                                                                        
+    4.88%  throughput  [kernel.kallsyms]  [k] prepare_exit_to_usermode                                                                                                                                            
+    4.37%  throughput  throughput         [.] core::slice::memchr::memchr                                                                                                                                         
+    4.07%  throughput  throughput         [.] hls_m3u8::media_playlist::parse_media_playlist                                                                                                                      
+    3.62%  throughput  throughput         [.] <alloc::vec::Vec<T> as core::clone::Clone>::clone                                                                                                                   
+    2.73%  throughput  [kernel.kallsyms]  [k] native_irq_return_iret                                                                                                                                              
+    2.70%  throughput  [kernel.kallsyms]  [k] clear_page_erms                                                                                                                                                     
+    2.57%  throughput  throughput         [.] core::str::<impl str>::trim                                                                                                                                         
+    2.32%  throughput  throughput         [.] core::num::dec2flt::dec2flt                                                                                                                                         
+    2.29%  throughput  throughput         [.] hls_m3u8::media_playlist::MediaPlaylistBuilder::build                                                                                                               
+    2.13%  throughput  [kernel.kallsyms]  [k] sync_regs                                                                                                                                                           
+    2.08%  throughput  [kernel.kallsyms]  [k] error_entry                                                                                                                                                         
+    2.05%  throughput  [kernel.kallsyms]  [k] swapgs_restore_regs_and_return_to_usermode                                                                                                                          
+    1.83%  throughput  throughput         [.] <hls_m3u8::line::Lines as core::iter::traits::iterator::Iterator>::next                                                                                             
+    1.63%  throughput  throughput         [.] <hls_m3u8::line::Tag as core::convert::TryFrom<&str>>::try_from                                                                                                     
+    1.61%  throughput  throughput         [.] <hls_m3u8::tags::media_segment::inf::ExtInf as core::convert::TryFrom<&str>>::try_from                                                                              
+    1.60%  throughput  throughput         [.] <stable_vec::core::bitvec::BitVecCore<T> as core::clone::Clone>::clone                                                                                              
+    1.40%  throughput  throughput         [.] hls_m3u8::media_playlist::MediaPlaylistBuilder::segments                                                                                                            
+    1.26%  throughput  throughput         [.] hls_m3u8::media_segment::MediaSegmentBuilder::build                                                                                                                 
+    1.18%  throughput  throughput         [.] core::str::SplitInternal<P>::next

Looking at the flamegraph, this cost seems to be spread out across a number of functions, and with no obvious 'explicit' data movement that I can tie it to. I am guessing that this must be due to the remaining uses of intermediate representations that need to be copied to arrive at the final MediaSegment value in the vec.

I would like to see if there's a way to arrange things that will convince rustc to optimise more of these moves away / construct the final value in-place. Not sure how, but seems worth a try!

dholroyd · 2020-04-27T15:52:35Z

I've added a couple of additional commits to my branch:

Originally, I had copied the Line and Tags types into the parser (with slightly different signatures to those in master) to decouple the 'parser' from the 'builders'. Turns out that wrapping the intermediate values from the parser into these types was the largest source of the memmov() cost showing up in profiling. dholroyd@da1a53b removes these types and has the parser call methods on the builder directly. This gives me around a 100% increase in throughtput in the benchmark, vs the performance of the parser-perf branch before this change.

Next, dholroyd@c855f48 alters MediaSegment so that all fields are optional (i.e. including the mandatory fields representing the URI and EXINF values 😱). Then, we can have our in-progress segment list end with an 'all-None' MediaSegment, and fill-in attributes of this segment as they are discovered. On top of the last change, this one gives an additional throughput increase of around 19%.

Current benchmark standings:

from_str: ▇▇▇▇ 63.58
try_from: ▇▇▇▇▇▇▇▇▇▇ 138.08
parser  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 641.73
(MB per second; more is better)

Luro02 · 2020-04-27T16:37:15Z

Wow that is amazing 😍

dholroyd · 2020-04-28T08:12:09Z

dholroyd@21b7fdc puts some restrictions on the format of floating point numbers when parsing the duration value from every EXTINF tag - another 10% or so throughput (and stops us accepting things like 123.0E+77, that the HLS spec does not allow anyway).

dholroyd · 2020-04-28T13:17:08Z

On my branch right now, the costly activities that stand out in the profile are:

24% - parsing EXTINF duration into f64
15% - pushing new segments onto the StableVec
10% - UTF-8 checks on URI values
6% - DATERANGE parsing
6% - Dropping the StableVec once we are done with a parsed MediaManifest instance

I see maybe one more major win for my usage, which to start tracking state across multiple subsequent reloads of a live manifest, and to avoid the full parsing pipeline for sections of the manifest that are unchanged. Note that all segments are expected to be the same from one reload to the next except that 1) one (or more) segment is added onto the end, 2) one (of more) old segments may be removed from the start. For a very long manifest, the vast majority of segments will be the same, so this might be a profitable strategy (less so for very small live manifests, I expect). I think this is better tacked as a follow-up ticket / PR though.

dholroyd · 2020-04-28T13:25:48Z

I'd like to work towards getting master achieving the same level of performance as my experimental branch. I think at least the following problems make the work on my branch unsuitable for merging right now:

The parser there only handles a tiny subset of HLS tags that are handled by master
There's a different interface to my parser (to enable comparative benchmarks vs. existing parser)
The code is not tested by any of the existing unit tests
The existing validation logic is being skipped
Error handling in some spots is missing (println!() etc.)

If these were addressed, would a PR using the design in https://github.com/dholroyd/hls_m3u8/tree/parser-perf be accepted? I'd like to address design concerns before writing lots more code.

Luro02 · 2020-04-28T19:12:41Z

I would prefer if you could integrate your improvements into the existing types, by for example having a TryFrom<&[u8]> or TryFrom<Cursor> implementation and the MediaSegment fields that are required should not be optional. The rest seems to be fine :)

Luro02 · 2020-04-28T19:14:54Z

It might make sense to create a new trait for this e.g.

trait ParseFrom {
    type Error;

    fn try_from_cursor(_: &mut Cursor) -> Result<Self, Self::Error>;
}

impl<T: ParseFrom> TryFrom<&[u8]> for T {
    // ...
}

Luro02 · 2020-04-28T19:53:18Z

Maybe something like this would be better

pub enum Poll<T> {
    Ready(T),
    Missing(Option<usize>)
}

trait Parser {
    type Error;

    fn parse<R: Read>(_: R) -> Poll<Result<Self, Self::Error>>;
}

I think currently it would be better to not expose your parser to the user, so it can be changed in the future.

Luro02 added help wanted Extra attention is needed A-parser Area: parser labels Apr 15, 2020

Luro02 added a commit to Luro02/hls_m3u8 that referenced this issue Apr 18, 2020

Add stable-vec sile#52

6dce2aa

Luro02 added a commit that referenced this issue Apr 18, 2020

Add stable-vec #52

096957b

Luro02 added a commit to Luro02/hls_m3u8 that referenced this issue Apr 22, 2020

Use Cow<'a, str> to reduce clones sile#52

6020cd5

Luro02 added a commit that referenced this issue Apr 22, 2020

Use Cow<'a, str> to reduce clones #52

c4643c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #52

Improve performance #52

Luro02 commented Apr 15, 2020 •

edited

Loading

Luro02 commented Apr 15, 2020

Luro02 commented Apr 15, 2020 •

edited

Loading

Luro02 commented Apr 15, 2020

Luro02 commented Apr 15, 2020

dholroyd commented Apr 15, 2020 •

edited

Loading

dholroyd commented Apr 15, 2020

Luro02 commented Apr 16, 2020

dholroyd commented Apr 16, 2020 •

edited

Loading

dholroyd commented Apr 16, 2020

Luro02 commented Apr 16, 2020

dholroyd commented Apr 16, 2020

dholroyd commented Apr 16, 2020

dholroyd commented Apr 17, 2020

Luro02 commented Apr 19, 2020

dholroyd commented Apr 19, 2020

Luro02 commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 26, 2020

Luro02 commented Apr 26, 2020

dholroyd commented Apr 26, 2020

dholroyd commented Apr 26, 2020

dholroyd commented Apr 27, 2020

Luro02 commented Apr 27, 2020

dholroyd commented Apr 28, 2020

dholroyd commented Apr 28, 2020

dholroyd commented Apr 28, 2020

Luro02 commented Apr 28, 2020

Luro02 commented Apr 28, 2020

Luro02 commented Apr 28, 2020 •

edited

Loading

Improve performance #52

Improve performance #52

Comments

Luro02 commented Apr 15, 2020 • edited Loading

Luro02 commented Apr 15, 2020

Luro02 commented Apr 15, 2020 • edited Loading

MediaPlaylist

MasterPlaylist:

Why smallvec?

other crates that I know of:

Luro02 commented Apr 15, 2020

Luro02 commented Apr 15, 2020

dholroyd commented Apr 15, 2020 • edited Loading

dholroyd commented Apr 15, 2020

Luro02 commented Apr 16, 2020

dholroyd commented Apr 16, 2020 • edited Loading

dholroyd commented Apr 16, 2020

Luro02 commented Apr 16, 2020

dholroyd commented Apr 16, 2020

dholroyd commented Apr 16, 2020

dholroyd commented Apr 17, 2020

Luro02 commented Apr 19, 2020

Small update from my side:

dholroyd commented Apr 19, 2020

Luro02 commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 20, 2020

dholroyd commented Apr 26, 2020

Luro02 commented Apr 26, 2020

Current master branch:

Without MediaPlaylistBuilder::validate:

Without any kind of build verification:

dholroyd commented Apr 26, 2020

dholroyd commented Apr 26, 2020

dholroyd commented Apr 27, 2020

Luro02 commented Apr 27, 2020

dholroyd commented Apr 28, 2020

dholroyd commented Apr 28, 2020

dholroyd commented Apr 28, 2020

Luro02 commented Apr 28, 2020

Luro02 commented Apr 28, 2020

Luro02 commented Apr 28, 2020 • edited Loading

Luro02 commented Apr 15, 2020 •

edited

Loading

Luro02 commented Apr 15, 2020 •

edited

Loading

Why `smallvec`?

dholroyd commented Apr 15, 2020 •

edited

Loading

dholroyd commented Apr 16, 2020 •

edited

Loading

Without `MediaPlaylistBuilder::validate`:

Luro02 commented Apr 28, 2020 •

edited

Loading