Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #52

Open
Luro02 opened this issue Apr 15, 2020 · 31 comments
Open

Improve performance #52

Luro02 opened this issue Apr 15, 2020 · 31 comments
Labels
A-parser Area: parser help wanted Extra attention is needed

Comments

@Luro02
Copy link
Collaborator

Luro02 commented Apr 15, 2020

As mentioned in the PR #51, parsing MediaPlaylist files with MediaPlaylist::from_str is quite slow and this can definitely be improved :)


     Running target/release/deps/parse-0d26cc9e51751f86
Benchmarking parser/throughput: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 25.3s or reduce sample count to 30.
Benchmarking parser/throughput: Collecting 100 samples in estimated 25.304 s (50                                                                                parser/throughput       time:   [4.6647 ms 4.6793 ms 4.6968 ms]
                        thrpt:  [64.794 MiB/s 65.036 MiB/s 65.240 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@dholroyd

@Luro02 Luro02 added help wanted Extra attention is needed A-parser Area: parser labels Apr 15, 2020
@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 15, 2020

An obvious solution would be to use Cows instead of Strings, which could avoid allocations. Note that this would not improve the MediaPlaylist::from_str performance, because the trait expects the struct to take ownership of the data, so one would have to use TryFrom;

pub struct MediaPlaylist<'a> { /* ... */ }

impl<'a> TryFrom<&'a str> for MediaPlaylist<'a> {
    type Error = Error;
    fn try_from(input: &'a str) -> Result<Self, Self::Error> {
        parse_media_playlist(input, &mut MediaPlaylist::builder())
    }
}

One might also consider adding beef (faster and smaller Cow implementation) as an optional dependency.

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 15, 2020

smallvec could also improve the performance, but I think it would be better to wait for const-generics to stabilize, before depending on it.

All places where it could be used (I think):

  • hls_m3u8::types::Codecs
  • hls_m3u8::types::KeyFormatVersions
  • hls_m3u8::types::Value::Hex

MediaPlaylist

  • hls_m3u8::tags::ExtXMap::keys
  • hls_m3u8::MediaPlaylist::unknown
  • hls_m3u8::MediaSegment::keys

MasterPlaylist:

  • hls_m3u8::MasterPlaylist::media
  • hls_m3u8::MasterPlaylist::variant_streams
  • hls_m3u8::MasterPlaylist::session_data
  • hls_m3u8::MasterPlaylist::session_keys
  • hls_m3u8::MasterPlaylist::unknown_tags (small note: this field should be called unknown)

but I am not sure if it should be used as a drop-in replacement for Vec or just on things that are known to be used infrequently (for example right now the specs only allow a single DecryptionKey, but they are in a Vec, because the spec indicates that new keys might be added with different identities?)

Why smallvec?

I like that it automatically allocates on the heap, if the size exceeds the array and it is used by the rust compiler and servo so it is quite well-supported.

other crates that I know of:

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 15, 2020

Maybe it would be possible to parse tags in parallel? I think that would be really difficult, because the m3u8 file format seems to be designed to be parsed sequentially (some tags depend on the previous or next tag).

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 15, 2020

Most likely the culprit is the BTreeMap that contains all the MediaSegment, which is quite inefficient with many MediaSegments, so maybe it would be better to use a HashMap and then sort them before writing with fmt::Display?

@dholroyd
Copy link
Contributor

dholroyd commented Apr 15, 2020

I made some experimental changes to try and get a speed-up. I'm afraid I gutted the code and will have broken lots of existing uses cases / tests, so I can share a branch, but not provide a PR yet.

  • Changing the segments representation from BTreeMap<usize, MediaSegment> to Vec<MediaSegment> was a good win
  • Changing keys to a SmallVec<[ExtXKey; 1]> also helped
  • I just commented out code handling description from the EXTINF line. Not great, but avoids another allocation per segment
  • Switch to hand-coded builder implementations for the types being exercised, since many of the #[derive(builder)] setter methods seem to clone their arguments (I did not investigate this much to see if there was another way)
  • And the big one (in terms of lines added) - switched to a nom5-based parser for the subset of syntax I'm testing
    • its a single-pass for the most part (the current hls_m3u8 parser first scans for end-of-line, and having found a line it parses it - so its more like a two-pass process)
    • The parser is byte-oriented, and defers UTF-8 conversion only for elements like URIs that might actually contain UTF-8 content (all the content likeEXTINF tags are in the ASCII subset of UTF-8 and can I think be safely parsed as bytes)
    • the parser is 'left factored' - i.e. a rule recognises #EXT and then sub-rules try to recognise e.g. -X-MEDIA-SEQUENCE: suffix (current parser will re-rest the #EXT when attempting to match each tag, unless the optimiser is doing very clever tricks!)

with all that, I see the benchmark go from around 80MB/s to:

Benchmarking parser/throughput: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s or reduce sample count to 60.
parser/throughput       time:   [1.0120 ms 1.0175 ms 1.0243 ms]                               
                        thrpt:  [297.09 MiB/s 299.10 MiB/s 300.72 MiB/s]

Will try to push the branch to share work in progress later.

@dholroyd
Copy link
Contributor

WIP over here dholroyd@f601da1

Benchmarking parser/from_str parser: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s or reduce sample count to 50.
parser/from_str parser  time:   [1.6824 ms 1.6864 ms 1.6912 ms]                                    
                        thrpt:  [179.94 MiB/s 180.46 MiB/s 180.89 MiB/s]
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
Benchmarking parser/nom parser: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s or reduce sample count to 60.
parser/nom parser       time:   [1.0867 ms 1.0916 ms 1.0976 ms]                               
                        thrpt:  [277.27 MiB/s 278.78 MiB/s 280.03 MiB/s]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

It's not intended to be ready to commit yet. In particular lots of MediaPlaylistBuilder::build() is commented out because I didn't understand it yet (and the switch to Vec<MediaSegment> breaks it). Sorry - I made a change to the data structure without really understanding it's function! Could you explain why the sequence numbers are inside MediaSegment and also tracked via the Map key?

I'm not particularly committed to using nom; just a tool I've tried out recently. The parser only handles a small portion of the syntax that the FromString version covers, that would need loads more work. This is an opportunity to add 'Span' support though.

I did not yet try your idea to handle uri / description with Cow (or beef) - could well help further.

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 16, 2020

Sorry - I made a change to the data structure without really understanding it's function!

Do not worry, all help is appreciated 🤗

Could you explain why the sequence numbers are inside MediaSegment and also tracked via the Map key?

A MediaSegment describes a small chunk of the output stream and the player might want to switch to a lower quality stream or higher quality stream. According to the RFC one should not rely on the Duration and instead each segment has a number assigned. The first segment has the discontinuity sequence number and all the following are previous+1. This makes it possible to switch the streams, by fetching the next segment from another VariantStream with the last sequence number+1. (I might be wrong, so maybe it would be better to look at the EXT-X-DISCONTINUITY documentation in the RFC)

I had quite a difficult time writing code that keeps the insertion order of the provided segments vector and that does allow for explicitly numbered segments (the implicitly numbered segments should fill in the free spaces in the vector).

I choose a BTreeMap as a data structure, because you could sort the segments after their number, it does not allow 2 segments with the same number and you can iterate through them in the correct order. The problem is that inserting something into a BTreeMap has a complexity of n log n (see here for a visualization of what this means: https://miro.medium.com/max/1400/1*5ZLci3SuR0zM_QlZOADv8Q.jpeg), which means for our case that the more segments you insert the slower it becomes (a lot slower).

We could replace the BTreeMap with a custom data structure that takes advantage of the fact that the segment numbers are just indexes (number - discontinuity number). I will look into this a bit more later today and write a custom data structure for it.

Simply using a Vec is not an option, because one could insert an element anywhere and all the following elements would be moved, which destroys the order of the segments.

I'm not particularly committed to using nom

I used it quite a while ago, when you had to use macros for everything, which was quite confusing/difficult for me and I think adding a parsing library would make it more difficult for people to contribute this this library, because they would have to first understand nom.

I would rather use a custom trait (Parser) and some kind of struct that can be used to get tokens from the stream (see #46).

How much did this improve performance?

This is an opportunity to add 'Span' support though.

I would wait with that...This is how I ended up rewriting most of this library, because I had so many ideas that I wanted to apply 😅.

@dholroyd
Copy link
Contributor

dholroyd commented Apr 16, 2020

This makes it possible to switch the streams, by fetching the next segment from another VariantStream with the last sequence number+1

My reading of the spec is that,

  • media-sequence-number increments by 1 for each segment in the Media Manifest
  • discontinuity-sequence-number increments by 1 for each discontinuity in the Media Manifest

Is that your understanding too?

So, writing out explicitly the per-segment values that are implied by the manifest + spec-rules:

msn dsn m3u8
0 0 one.ts
1 0 two.ts
. . #EXT-X-DISCONTINUITY
2 1 three.ts
3 1 four.ts

Or with explicit headers,

msn dsn m3u8
. . #EXT-X-MEDIA-SEQUENCE:100
. . #EXT-X-DISCONTINUITY-SEQUENCE:5
100 5 one.ts
101 5 two.ts
. . #EXT-X-DISCONTINUITY
102 6 three.ts
103 6 four.ts

@dholroyd
Copy link
Contributor

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values, but I can't seem to find where this is stated or implied in the spec 😨

(And on that point, the 2nd edition spec introduces the X-GAP tag allowing the server to 'fill time' where media is missing, while maintaining the requirement for monotonically increasing MSN.)

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 16, 2020

My reading of the spec is that,

  • media-sequence-number increments by 1 for each segment in the Media Manifest
  • discontinuity-sequence-number increments by 1 for each discontinuity in the Media Manifest

Is that your understanding too?

Yes that is my understanding too. (Note: that I somehow ignored the discontinuity number, which should be inside the MediaSegment as a field (e.g. MediaSegment::discontinuity_number))

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values, but I can't seem to find where this is stated or implied in the spec fearful

Each Media Segment MUST carry the continuation of the encoded
bitstream from the end of the segment with the previous Media
Sequence Number, where values in a series such as timestamps and
Continuity Counters MUST continue uninterrupted. The only exceptions
are the first Media Segment ever to appear in a Media Playlist and
Media Segments that are explicitly signaled as discontinuities
(Section 4.3.2.3). Unmarked media discontinuities can trigger
playback errors.

https://tools.ietf.org/html/rfc8216#section-3

The RFC seems to indicate that gaps are allowed?


My quick prototype collection (a wrapper around Vec<Option<T>>) seems to be 17x faster than the BTreeMap:

Benchmarking SortedVec::insert_at: Collecting 10000 samples in estimated 5.0312                                                                                 SortedVec::insert_at    time:   [6.6438 ns 6.6571 ns 6.6705 ns]
                        change: [+4.8759% +5.4093% +6.1275%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1230 outliers among 10000 measurements (12.30%)
  562 (5.62%) high mild
  668 (6.68%) high severe

Benchmarking BTreeMap::insert: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 8.5s or reduce sample count to 5420.
Benchmarking BTreeMap::insert: Collecting 10000 samples in estimated 8.4930 s (5                                                                                BTreeMap::insert        time:   [96.255 ns 96.414 ns 96.581 ns]
                        change: [-3.7088% -3.0317% -2.2841%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 916 outliers among 10000 measurements (9.16%)
  346 (3.46%) low severe
  291 (2.91%) low mild
  87 (0.87%) high mild
  192 (1.92%) high severe

Benchmarking Vec::push: Collecting 10000 samples in estimated 5.3234 s (450M ite                                                                                Vec::push               time:   [5.4963 ns 5.5109 ns 5.5257 ns]
                        change: [+8.6607% +10.698% +12.587%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 736 outliers among 10000 measurements (7.36%)
  373 (3.73%) high mild
  363 (3.63%) high severe

I will make a PR against your fork later today (or tomorrow) :)

@dholroyd
Copy link
Contributor

Simply using a Vec is not an option, because one could insert an element anywhere and all the following elements would be moved, which destroys the order of the segments.

So I guess I had naively assumed that callers could arrange for segments to be submitted 'in order'. Sounds like you have more advanced usages in mind? For me, it seems reasonable to require that insert-order == manifest-order, but I care far more about parsing today than I do about generating 😄

@dholroyd
Copy link
Contributor

Further to my last comment , I also recall that there are not supposed to be any gaps in the sequence of MSN values

The RFC seems to indicate that gaps are allowed?

I'm sorry, that earlier "not supposed to be any gaps" statement was just my confusion :)

@dholroyd
Copy link
Contributor

I think adding a parsing library would make it more difficult for people to contribute this this library, because they would have to first understand nom.

I would rather use a custom trait (Parser) and some kind of struct that can be used to get tokens from the stream

I've not done the latter part, but FWIW I pushed a change to the branch which drops nom, so its 'just' a recursive descent parser in plain old rust now. Overall structure is the same, but no higher-order functions etc now. NB this parser does still depend on memchr (search for a u8) and lexical-core (parse numbers from &[u8]) crates. There may be a way to get equivalent performance to memchr using only std, but I didn't dig into that yet. The use of lexical-core rather than std is on the basis that the std functions would require checked utf8 conversion up front, and I'm betting that skipping this is a win - I've not profiled that change separately yet.

I can take a look at moving to some kind of mutable cursor, rather than threading &[u8] through all parser function inputs/outputs.

Luro02 added a commit to Luro02/hls_m3u8 that referenced this issue Apr 18, 2020
Luro02 added a commit that referenced this issue Apr 18, 2020
@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 19, 2020

Small update from my side:

I just replaced almost all instances of String with Cow<'a, str>, which gave a huge performance boost:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.1s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.5804 ms 5.5952 ms 5.6146 ms]
                        thrpt:  [54.257 MiB/s 54.444 MiB/s 54.589 MiB/s]
                 change:
                        time:   [-5.6532% -4.4262% -2.9715%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0625% +4.6311% +5.9920%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.0s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5043 ms 2.5105 ms 2.5175 ms]
                        thrpt:  [121.00 MiB/s 121.34 MiB/s 121.64 MiB/s]
                 change:
                        time:   [+0.2180% +0.9366% +1.6358%] (p = 0.01 < 0.05)
                        thrpt:  [-1.6094% -0.9279% -0.2175%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

The TryFrom::try_from version is ~44.8% faster than the old implementation.

@dholroyd
Copy link
Contributor

Nice!

I've made one attempt at moving my parser over to using a custom Cursor type, changing function signatures from approximately

fn foo(input: &[u8]) -> std::result::Result<((&[u8], T), E> {

to

fn foo(input: &mut Cursor) -> std::result::Result<T, E> {

but I was unable to overcome lifetime issues 😫

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 20, 2020

It would make more sense to change the function signatures to something like this

fn foo(input: &mut Cursor<'a>) -> Result<T, E>;

and the Cursor could be

use std::marker::PhantomData;

struct Cursor<'a, B = &'a [u8]>
where
    B: Buffer
{
    buffer: B,
    _p: PhantomData<&'a str>,
}

trait Buffer {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error>;

    // insert methods here required for a parsing?
    // for example:
    fn tag(&mut self, expected: &[u8]) -> Result<(), Error> {
        let mut buffer = Vec::with_capacity(expected.len());
        self.read(&mut buffer)?;
        if buffer == expected {
            Ok(())
        } else {
            unimplemented!("return error here")
        }
    }
}

impl<'a> Buffer for &'a [u8] {
    fn read(&mut self, buf: &mut [u8]) -> Result<usize, Error> {
        unimplemented!()
    }
}

@dholroyd
Copy link
Contributor

So to demonstrate where I got to, I've extracted out a small portion of the WIP code the shows the problem,

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=405eb072639b4e96fb164ba3a3737d21

error[E0499]: cannot borrow `*input` as mutable more than once at a time
  --> src/main.rs:33:37
   |
28 | fn ext_tag<'a, 'b: 'a>(input: &'b mut Cursor<'a>) -> Result<Tag<'a>, ParseError<'a>> {
   |            -- lifetime `'a` defined here
...
31 |     let res = ext_inf(input).map(|(duration, title)| Tag::ExtInf { duration, title });
   |               --------------
   |               |       |
   |               |       first mutable borrow occurs here
   |               argument requires that `*input` is borrowed for `'a`
32 |     if res.is_ok() { return res; }
33 |     let res = ext_program_date_time(input).map(Tag::ExtXProgramDateTime);
   |                                     ^^^^^ second mutable borrow occurs here

As you can see from all the annotations, I've been trying to play lifetime whack-a-mole, and have lost! 😉

There are probably various strategies that can work around the lifetime issues by using additional types to defer ownership checks to runtime, however the goal of my exercise is to remove runtime costs!

@dholroyd
Copy link
Contributor

As soon as I posted that last comment, I thought... hmm, those functions look kinda like they want to be instance methods.. 🤔

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=611f33fbdb623986601a26639e6c7e1d

...and indeed this transformation addresses the lifetime issues. I will try this in the real codebase! 😁

@dholroyd
Copy link
Contributor

I believe mutable Cursor now working as of https://github.com/dholroyd/hls_m3u8/blob/bf87883501d16ab307010a978666443c25f3b3d7/src/parser.rs#L206

I'll rebase for your latest updates.

Luro02 added a commit to Luro02/hls_m3u8 that referenced this issue Apr 22, 2020
Luro02 added a commit that referenced this issue Apr 22, 2020
@dholroyd
Copy link
Contributor

I've rebased my branch on master, gaining the Cow<str> changes too; here are the current standings in the throughput benchmarks,

from_str: ▇▇▇▇▇▇▇▇▇ 65.59
try_from: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 140.44
parser  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 329.78

(MB per second; more is better)

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 26, 2020

I just made some benchmarks on MediaPlaylistBuilder::build and it seems like almost no time is spent verifying the input:

Current master branch:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 30.2s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.7206 ms 5.7604 ms 5.8021 ms]
                        thrpt:  [52.503 MiB/s 52.883 MiB/s 53.251 MiB/s]
                 change:
                        time:   [+1.8543% +3.5562% +5.2609%] (p = 0.00 < 0.05)
                        thrpt:  [-4.9979% -3.4341% -1.8205%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  8 (8.00%) high mild
  8 (8.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5498 ms 2.5571 ms 2.5655 ms]
                        thrpt:  [118.74 MiB/s 119.13 MiB/s 119.47 MiB/s]
                 change:
                        time:   [-6.6711% -5.3264% -4.0808%] (p = 0.00 < 0.05)
                        thrpt:  [+4.2544% +5.6261% +7.1480%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [623.68 us 647.98 us 677.56 us]
                        change: [+4.7927% +7.0447% +9.8695%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Without MediaPlaylistBuilder::validate:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 31.7s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.6478 ms 5.6973 ms 5.7631 ms]
                        thrpt:  [52.859 MiB/s 53.468 MiB/s 53.937 MiB/s]
                 change:
                        time:   [+5.4297% +7.0603% +8.8352%] (p = 0.00 < 0.05)
                        thrpt:  [-8.1179% -6.5947% -5.1501%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.1s or reduce sample count to 30.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.5807 ms 2.6233 ms 2.6721 ms]
                        thrpt:  [114.01 MiB/s 116.13 MiB/s 118.04 MiB/s]
                 change:
                        time:   [+1.8487% +3.5593% +5.4648%] (p = 0.00 < 0.05)
                        thrpt:  [-5.1817% -3.4370% -1.8151%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) high mild
  7 (7.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [566.60 us 568.95 us 572.01 us]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Without any kind of build verification:

Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 27.9s or reduce sample count to 20.
                                                                                Benchmarking MediaPlaylist::from_str/MediaPlaylist::from_str: Collecting 100 sam                                                                                MediaPlaylist::from_str/MediaPlaylist::from_str                        
                        time:   [5.5241 ms 5.5325 ms 5.5415 ms]
                        thrpt:  [54.972 MiB/s 55.062 MiB/s 55.145 MiB/s]
                 change:
                        time:   [-8.4850% -6.9115% -5.4253%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7365% +7.4246% +9.2717%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.7s or reduce sample count to 40.
                                                                                Benchmarking MediaPlaylist::try_from/MediaPlaylist::try_from: Collecting 100 sam                                                                                MediaPlaylist::try_from/MediaPlaylist::try_from                        
                        time:   [2.4998 ms 2.5062 ms 2.5129 ms]
                        thrpt:  [121.23 MiB/s 121.55 MiB/s 121.86 MiB/s]
                 change:
                        time:   [-2.5795% -1.8418% -1.0523%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0635% +1.8764% +2.6478%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Warming up for 3.                                                                                Benchmarking MediaPlaylistBuilder/MediaPlaylistBuilder::build: Collecting 100 sa                                                                                MediaPlaylistBuilder/MediaPlaylistBuilder::build                        
                        time:   [571.44 us 589.05 us 608.32 us]
                        change: [-11.326% -7.9614% -4.7652%] (p = 0.00 < 0.05)
                        Performance has improved.

as can be seen only a few MBs can be gained from skipping the builder.

@dholroyd
Copy link
Contributor

Yup, I think it's fair to say that validation isn't the biggest concern, I can see it's about 2-3% of this cargo flamegraph profile on master...

image

...much more significant on master is the amount of time spent in segments() (35-40% of runtime), where all the items collected in the Vec get moved into the StableVec instance. One of the changes in my branch is to have the parsing process directly construct the StableVec representation, so that this work can be avoided:

https://github.com/dholroyd/hls_m3u8/blob/58f354cb06af952037aec385b00aa2bbb6ccfe27/src/parser.rs#L205

For avoidance of doubt, I want to manifests validated - this is one of my goals in using this crate - it's just that my priority in the short term is iterating on performance. My branch shouldn't be merged without validation working just as well as now, and all other tidy-ups etc being applied.

@dholroyd
Copy link
Contributor

FWIW, on my branch the main offender (according to linux perf) now seems to be memmove...

+   29.90%  throughput  libc-2.30.so       [.] __memmove_avx_unaligned_erms                                                                                                                                        
+    4.88%  throughput  [kernel.kallsyms]  [k] prepare_exit_to_usermode                                                                                                                                            
+    4.37%  throughput  throughput         [.] core::slice::memchr::memchr                                                                                                                                         
+    4.07%  throughput  throughput         [.] hls_m3u8::media_playlist::parse_media_playlist                                                                                                                      
+    3.62%  throughput  throughput         [.] <alloc::vec::Vec<T> as core::clone::Clone>::clone                                                                                                                   
+    2.73%  throughput  [kernel.kallsyms]  [k] native_irq_return_iret                                                                                                                                              
+    2.70%  throughput  [kernel.kallsyms]  [k] clear_page_erms                                                                                                                                                     
+    2.57%  throughput  throughput         [.] core::str::<impl str>::trim                                                                                                                                         
+    2.32%  throughput  throughput         [.] core::num::dec2flt::dec2flt                                                                                                                                         
+    2.29%  throughput  throughput         [.] hls_m3u8::media_playlist::MediaPlaylistBuilder::build                                                                                                               
+    2.13%  throughput  [kernel.kallsyms]  [k] sync_regs                                                                                                                                                           
+    2.08%  throughput  [kernel.kallsyms]  [k] error_entry                                                                                                                                                         
+    2.05%  throughput  [kernel.kallsyms]  [k] swapgs_restore_regs_and_return_to_usermode                                                                                                                          
+    1.83%  throughput  throughput         [.] <hls_m3u8::line::Lines as core::iter::traits::iterator::Iterator>::next                                                                                             
+    1.63%  throughput  throughput         [.] <hls_m3u8::line::Tag as core::convert::TryFrom<&str>>::try_from                                                                                                     
+    1.61%  throughput  throughput         [.] <hls_m3u8::tags::media_segment::inf::ExtInf as core::convert::TryFrom<&str>>::try_from                                                                              
+    1.60%  throughput  throughput         [.] <stable_vec::core::bitvec::BitVecCore<T> as core::clone::Clone>::clone                                                                                              
+    1.40%  throughput  throughput         [.] hls_m3u8::media_playlist::MediaPlaylistBuilder::segments                                                                                                            
+    1.26%  throughput  throughput         [.] hls_m3u8::media_segment::MediaSegmentBuilder::build                                                                                                                 
+    1.18%  throughput  throughput         [.] core::str::SplitInternal<P>::next                                                                                                                                   

Looking at the flamegraph, this cost seems to be spread out across a number of functions, and with no obvious 'explicit' data movement that I can tie it to. I am guessing that this must be due to the remaining uses of intermediate representations that need to be copied to arrive at the final MediaSegment value in the vec.

I would like to see if there's a way to arrange things that will convince rustc to optimise more of these moves away / construct the final value in-place. Not sure how, but seems worth a try!

@dholroyd
Copy link
Contributor

I've added a couple of additional commits to my branch:

Originally, I had copied the Line and Tags types into the parser (with slightly different signatures to those in master) to decouple the 'parser' from the 'builders'. Turns out that wrapping the intermediate values from the parser into these types was the largest source of the memmov() cost showing up in profiling. dholroyd@da1a53b removes these types and has the parser call methods on the builder directly. This gives me around a 100% increase in throughtput in the benchmark, vs the performance of the parser-perf branch before this change.

Next, dholroyd@c855f48 alters MediaSegment so that all fields are optional (i.e. including the mandatory fields representing the URI and EXINF values 😱). Then, we can have our in-progress segment list end with an 'all-None' MediaSegment, and fill-in attributes of this segment as they are discovered. On top of the last change, this one gives an additional throughput increase of around 19%.

Current benchmark standings:

from_str: ▇▇▇▇ 63.58
try_from: ▇▇▇▇▇▇▇▇▇▇ 138.08
parser  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 641.73
(MB per second; more is better)

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 27, 2020

Wow that is amazing 😍

@dholroyd
Copy link
Contributor

dholroyd@21b7fdc puts some restrictions on the format of floating point numbers when parsing the duration value from every EXTINF tag - another 10% or so throughput (and stops us accepting things like 123.0E+77, that the HLS spec does not allow anyway).

@dholroyd
Copy link
Contributor

On my branch right now, the costly activities that stand out in the profile are:

  • 24% - parsing EXTINF duration into f64
  • 15% - pushing new segments onto the StableVec
  • 10% - UTF-8 checks on URI values
  • 6% - DATERANGE parsing
  • 6% - Dropping the StableVec once we are done with a parsed MediaManifest instance

I see maybe one more major win for my usage, which to start tracking state across multiple subsequent reloads of a live manifest, and to avoid the full parsing pipeline for sections of the manifest that are unchanged. Note that all segments are expected to be the same from one reload to the next except that 1) one (or more) segment is added onto the end, 2) one (of more) old segments may be removed from the start. For a very long manifest, the vast majority of segments will be the same, so this might be a profitable strategy (less so for very small live manifests, I expect). I think this is better tacked as a follow-up ticket / PR though.

@dholroyd
Copy link
Contributor

I'd like to work towards getting master achieving the same level of performance as my experimental branch. I think at least the following problems make the work on my branch unsuitable for merging right now:

  • The parser there only handles a tiny subset of HLS tags that are handled by master
  • There's a different interface to my parser (to enable comparative benchmarks vs. existing parser)
  • The code is not tested by any of the existing unit tests
  • The existing validation logic is being skipped
  • Error handling in some spots is missing (println!() etc.)

If these were addressed, would a PR using the design in https://github.com/dholroyd/hls_m3u8/tree/parser-perf be accepted? I'd like to address design concerns before writing lots more code.

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 28, 2020

I would prefer if you could integrate your improvements into the existing types, by for example having a TryFrom<&[u8]> or TryFrom<Cursor> implementation and the MediaSegment fields that are required should not be optional. The rest seems to be fine :)

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 28, 2020

It might make sense to create a new trait for this e.g.

trait ParseFrom {
    type Error;

    fn try_from_cursor(_: &mut Cursor) -> Result<Self, Self::Error>;
}

impl<T: ParseFrom> TryFrom<&[u8]> for T {
    // ...
}

@Luro02
Copy link
Collaborator Author

Luro02 commented Apr 28, 2020

Maybe something like this would be better

pub enum Poll<T> {
    Ready(T),
    Missing(Option<usize>)
}

trait Parser {
    type Error;

    fn parse<R: Read>(_: R) -> Poll<Result<Self, Self::Error>>;
}

I think currently it would be better to not expose your parser to the user, so it can be changed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area: parser help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants