-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: improve aggregator event queue efficiency #457
base: main
Are you sure you want to change the base?
Conversation
This branch updates the console crates' dependency on `tonic` to v0.8.0 and `prost` to 0.11.0. In addition, I've added a [`cargo xtask`][xtask] command for manually regenerating the generated protobuf bindings. This is necessary as the current approach, regenerating the bindings in an integration test, does not work when the protos fail to compile (which they do after the Tonic update). Since running the crate's tests requires compiling the crate, if the proto bindings don't compile, we can't re-run the test. [xtask]: https://github.com/matklad/cargo-xtask
This changes the console CLI's error handling so that GitHub issues are only suggested for panics, not for recoverable errors (such as "no config file found", "couldn't connect to remote host", etc).
This way, if running a subcommand panics/errors, we still get nice `color-eyre` reports.
This commit adds `color_eyre`'s default backtrace frame filters, so we skip stuff that's not relevant.
…rs#365) This adds a dump of all the console's config options to the issue metadata for GitHub issues generated using `color_eyre`'s GitHub issue generation on panics. <details> <summary>Example issue Markdown:</summary> ## Error ``` lol ``` ## Metadata |key|value| |--|--| |**version**|0.1.6| |**config.subcmd**|`None`| |**config.target_addr**|`Some(http://127.0.0.1:6669/)`| |**config.env_filter**|`None`| |**config.log_directory**|`Some("/tmp/tokio-console/logs")`| |**config.retain_for**|`None`| |**config.view_options.no_colors**|`false`| |**config.view_options.lang**|`Some("en_US.UTF-8")`| |**config.view_options.ascii_only**|`Some(false)`| |**config.view_options.truecolor**|`Some(true)`| |**config.view_options.palette**|`Some(All)`| |**config.view_options.toggles.color_durations**|`Some(false)`| |**config.view_options.toggles.color_terminated**|`Some(false)`| |**location**|tokio-console/src/main.rs:36:5| </summary>
<a name="0.4.0"></a> ## 0.4.0 (2022-08-10) #### Breaking Changes * Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f)) #### Features * Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f))
<a name="0.1.7"></a> ## 0.1.7 (2022-08-10) #### Features * Update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f)) * Update `console-api` to `0.4` (tokio-rs#364) ([40e2f6f](40e2f6f))
<a name="0.1.7"></a> ## 0.1.7 (2022-08-10) #### Features * include config options in autogenerated issues (tokio-rs#365) ([fcb54df](fcb54df)) * filter out boring frames in backtraces (tokio-rs#365) ([523a44a](523a44a)) * init error handling before subcmds (tokio-rs#365) ([6646568](6646568)) * only suggest opening issues for panics (tokio-rs#365) ([23cb6bf](23cb6bf)) * update `tonic` to `0.8` (tokio-rs#364) ([40e2f6f](40e2f6f)) * update `console-api` to `0.4` (tokio-rs#364) ([40e2f6f](40e2f6f))
Needed for tokio-rs#374. This configures clippy to ignore most of the generated code in `console-api`.
Due to a change in the unstable task builder APIs, this no longer compiles with the latest version of Tokio. Fortunately, it's a simple fix.
<a name="0.1.7"></a> ## 0.1.8 (2022-09-04) #### Bug Fixes * fix build on tokio 1.21.0 (tokio-rs#374) ([0106407](0106407))
Fixes: tokio-rs#372 `--ascii-only true` shows ascii `--ascii-only false` shows emojis if `--ascii-only` is not passed, default value will be true, shows emojis
) Currently, the documented `cargo-run` in README no longer works, as now there are two binaries: ```console $ cargo run error: `cargo run` could not determine which binary to run. Use the `--bin` option t o specify a binary, or the `default-run` manifest key. available binaries: tokio-console, xtask ``` This branch declares the `tokio-console` binary as the [`default-run`] value in the `Cargo.toml`, so it is now run by default by `cargo run`. Alternatively, we can just update the README line with `cargo run --bin tokio-console` [`default-run`]: (https://doc.rust-lang.org/cargo/reference/manifest.html#the-default-run-field)
Found via these commands: codespell . markdownlint *.md --disable MD013
…o-rs#387) Currently, the help text for the `r` and `t` keystrokes to switch to the Resources and Tasks views is shown on the Task Details and Resource Details views, as well as on the Task List and Resource List views. See this screenshot for an example: ![image](https://user-images.githubusercontent.com/2345750/202058962-4a8f062c-78fd-47eb-a3b1-5de470976aa8.png) However, the keystrokes are not actually handled while the console is showing a details view. This branch changes the console to handle the `r` and `t` keystrokes on all views. This also simplifies the keyboard input code somewhat.
Build and add to the released binaries for `aarch64-unknown-linux-gnu`, `aarch64-apple-darwin` and `aarch64-pc-windows-msvc`
Hey I noticed this out of date in my `Cargo.lock`, and since I didn't see any PRs/issues open doing so already, figured I'd open one!
Signed-off-by: hi-rustin <[email protected]> Signed-off-by: hi-rustin <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
Fixes tokio-rs#386 Signed-off-by: hi-rustin <[email protected]>
The durations in the tokio-console UI are shown with a unit, so the digits after the decimal separator are generally not relevant. Since space is at a premium in a terminal UI, it makes sense to cut down where possible. This change removes all digits after the decimal separator in the tasks table view. In the task detail view, 2 decimal places are kept. Additionally, 4 new duration formats are added to represent minutes-with-secondsi, hours-with-minutes, days-with-hours and days. These are only applied once the leading unit is greater than zero. For example, 59 seconds will be shown as seconds: `99s` and then 60 seconds will be shown as minutes-with-seconds: `1m00s`. New colors have been chosen for each format. NOTE: I had to make a small unrelated change in `task::Details::make_percentiles_widget` to make clippy happy. Fixes: tokio-rs#224
…kio-rs#404) The `color-eyre` crate was previously on version 0.5. This version depends on `tracing-subscriber` 0.2, meaning that we had 2 different veresions of `tracing-subscriber` in our dependency tree. This change updates `color-eyre` to 0.6, which depends on `tracing-subscriber` 0.3, the same as `console-subscriber`.
Tokio Console generates its own sequential Id for internal tracking and indexing of objects (tasks, resources, etc.). However, this Id will be recreated if Console is restarted. In order to provide more useful information to the user, the task Id generated by Tokio can be used in the task list and task details screens instead. If used in this way, the ID field in the task list and task detail views will be stable across restarts of Console (assuming the monitored application is not restarted). This also frees up horizontal space, as the `task.id` attribute doesn't need to be displayed separately. The disadvantage of using Tokio's task Id is that it is not guaranteed to be present by the type system. To avoid problems with missing task Ids, the Tokio task Id is store in addition to the `span::Id` (used to communicate with the `console-subscriber`) and the sequential console `Id` (used in the `store`). If a task is missing the `task.id` field for whatever reason it will still appear, but with an empty ID. If multiple runtimes are active, then duplicate ID values will appear. Fixes: tokio-rs#385 Co-authored-by: Eliza Weisman <[email protected]>
Add support for console connections that use Unix domain sockets rather than TCP. Closes tokio-rs#296. Co-authored-by: Eliza Weisman <[email protected]>
The Console API specifies sending task busy duration only for completed polls, it doesn't include the time spent in the current poll if the task is active. Tokio Console then calculates the busy time including the time spent in the current poll - based on the last poll start and poll end times sent by the Console Subscriber. However, there was an error in the logic which determined when a task is being polled for the purpose of calculating the busy time. The logic only considered the first poll, when there was no recorded end poll time at all. This change adapts the logic so that it also considers the case where the last recorded poll start is later than the last recorded poll end. This indicates that the task is being polled. Co-authored-by: Eliza Weisman <[email protected]>
…s#408) There are 2 widgets which display the poll times for a task in the detail view. The poll times percentiles are always displayed and if UTF-8 is available, then a sparkline histogram is also shown to the right. The logic for displaying these two widgets is quite long and is currently interspersed within the `render` function for the task detail view plus helper functions. Additionally, it is not easy to add a second set of widgets showing the time between waking and being polled for a task which is planned for tokio-rs#409. This change factors out that logic into separate widgets. There was already a separate widget `MiniHistogram`. Some of the logic that was previously in the task detail view has been moved here. A new widget `Percentiles` has been added to encapsulate the logic for preparing and displaying the percentiles. A top level `Durations` widget occupies the entire width of the task detail view and control the horizontal layout of the `Percentiles` and `MiniHistogram` widgets. The new widget will also supress the histogram if there isn't at least enough room to display the legend at the bottom
Each task displays the sum of the time it has been idle and busy, as well as the total. The idle time includes the time between when a task is woken, and when the runtime actually polls that task. There are cases where a task may be scheduled for a long time after being woken, before it is polled. Especially if many tasks are woken at the same time and don't yield back to the runtime quickly. To add visibility to this, the total time that a task is scheduled before being polled has been added. Additionally, a new task state `Scheduled` has been added. This is displayed in both the tasks table and in the task detail view. In the `console-api`, the total `scheduled_time` for the task has been added to the `TaskStats` message in `tasks.proto`. This is the first of two parts. In the second part (tokio-rs#409), a histogram for scheduled time will be added, the equivalent of the poll time histogram which is already available on the task detail screen. To show a pathological case which may lead to needing to see the scheduled time, a new example has been added to the `console-subscriber` ## PR Notes This PR does something adjacent to what is described in tokio-rs#50, but not quite. The unicode character used for a `SCHED` task is ⏫. The second PR (tokio-rs#409) will record a scheduled time histogram the same as it recorded for poll times. These two changes should go in together (and certainly shouldn't be released separately). However, this PR is already quite big, so they'll be separated out. The idea is that this PR isn't merged until the next one has also been reviewed and approved. It would be good to get some feedback at this stage though. The task list view with the new column for `Sched` time: <img width="1032" alt="a tasks table view for the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232456977-2921f884-4673-420f-ba4f-3646627d44db.png"> The `Task` block in the task detail view showing the new `Scheduled` time entry. <img width="510" alt="The task block on the task detail view for the rx task in the long-scheduled example" src="https://user-images.githubusercontent.com/89589/232457332-e455e086-9468-42c9-8fda-7965d8d1e6f8.png">
Self-wakes were not being detected and displayed in the console. The `burn` task in the `app` example - which deliberately has many self-wakes - was not registering any. It appears that a logic error was present in the self-wake counting in `console-subscriber` since it was added in tokio-rs#238. When a self wake was detected, the `wakes` count was incremented a second time (the `wakes` count is incremented for all wakes before checking for a self wake), instead of increamenting the `self_wakes` count. This PR fixes the logic so that when a self wake is detected, the `self_wakes` count is incremented.
Each view in tokio-console has a widget up the top that lists the available controls for that view. There was a common implementation of this for table based views (tasks, resources, and async_ops) and separate implementations for the task and resource views. The resource view included two controls widgets, one for the Resource details at the top of the view, and another above the table of Async Ops at the bottom of the view. This change centralises the logic for the creation of this controls widget. This change is mostly a precursor to also displaying the controls in the help view (at which point we can revisit whether the entire list needs to be shown at the top of the screen). Controls (an action and the key or keys used to invoke it) are defined in structs so that their definition can be separated from the display logic (which includes whether or not UTF-8 is supported). This allows the problem of the text in the controls widget wrapping in the middle of a control definition to be fixed. Previously a subset of the controls would have wrapped like this: ```text controls: select column (sort) = ←→ or h, l, scroll = ↑↓ or k, j, view details = ↵, invert sort (highest/lowest) = i, ``` Notice how "view details = ↵," was split across multiple lines. The same list of controls will now wrap at a full control definition. ```text controls: select column (sort) = ←→ or h, l, scroll = ↑↓ or k, j, view details = ↵, invert sort (highest/lowest) = i, ``` Additionally, the list of controls on the Resource view has been consolidated up the top of the screen. Universal controls, those that are available in all views, are also defined centrally. As well as the quit action, using the space bar to pause has been added to that list. This was previously somewhat of an undocumented feature.
We often receive questions about the meaning of certain columns in the different views in Tokio Console. This change adds additional documentation to the `tokio-console` README - which is also used in the docs.rs main page - describing each column in the tasks, resources, and async ops tables. Where there are a fixed set of values possible in the column, those values are also described. Additionally, the 4 views are enumerated and separated into sections to aid readability and allow deep linking. Co-authored-by: Eliza Weisman <[email protected]>
…o-rs#443) * fix(console): remove trailing space from task/resource location When a file location is formatted, an extra space is added at the end. This appears to be the result of some refactoring from a case where we needed the extra space at the end. Currently, the extra space results in there being 2 spaces between the Location column and the Fields column in the tasks list view, and the same between the Location column and the Attributes column in the resources list view. It is also causing issues when attempting to truncate the first part of the location in tokio-rs#441 (during which this extra space was discovered). This change removes the trailing space.
…o-rs#440) When entering and exiting a span the old code was also updating the parent stats. This was causing excessive polling being reported for the parent tasks. See issue tokio-rs#378 for more details. The regression was introduced by the refactor in tokio-rs#238. This fixes the issue by limiting updates to the current span. Closes tokio-rs#378 Co-authored-by: Hayden Stainsby <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
This closes tokio-rs#411. This is how it looks like. <img width="514" alt="Screenshot 2023-07-03 at 12 22 24 PM" src="https://github.com/tokio-rs/console/assets/41150432/878a43e6-1512-49c0-85fd-0e245cc6ee04"> Co-authored-by: Hayden Stainsby <[email protected]>
This appears to be breaking it.
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
…-rs#453) This PR adds an explicit dependency to futures-core for `console-api` to fix broken builds. Without this fix I get the following errors: ``` error[E0433]: failed to resolve: use of undeclared crate or module `futures_core` --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.instrument.rs:275:34 | 275 | type WatchUpdatesStream: futures_core::Stream< | ^^^^^^^^^^^^ use of undeclared crate or module `futures_core` error[E0433]: failed to resolve: use of undeclared crate or module `futures_core` --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.instrument.rs:289:38 | 289 | type WatchTaskDetailsStream: futures_core::Stream< | ^^^^^^^^^^^^ use of undeclared crate or module `futures_core` error[E0433]: failed to resolve: use of undeclared crate or module `futures_core` --> /home/weiznich/.cargo/registry/src/index.crates.io-6f17d22bba15001f/console-api-0.5.0/src/generated/rs.tokio.console.trace.rs:212:27 | 212 | type WatchStream: futures_core::Stream< | ^^^^^^^^^^^^ use of undeclared crate or module `futures_core` ``` The somewhat funny thing is that only happened after I've updated some unrelated dependencies. It worked before, but after looking at the relevant code I'm really surprised that it ever compiled without error.
The `ConsoleLayer` builder provides the user with a console layer and a server, which is used to start the gRPC server. However, it may be desireable to expose the instrumentation server together with other services on the same Tonic router. This was requested explicitly in tokio-rs#449 428. Additionally, to add tests which make use of the instrumentation server (as part of improving test coverage for tokio-rs#450), more flexibility is needed than what is provided by the current API. Specifically we would like to connect a client and server via an in memory channel, rather than a TCP connection. This change adds an additional method to `console_subscriber::Server` called `into_parts` which allows the user to access the `InstrumentServer` directly. The `Aggregator` is also returned and must be set to run for at least as long as the instrument server. This allows the aggregator to be spawned wherever the user wishes. To facilitate the addition of functionality which would result in more "parts" in the future, `into_parts` returns a non-exhaustive struct, rather than a tuple of parts. Closes: tokio-rs#428
…okio-rs#449) Now it is possible while navigate in `async_ops_table` to go directly to details of the task that you are pointing on the list. This closes tokio-rs#448
The current logic present in `IdData::drop_closed` marks an item (task, resource, and async op stats) to be dropped in the case that the item **is** dirty and there **are** watchers: `(dirty && has_watchers)`. This causes a case where if an item is first received and then completes in between the aggregator push cycle, it will be discarded immediately and never sent. This logic has been in place since the concepts of watchers and dirty items was introduced in tokio-rs#77. However since an item that is created and then dropped within a single update cycle isn't likely to be missed in the UI, it may never have been noticed. Instead the logic should be to **retain** an item if **any** of the following is true: * there are watchers and the item is dirty: `(dirty && has_watchers)` * item has been dropped less time than the retention period: `dropped_for <= retention`.
I just came across this typo in the Readme when using `tokio-console` for the first time. This is a great tool 👍
- Implement queue drain pressure that is based on channel's capacity. Pressure usually leads to the system reaching equilibrium without piling the events up. - Allow configuring channel's target fill capacity, before flushing. - Lower event buffer capacity, because the current value let the buffer grow into gigabytes in size.
Thank you for working on this! The current queue sizes and other parameters were chosen pretty arbitrarily in the interest of "just getting something working", so I had been hoping to improve the queue performance for some time. I'm excited to give this PR a review! |
let fill_percentage = self.shared.fill_percentage.load(Relaxed) as u32; | ||
let sleep = tokio::time::sleep(self.publish_interval * (100 - fill_percentage) / 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually flush more frequently than the default publish interval? it seems like if the fill percentage is 100, this will be the same as the publish interval, otherwise it will be greater. is that the intention here, or is this supposed to fire more frequently than the publish interval when the queue is more full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fill percentage is 100, then the interval will be publish_interval * 0
, meaning there will not be any real sleeping happening.
@hawkw I have pushed requested changes (can squash it later, but I think it's easier to see changes during review) |
@h33p Just letting you know that we haven't forgotten this PR! I've got a test case that I'd like to run your changes against to compare to the current behaviour. It pretty easily produces a situation where the channel between the subscriber and the aggregator grows forever (at least on my local computer using a single thread). |
Details
This reverts #235, because I found that limit being unreasonably high - adding the following code would lead to tens of gigabytes of size events being piled up to the event queue:
The problem I found in testing is that the aggregator task would consistently fall behind the number of events being created and eventually the event queue would fill up, regardless of its size (increasing the size merely delays the problem). The reason why it fills up is rather subtle
recv().now_or_never()
eventually returnsPoll::Pending
due to task coop budget being used up. My immediate fix was to wrap the future withtokio::task::unconstrained
, but I figured there's a reason why it's not doingtry_recv()
and that is to avoid livelocking the task.What this PR does is expose the "fullness" level of the queue to the shared vars. This allows the aggregator task to have a secondary timer, alongside regular ticking, that adds pressure to clear out the event queue at the same rate as it's filling up.
Another important thing to note is that event queue filling up faster than it's being processed will lead to time dilation - the rate of events in the console would appear slower than it actually is. Adding pressure eventually leads to an equilibrium where events are processed at roughly equal rate.
One key change was adding a customizable variable
event_buffer_flush_fraction
- the higher the fraction the sooner this equilibrium will be reached. I set it to10
for now (before the effective value was2
), which, with the reduction of event buffer capacity, should lead to 50x+ faster reaction by the aggregator task, while keeping 9x extra space for random spikes. Note that this variable may be changed to something slightly easier to understand, likeevent_buffer_flush_percentage
, but that's just a user-facing detail I have no strong opinion about.I believe this should fix #184, but I haven't done an actual long-term test in production, will see in a couple days. EDIT: it doesn't seem to fix it