Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default log store backend to WAL and allow disabling verification #21700

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dhiaayachi
Copy link
Collaborator

@dhiaayachi dhiaayachi commented Sep 4, 2024

Description

This PR change the default log store config to use WAL when starting with a fresh database. If a bolt db already exist bolt db will be used as a backend and a warning will be logged.

It also allow the log verifier, enabled by default, to be disabled.

Testing & Reproduction steps

Added tests to verify combination of configs.

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@dhiaayachi dhiaayachi requested a review from a team as a code owner September 4, 2024 13:53
@github-actions github-actions bot added the theme/config Relating to Consul Agent configuration, including reloading label Sep 4, 2024
Copy link
Member

@jmurret jmurret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: 🔥

Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few docs ❓ and minor suggestions, but otherwise LGTM! 🚀

.changelog/21700.txt Outdated Show resolved Hide resolved
Comment on lines 1059 to 1060
if s.config.LogStoreConfig.Backend == LogStoreBackendDefault && !boltFileExists {
if (s.config.LogStoreConfig.Backend == LogStoreBackendDefault || s.config.LogStoreConfig.Backend == LogStoreBackendWAL) && !boltFileExists {
s.config.LogStoreConfig.Backend = LogStoreBackendWAL
Copy link
Member

@zalimeni zalimeni Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~ Should we consider moving the original

if s.config.LogStoreConfig.Backend == LogStoreBackendDefault && !boltFileExists {
  s.config.LogStoreConfig.Backend = LogStoreBackendWAL
}

bit up above the rest of this if block, and just check explicitly for WAL (not default) after?

Main thought that crossed my mind is we're treating default and WAL as equivalent in these checks once we get past the BoltDB detection gate, so normalizing in one place is less error-prone in case of future changes and separates the defaulting from the business logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking we could simplify the checks following the first defaulting block, so that we aren't repeating stuff like "default or WAL" and "!boltDB". Maybe something like this? (also switches the warning to "using BoltDB" since "ignoring 'wal'" might be confusing when default is used)

- Take a snapshot prior to testing.
- Monitor Consul server metrics and logs, and set an alert on specific log events that occur when WAL is enabled. Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for more information.
- Enable WAL in a pre-production environment and run it for a several days before enabling it in production.
WAL LogStore is now enabled by default
Copy link
Member

@zalimeni zalimeni Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still ignoring config if there's a BoltDB file found - should we call that out here (new installs only) similar to the main doc, and keep some instructions to transition existing servers if desired?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boruszak Can you please check the wording in here?

website/content/docs/agent/wal-logstore/enable.mdx Outdated Show resolved Hide resolved
Comment on lines -52 to -74
## Enable log verification

You must enable log verification on all voting servers in Enterprise and all servers in CE because the leader writes verification checkpoints.

1. On each voting server, add the following to the server's configuration file:

```hcl
raft_logstore {
verification {
enabled = true
interval = "60s"
}
}
```

1. Restart the server to apply the changes. The `consul reload` command is not sufficient to apply `raft_logstore` configuration changes.
1. Run the `consul operator raft list-peers` command to wait for each server to become a healthy voter before moving on to the next. This may take a few minutes for large snapshots.

When complete, the server's logs should contain verifier reports that appear like the following example:

```log hideClipboard
2023-01-31T14:44:31.174Z [INFO] agent.server.raft.logstore.verifier: verification checksum OK: elapsed=488.463268ms leaderChecksum=f15db83976f2328c rangeEnd=357802 rangeStart=298132 readChecksum=f15db83976f2328c
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this section on log verification still relevant info even when WAL is defaulted on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log verification is enabled by default now as part of WAL. The reasoning is that it have minimal impact but great benefits in case of bugs.

I will double check if it's documented as part of the logstore config properly.

website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be deleting the instructions in the wal-logstore/enable page. We should rephrase the language around "experimental" and callout that it's the default. But what if someone changes from WAL to BoltDB and then wants to change back?

website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/monitoring.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/wal-logstore/index.mdx Outdated Show resolved Hide resolved
Co-authored-by: Michael Zalimeni <[email protected]>
Co-authored-by: Jeff Boruszak <[email protected]>
@dhiaayachi
Copy link
Collaborator Author

I don't think we should be deleting the instructions in the wal-logstore/enable page. We should rephrase the language around "experimental" and callout that it's the default. But what if someone changes from WAL to BoltDB and then wants to change back?

@boruszak I'm not sure I get your point 🤔. The aim of that page is to help users enable an experimental feature and make sure that it's working safely for them. Now that WAL is default that logic don't hold anymore as by making it default we implicitly admit to it being stable enough to make it default.

I agree on your point about reverting from WAL to boltdb being important but it's a simple configuration change and don't need any extra steps. The only thing I can think of and that we should call-out, and we can probably document in that page is that:

  • if you activate WAL with an existing BoltDB db the boltDB db will be used and WAL will not be activated
  • if you activate boltdb while a WAL db is present the WAL db will be ignored and the server will start with a new DB from scratch and the existing WAL db will be ignored.

So to sum it up, when changing the log store backend it's always recommended to:

  • Create a snapshot and verify it's not corrupted
  • Gracefully stop the server
  • change the log store config
  • delete the existing DB (wal or boltdb)
  • start the server
  • Wait for it to get its data replicated or restore the snapshot

WYT?

@JadhavPoonam
Copy link
Contributor

@dhiaayachi For someone who already has BoltDB what steps would they have to take to make the switch? Do we need/have some migration docs somewhere? 🤔

@dhiaayachi
Copy link
Collaborator Author

@dhiaayachi For someone who already has BoltDB what steps would they have to take to make the switch? Do we need/have some migration docs somewhere? 🤔

@JadhavPoonam the procedure I highlighted in the comment above would be needed. We can add that as documentation.

agent/consul/server.go Outdated Show resolved Hide resolved
agent/consul/server.go Outdated Show resolved Hide resolved
Copy link
Member

@zalimeni zalimeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM! Will defer to you and docs team for remaining open questions about what to retain/drop/change.

@dhiaayachi
Copy link
Collaborator Author

@boruszak This is ready from code perspective, can you please check what changes are needed to the doc to get this into a merging state?

Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes in these suggestions , and I'm approving to unblock this PR on my end.

@dhiaayachi I proposed keeping the "Enable WAL" instructions assuming that if someone had WAL and reverted to BoltDB, they might want to go back to WAL. But from your comments, it sounds like what we actually need instead is a page that describes the steps to migrate a datacenter running BoltDB to one that runs WAL (using the steps you describe with the snapshot agent).

- Take a snapshot prior to testing.
- Monitor Consul server metrics and logs, and set an alert on specific log events that occur when WAL is enabled. Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for more information.
- Enable WAL in a pre-production environment and run it for a several days before enabling it in production.
WAL LogStore is now enabled by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WAL LogStore is now enabled by default
The WAL LogStore backend is now enabled in Consul by default.

@@ -7,30 +7,7 @@ description: >-

# Enable the experimental WAL LogStore backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Enable the experimental WAL LogStore backend
# Enable the WAL LogStore backend


This topic provides an overview of the WAL (write-ahead log) LogStore backend.
The WAL backend is an experimental feature. Refer to
The WAL backend is now the default Consul LogStore when a boltdb database is not already in place. Refer to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The WAL backend is now the default Consul LogStore when a boltdb database is not already in place. Refer to
The WAL backend is now the default LogStore for Consul server agents when a BoltDB database is not already in place. Refer to

website/content/docs/agent/wal-logstore/monitoring.mdx Outdated Show resolved Hide resolved
@dhiaayachi
Copy link
Collaborator Author

Small changes in these suggestions , and I'm approving to unblock this PR on my end.

@dhiaayachi I proposed keeping the "Enable WAL" instructions assuming that if someone had WAL and reverted to BoltDB, they might want to go back to WAL. But from your comments, it sounds like what we actually need instead is a page that describes the steps to migrate a datacenter running BoltDB to one that runs WAL (using the steps you describe with the snapshot agent).

Thank you for the review @boruszak but no need to rush this anymore as we decided not to include it in the next release. I think we should have a page that describe both:

  • how to migrate to WAL when you already have a boltDB database
  • how to migrate back to boltdb if you are using WAL.

I will try to add that page and ping you for a review.

@digital-content-events
Copy link

📄 Content Checks

Updated: Tue, 15 Oct 2024 20:10:33 GMT

Found 4 error(s)

content/docs/agent/wal-logstore/migrate-to-wal.mdx

Error parsing frontmatter: YAMLParseError: Implicit map keys need to be followed by map values at line 4, column 1:

description: >-
Learn how to migrate from boltDB to WAL with an existing boltDB database.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Position Description Rule
1:1-1:1 Document does not have a page_title key in its frontmatter. Add a page_title key at the top of the document. ensure-valid-frontmatter
1:1-1:1 Document does not have a description key in its frontmatter. Add a description key at the top of the document. ensure-valid-frontmatter
1:1-1:1 This file is not present in the nav data file at data/docs-nav-data.json. Either add a path that maps to this file in the nav data or remove the file. If you want the page to exist but not be linked in the navigation, add a hidden property to the associated nav node. no-unlinked-pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport theme/config Relating to Consul Agent configuration, including reloading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants