Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend "take over" mode in filestream to cover the changing ID use-case #42884

Open
rdner opened this issue Feb 25, 2025 · 8 comments
Open

Extend "take over" mode in filestream to cover the changing ID use-case #42884

rdner opened this issue Feb 25, 2025 · 8 comments
Assignees
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@rdner
Copy link
Member

rdner commented Feb 25, 2025

Filestream "take over" mode

The "take over" mode was originally created for a single use-case – migrating state (current file offsets) from a log input to a filestream input.

The implementation is currently contained by this package #42624

Short description of how it works:

  • When Filebeat starts, it detects every filestream that has take_over: true in its configuration
  • We open the registry and iterate through existing states of all log inputs
  • For each log input state we try to match its file path against paths of any of filestream inputs with take_over: true
  • If we have a match, we create a new filestream state for this matched file converting all the data from the log input state
  • We delete all the log input states that matched any filestream with take_over: true

At the moment it has some limitations due to the placement of this implementation: dynamically created inputs don't perform the state migration despite take_over: true is set. This means that running Filebeat under the agent that communicates input changes or running Filbeat with the autodiscover mode enabled would not migrate the states.

This issue was created to address this problem #36777

A new use-case

Due to the recent development we found a new migration use-case similar to the existing take_over mode – migrating the state (file offsets) from one filestream input to another after an ID change.

The current solution introduces a separate migration logic for just this use-case and introduces a new configuration parameter previous_ids #42624

Proposed alternative

From the UX perspective I think it would be better to have only one way how we migrate the state between inputs – the already existing take_over: true mode.

For that reason we should consolidate the effort of working on #36777 and #42472 into one single solution:

  • We move the take_over logic into the filestream implementation
  • We treat filestream->filestream migration (due to an ID change) exactly like log->filestream migration matching by paths (we might need additional checks like inode/fingerprint and reset the offset on mismatch)
  • We no longer delete previous states from the registry, so old and new inputs can co-exist if needed. That also removes the need for the registry backup step. The clean up should happen on later input removal anyway.

cc @belimawr @cmacknz

@rdner rdner added Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Feb 25, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Feb 25, 2025

For that reason we should consolidate the effort of working on #36777 and #42472 into one single solution:

This sounds reasonable to me, so there's only one way to do things.

@belimawr
Copy link
Contributor

I also like the idea of having a single implementation.

There is an issue of migrating filestream -> filestream matching only by path: multiple Filestream inputs can harvest the same file and have their states independently tracked, which requires unique IDs. To support the migration on this case, we still need to set the previous_ids.

We no longer delete previous states from the registry, so old and new inputs can co-exist if needed.

For the log -> filestream that works well and I like the idea. At some point we will need a way to remove the log input entries from the registry, but that also requires the input not be running/harvesting that file any more.

For the filestream -> filestream we need to remove the old entry to avoid the following corner case: Filestream starts and migrates the state of /foo/bar.log, harvest a few lines, updates the registry then Filebeat stops (either gracefully or crashing), before the registry cleanup kicks in. Whenever Filebeat is restarted, the migration code will run again, copy the old state overwriting the new, leading to some data duplication.

If we change the way we access the store, we can circumvent this limitation.

@rdner
Copy link
Member Author

rdner commented Feb 26, 2025

@belimawr

There is an issue of migrating filestream -> filestream matching only by path: multiple Filestream inputs can harvest the same file and have their states independently tracked, which requires unique IDs. To support the migration on this case, we still need to set the previous_ids.

This is a valid point, however, I don't think it's a common use-case. It seems like we could allow to migrate without setting the IDs by default: we skip all files that are present multiple times, printing a warning each time, recommending to use the ID. Or even fail the migration with an error message (the input does not start, states don't change).

Another important moment is that when I introduced take_over: true I didn't account for future options. If we introduce IDs, it should look like:

take_over:
  enabled: true
  # if empty, we try and fail if the same file is tracked by multiple inputs
  # if has items, we take over states ONLY from inputs with given IDs.
  from_ids: []

I think having previous_ids at the input level would be very confusing.

Whenever Filebeat is restarted, the migration code will run again, copy the old state overwriting the new, leading to some data duplication.

I might be wrong but it should be possible to detect that the "taking over" input already has a state for the file and just skip it without migrating again.

@belimawr
Copy link
Contributor

@rdner and I talked, here is what we decided and some corner case we
identified.

Filestream and Log input are handled the same way

The state migration for log and filestream inputs will be handled
the same way and using the same configuration.

During the filestream input startup it will copy the state from all
files that match its paths globs (this can be restricted to only
migrate from some input IDs) from both log and filestream inputs.

take_over will become an object

The take_over feature is still in Beta, so we can still make breaking changes on
it. If needed we could only release it in 9.0.

The new format is:

take_over:
  enabled: true
  from_ids:
    - foo
    - bar
  • take_over.enabled: true will enable the migration.
  • take_over.from_ids: [] is optional, if set, it is an array of
    strings defining the inputs IDs to migrate state from. Because only
    the filestream input supports IDs, if set, then the states are
    only migrated from filestream inputs.

File identity can be migrated at the same time

9.0 changes the default file identity from filestream to
fingerprint, and the log input does not support it, thus we need to
also allow for the file identity migration at the same time. The file
identity migration will follow the same restrictions and logic introduced by
41762.

Corner cases

There are several corner cases, here is a list of the key ones and how
we will handle them.

1. Same file harvested by log and filestram inputs

While this is technically possible, it is very unusual to have log
and filestream inputs actively harvesting the same file at the same
time.

If this is really the case, take_over.from_ids can be used to select
from which filestream input to read the state.

2. Log rotation: two or more states for the same path

That is the common log rotation case, even though there are multiple
states for the path, there is only one "physical" file at the time of
the migration. We will apply the file identity migration restrictions
and logic from #41762 and only migrate the state that matches the file
we found.

3. Same file harvested by multiple inputs

This is a very unusual case, if that happens, the user can use
take_over.from_ids to select the source of state for the
migration. Listing multiple input IDs that were harvesting the same
file is not supported.

Not supported cases

Some situations are not supported, for those cases take over must not
be used, the side effect will be some data duplication, while this is
not ideal, it is a reasonable option in most cases.

Selecting states from the log input over filestream

In the unlikely case that Filebeat's registry contains states from the
log and filestream inputs for the same path, and the user wants to
migrate the state from the log input, this will not be supported.

@rdner do you have anything to add/have I missed something?

I was trying to write something about not deleting the old states, but
I cannot think about a safe way to keep them. Did we get to a solution
for this problem?

@rdner
Copy link
Member Author

rdner commented Feb 27, 2025

@belimawr this is a great summary, thank you so much!

I was trying to write something about not deleting the old states, but
I cannot think about a safe way to keep them. Did we get to a solution
for this problem?

In theory we could just detect that the "taking over" input already has a state for this file path and just skip it during the next migration run. This should ensure migrating a file state only once.

However, this also means that the user would see data duplication until the old input is removed from the config.

This puts us in a tricky situation:

  • if we decide to delete state entries, we cannot really make a registry backup because we are running the migration on the in-memory state (no access to the registry file) and the registry file is in active use (sync discrepancies).
  • if we decide to keep the state entries, the users would have data duplication until they remove the old input

I think it's okay to delete the state entries if we are very clear about that in the docs and we explain how to make a registry backup manually if needed.

Few things I'd like to add to the summary:

  • we should lock the entire state from changes by other inputs during the migration, it's a critical section.
  • we should compute changes while iterating through the state entries, collect these changes and apply them in one go at the end. So if the migration fails in the middle we don't have an incomplete state (partial migration).
  • we should explicitly support and test the migration from a log input with the inode file identity to a filestream input with a fingerprint file identity.

@belimawr
Copy link
Contributor

I think it's okay to delete the state entries if we are very clear about that in the docs and we explain how to make a registry backup manually if needed.

I agree, thanks!

we should compute changes while iterating through the state entries, collect these changes and apply them in one go at the end. So if the migration fails in the middle we don't have an incomplete state (partial migration).

From a safety standpoint, I like this idea, I'm not sure how feasible this is during runtime as it might mean locking up the whole log and filestream input stores. Once I start working on it, I'll have better clarity of the possibilities and challenges.

(...) this also means that the user would see data duplication until the old input is removed from the config.

I believe we should only support the take_over if the "old inputs" have been removed from the config.

Honestly, in a correct use-case I have a hard time picturing any reason to have the "old" and "new" inputs running at the same time.

We cannot really enforce it at runtime, but we can document well the risk of having the "old" and "new" inputs running at the same time is data duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants