Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add optional gazelle ConventionChecker interface to avoid indexing. #1907

Open
tyler-french opened this issue Sep 3, 2024 · 1 comment

Comments

@tyler-french
Copy link
Contributor

See implementation: #1900

Why Gazelle's index doesn't scale

Currently, gazelle fix-update runs with a configurable option of an --index. Certain extensions can use the index to properly resolve imports in a repo. The ruleIndex is essentially a large structure containing all of the rules that have an associated Resolver.

If the index is enabled, the rule index must load and store information from the entire repository. If disabled, gazelle can rely (in most cases) on the # gazelle:prefix directive or -go_prefix flag to resolve dependencies.

There are two reasons this does not scale:

  • With index=true every gazelle run must rely on O(size of repo) rather than O(size of changes) to run because it must have a fully-populated index.
  • WIth index=false any exceptions to the go_prefix or # gazelle:prefix will require manual # gazelle:resolve statements added, which almost removes the purpose of automatic build file generation. Also, not all plugins or gazelle language extensions might work without an index.

Running gazelle in a repository should almost always be doable by running on just the files changed since a recent run. If a user is only modifying a small project, their gazelle invocation should only use and care about that part of the repository. Otherwise, the incrementally of gazelle is not there.

Gazelle Convention Checker

At Uber, we've solved this problem using a gazelle ConventionChecker interface, which is patched onto our gazelle binary. This convention checker works in the following manner.

A language extension can implement a Convention interface.

// Convention should be implemented by langauge extensions in order to
// register language specific convention logic with the convention checker.
type Convention interface {
	// CheckConvention returns whether or not the rule information follows
	// a known convention.
	CheckConvention(c *config.Config, kind, imp, name, rel string) bool
}

If an import path follows the conventions expected by the Resolver of the interface, then the convention checker will return true. Otherwise, the ConventionChecker will add to list of rules which violate the convention.

Then, on Finish(, the ConventionChecker will write any rules that violate the convention as # gazelle:resolve directives in the top-level BUILD.bazel file. The "automatic resolves" are added to the top-level file to speed up lookup, and are added in a special location to avoid confusion with other parts of this file. They are added under a header "### AUTOMATIC RESOLVES ###" at the bottom of the file.

Optionally, these could be added somewhere else from some other file passed at runtime, but since they are technically part of the build configuration, using the BUILD.bazel file makes the most sense.

Why do this?

This has two clear benefits:

  • All gazelle runs are incremental, and do not need to have an index. Only the convention violations need to be loaded, and there is no manual intervention required for rules that don't follow convention.
  • The bottleneck of gazelle performance will either be the size of the change, or the number of # gazelle:resolve statements for convention violations. There can be 10000s of convention violations, but they will still be loaded much faster than walking 10000s of directories to load an index. Once they are loaded once, they exist in memory in a O(1) lookup map, so they can be easily retrieved.

The convention checker can also be enabled/disabled with a flag, so this optimization is optional for all users, and can be avoided if desired.

@alex-torok
Copy link

alex-torok commented Sep 10, 2024

This was mentioned in my proposal for lazy indexing (#1891) as an alternative method. I think that this conventions system offers a better way to get faster incremental results, as it allows repos to more easily implement their own custom plugins to form their own conventions. The "automatic resolves" act as a versioned partial index, which solves the cache consistency issue that #1181 faces. Nice idea here!

I have a few questions:

  1. How do you keep the automatic resolves up-to-date? I'm guessing that a CI job runs gazelle with --use_conventions and asserts no changes are made to the index?
    • It seems like this would work well, as any change to a specific directory would assert that the automatic resolves for that directory are updated in the root-level "automatic resolves" section. Have there been any issues with merge conflicts?
  2. Can you provide some example descriptions of conventions you've implemented and found use for? The Convention interface is somewhat abstract from what an actual import->label convention is, so having an example or two would help.
    • Specifically, I'm curious about what conventions you use for python. The problem of mapping import statements to target labels without full indexing seems like it would require enforcing a strict style for import statements or considering all python targets as not adhering to any conventions. The latter essentially would make the conventions index have every python target in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants