[RFC] Path-walk API and applications #1786

derrickstolee · 2024-09-09T20:30:32Z

This RFC is ultimately about introducing a new way to walk objects, called the
"path-walk API" in the new path-walk.[ch] files. Before digging into the details
of the API, let's discuss the applications which will hint at the API's design.

APPLICATIONS OF THE PATH-WALK API

The applications of this API were discovered in the following order, though I
recommend reversing the order for the priority of their actual implementation in
future patch series:

git backfill: a builtin to download missing blobs in a blobless partial clone,
done in batches and grouped by the path they appear in to maximize delta
compression in each batch. Allows focusing on the paths of the sparse-checkout
to only get the blobs necessary for history queries in the current focus.
git survey: Jeff Hostetler built this feature [1] as a way to get functionality
similar to git-sizer [2], but using the internals of Git to do it faster. It
also displays information not available to git-sizer, like the on-disk size of
objects. This RFC presents a simplified version of the builtin focused on
listing the paths that contribute the most to the on-disk size of trees and
blobs.
git pack-objects --path-walk: In order to find a way to compute deltas among
objects of the same path, I applied the path-walk API to 'git pack-objects'
behind an optional flag. There are overlaps with the '--sparse' option [3], [4]
that can be used here. This provides perfect partitioning by path name, without
any possible collisions from the name-hash algorithm. It also allows using the
name-hash values to find cross-path delta chains in a second pass.
git repack --full-name-hash: If we are worried about name-hsah collisions, an
easier thing to implement is a different name-hash algorithm that is less
likely to have collisions. This feature was already sent to the mailing list
as a fully-reviewable series [5]. It is included here because this series
allows testing the --path-walk option against the --full-name-hash.

[1] microsoft#667

[2] https://github.com/github/git-sizer

[3] git/git@5d826e9...99dbbfa

[4] https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/

[5] https://lore.kernel.org/git/[email protected]

TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER

Here's the story about how these applications came about: I was tasked with
understanding why certain internal repositories were growing larger than
expected. (Feel free to skip. Otherwise, thank you for indulging me.)

I first prototyped 'git backfill' as a way to download some of the largest
repositories without being blocked on a full clone. This batched download
mechanism allowed me to essentially have a retryable clone, since the client
could restart the process from scratch and skip any objects that were already on
disk. It was natural to batch based on the path of the blobs in order to
theoretically save time and network bandwidth due to better delta calculations.

While investigating these repositories, I had some data hinting at the total
size of the objects by type. But I was most interested in learning what exactly
was causing this growth. I did have a hint that the "release" branches were
taking up much more space than the default branch, since cloning with
--single-branch resulted in ~55GB of data but then fetching the release branches
led to an additional ~125GB of data. Using "git diff" it was clear that these
branches stored some CHANGELOG.json and CHANGELOG.md files, but the diffs were
relatively small (but multiple such changes were batched together into a single
new commit). I needed to see why exactly these paths were taking up so much
space.

To this, I turned to Jeff Hostetler's new "git survey" command. This told me
information about the total size of trees and blobs and told me the path name
for the individual blobs that were the largest in the repository. These paths
were typically binary files that appeared only once or twice and did not account
for the scale issues. I modified 'git survey' to use the same path-batching
logic as in 'git backfill' to then consider each batch of objects and their size.
Thus, the "path-walk API" was born. (This RFC contains a version that looks like
it was created before 'git backfill'.)

The repository I was looking at had a clear pattern in its top 100 file paths by
on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md files. The .md
files surprised me, since they were always simple appends of the previous .md
file at the same path. The .json files were capped in how many versions were
being listed (at least in recent versions) but the data it stored was harder to
compress. So I went looking into how 'git push' was calculating these delta
bases. Adding some debug information to 'git pack-objects' demonstrated that the
previous file versions were not being matched as delta bases. Instead, other new
blobs in the push were being used as delta bases.

This meant that what should have been a trivial set of deltas bloated to 20-60
MB. (We will see later that it is possible for these to be 100-500 KB.)

Here is where I went on a little bit of a detour. (Come with me, it's important.)
I knew that 'git pack-objects' used a name-hash to group objects by path, so I
assumed that the reason these delta bases were not found was because the
UNINTERESTING objects were not being added to the packing list. I have since
discovered that this is incorrect, but I might have gotten stuck if I didn't
think this.

This seemed like a natural reason to extend the path-walk API to allow walking
commits and tags as part of 'git pack-objects' behind a new '--path-walk'
option. The idea here is to compute deltas among the objects that share a common
path and then later go through the (type, name-hash, size) sorting system to
find other delta bases across path boundaries. After a lot of testing, failing,
and testing again, the implementation in this RFC finally works to achieve the
goal. It's not pretty (especially with how it handles tags) but it gets the job
done.

In hindsight, I realized that the UNINTERESTING objects were being considered,
but due to collisions in the name-hash algorithm these objects were being sorted
outside of the delta computation window. For this reason, I thought to create a
new name-hash algorithm. Thus, the --full-name-hash option for 'git pack-objects'
and 'git repack' was born. This feature was split out and sent to the mailing
list independently from this RFC.

RFC GOALS

The goals of this RFC are:

To demonstrate potential applications of the path-walk API to motivate its
generality as these features are sent in full-quality patch series, but in
a different order.
To communicate the discoveries found during the --path-walk and
--full-name-hash features in 'git pack-objects' and 'git repack'. This
includes comparing and contrasting the effectiveness of these features.
To demonstrate the value of the path-based batching in the 'git survey'
feature, and to inspire others to think about what other statistics would
be valuable in that feature. (I anticipate that once a base is established,
multiple contributors will help expand its functionality long into the
future.)

RFC OUTLINE

The patches are grouped roughly by the application, in order of discovery:

PART I: 'git backfill'

These patches introduce the 'git backfill' builtin including its '--batch-size'
and '--sparse' options. While this is the first and simplest application, it is
also the lowest priority in terms of user need.

path-walk: introduce an object walk by path
backfill: add builtin boilerplate
backfill: basic functionality and tests
backfill: add --batch-size= option
backfill: add --sparse option
backfill: assume --sparse when sparse-checkout is enabled

PART II: 'git survey'

These patches reimplement a subset of the functionality of 'git survey' as well
as generalize some of the data structures that Jeff's implementation made. The
flexibility hopefully comes through to show the potential for future extensions.
These patches are quite rough and will need more attention before they can be
sent for full review.

path-walk: allow consumer to specify object types
path-walk: allow visiting tags
survey: stub in new experimental git-survey command
survey: add command line opts to select references
survey: collect the set of requested refs
survey: start pretty printing data in table form
survey: add object count summary
survey: summarize total sizes by object type
survey: show progress during object walk
survey: add ability to track prioritized lists
survey: add report of "largest" paths

PART III: 'git pack-objects --path-walk'

Here is where I think the meat of the RFC really lies. There are still some
rough edges, but the data will show that 'git pack-objects --path-walk' has the
potential to be an extremely effective way to pack objects. (Caveats will come
later in the analysis section.)

revision: create mark_trees_uninteresting_dense()
path-walk: add prune_all_uninteresting option
pack-objects: add --path-walk option
pack-objects: extract should_attempt_deltas()
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
p5313: add size comparison test
repack: add --path-walk option
pack-objects: enable --path-walk via config
scalar: enable path-walk during push via config

One obvious issue with this current implementation is that it inlines much of
the delta calculation into the "Enumerate objects" phase, and thus makes it
single-threaded. This should be fixed before being considered for full review,
but even without threading this version can out-perform the standard packing
strategy in terms of end-to-end packing time!

PART IV: 'git repack --full-name-hash'

This is a simplified version of the patch series that was split out by itself
earlier for full review. This was split out on its own partly because it doesn't
actually use the path-walk API. This has benefits and drawbacks, but it seems
like a quick win for many scenarios.

pack-objects: add --full-name-hash option
test-name-hash: add helper to compute name-hash functions
p5314: add a size test for name-hash collisions
pack-objects: output debug info about deltas

This last patch is an add-on of the debugging information that I used to
discover issues with delta bases during 'git push'.

DISCUSSION OF NAME HASH

One thing to talk about before digging into --path-walk and --full-name-hash
features is the existing name-hash algorithm. This hash algorithm creates a
uint32_t based on the final 16 characters of the path name, weighing the last
characters more. There are multiple benefits to this:

Files of common types (.c, .txt, ...) may be grouped together.
Files that are renamed across directories may be grouped together.

(Thanks, Junio, for making this second benefit clear.)

The issue here is that some common patterns arise in repositories that use
common path names across directories, and those files are creating name-hash
collisions and making the sort less effective. One thing that can counteract
these collisions is to increase the --window setting, but this significantly
slows the delta computations.

Thus, the --path-walk and --full-name-hash features both attempt to combat these
name-hash collisions in very different ways. The --path-walk mechanism uses the
path-walk API to consider batches of objects that all share the same path. This
avoids storing every possible path names in memory while doing the object walk,
but still gives nice boundaries for delta compression possibilities. After the
path-walk is complete, the full packing list is still sorted via name-hash and
this allows for cross-path deltas. This is critical!

The --full-name-hash feature does the simpler choice of replacing the name-hash
method with one that has fewer collisions, but loses the benefits of "nearby"
paths having close hash values.

This naturally leads to these two main differences in the two approaches:

The --full-name-hash feature is not good for 'git push' or similarly small
pack-files. Since it limits the delta chains to objects with the same full
path and loses the benefit of "nearby" paths, this feature should be used
for larger repacks. In my testing, 'git push' simulations almost always have
poor packing but 'git repack -adf' simulations have packing rivaling the
--path-walk option.
The --path-walk option changes the object ordering significantly, meaning it
may not ever be appropriate to combine with advanced repacking features such
as delta islands or even reachability bitmaps. While my testing has shown
that the --path-walk repacks are the most efficient of all options, this
limitation makes me hesitate to recommend it wider than client repositories.

One natural question to consider is to think about storing both the name-hash
and the full-name-hash and doing two delta passes, each one sorting the objects
by a different hash function. The first issue is that we don't want to store two
hash values per object, as that will significantly increase the memory pressure
during repacking. This could be side-stepped by storing the full-name-hash in
the packing list and then a second mapping from full-name-hash to name-hash.
However, that still leads to the two passes taking extra time. The --path-walk
approach is faster than even a single pass. And in the right scenarios, the
--full-name-hash option is very close to the --path-walk results.

ANALYSIS OF PACKING STRATEGIES

Since I was focused on an internal monorepo that stored a large collection of
Javascript packages, it should be no surprise that I eventually found other
Javascript repositories that used similar tooling and thus had similar issues
with unexplained scale problems. I'll use these four repositories as examples
repeatedly, but one is actually public: microsoft/fluentui [6]. I'll use Repo B,
C, and D for the others, in increasing size.

[6] https://github.com/microsoft/fluentui

In each of these repositories, doing a full repack ('git repack -adf') right
after cloning presents a sizeable reduction in space. This is expected for
servers not being optimized for exactly the reachable set I'm cloning. So, I'll
focus on how much the default repacking parameters compare to using the
--full-name-hash or --path-walk options:

| Repo     | Standard Repack | With --full-name-hash | With --path-walk |
|----------|-----------------|-----------------------|------------------|
| fluentui |         438 MB  |               168 MB  |          148 MB  |
| Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |

Hopefully these reductions show how much these name-hash collisions are causing
issues in these repositories.

For the fluentui repo, I'm also able to share highlights from the 'git survey'
output immediately after cloning and after repacking with the --path-walk option.

First, we can consider the total reachable objects in each scenario:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10443669 |      15092790
      Trees | 276503 |  40212070 |     244429615
      Blobs | 294500 | 635365661 |   10791187920

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10450605 |      15092790
      Trees | 276503 |  31136263 |     244429615
      Blobs | 294500 |  94442401 |   10791187920

Now, we can consider the top 10 file paths before and after repacking with the
--path-walk feature:

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |  58060531 |     889120488
 ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
 ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
 ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
 ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
 ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
 ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
 ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
 .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
 ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |   9909326 |     889120488
 ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
 ...out/src/images/download.jpg |     1 |   1845249 |       1846117
 ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
          .yarn/releases/cli.js |     1 |   1335657 |       6741614
 ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
 ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
 ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
 ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
 ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268

As we can see from this example, before the repack we are mainly seeing the disk
space be dominated by objects that appear at paths with CHANGELOG.json or
CHANGELOG.md, which are frequently hitting collisions with the default name-hash.
After the repack with the --path-walk feature, we see the largest paths are what
we expect: checked in binaries, frequently-edited yarn files, and other hard-to-
compress data.

The nice thing about this result is that we can point to files that are taking
up space because there is no other way around it, not that the naming convention
for the files is causing confusion during packing.

WHAT TO DO NOW?

Thank you for reading this far. I hope that this has added context on the patch
series for the --full-name-hash option, but also provided the right context for
when I come back in a week or so with a review-ready version of the --path-walk
option.

These patches are rough and I want to make sure everyone knows that. Reviewing
them for style or even basic organization may lead to some wasted time. I know
that at least one of the patches got mangled and its diff does not match its
description (I'm thinking specifically about "pack-objects: extract
should_attempt_deltas()" but this could apply elsewhere).

But I think that interested parties could take my branch, build it, and give
these features a try on their favorite repos. I'd love to hear feedback on the
usability and effectiveness, especially if someone finds a case where the
--path-walk option is less effective at packing data.

There are two things going on since I started working on this series that could
cause issues with applying this series on top of current 'master' or with topics
in 'seen':

The most obvious one is the collision with the --full-name-hash feature under
review. The in-review series takes precedent.
The unused parameter warnings are now errors! I'm glad that got in, but I
wasn't careful about it in these RFC patches.
John Cai is proposing to change the builtin interface to have builtin methods
take a repository pointer. I approve of this effort, but it collides with the
two new builtins introduced here. As discussed above, the 'git survey' and
'git backfill' builtins are the lowest priority items for getting merged, in
my opinion, so they can wait until that change has been made.

I look forward to any and all comments.

Thanks! -Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: Christian Couder [email protected]
cc: "Kristoffer Haugsbakk" [email protected]

In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. RFC TODO: It would be helpful to create a test-tool that allows printing of each batch for strong testing. Signed-off-by: Derrick Stolee <[email protected]>

In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. RFC TODO: When preparing this for a full implementation, make sure it is based on the newest standards introduced by [1]. [1] https://lore.kernel.org/git/[email protected]/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7 [PATCH 1/3] builtin: add a repository parameter for builtin functions Signed-off-by: Derrick Stolee <[email protected]>

The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. Signed-off-by: Derrick Stolee <[email protected]>

Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. Signed-off-by: Derrick Stolee <[email protected]>

One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensie as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Signed-off-by: Derrick Stolee <[email protected]>

The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <[email protected]>

This adds the ability to ask for the commits as a single list. This will also reduce the calls in 'git backfill' to be a BUG() statement if called with anything other than blobs. Signed-off-by: Derrick Stolee <[email protected]>

In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. Signed-off-by: Derrick Stolee <[email protected]>

Start work on a new `git survey` command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin methods that include a 'struct repository' pointer. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

Collect the set of requested branches, tags, and etc into a ref_array and collect the set of requested patterns into a strvec. RFC TODO: This patch has some changes that should be in the previous patch, to make the diff look a lot better. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <[email protected]>

At the moment, nothing is obvious about the reason for the use of the path-walk API, but this will become more prevelant in future iterations. For now, use the path-walk API to sum up the counts of each kind of object. For example, this is the reachable object summary output for my local repo: REACHABLE OBJECT SUMMARY ======================== Object Type | Count ------------+------- Tags | 0 Commits | 178573 Trees | 312745 Blobs | 183035 (Note: the "Tags" are zero right now because the path-walk API has not been integrated to walk tags yet. This will be fixed in a later change.) RFC TODO: make sure tags are walked before this change. Signed-off-by: Derrick Stolee <[email protected]>

Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <[email protected]>

Signed-off-by: Derrick Stolee <[email protected]>

In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <[email protected]> Signed-off-by; Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <[email protected]>

The sparse tree walk algorithm was created in d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and involves using the mark_trees_uninteresting_sparse() method. This method takes a repository and an oidset of tree IDs, some of which have the UNINTERESTING flag and some of which do not. Create a method that has an equivalent set of preconditions but uses a "dense" walk (recursively visits all reachable trees, as long as they have not previously been marked UNINTERESTING). This is an important difference from mark_tree_uninteresting(), which short-circuits if the given tree has the UNINTERESTING flag. A use of this method will be added in a later change, with a condition set whether the sparse or dense approach should be used. Signed-off-by: Derrick Stolee <[email protected]>

This option causes the path-walk API to act like the sparse tree-walk algorithm implemented by mark_trees_uninteresting_sparse() in list-objects.c. Starting from the commits marked as UNINTERESTING, their root trees and all objects reachable from those trees are UNINTERSTING, at least as we walk path-by-path. When we reach a path where all objects associated with that path are marked UNINTERESTING, then do no continue walking the children of that path. We need to be careful to pass the UNINTERESTING flag in a deep way on the UNINTERESTING objects before we start the path-walk, or else the depth-first search for the path-walk API may accidentally report some objects as interesting. Signed-off-by: Derrick Stolee <[email protected]>

In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. RFC TODO: It is important to note that this option is inherently incompatible with using a bitmap index. This walk probably also does not work with other advanced features, such as delta islands. Getting ahead of myself, this option compares well with --full-name-hash when the packfile is large enough, but also performs at least as well as the default in all cases that I've seen. RFC TODO: this should probably be recording the batch locations to another list so they could be processed in a second phase using threads. RFC TODO: list some examples of how this outperforms previous pack-objects strategies. (This is coming in later commits that include performance test changes.) Signed-off-by: Derrick Stolee <[email protected]>

Signed-off-by: Derrick Stolee <[email protected]>

There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. RFC TODO: list other helpful test cases, as well as the ones where the behavior breaks if this is enabled... Signed-off-by: Derrick Stolee <[email protected]>

To test the benefits of the new --path-walk option in 'git pack-objects', create a performance test that times the process but also compares the size of the output. Against the microsoft/fluentui repo [1] against a particular commit [2], this has reproducible results of a similar scale: Test this tree --------------------------------------------------------------- 5313.2: thin pack 0.39(0.48+0.03) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.09(0.07+0.01) 5313.5: thin pack size with --path-walk 20.8K 5313.6: big recent pack 2.13(8.29+0.26) 5313.7: big recent pack size 17.7M 5313.8: big recent pack with --path-walk 3.18(4.21+0.22) 5313.9: big recent pack size with --path-walk 15.0M [1] https://github.com/microsoft/reactui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 RFC TODO: Note that the path-walk version is slower for the big case, but the delta calculation is single-threaded with the current implementation! It's still faster for the small case that mimics a typical push. Signed-off-by: Derrick Stolee <[email protected]>

Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. For the microsoft/fluentui repo [1], the results are very interesting: Test this tree ------------------------------------------------------------------- 5313.10: full repack 97.91(663.47+2.83) 5313.11: full repack size 449.1K 5313.12: full repack with --path-walk 105.42(120.49+0.95) 5313.13: full repack size with --path-walk 159.1K [1] https://github.com/microsoft/fluentui This repo suffers from having a lot of paths that collide in the name hash, so examining them in groups by path leads to better deltas. Also, in this case, the single-threaded implementation is competitive with the full repack. This is saving time diffing files that have significant differences from each other. Signed-off-by: Derrick Stolee <[email protected]>

Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <[email protected]>

Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]>

RFC NOTE: this is essentially the same as the patch introduced independently of the RFC, but now is on top of the --path-walk option instead. This is included in the RFC for comparison purposes. RFC NOTE: As you can see from the details below, the --full-name-hash option essentially attempts to do similar things as the --path-walk option, but sometimes misses the mark. Collisions still happen with the --full-name-hash option, leading to some misses. However, in cases where the default name-hash algorithm has low collision rates and deltas are actually desired across objects with similar names but different full names, the --path-walk option can still take advantage of the default name hash approach. Here are the new performance details simulating a single push in an internal monorepo using a lot of paths that collide in the default name hash. We can see that --full-name-hash gets close to the --path-walk option's size. Test this tree -------------------------------------------------------------- 5313.2: thin pack 2.43(2.92+0.14) 5313.3: thin pack size 4.5M 5313.4: thin pack with --full-name-hash 0.31(0.49+0.12) 5313.5: thin pack size with --full-name-hash 15.5K 5313.6: thin pack with --path-walk 0.35(0.31+0.04) 5313.7: thin pack size with --path-walk 14.2K However, when simulating pushes on repositories that do not have issues with name-hash collisions, the --full-name-hash option presents a potential of worse delta calculations, such as this example using my local Git repository: Test this tree -------------------------------------------------------------- 5313.2: thin pack 0.03(0.01+0.01) 5313.3: thin pack size 475 5313.4: thin pack with --full-name-hash 0.02(0.01+0.01) 5313.5: thin pack size with --full-name-hash 14.8K 5313.6: thin pack with --path-walk 0.02(0.01+0.01) 5313.7: thin pack size with --path-walk 475 Note that the path-walk option found the same delta bases as the default options in this case. In the full repack case, the --full-name-hash option may be preferable because it interacts well with other advanced features, such as using bitmap indexes and tracking delta islands. Signed-off-by: Derrick Stolee <[email protected]>

Using this tool, we can count how many distinct name-hash values exist within a list of paths. Examples include git ls-tree -r --name-only HEAD | \ test-tool name-hash | \ awk "{print \$1;}" | \ sort -ns | uniq | wc -l which outputs the number of distinct name-hash values that appear at HEAD. Or, the following which presents the resulting name-hash values of maximum multiplicity: git ls-tree -r --name-only HEAD | \ test-tool name-hash | \ awk "{print \$1;}" | \ sort -n | uniq -c | sort -nr | head -n 25 For an internal monorepo with around a quarter million paths at HEAD, the highest multiplicity for the standard name-hash function was 14,424 while the full name-hash algorithm had only seven hash values with any collision, with a maximum multiplicity of two. Signed-off-by: Derrick Stolee <[email protected]>

This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. For example, the microsoft/fluentui repo had these statistics at time of committing: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 19.6K 5314.2: number of distinct name-hashes 8.2K 5314.3: number of distinct full-name-hashes 19.6K 5314.4: maximum multiplicity of name-hashes 279 5314.5: maximum multiplicity of fullname-hashes 1 That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. In this repository, no collisions occur for the full-name-hash algorithm. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 221.6K 5314.2: number of distinct name-hashes 72.0K 5314.3: number of distinct full-name-hashes 221.6K 5314.4: maximum multiplicity of name-hashes 14.4K 5314.5: maximum multiplicity of fullname-hashes 2 Even in this repository with many more paths at HEAD, the collision rate was low and the maximum number of paths being grouped into a single bucket by the full-path-name algorithm was two. Signed-off-by: Derrick Stolee <[email protected]>

In order to debug what is going on during delta calculations, add a --debug-file=<file> option to 'git pack-objects'. This leads to sending a JSON-formatted description of the delta information to that file. Signed-off-by: Derrick Stolee <[email protected]>

This is a follow up to #5157 as well as motivated by the RFC in gitgitgadget#1786. We have ways of walking all objects, but it is focused on visiting a single commit and then expanding the new trees and blobs reachable from that commit that have not been visited yet. This means that objects arrive without any locality based on their path. Add a new "path walk API" that focuses on walking objects in batches according to their type and path. This will walk all annotated tags, all commits, all root trees, and then start a depth-first search among all paths in the repo to collect trees and blobs in batches. The most important application for this is being fast-tracked to Git for Windows: `git pack-objects --path-walk`. This application of the path walk API discovers the objects to pack via this batched walk, and automatically groups objects that appear at a common path so they can be checked for delta comparisons. This use completely avoids any name-hash collisions (even the collisions that sometimes occur with the new `--full-name-hash` option) and can be much faster to compute since the first pass of delta calculations does not waste time on objects that are unlikely to be diffable. Some statistics are available in the commit messages.

derrickstolee and others added 30 commits September 8, 2024 14:04

path-walk: allow consumer to specify object types

2829fe3

This adds the ability to ask for the commits as a single list. This will also reduce the calls in 'git backfill' to be a BUG() statement if called with anything other than blobs. Signed-off-by: Derrick Stolee <[email protected]>

path-walk: allow visiting tags

d67679d

In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. Signed-off-by: Derrick Stolee <[email protected]>

survey: show progress during object walk

9c54c14

Signed-off-by: Derrick Stolee <[email protected]>

pack-objects: extract should_attempt_deltas()

502008b

Signed-off-by: Derrick Stolee <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Path-walk API and applications #1786

[RFC] Path-walk API and applications #1786

Uh oh!

derrickstolee commented Sep 9, 2024 •

edited by gitgitgadget bot

Loading

Uh oh!

Uh oh!

[RFC] Path-walk API and applications #1786

[RFC] Path-walk API and applications #1786

Uh oh!

Conversation

derrickstolee commented Sep 9, 2024 • edited by gitgitgadget bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

APPLICATIONS OF THE PATH-WALK API

TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER

RFC GOALS

RFC OUTLINE

PART I: 'git backfill'

PART II: 'git survey'

PART III: 'git pack-objects --path-walk'

PART IV: 'git repack --full-name-hash'

DISCUSSION OF NAME HASH

ANALYSIS OF PACKING STRATEGIES

WHAT TO DO NOW?

Uh oh!

Uh oh!

derrickstolee commented Sep 9, 2024 •

edited by gitgitgadget bot

Loading