Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Path-walk API and applications #1786

Draft
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Sep 9, 2024

This RFC is ultimately about introducing a new way to walk objects, called the
"path-walk API" in the new path-walk.[ch] files. Before digging into the details
of the API, let's discuss the applications which will hint at the API's design.

APPLICATIONS OF THE PATH-WALK API

The applications of this API were discovered in the following order, though I
recommend reversing the order for the priority of their actual implementation in
future patch series:

  • git backfill: a builtin to download missing blobs in a blobless partial clone,
    done in batches and grouped by the path they appear in to maximize delta
    compression in each batch. Allows focusing on the paths of the sparse-checkout
    to only get the blobs necessary for history queries in the current focus.

  • git survey: Jeff Hostetler built this feature [1] as a way to get functionality
    similar to git-sizer [2], but using the internals of Git to do it faster. It
    also displays information not available to git-sizer, like the on-disk size of
    objects. This RFC presents a simplified version of the builtin focused on
    listing the paths that contribute the most to the on-disk size of trees and
    blobs.

  • git pack-objects --path-walk: In order to find a way to compute deltas among
    objects of the same path, I applied the path-walk API to 'git pack-objects'
    behind an optional flag. There are overlaps with the '--sparse' option [3], [4]
    that can be used here. This provides perfect partitioning by path name, without
    any possible collisions from the name-hash algorithm. It also allows using the
    name-hash values to find cross-path delta chains in a second pass.

  • git repack --full-name-hash: If we are worried about name-hsah collisions, an
    easier thing to implement is a different name-hash algorithm that is less
    likely to have collisions. This feature was already sent to the mailing list
    as a fully-reviewable series [5]. It is included here because this series
    allows testing the --path-walk option against the --full-name-hash.

[1] microsoft#667

[2] https://github.com/github/git-sizer

[3] git/git@5d826e9...99dbbfa

[4] https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/

[5] https://lore.kernel.org/git/[email protected]

TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER

Here's the story about how these applications came about: I was tasked with
understanding why certain internal repositories were growing larger than
expected. (Feel free to skip. Otherwise, thank you for indulging me.)

I first prototyped 'git backfill' as a way to download some of the largest
repositories without being blocked on a full clone. This batched download
mechanism allowed me to essentially have a retryable clone, since the client
could restart the process from scratch and skip any objects that were already on
disk. It was natural to batch based on the path of the blobs in order to
theoretically save time and network bandwidth due to better delta calculations.

While investigating these repositories, I had some data hinting at the total
size of the objects by type. But I was most interested in learning what exactly
was causing this growth. I did have a hint that the "release" branches were
taking up much more space than the default branch, since cloning with
--single-branch resulted in ~55GB of data but then fetching the release branches
led to an additional ~125GB of data. Using "git diff" it was clear that these
branches stored some CHANGELOG.json and CHANGELOG.md files, but the diffs were
relatively small (but multiple such changes were batched together into a single
new commit). I needed to see why exactly these paths were taking up so much
space.

To this, I turned to Jeff Hostetler's new "git survey" command. This told me
information about the total size of trees and blobs and told me the path name
for the individual blobs that were the largest in the repository. These paths
were typically binary files that appeared only once or twice and did not account
for the scale issues. I modified 'git survey' to use the same path-batching
logic as in 'git backfill' to then consider each batch of objects and their size.
Thus, the "path-walk API" was born. (This RFC contains a version that looks like
it was created before 'git backfill'.)

The repository I was looking at had a clear pattern in its top 100 file paths by
on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md files. The .md
files surprised me, since they were always simple appends of the previous .md
file at the same path. The .json files were capped in how many versions were
being listed (at least in recent versions) but the data it stored was harder to
compress. So I went looking into how 'git push' was calculating these delta
bases. Adding some debug information to 'git pack-objects' demonstrated that the
previous file versions were not being matched as delta bases. Instead, other new
blobs in the push were being used as delta bases.

This meant that what should have been a trivial set of deltas bloated to 20-60
MB. (We will see later that it is possible for these to be 100-500 KB.)

Here is where I went on a little bit of a detour. (Come with me, it's important.)
I knew that 'git pack-objects' used a name-hash to group objects by path, so I
assumed that the reason these delta bases were not found was because the
UNINTERESTING objects were not being added to the packing list. I have since
discovered that this is incorrect, but I might have gotten stuck if I didn't
think this.

This seemed like a natural reason to extend the path-walk API to allow walking
commits and tags as part of 'git pack-objects' behind a new '--path-walk'
option. The idea here is to compute deltas among the objects that share a common
path and then later go through the (type, name-hash, size) sorting system to
find other delta bases across path boundaries. After a lot of testing, failing,
and testing again, the implementation in this RFC finally works to achieve the
goal. It's not pretty (especially with how it handles tags) but it gets the job
done.

In hindsight, I realized that the UNINTERESTING objects were being considered,
but due to collisions in the name-hash algorithm these objects were being sorted
outside of the delta computation window. For this reason, I thought to create a
new name-hash algorithm. Thus, the --full-name-hash option for 'git pack-objects'
and 'git repack' was born. This feature was split out and sent to the mailing
list independently from this RFC.

RFC GOALS

The goals of this RFC are:

  1. To demonstrate potential applications of the path-walk API to motivate its
    generality as these features are sent in full-quality patch series, but in
    a different order.

  2. To communicate the discoveries found during the --path-walk and
    --full-name-hash features in 'git pack-objects' and 'git repack'. This
    includes comparing and contrasting the effectiveness of these features.

  3. To demonstrate the value of the path-based batching in the 'git survey'
    feature, and to inspire others to think about what other statistics would
    be valuable in that feature. (I anticipate that once a base is established,
    multiple contributors will help expand its functionality long into the
    future.)

RFC OUTLINE

The patches are grouped roughly by the application, in order of discovery:

PART I: 'git backfill'

These patches introduce the 'git backfill' builtin including its '--batch-size'
and '--sparse' options. While this is the first and simplest application, it is
also the lowest priority in terms of user need.

  • path-walk: introduce an object walk by path
  • backfill: add builtin boilerplate
  • backfill: basic functionality and tests
  • backfill: add --batch-size= option
  • backfill: add --sparse option
  • backfill: assume --sparse when sparse-checkout is enabled

PART II: 'git survey'

These patches reimplement a subset of the functionality of 'git survey' as well
as generalize some of the data structures that Jeff's implementation made. The
flexibility hopefully comes through to show the potential for future extensions.
These patches are quite rough and will need more attention before they can be
sent for full review.

  • path-walk: allow consumer to specify object types
  • path-walk: allow visiting tags
  • survey: stub in new experimental git-survey command
  • survey: add command line opts to select references
  • survey: collect the set of requested refs
  • survey: start pretty printing data in table form
  • survey: add object count summary
  • survey: summarize total sizes by object type
  • survey: show progress during object walk
  • survey: add ability to track prioritized lists
  • survey: add report of "largest" paths

PART III: 'git pack-objects --path-walk'

Here is where I think the meat of the RFC really lies. There are still some
rough edges, but the data will show that 'git pack-objects --path-walk' has the
potential to be an extremely effective way to pack objects. (Caveats will come
later in the analysis section.)

  • revision: create mark_trees_uninteresting_dense()
  • path-walk: add prune_all_uninteresting option
  • pack-objects: add --path-walk option
  • pack-objects: extract should_attempt_deltas()
  • pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  • p5313: add size comparison test
  • repack: add --path-walk option
  • pack-objects: enable --path-walk via config
  • scalar: enable path-walk during push via config

One obvious issue with this current implementation is that it inlines much of
the delta calculation into the "Enumerate objects" phase, and thus makes it
single-threaded. This should be fixed before being considered for full review,
but even without threading this version can out-perform the standard packing
strategy in terms of end-to-end packing time!

PART IV: 'git repack --full-name-hash'

This is a simplified version of the patch series that was split out by itself
earlier for full review. This was split out on its own partly because it doesn't
actually use the path-walk API. This has benefits and drawbacks, but it seems
like a quick win for many scenarios.

  • pack-objects: add --full-name-hash option
  • test-name-hash: add helper to compute name-hash functions
  • p5314: add a size test for name-hash collisions
  • pack-objects: output debug info about deltas

This last patch is an add-on of the debugging information that I used to
discover issues with delta bases during 'git push'.

DISCUSSION OF NAME HASH

One thing to talk about before digging into --path-walk and --full-name-hash
features is the existing name-hash algorithm. This hash algorithm creates a
uint32_t based on the final 16 characters of the path name, weighing the last
characters more. There are multiple benefits to this:

  1. Files of common types (.c, .txt, ...) may be grouped together.

  2. Files that are renamed across directories may be grouped together.

(Thanks, Junio, for making this second benefit clear.)

The issue here is that some common patterns arise in repositories that use
common path names across directories, and those files are creating name-hash
collisions and making the sort less effective. One thing that can counteract
these collisions is to increase the --window setting, but this significantly
slows the delta computations.

Thus, the --path-walk and --full-name-hash features both attempt to combat these
name-hash collisions in very different ways. The --path-walk mechanism uses the
path-walk API to consider batches of objects that all share the same path. This
avoids storing every possible path names in memory while doing the object walk,
but still gives nice boundaries for delta compression possibilities. After the
path-walk is complete, the full packing list is still sorted via name-hash and
this allows for cross-path deltas. This is critical!

The --full-name-hash feature does the simpler choice of replacing the name-hash
method with one that has fewer collisions, but loses the benefits of "nearby"
paths having close hash values.

This naturally leads to these two main differences in the two approaches:

  1. The --full-name-hash feature is not good for 'git push' or similarly small
    pack-files. Since it limits the delta chains to objects with the same full
    path and loses the benefit of "nearby" paths, this feature should be used
    for larger repacks. In my testing, 'git push' simulations almost always have
    poor packing but 'git repack -adf' simulations have packing rivaling the
    --path-walk option.

  2. The --path-walk option changes the object ordering significantly, meaning it
    may not ever be appropriate to combine with advanced repacking features such
    as delta islands or even reachability bitmaps. While my testing has shown
    that the --path-walk repacks are the most efficient of all options, this
    limitation makes me hesitate to recommend it wider than client repositories.

One natural question to consider is to think about storing both the name-hash
and the full-name-hash and doing two delta passes, each one sorting the objects
by a different hash function. The first issue is that we don't want to store two
hash values per object, as that will significantly increase the memory pressure
during repacking. This could be side-stepped by storing the full-name-hash in
the packing list and then a second mapping from full-name-hash to name-hash.
However, that still leads to the two passes taking extra time. The --path-walk
approach is faster than even a single pass. And in the right scenarios, the
--full-name-hash option is very close to the --path-walk results.

ANALYSIS OF PACKING STRATEGIES

Since I was focused on an internal monorepo that stored a large collection of
Javascript packages, it should be no surprise that I eventually found other
Javascript repositories that used similar tooling and thus had similar issues
with unexplained scale problems. I'll use these four repositories as examples
repeatedly, but one is actually public: microsoft/fluentui [6]. I'll use Repo B,
C, and D for the others, in increasing size.

[6] https://github.com/microsoft/fluentui

In each of these repositories, doing a full repack ('git repack -adf') right
after cloning presents a sizeable reduction in space. This is expected for
servers not being optimized for exactly the reachable set I'm cloning. So, I'll
focus on how much the default repacking parameters compare to using the
--full-name-hash or --path-walk options:

| Repo     | Standard Repack | With --full-name-hash | With --path-walk |
|----------|-----------------|-----------------------|------------------|
| fluentui |         438 MB  |               168 MB  |          148 MB  |
| Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |

Hopefully these reductions show how much these name-hash collisions are causing
issues in these repositories.

For the fluentui repo, I'm also able to share highlights from the 'git survey'
output immediately after cloning and after repacking with the --path-walk option.

First, we can consider the total reachable objects in each scenario:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10443669 |      15092790
      Trees | 276503 |  40212070 |     244429615
      Blobs | 294500 | 635365661 |   10791187920
TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10450605 |      15092790
      Trees | 276503 |  31136263 |     244429615
      Blobs | 294500 |  94442401 |   10791187920

Now, we can consider the top 10 file paths before and after repacking with the
--path-walk feature:

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |  58060531 |     889120488
 ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
 ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
 ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
 ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
 ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
 ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
 ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
 .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
 ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258
TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |   9909326 |     889120488
 ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
 ...out/src/images/download.jpg |     1 |   1845249 |       1846117
 ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
          .yarn/releases/cli.js |     1 |   1335657 |       6741614
 ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
 ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
 ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
 ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
 ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268

As we can see from this example, before the repack we are mainly seeing the disk
space be dominated by objects that appear at paths with CHANGELOG.json or
CHANGELOG.md, which are frequently hitting collisions with the default name-hash.
After the repack with the --path-walk feature, we see the largest paths are what
we expect: checked in binaries, frequently-edited yarn files, and other hard-to-
compress data.

The nice thing about this result is that we can point to files that are taking
up space because there is no other way around it, not that the naming convention
for the files is causing confusion during packing.

WHAT TO DO NOW?

Thank you for reading this far. I hope that this has added context on the patch
series for the --full-name-hash option, but also provided the right context for
when I come back in a week or so with a review-ready version of the --path-walk
option.

These patches are rough and I want to make sure everyone knows that. Reviewing
them for style or even basic organization may lead to some wasted time. I know
that at least one of the patches got mangled and its diff does not match its
description (I'm thinking specifically about "pack-objects: extract
should_attempt_deltas()" but this could apply elsewhere).

But I think that interested parties could take my branch, build it, and give
these features a try on their favorite repos. I'd love to hear feedback on the
usability and effectiveness, especially if someone finds a case where the
--path-walk option is less effective at packing data.

There are two things going on since I started working on this series that could
cause issues with applying this series on top of current 'master' or with topics
in 'seen':

  • The most obvious one is the collision with the --full-name-hash feature under
    review. The in-review series takes precedent.

  • The unused parameter warnings are now errors! I'm glad that got in, but I
    wasn't careful about it in these RFC patches.

  • John Cai is proposing to change the builtin interface to have builtin methods
    take a repository pointer. I approve of this effort, but it collides with the
    two new builtins introduced here. As discussed above, the 'git survey' and
    'git backfill' builtins are the lowest priority items for getting merged, in
    my opinion, so they can wait until that change has been made.

I look forward to any and all comments.

Thanks! -Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: Christian Couder [email protected]

derrickstolee and others added 30 commits September 8, 2024 14:04
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

RFC TODO: It would be helpful to create a test-tool that allows printing of
each batch for strong testing.

Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/[email protected]/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <[email protected]>
The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future.

Signed-off-by: Derrick Stolee <[email protected]>
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <[email protected]>
This adds the ability to ask for the commits as a single list. This will
also reduce the calls in 'git backfill' to be a BUG() statement if called
with anything other than blobs.

Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

Signed-off-by: Derrick Stolee <[email protected]>
Start work on a new `git survey` command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin
methods that include a 'struct repository' pointer.

Co-authored-by: Derrick Stolee <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Collect the set of requested branches, tags, and etc into a ref_array and
collect the set of requested patterns into a strvec.

RFC TODO: This patch has some changes that should be in the previous patch,
to make the diff look a lot better.

Co-authored-by: Derrick Stolee <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <[email protected]>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |      0
    Commits | 178573
      Trees | 312745
      Blobs | 183035

(Note: the "Tags" are zero right now because the path-walk API has not been
integrated to walk tags yet. This will be fixed in a later change.)

RFC TODO: make sure tags are walked before this change.

Signed-off-by: Derrick Stolee <[email protected]>
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <[email protected]>
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <[email protected]>
Signed-off-by; Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <[email protected]>
The sparse tree walk algorithm was created in d5d2e93 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

RFC TODO: list other helpful test cases, as well as the ones where the
behavior breaks if this is enabled...

Signed-off-by: Derrick Stolee <[email protected]>
To test the benefits of the new --path-walk option in 'git
pack-objects', create a performance test that times the process but also
compares the size of the output.

Against the microsoft/fluentui repo [1] against a particular commit [2],
this has reproducible results of a similar scale:

Test                                            this tree
---------------------------------------------------------------
5313.2: thin pack                               0.39(0.48+0.03)
5313.3: thin pack size                                     1.2M
5313.4: thin pack with --path-walk              0.09(0.07+0.01)
5313.5: thin pack size with --path-walk                   20.8K
5313.6: big recent pack                         2.13(8.29+0.26)
5313.7: big recent pack size                              17.7M
5313.8: big recent pack with --path-walk        3.18(4.21+0.22)
5313.9: big recent pack size with --path-walk             15.0M

[1] https://github.com/microsoft/reactui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

RFC TODO: Note that the path-walk version is slower for the big case,
but the delta calculation is single-threaded with the current
implementation! It's still faster for the small case that mimics a
typical push.

Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

For the microsoft/fluentui repo [1], the results are very interesting:

Test                                            this tree
-------------------------------------------------------------------
5313.10: full repack                            97.91(663.47+2.83)
5313.11: full repack size                                449.1K
5313.12: full repack with --path-walk           105.42(120.49+0.95)
5313.13: full repack size with --path-walk               159.1K

[1] https://github.com/microsoft/fluentui

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <[email protected]>
RFC NOTE: this is essentially the same as the patch introduced
independently of the RFC, but now is on top of the --path-walk option
instead. This is included in the RFC for comparison purposes.

RFC NOTE: As you can see from the details below, the --full-name-hash
option essentially attempts to do similar things as the --path-walk
option, but sometimes misses the mark. Collisions still happen with the
--full-name-hash option, leading to some misses. However, in cases where
the default name-hash algorithm has low collision rates and deltas are
actually desired across objects with similar names but different full
names, the --path-walk option can still take advantage of the default
name hash approach.

Here are the new performance details simulating a single push in an
internal monorepo using a lot of paths that collide in the default name
hash. We can see that --full-name-hash gets close to the --path-walk
option's size.

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.43(2.92+0.14)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.31(0.49+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

However, when simulating pushes on repositories that do not have issues
with name-hash collisions, the --full-name-hash option presents a
potential of worse delta calculations, such as this example using my
local Git repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.03(0.01+0.01)
5313.3: thin pack size                                     475
5313.4: thin pack with --full-name-hash        0.02(0.01+0.01)
5313.5: thin pack size with --full-name-hash             14.8K
5313.6: thin pack with --path-walk             0.02(0.01+0.01)
5313.7: thin pack size with --path-walk                    475

Note that the path-walk option found the same delta bases as the default
options in this case.

In the full repack case, the --full-name-hash option may be preferable
because it interacts well with other advanced features, such as using
bitmap indexes and tracking delta islands.

Signed-off-by: Derrick Stolee <[email protected]>
Using this tool, we can count how many distinct name-hash values exist
within a list of paths. Examples include

 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
  	      awk "{print \$1;}" | \
  		 sort -ns | uniq | wc -l

which outputs the number of distinct name-hash values that appear at
HEAD. Or, the following which presents the resulting name-hash values of
maximum multiplicity:

 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
	      awk "{print \$1;}" | \
	       sort -n | uniq -c | sort -nr | head -n 25

For an internal monorepo with around a quarter million paths at HEAD,
the highest multiplicity for the standard name-hash function was 14,424
while the full name-hash algorithm had only seven hash values with any
collision, with a maximum multiplicity of two.

Signed-off-by: Derrick Stolee <[email protected]>
This test helps inform someone as to the behavior of the name-hash
algorithms for their repo based on the paths at HEAD.

For example, the microsoft/fluentui repo had these statistics at time of
committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <[email protected]>
In order to debug what is going on during delta calculations, add a
--debug-file=<file> option to 'git pack-objects'. This leads to sending
a JSON-formatted description of the delta information to that file.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee derrickstolee self-assigned this Sep 9, 2024
@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Sep 10, 2024

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1786/derrickstolee/path-walk-rfc-v1

To fetch this version to local tag pr-1786/derrickstolee/path-walk-rfc-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1786/derrickstolee/path-walk-rfc-v1

Copy link

gitgitgadget bot commented Sep 11, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> One obvious issue with this current implementation is that it inlines much
> of the delta calculation into the "Enumerate objects" phase, and thus makes
> it single-threaded.

Naïvely, traversal of history partitioned by paths smells
embarrassingly parallelizable; it may need some post processing to
make sure that the same object only appears once, though, and the
devil probably is in the details ;-).

Thanks for an enjoyable cover letter that pulls readers in.

Copy link

gitgitgadget bot commented Sep 17, 2024

On the Git mailing list, Christian Couder wrote (reply to this):

On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> This RFC is ultimately about introducing a new way to walk objects, called
> the "path-walk API" in the new path-walk.[ch] files. Before digging into the
> details of the API, let's discuss the applications which will hint at the
> API's design.
>
>
> APPLICATIONS OF THE PATH-WALK API
> =================================
>
> The applications of this API were discovered in the following order, though
> I recommend reversing the order for the priority of their actual
> implementation in future patch series:
>
>  * git backfill: a builtin to download missing blobs in a blobless partial
>    clone, done in batches and grouped by the path they appear in to maximize
>    delta compression in each batch. Allows focusing on the paths of the
>    sparse-checkout to only get the blobs necessary for history queries in
>    the current focus.

It's not very clear if this would be useful when doing a `git
sparse-checkout add`, or a `git blame` on a file path not covered by
the current sparse-checkout, or both. I think it would be clearer if
there were a few examples.

>  * git survey: Jeff Hostetler built this feature [1] as a way to get
>    functionality similar to git-sizer [2], but using the internals of Git to
>    do it faster. It also displays information not available to git-sizer,
>    like the on-disk size of objects. This RFC presents a simplified version
>    of the builtin focused on listing the paths that contribute the most to
>    the on-disk size of trees and blobs.

Not sure how `git survey` works, but `git sizer` works on a whole
repo, so, if they work in the same way, I am not sure I see what a new
path oriented way to walk repos would bring to the tool.

>  * git pack-objects --path-walk: In order to find a way to compute deltas
>    among objects of the same path, I applied the path-walk API to 'git
>    pack-objects' behind an optional flag. There are overlaps with the
>    '--sparse' option [3], [4] that can be used here. This provides perfect
>    partitioning by path name, without any possible collisions from the
>    name-hash algorithm. It also allows using the name-hash values to find
>    cross-path delta chains in a second pass.

Do you mean that the new way to walk repos allows pack-object to
perform better in the general case or maybe only in the case where
partial clone without blobs and sparse-checkout are used as described
in the git backfill related point above?

>  * git repack --full-name-hash: If we are worried about name-hsah

s/name-hsah/name-hash/

>    collisions, an easier thing to implement is a different name-hash
>    algorithm that is less likely to have collisions.

Maybe it could help to remind people a bit (perhaps in a note or using
a link) what name-hash collisions are and why we could be worried
about them.

Actually there is a "DISCUSSION OF NAME HASH" section below with
explanations, so maybe a note could just tell about that section.

> This feature was
>    already sent to the mailing list as a fully-reviewable series [5]. It is
>    included here because this series allows testing the --path-walk option
>    against the --full-name-hash.

Do you mean that the new way to walk repos can be used along with `git
repack --full-name-hash` (maybe with `git repack --full-name-hash
--path-walk` or a config option or otherwise?) and that it brings some
performance or other kind (better packs or which ones?) of
improvements?

> [1] https://github.com/microsoft/git/pull/667 [2]
> https://github.com/github/git-sizer [3]
> https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f
> [4]
> https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/
> [5]
> https://lore.kernel.org/git/[email protected]

It looks like adding line breaks could help make the above link list
more readable.

> TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
> ======================================================
>
> Here's the story about how these applications came about: I was tasked with
> understanding why certain internal repositories were growing larger than
> expected. (Feel free to skip. Otherwise, thank you for indulging me.)
>
> I first prototyped 'git backfill' as a way to download some of the largest
> repositories without being blocked on a full clone. This batched download
> mechanism allowed me to essentially have a retryable clone, since the client
> could restart the process from scratch and skip any objects that were
> already on disk. It was natural to batch based on the path of the blobs in
> order to theoretically save time and network bandwidth due to better delta
> calculations.
>
> While investigating these repositories, I had some data hinting at the total
> size of the objects by type.

By type (blob, tree, commit, tag?) or by path? Why would the size of
the objects by type be interesting? Could trees delta poorly?

> But I was most interested in learning what
> exactly was causing this growth. I did have a hint that the "release"
> branches were taking up much more space than the default branch, since
> cloning with --single-branch resulted in ~55GB of data but then fetching the
> release branches led to an additional ~125GB of data. Using "git diff" it
> was clear that these branches stored some CHANGELOG.json and CHANGELOG.md
> files, but the diffs were relatively small (but multiple such changes were
> batched together into a single new commit). I needed to see why exactly
> these paths were taking up so much space.
>
> To this, I turned to Jeff Hostetler's new "git survey" command. This told me
> information about the total size of trees and blobs and told me the path
> name for the individual blobs that were the largest in the repository.

Nice.

> These
> paths were typically binary files that appeared only once or twice and did
> not account for the scale issues.

You mean they couldn't account for a 55GB to 125GB change in data size?

> I modified 'git survey' to use the same
> path-batching logic as in 'git backfill' to then consider each batch of
> objects and their size. Thus, the "path-walk API" was born. (This RFC
> contains a version that looks like it was created before 'git backfill'.)
>
> The repository I was looking at had a clear pattern in its top 100 file
> paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md
> files. The .md files surprised me, since they were always simple appends of
> the previous .md file at the same path. The .json files were capped in how
> many versions were being listed (at least in recent versions) but the data
> it stored was harder to compress. So I went looking into how 'git push' was
> calculating these delta bases. Adding some debug information to 'git
> pack-objects' demonstrated that the previous file versions were not being
> matched as delta bases. Instead, other new blobs in the push were being used
> as delta bases.

Interesting.

> This meant that what should have been a trivial set of deltas bloated to
> 20-60 MB. (We will see later that it is possible for these to be 100-500
> KB.)
>
> Here is where I went on a little bit of a detour. (Come with me, it's
> important.) I knew that 'git pack-objects' used a name-hash to group objects
> by path, so I assumed that the reason these delta bases were not found was
> because the UNINTERESTING objects were not being added to the packing list.
> I have since discovered that this is incorrect, but I might have gotten
> stuck if I didn't think this.
>
> This seemed like a natural reason to extend the path-walk API to allow
> walking commits and tags as part of 'git pack-objects' behind a new
> '--path-walk' option. The idea here is to compute deltas among the objects
> that share a common path and then later go through the (type, name-hash,
> size) sorting system to find other delta bases across path boundaries. After
> a lot of testing, failing, and testing again, the implementation in this RFC
> finally works to achieve the goal. It's not pretty (especially with how it
> handles tags) but it gets the job done.

Yeah, if it can significantly improve the generated packfiles (without
drawbacks), it looks like a great improvement.

> In hindsight, I realized that the UNINTERESTING objects were being
> considered, but due to collisions in the name-hash algorithm these objects
> were being sorted outside of the delta computation window. For this reason,
> I thought to create a new name-hash algorithm. Thus, the --full-name-hash
> option for 'git pack-objects' and 'git repack' was born. This feature was
> split out and sent to the mailing list independently from this RFC.

So the question is "Is the problem fully solved by the new name-hash
algorithm or are there still benefits to using a new way to walk
repos?"

> RFC GOALS
> =========
>
> The goals of this RFC are:
>
>  1. To demonstrate potential applications of the path-walk API to motivate
>     its generality

Yeah, the "Path-walk API and applications" title summarizes this well.

> as these features are sent in full-quality patch series,
>     but in a different order.

I wonder if the patch series could have been separated and not all
sent in a 30 patch long series.

>  2. To communicate the discoveries found during the --path-walk and
>     --full-name-hash features in 'git pack-objects' and 'git repack'. This
>     includes comparing and contrasting the effectiveness of these features.

Thanks for communicating that.

>  3. To demonstrate the value of the path-based batching in the 'git survey'
>     feature, and to inspire others to think about what other statistics
>     would be valuable in that feature. (I anticipate that once a base is
>     established, multiple contributors will help expand its functionality
>     long into the future.)

Yeah, a better git sizer would be valuable for GitLab too and probably
everyone hosting a significant number of repos.

> RFC OUTLINE
> ===========
>
> The patches are grouped roughly by the application, in order of discovery:
>
>
> PART I: 'git backfill'
> ======================
>
> These patches introduce the 'git backfill' builtin including its
> '--batch-size' and '--sparse' options. While this is the first and simplest
> application, it is also the lowest priority in terms of user need.
>
>  * path-walk: introduce an object walk by path
>  * backfill: add builtin boilerplate
>  * backfill: basic functionality and tests
>  * backfill: add --batch-size= option
>  * backfill: add --sparse option
>  * backfill: assume --sparse when sparse-checkout is enabled

I would prefer a first patch series with all the above and the first
patch creating a technical doc called maybe
Documentation/technical/path-walk.txt which could contain a lot of
information from this RFC and perhaps technical details of how the
path-walk works and how it is different from a regular walk.

> DISCUSSION OF NAME HASH
> =======================
>
> One thing to talk about before digging into --path-walk and --full-name-hash
> features is the existing name-hash algorithm. This hash algorithm creates a
> uint32_t based on the final 16 characters of the path name, weighing the
> last characters more. There are multiple benefits to this:
>
>  1. Files of common types (.c, .txt, ...) may be grouped together.
>
>  2. Files that are renamed across directories may be grouped together.
>
> (Thanks, Junio, for making this second benefit clear.)
>
> The issue here is that some common patterns arise in repositories that use
> common path names across directories, and those files are creating name-hash
> collisions and making the sort less effective. One thing that can counteract
> these collisions is to increase the --window setting, but this significantly
> slows the delta computations.
>
> Thus, the --path-walk and --full-name-hash features both attempt to combat
> these name-hash collisions in very different ways. The --path-walk mechanism
> uses the path-walk API to consider batches of objects that all share the
> same path. This avoids storing every possible path names in memory while
> doing the object walk, but still gives nice boundaries for delta compression
> possibilities. After the path-walk is complete, the full packing list is
> still sorted via name-hash and this allows for cross-path deltas. This is
> critical!
>
> The --full-name-hash feature does the simpler choice of replacing the
> name-hash method with one that has fewer collisions, but loses the benefits
> of "nearby" paths having close hash values.

The benefits of "nearby" paths are only available with --path-walk,
not in the current way object packing works.

> This naturally leads to these two main differences in the two approaches:
>
>  1. The --full-name-hash feature is not good for 'git push' or similarly
>     small pack-files. Since it limits the delta chains to objects with the
>     same full path and loses the benefit of "nearby" paths, this feature
>     should be used for larger repacks. In my testing, 'git push' simulations
>     almost always have poor packing but 'git repack -adf' simulations have
>     packing rivaling the --path-walk option.

Interesting.

>  2. The --path-walk option changes the object ordering significantly,
>     meaning it may not ever be appropriate to combine with advanced
>     repacking features such as delta islands or even reachability bitmaps.
>     While my testing has shown that the --path-walk repacks are the most
>     efficient of all options, this limitation makes me hesitate to recommend
>     it wider than client repositories.

Might still be interesting to have it for client repos.

> One natural question to consider is to think about storing both the
> name-hash and the full-name-hash and doing two delta passes, each one
> sorting the objects by a different hash function. The first issue is that we
> don't want to store two hash values per object, as that will significantly
> increase the memory pressure during repacking.

Is one more uint32_t per object really increasing the memory pressure
so much? Isn't there a way to pack bits together to avoid that?

> This could be side-stepped by
> storing the full-name-hash in the packing list and then a second mapping
> from full-name-hash to name-hash. However, that still leads to the two
> passes taking extra time.

Isn't there a way to sort using both hash functions in a single pass?
(That is to perform a single pass and when considering an object sort
it using both hash functions before considering a different object.)

> The --path-walk approach is faster than even a
> single pass.

Is there a reason for that?

> And in the right scenarios, the --full-name-hash option is very
> close to the --path-walk results.
>
>
> ANALYSIS OF PACKING STRATEGIES
> ==============================
>
> Since I was focused on an internal monorepo that stored a large collection
> of Javascript packages, it should be no surprise that I eventually found
> other Javascript repositories that used similar tooling and thus had similar
> issues with unexplained scale problems. I'll use these four repositories as
> examples repeatedly, but one is actually public: microsoft/fluentui [6].
> I'll use Repo B, C, and D for the others, in increasing size.
>
> [6] https://github.com/microsoft/fluentui
>
> In each of these repositories, doing a full repack ('git repack -adf') right
> after cloning presents a sizeable reduction in space. This is expected for
> servers not being optimized for exactly the reachable set I'm cloning. So,
> I'll focus on how much the default repacking parameters compare to using the
> --full-name-hash or --path-walk options:
>
> | Repo     | Standard Repack | With --full-name-hash | With --path-walk |
> |----------|-----------------|-----------------------|------------------|
> | fluentui |         438 MB  |               168 MB  |          148 MB  |
> | Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
> | Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
> | Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |
>
>
> Hopefully these reductions show how much these name-hash collisions are
> causing issues in these repositories.

Yeah, right. It might be interesting to see the size reduction in % too.

> For the fluentui repo, I'm also able to share highlights from the 'git
> survey' output immediately after cloning and after repacking with the
> --path-walk option.
>
> First, we can consider the total reachable objects in each scenario:
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10443669 |      15092790
>       Trees | 276503 |  40212070 |     244429615
>       Blobs | 294500 | 635365661 |   10791187920
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10450605 |      15092790
>       Trees | 276503 |  31136263 |     244429615
>       Blobs | 294500 |  94442401 |   10791187920

Using % of size reduction or increase might help make it a bit clearer here too.

> Now, we can consider the top 10 file paths before and after repacking with
> the --path-walk feature:
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |  58060531 |     889120488
>  ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
>  ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
>  ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
>  ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
>  ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
>  ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
>  ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
>  .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
>  ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258
>
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |   9909326 |     889120488
>  ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
>  ...out/src/images/download.jpg |     1 |   1845249 |       1846117
>  ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
>           .yarn/releases/cli.js |     1 |   1335657 |       6741614
>  ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
>  ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
>  ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
>  ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
>  ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268
>
>
> As we can see from this example, before the repack we are mainly seeing the
> disk space be dominated by objects that appear at paths with CHANGELOG.json
> or CHANGELOG.md, which are frequently hitting collisions with the default
> name-hash. After the repack with the --path-walk feature, we see the largest
> paths are what we expect: checked in binaries, frequently-edited yarn files,
> and other hard-to- compress data.
>
> The nice thing about this result is that we can point to files that are
> taking up space because there is no other way around it, not that the naming
> convention for the files is causing confusion during packing.

Yeah, nice result. I guess the result is the same or very similar when
the improved name-hash algorithm is used.

> WHAT TO DO NOW?
> ===============
>
> Thank you for reading this far. I hope that this has added context on the
> patch series for the --full-name-hash option, but also provided the right
> context for when I come back in a week or so with a review-ready version of
> the --path-walk option.
>
> These patches are rough and I want to make sure everyone knows that.
> Reviewing them for style or even basic organization may lead to some wasted
> time. I know that at least one of the patches got mangled and its diff does
> not match its description (I'm thinking specifically about "pack-objects:
> extract should_attempt_deltas()" but this could apply elsewhere).

Ok, I will wait for the next iteration before reviewing the patches then.

> But I think that interested parties could take my branch, build it, and give
> these features a try on their favorite repos. I'd love to hear feedback on
> the usability and effectiveness, especially if someone finds a case where
> the --path-walk option is less effective at packing data.

I might do that when reviewing the patch series later. I think it can
still be valuable for you to get some first feedback from just reading
this RFC.

Thanks.

Copy link

gitgitgadget bot commented Sep 17, 2024

User Christian Couder <[email protected]> has been added to the cc: list.

Copy link

gitgitgadget bot commented Sep 18, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/17/24 6:41 AM, Christian Couder wrote:
> On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget
> <[email protected]> wrote:

>>   * git backfill: a builtin to download missing blobs in a blobless partial
>>     clone, done in batches and grouped by the path they appear in to maximize
>>     delta compression in each batch. Allows focusing on the paths of the
>>     sparse-checkout to only get the blobs necessary for history queries in
>>     the current focus.
>
> It's not very clear if this would be useful when doing a `git
> sparse-checkout add`, or a `git blame` on a file path not covered by
> the current sparse-checkout, or both. I think it would be clearer if
> there were a few examples.

You are correct, this doesn't help with either of those examples, as
they both require blobs outside of the current sparse-checkout. The
idea is for users who are most often in a given sparse-checkout to have
the experience as if they were in a non-partial Git clone, as long as
they restrict their activity within their sparse-checkout.

If they increase their sparse-checkout, then they can re-run 'git
backfill --sparse' to get the still-missing objects in the new paths.

>>   * git survey: Jeff Hostetler built this feature [1] as a way to get
>>     functionality similar to git-sizer [2], but using the internals of Git to
>>     do it faster. It also displays information not available to git-sizer,
>>     like the on-disk size of objects. This RFC presents a simplified version
>>     of the builtin focused on listing the paths that contribute the most to
>>     the on-disk size of trees and blobs.
>
> Not sure how `git survey` works, but `git sizer` works on a whole
> repo, so, if they work in the same way, I am not sure I see what a new
> path oriented way to walk repos would bring to the tool.

`git survey` also works on a whole repo, but with the path-walk API
implementation can talk about a group of objects that appear at a common
path instead of only talking about single extreme objects, such as one
large binary (that never changes) being singled out over a small file
that chnages frequently and across all versions contributes more to the
repo's disk size.

The main benefit of using `git survey` over `git-sizer` is that it can
report on things internal to Git's storage that are not accessible to
`git-sizer` through the `git cat-file` output it uses.

>>   * git pack-objects --path-walk: In order to find a way to compute deltas
>>     among objects of the same path, I applied the path-walk API to 'git
>>     pack-objects' behind an optional flag. There are overlaps with the
>>     '--sparse' option [3], [4] that can be used here. This provides perfect
>>     partitioning by path name, without any possible collisions from the
>>     name-hash algorithm. It also allows using the name-hash values to find
>>     cross-path delta chains in a second pass.
>
> Do you mean that the new way to walk repos allows pack-object to
> perform better in the general case or maybe only in the case where
> partial clone without blobs and sparse-checkout are used as described
> in the git backfill related point above?

From what I can see, `git pack-objects --path-walk` outperforms all other
repacking strategies that I have found, but it does have some weaknesses in
how it interacts with things like delta islands. There may be other concerns,
and I _was_ able to find a repo that compressed slightly better with `git
pack-objects --full-name-hash`, but it was also storing computer-generated,
single-line JSON files representing configuration of production systems. I
would like to have a more generic thing to say about this, but it will vary
on data shape.

>> This feature was
>>     already sent to the mailing list as a fully-reviewable series [5]. It is
>>     included here because this series allows testing the --path-walk option
>>     against the --full-name-hash.
>
> Do you mean that the new way to walk repos can be used along with `git
> repack --full-name-hash` (maybe with `git repack --full-name-hash
> --path-walk` or a config option or otherwise?) and that it brings some
> performance or other kind (better packs or which ones?) of
> improvements?

Combining the two features actually ends up with very similar performance
to what `--full-name-hash` already does. It's actually important that the
`--path-walk` option does a full pass of the objects via the standard
name-hash after its first pass in groups based on the path.

It's more that I include performance tests that compare how effective the
two features are relative to the existing repacking strategy and in a few
different situations. This helps with both time and space performance
measurements.

>> TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
>> ======================================================
>>
>> Here's the story about how these applications came about: I was tasked with
>> understanding why certain internal repositories were growing larger than
>> expected. (Feel free to skip. Otherwise, thank you for indulging me.)
>>
>> I first prototyped 'git backfill' as a way to download some of the largest
>> repositories without being blocked on a full clone. This batched download
>> mechanism allowed me to essentially have a retryable clone, since the client
>> could restart the process from scratch and skip any objects that were
>> already on disk. It was natural to batch based on the path of the blobs in
>> order to theoretically save time and network bandwidth due to better delta
>> calculations.
>>
>> While investigating these repositories, I had some data hinting at the total
>> size of the objects by type.
>
> By type (blob, tree, commit, tag?) or by path? Why would the size of
> the objects by type be interesting? Could trees delta poorly?

The previously-existing data could only group objects by type (commit,
tree, blob) and report on the size of the reachable objects in those
categories. The per-path data was not available as no tool that I knew
about had the capability to expose that information.

>> These
>> paths were typically binary files that appeared only once or twice and did
>> not account for the scale issues.
>
> You mean they couldn't account for a 55GB to 125GB change in data size?

The numbers were so small they didn't seem like they could have accounted
for any issues at all, let alone a hundred gigabytes of data in the repo.

>> RFC GOALS
>> =========
>>
>> The goals of this RFC are:
>>
>>   1. To demonstrate potential applications of the path-walk API to motivate
>>      its generality
>
> Yeah, the "Path-walk API and applications" title summarizes this well.
>
>> as these features are sent in full-quality patch series,
>>      but in a different order.
>
> I wonder if the patch series could have been separated and not all
> sent in a 30 patch long series.

I was not clear about this, but the RFC is 30 patches so it's possible to see
the big picture, but I will be breaking it into at least four series in
sequence for actual review. They match the four sections described above, but
will be in the opposite order:

 A. `git repack --full-name-hash`
 B. `git pack-objects --path-walk`
 C. `git survey`
 D. `git backfill`

(It's possible that `git survey` and `git backfill` may be orthogonal enough
that they could be under review at the same time. Alternatively, `git backfill`
may jump the line because it's so simple to implement once the path-walk API
is established.)

>>   3. To demonstrate the value of the path-based batching in the 'git survey'
>>      feature, and to inspire others to think about what other statistics
>>      would be valuable in that feature. (I anticipate that once a base is
>>      established, multiple contributors will help expand its functionality
>>      long into the future.)
>
> Yeah, a better git sizer would be valuable for GitLab too and probably
> everyone hosting a significant number of repos.

This was definitely Jeff's intention, and I agree. Hopefully we can establish
a baseline quickly and then make room for extensions as people discover new
ways to investigate repo size or performance issues.

>> PART I: 'git backfill'
>> ======================
>>
>> These patches introduce the 'git backfill' builtin including its
>> '--batch-size' and '--sparse' options. While this is the first and simplest
>> application, it is also the lowest priority in terms of user need.
>>
>>   * path-walk: introduce an object walk by path
>>   * backfill: add builtin boilerplate
>>   * backfill: basic functionality and tests
>>   * backfill: add --batch-size= option
>>   * backfill: add --sparse option
>>   * backfill: assume --sparse when sparse-checkout is enabled
>
> I would prefer a first patch series with all the above and the first
> patch creating a technical doc called maybe
> Documentation/technical/path-walk.txt which could contain a lot of
> information from this RFC and perhaps technical details of how the
> path-walk works and how it is different from a regular walk.

My rework of the `git pack-objects --path-walk` series has a draft of a
technical document as you suggest [1]. I regret not having time to get
it done before this RFC.

[1] https://github.com/derrickstolee/git/pull/28/files#diff-d8a8f04540e5fac6727529dcabf05501ba2447a7de340b540f2601e071b45260

>> DISCUSSION OF NAME HASH
>> =======================
...
>> The --full-name-hash feature does the simpler choice of replacing the
>> name-hash method with one that has fewer collisions, but loses the benefits
>> of "nearby" paths having close hash values.
>
> The benefits of "nearby" paths are only available with --path-walk,
> not in the current way object packing works.

I suppose this depends on your definition of "nearby" and I think we are
using it differently.

The --full-name-hash mechanism groups objects by their full path name, with
a small probability of a collision with another path, and thus we can think
about it as grouping objects by their full path. However, this lack of
collision comes at a cost: having unequal but "near" hash value does not
imply any similarity in the full path.

The standard name-hash algorithm has a form of "locality" that provides a
nice property: paths with unequal by "near" hash values will have some
similarities in the end of their path names.

Between these two mechanisms, the collisions of the standard name-hash
can cause objects that should delta together to be separated in the object
order due to too many objects with the same hash value, even though they
have very different full path names. While the full-name-hash mechanism
helps keep objects from the same path close together in the order, it
fails to compute good deltas across objects from different paths, even
if they may end similarly (implying similar file types or even renames
across directories).

The --path-walk feature has the nice benefit that it first attempts to
compute delta bases are grouped to exactly the set of objects that
appear at a given path (no more, no less) but then it _also_ does the
standard name-hash sort to help look for delta bases using the name-hash
locality heuristic.

>> One natural question to consider is to think about storing both the
>> name-hash and the full-name-hash and doing two delta passes, each one
>> sorting the objects by a different hash function. The first issue is that we
>> don't want to store two hash values per object, as that will significantly
>> increase the memory pressure during repacking.
>
> Is one more uint32_t per object really increasing the memory pressure
> so much? Isn't there a way to pack bits together to avoid that?

I hesitate to conclude that four bytes per object will not have a meaningful
impact. One thing that I'll be reporting on in my v2 of the --full-name-hash
feature is that I've gone and tried to prototype what would happen with this
kind of approach.

The short version is that I failed to make the addition of more data to this
struct helpful in outperforming the --full-name-hash option. I thought that
a two-pass approach would work, but something about how the two sorts worked
caused problems that I could not overcome. (Maybe someone with a stronger
handle on this could make it work.) My WIP branch is here [2] for anyone
willing to try to succeed where I failed.

[2] https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

>> This could be side-stepped by
>> storing the full-name-hash in the packing list and then a second mapping
>> from full-name-hash to name-hash. However, that still leads to the two
>> passes taking extra time.
>
> Isn't there a way to sort using both hash functions in a single pass?
> (That is to perform a single pass and when considering an object sort
> it using both hash functions before considering a different object.)

We could sort the objects using both hash functions (say, name-hash then
full-name-hash to break ties) but then the full-name-hash values keep
the similar-sized objects with the same name-hash from being sorted near
each other (at least, within the short default window, which I use for
all of my testing). So this ends up being ineffective. In [2] above, this
is something I tested along the way and am pretty confident that it does
not work as well as we'd like it to.

>> The --path-walk approach is faster than even a
>> single pass.
>
> Is there a reason for that?

My guess here is that we are finding the "best" deltas quickly, allowing
us to short-circuit the possibility of other deltas due to file size
differences. (If my object has size X, with a current delta of size D,
then any object smaller than X - D will not be a good base.)

>> First, we can consider the total reachable objects in each scenario:
>>
>> TOTAL OBJECT SIZES BY TYPE
>> ================================================
>> Object Type |  Count | Disk Size | Inflated Size
>> ------------+--------+-----------+--------------
>>      Commits |  20579 |  10443669 |      15092790
>>        Trees | 276503 |  40212070 |     244429615
>>        Blobs | 294500 | 635365661 |   10791187920
>>
>> TOTAL OBJECT SIZES BY TYPE
>> ================================================
>> Object Type |  Count | Disk Size | Inflated Size
>> ------------+--------+-----------+--------------
>>      Commits |  20579 |  10450605 |      15092790
>>        Trees | 276503 |  31136263 |     244429615
>>        Blobs | 294500 |  94442401 |   10791187920
>
> Using % of size reduction or increase might help make it a bit clearer here too.

True. I could compute that manually, as this data is not available in the
same process. I'll try to do that in the future.

>> Now, we can consider the top 10 file paths before and after repacking with
>> the --path-walk feature:
...
>> The nice thing about this result is that we can point to files that are
>> taking up space because there is no other way around it, not that the naming
>> convention for the files is causing confusion during packing.
>
> Yeah, nice result. I guess the result is the same or very similar when
> the improved name-hash algorithm is used.

Yes, similar results happen there. I haven't dug into them very much to see
if it helps point out anything about inefficiencies in --full-name-hash, but
I don't expect any of the big differences between --full-name-hash and
--path-walk to appear in these tables.

>> WHAT TO DO NOW?
>> ===============
>>
>> Thank you for reading this far. I hope that this has added context on the
>> patch series for the --full-name-hash option, but also provided the right
>> context for when I come back in a week or so with a review-ready version of
>> the --path-walk option.
>>
>> These patches are rough and I want to make sure everyone knows that.
>> Reviewing them for style or even basic organization may lead to some wasted
>> time. I know that at least one of the patches got mangled and its diff does
>> not match its description (I'm thinking specifically about "pack-objects:
>> extract should_attempt_deltas()" but this could apply elsewhere).
>
> Ok, I will wait for the next iteration before reviewing the patches then.

Yes, I only meant for them to be available for those who were curious to
explore. Please do review v2 of the --full-name-hash series [3].

[3] https://lore.kernel.org/git/[email protected]/

>> But I think that interested parties could take my branch, build it, and give
>> these features a try on their favorite repos. I'd love to hear feedback on
>> the usability and effectiveness, especially if someone finds a case where
>> the --path-walk option is less effective at packing data.
>
> I might do that when reviewing the patch series later. I think it can
> still be valuable for you to get some first feedback from just reading
> this RFC.

Thank you! And thanks for the thoughts on this very long cover letter.

-Stolee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants