fix: Adding support for submodules in CAS#6294
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR adds comprehensive Git submodule support to the Content Addressable Store (CAS), enabling cloning of repositories containing submodules. It implements URL threading through tree storage, submodule URL resolution from ChangesGit submodule support in CAS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
internal/git/server.go (1)
297-309: ⚡ Quick winReplace repeated
.gitmodulesliterals with the sharedGitmodulesPathconstant.This duplicates the path in three places and can drift from package-wide behavior; it also matches the current Sonar finding.
♻️ Proposed fix
-func appendGitmodules(w *gogit.Worktree, path, url string) error { - existing, err := util.ReadFile(w.Filesystem, ".gitmodules") +func appendGitmodules(w *gogit.Worktree, path, url string) error { + existing, err := util.ReadFile(w.Filesystem, GitmodulesPath) if err != nil && !errors.Is(err, os.ErrNotExist) { return fmt.Errorf("read .gitmodules: %w", err) } @@ - if err := util.WriteFile(w.Filesystem, ".gitmodules", append(existing, section...), gitmodulesPerms); err != nil { + if err := util.WriteFile(w.Filesystem, GitmodulesPath, append(existing, section...), gitmodulesPerms); err != nil { return fmt.Errorf("write .gitmodules: %w", err) } - if _, err := w.Add(".gitmodules"); err != nil { + if _, err := w.Add(GitmodulesPath); err != nil { return fmt.Errorf("add .gitmodules: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/git/server.go` around lines 297 - 309, Replace the hard-coded ".gitmodules" string literals with the package-level constant GitmodulesPath to avoid duplication and drift: update the util.ReadFile call, the util.WriteFile call, and the w.Add call to use GitmodulesPath instead of ".gitmodules" (refer to the variables/funcs existing, util.ReadFile, util.WriteFile, w.Add and the section string construction in server.go).Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/src/content/docs/03-features/07-caching/04-cas/02-git.mdx`:
- Around line 50-57: The new "Submodules" section is for an unreleased feature
and must be gated with the Since component; add an import for Since at the top
of the MDX alongside the other imports (e.g., import Since from "...") and wrap
the entire "## Submodules" section content inside <Since
version="v1.0.8">...</Since> so the docs site will hide it until v1.0.8 is
released.
In `@internal/cas/cas.go`:
- Around line 633-635: When a root-tree cache hit causes populateTreeFromRef to
short-circuit, ensure we backfill pre-existing CAS entries for submodule trees
before returning: detect the cache-version or a backfill-needed flag on the
repository inside populateTreeFromRef and call c.storeSubmodules(ctx, l, v,
runner, url, tree) (or a new backfillSubmodules helper) even when the root tree
was unchanged; persist a repo-level marker (cache version or boolean) so
backfill runs only once per upgrade, and apply the same change to the other
short-circuit path around lines 678-718 so previously cached repos ingest
submodule trees before populateTreeFromRef returns.
- Around line 703-715: The code currently calls populateTreeFromRef with a
commitRef built from resolvedURL (via git.ResolveSubmoduleURL) without
validating that resolvedURL is permitted; enforce a validation step before
calling populateTreeFromRef: implement and call a URL allowlist/same-origin
check (or an explicit opt-in flag) that inspects resolvedURL (scheme, host, and
path) and rejects or skips submodules that are not allowed, returning/logging an
error or continuing as appropriate; integrate this check where resolvedURL is
assigned in the loop that builds commitRef and only call populateTreeFromRef
when the URL passes validateSubmoduleURL (or when an opt-in boolean like
allowCrossOriginSubmodules is true).
In `@internal/git/submodule_test.go`:
- Around line 23-75: Tests in internal/git/submodule_test.go hard-code
external-looking hosts in table-driven cases (the parentURL and url fields in
test entries such as the "absolute https url unchanged", "absolute scp url
unchanged", and other cases using "https://example.com/..." and
"git@github.com:..."), which triggers the UNKNOWN-UNKNOWNS signature; either
replace those strings with clearly local/fixture-only hosts (e.g. localhost,
example.local, or use a test helper that constructs ephemeral repo URLs) in the
table entries for parentURL and url, or add an explicit maintainer justification
in the PR and a clarifying comment in the test near the table explaining these
are safe test fixtures and not new egress targets.
In `@internal/git/submodule.go`:
- Around line 110-126: The loop that trims `base` by searching
LastIndexByte('/', ...) and LastIndexByte(':', ...) can over-trim scheme-based
URLs (e.g., "https://example.com" -> "https:/"); modify the backtracking logic
in the function that computes submodule bases (the code manipulating the `base`
and `colonsep` variables) to clamp traversal at the authority root for scheme
URLs: detect a scheme/authority boundary (e.g., the "://" index or host start)
and break out once trimming would remove past that boundary so you never reduce
"https://host" to "https:/"; update the join logic that follows to rely on this
clamp and add a regression test case using
parent="https://example.com/org/repo.git" and url="../../../dep.git" to assert
the final joined URL is valid (and add similar coverage for plain remote vs
local parents).
---
Nitpick comments:
In `@internal/git/server.go`:
- Around line 297-309: Replace the hard-coded ".gitmodules" string literals with
the package-level constant GitmodulesPath to avoid duplication and drift: update
the util.ReadFile call, the util.WriteFile call, and the w.Add call to use
GitmodulesPath instead of ".gitmodules" (refer to the variables/funcs existing,
util.ReadFile, util.WriteFile, w.Add and the section string construction in
server.go).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8de2ea89-e647-44ec-947a-02b3ddee59d2
📒 Files selected for processing (10)
docs/src/content/docs/03-features/07-caching/04-cas/02-git.mdxdocs/src/data/changelog/v1.0.8/cas-submodule-support.mdxinternal/cas/cas.gointernal/cas/submodule_test.gointernal/cas/tree.gointernal/cas/tree_test.gointernal/git/server.gointernal/git/submodule.gointernal/git/submodule_test.gointernal/git/tree.go
| ## Submodules | ||
|
|
||
| A submodule is recorded in the parent repository as a pointer to a commit in another repository. During a fetch, Terragrunt reads the submodule URL from `.gitmodules`, fetches the pinned commit from that URL into the same central store layout, and materializes the submodule contents in place. Nested submodules are handled the same way, and relative URLs (such as `../sibling.git`) are resolved against the parent repository URL. | ||
|
|
||
| Each submodule tree is keyed by its pinned commit hash, so submodule contents are deduplicated like any other content: two repositories pinning the same submodule commit share one stored tree, and a submodule that does not change between parent commits is never fetched again. | ||
|
|
||
| A committed submodule path with no `.gitmodules` entry (typically an accidentally committed nested repository) becomes an empty directory, the same result a plain `git clone` produces. | ||
|
|
There was a problem hiding this comment.
Gate unreleased submodule documentation with <Since version="v1.0.8">.
This new "Submodules" section describes behavior introduced in v1.0.8, which has not been released yet. As per coding guidelines, documentation for unreleased features must be wrapped in a <Since> component to prevent the docs site from publishing it before the release ships.
📝 Proposed fix to gate the section
First, add the import at the top of the file with the existing imports (after line 9):
import { Aside } from '`@astrojs/starlight/components`';
+import Since from '`@components/Since.astro`';Then wrap the new section:
-## Submodules
-
-A submodule is recorded in the parent repository as a pointer to a commit in another repository. During a fetch, Terragrunt reads the submodule URL from `.gitmodules`, fetches the pinned commit from that URL into the same central store layout, and materializes the submodule contents in place. Nested submodules are handled the same way, and relative URLs (such as `../sibling.git`) are resolved against the parent repository URL.
-
-Each submodule tree is keyed by its pinned commit hash, so submodule contents are deduplicated like any other content: two repositories pinning the same submodule commit share one stored tree, and a submodule that does not change between parent commits is never fetched again.
-
-A committed submodule path with no `.gitmodules` entry (typically an accidentally committed nested repository) becomes an empty directory, the same result a plain `git clone` produces.
+<Since version="v1.0.8">
+
+## Submodules
+
+A submodule is recorded in the parent repository as a pointer to a commit in another repository. During a fetch, Terragrunt reads the submodule URL from `.gitmodules`, fetches the pinned commit from that URL into the same central store layout, and materializes the submodule contents in place. Nested submodules are handled the same way, and relative URLs (such as `../sibling.git`) are resolved against the parent repository URL.
+
+Each submodule tree is keyed by its pinned commit hash, so submodule contents are deduplicated like any other content: two repositories pinning the same submodule commit share one stored tree, and a submodule that does not change between parent commits is never fetched again.
+
+A committed submodule path with no `.gitmodules` entry (typically an accidentally committed nested repository) becomes an empty directory, the same result a plain `git clone` produces.
+
+</Since>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/src/content/docs/03-features/07-caching/04-cas/02-git.mdx` around lines
50 - 57, The new "Submodules" section is for an unreleased feature and must be
gated with the Since component; add an import for Since at the top of the MDX
alongside the other imports (e.g., import Since from "...") and wrap the entire
"## Submodules" section content inside <Since version="v1.0.8">...</Since> so
the docs site will hide it until v1.0.8 is released.
Source: Coding guidelines
| if err := c.storeSubmodules(ctx, l, v, runner, url, tree); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Backfill pre-existing CAS entries before depending on storeSubmodules.
This only runs when the root tree is being rewritten. Repositories already cached before this PR will still hit the existing populateTreeFromRef short-circuit on the root tree, so their submodule trees are never ingested. After that, internal/cas/tree.go will treat every missing submodule tree as the “empty directory” case, which means previously cached repos silently lose submodule contents after upgrade. Please add a cache-version/backfill path before honoring a root-tree hit.
Also applies to: 678-718
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/cas/cas.go` around lines 633 - 635, When a root-tree cache hit
causes populateTreeFromRef to short-circuit, ensure we backfill pre-existing CAS
entries for submodule trees before returning: detect the cache-version or a
backfill-needed flag on the repository inside populateTreeFromRef and call
c.storeSubmodules(ctx, l, v, runner, url, tree) (or a new backfillSubmodules
helper) even when the root tree was unchanged; persist a repo-level marker
(cache version or boolean) so backfill runs only once per upgrade, and apply the
same change to the other short-circuit path around lines 678-718 so previously
cached repos ingest submodule trees before populateTreeFromRef returns.
| for _, entry := range gitlinks { | ||
| subURL, ok := urls[entry.Path] | ||
| if !ok { | ||
| l.Debugf("cas: gitlink %s has no .gitmodules entry, leaving an empty directory", entry.Path) | ||
| continue | ||
| } | ||
|
|
||
| resolvedURL := git.ResolveSubmoduleURL(url, subURL) | ||
|
|
||
| ref := &commitRef{URL: resolvedURL, RawRef: entry.Hash, Hash: entry.Hash} | ||
| if _, err := c.populateTreeFromRef(ctx, l, v, &CloneOptions{}, ref); err != nil { | ||
| return fmt.Errorf("fetch submodule %s from %s: %w", entry.Path, resolvedURL, err) | ||
| } |
There was a problem hiding this comment.
Validate submodule remotes before fetching them.
resolvedURL is fully controlled by repository contents here, and the new flow will clone whatever .gitmodules points at. That widens CAS from “fetch the repo the user asked for” to “make arbitrary Git connections or local-path clones chosen by the repo author,” including different hosts/protocols and local file/path targets. Please enforce a same-origin/allowlist policy or gate cross-origin submodule fetches behind explicit opt-in before calling populateTreeFromRef.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/cas/cas.go` around lines 703 - 715, The code currently calls
populateTreeFromRef with a commitRef built from resolvedURL (via
git.ResolveSubmoduleURL) without validating that resolvedURL is permitted;
enforce a validation step before calling populateTreeFromRef: implement and call
a URL allowlist/same-origin check (or an explicit opt-in flag) that inspects
resolvedURL (scheme, host, and path) and rejects or skips submodules that are
not allowed, returning/logging an error or continuing as appropriate; integrate
this check where resolvedURL is assigned in the loop that builds commitRef and
only call populateTreeFromRef when the URL passes validateSubmoduleURL (or when
an opt-in boolean like allowCrossOriginSubmodules is true).
| name: "absolute https url unchanged", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "https://example.com/other/dep.git", | ||
| want: "https://example.com/other/dep.git", | ||
| }, | ||
| { | ||
| name: "absolute scp url unchanged", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "git@github.com:org/dep.git", | ||
| want: "git@github.com:org/dep.git", | ||
| }, | ||
| { | ||
| name: "sibling repository", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "../sibling.git", | ||
| want: "https://example.com/org/sibling.git", | ||
| }, | ||
| { | ||
| name: "two levels up", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "../../other/dep.git", | ||
| want: "https://example.com/other/dep.git", | ||
| }, | ||
| { | ||
| name: "dot slash appends", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "./sub.git", | ||
| want: "https://example.com/org/repo.git/sub.git", | ||
| }, | ||
| { | ||
| name: "parent with trailing slash", | ||
| parentURL: "https://example.com/org/repo.git/", | ||
| url: "../sibling.git", | ||
| want: "https://example.com/org/sibling.git", | ||
| }, | ||
| { | ||
| name: "scp parent sibling", | ||
| parentURL: "git@github.com:org/repo.git", | ||
| url: "../sibling.git", | ||
| want: "git@github.com:org/sibling.git", | ||
| }, | ||
| { | ||
| name: "scp parent exhausts path components", | ||
| parentURL: "git@github.com:org/repo.git", | ||
| url: "../../sibling.git", | ||
| want: "git@github.com:sibling.git", | ||
| }, | ||
| { | ||
| name: "mixed dot and dotdot segments", | ||
| parentURL: "https://example.com/org/repo.git", | ||
| url: "./../sibling.git", | ||
| want: "https://example.com/org/sibling.git", | ||
| }, |
There was a problem hiding this comment.
UNKNOWN-UNKNOWNS signature (2): newly hard-coded external hosts require explicit justification before merge.
internal/git/submodule_test.go introduces new outbound-looking destinations, e.g. Line 24 (https://example.com/...), Line 31 (git@github.com:...), and Line 166 (https://example.com/child.git). Even in tests, signature (2) requires confirming these hosts are expected project infrastructure/fixtures and not new egress targets. Please add maintainer justification in the PR (or replace with clearly local fixture-only hosts where possible).
As per coding guidelines, signature (2) requires flagging newly hard-coded URLs/hostnames and cross-checking new destinations before approval.
Also applies to: 102-106, 113-115, 126-126, 131-134, 166-166
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/git/submodule_test.go` around lines 23 - 75, Tests in
internal/git/submodule_test.go hard-code external-looking hosts in table-driven
cases (the parentURL and url fields in test entries such as the "absolute https
url unchanged", "absolute scp url unchanged", and other cases using
"https://example.com/..." and "git@github.com:..."), which triggers the
UNKNOWN-UNKNOWNS signature; either replace those strings with clearly
local/fixture-only hosts (e.g. localhost, example.local, or use a test helper
that constructs ephemeral repo URLs) in the table entries for parentURL and url,
or add an explicit maintainer justification in the PR and a clarifying comment
in the test near the table explaining these are safe test fixtures and not new
egress targets.
Source: Coding guidelines
| if i := strings.LastIndexByte(base, '/'); i >= 0 { | ||
| base = base[:i] | ||
| continue | ||
| } | ||
|
|
||
| if i := strings.LastIndexByte(base, ':'); i >= 0 { | ||
| base = base[:i] | ||
| colonsep = true | ||
|
|
||
| continue | ||
| } | ||
|
|
||
| // Out of components. Git dies here for local parents and | ||
| // degrades for remote ones; "." keeps the join well-formed | ||
| // without inventing a path. | ||
| base = "." | ||
| } |
There was a problem hiding this comment.
Relative URL backtracking can generate malformed HTTP(S) bases.
On Line 110-Line 112, repeated ../ can trim https://example.com down to https:/, so the final join on Line 132 can produce invalid URLs. Please clamp traversal at URL authority root for scheme-based parents and add a regression case (e.g., parent=https://example.com/org/repo.git, url=../../../dep.git).
Also applies to: 132-133
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/git/submodule.go` around lines 110 - 126, The loop that trims `base`
by searching LastIndexByte('/', ...) and LastIndexByte(':', ...) can over-trim
scheme-based URLs (e.g., "https://example.com" -> "https:/"); modify the
backtracking logic in the function that computes submodule bases (the code
manipulating the `base` and `colonsep` variables) to clamp traversal at the
authority root for scheme URLs: detect a scheme/authority boundary (e.g., the
"://" index or host start) and break out once trimming would remove past that
boundary so you never reduce "https://host" to "https:/"; update the join logic
that follows to rely on this clamp and add a regression test case using
parent="https://example.com/org/repo.git" and url="../../../dep.git" to assert
the final joined URL is valid (and add similar coverage for plain remote vs
local parents).
Description
Adds support for submodules in CAS.
TODOs
Read the Gruntwork contribution guidelines.
Summary by CodeRabbit
Bug Fixes
Documentation