Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: return simple stats or extended stats #760

Merged
merged 12 commits into from
Mar 20, 2025

Conversation

achingbrain
Copy link
Member

@achingbrain achingbrain commented Mar 11, 2025

Splits the unixfs/mfs stat command return types into two types - "regular" stats (these only return stats available on the root node of the DAG) and "extended" - these collect stats from the DAG, potentially going to the network to fetch missing blocks.

There are separate simple and extended types for files and dirs, but raw/leaf nodes only get one type since there are never any nodes linked to from them.

As a bonus we can now calculate the size of directories correctly, (as the combination of all child files/directories), assuming all blocks are present in the blockstore or fetchable from the network.

Fixes #580

BREAKING CHANGE: Fields that would involve DAG traversal have been removed from the output of fs.stat - pass the extended option to have them returned

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

@achingbrain achingbrain requested a review from a team as a code owner March 11, 2025 18:18
@achingbrain
Copy link
Member Author

Check out the changes to the returned types - https://github.com/ipfs/helia/pull/760/files#diff-026d52cd110beb421b24bb2d95961f27f3f3d083df76eac8302eb4eab783c680 - most of this PR is noise dealing with that.

@achingbrain
Copy link
Member Author

Removing the mtime/mode fields from the stat results might be a bit much since there are supposed to be default values if they aren't specified in the UnixFS metadata, will probably reinstate them which will make this PR a lot quieter.

@achingbrain achingbrain force-pushed the fix/report-directory-sizes-properly branch 4 times, most recently from 9f2e7e0 to 5cd760c Compare March 14, 2025 11:55
Splits the unixfs/mfs stat command return types into two types -
"regular" stats (these only return stats available on the root
node of the DAG) and "extended" - these collect stats from the DAG,
potentially going to the network to fetch missing blocks.

There are separate simple and extended types for files and dirs,
but raw/leaf nodes only get one type since there are never any
nodes linked to from them.

As a bonus we can now calculate the size of directories correctly,
(as the combination of all child files/directories), assuming all
blocks are present in the blockstore or fetchable from the network.

Fixes #580

BREAKING CHANGE: The return type from fs.stat varies by DAG type - inspect the `.type` property to ensure type safety
@achingbrain achingbrain force-pushed the fix/report-directory-sizes-properly branch from 5cd760c to 4ebc680 Compare March 14, 2025 11:56
Copy link
Member

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing confusing me here is that in the tests, we aren't validating that "network requests" will fill out dagSize more when extended=true.

It would be good to have a test for that.. maybe have a wrapped blockstore that calls to a "network" blockstore that we load the bytes into?

Comment on lines 59 to 68
await blockstore.delete(node.Links[0].Hash)

// block count and local file/dag sizes should be smaller
await expect(fs.stat(filePath)).to.eventually.include({
fileSize: 5242880n,
blocks: 5,
localFileSize: 4194304n,
localDagSize: 4194563n
const updatedStats = await fs.stat(filePath, {
extended: true
})

expect(updatedStats.unixfs?.fileSize()).to.equal(5242880n)
expect(updatedStats.blocks).to.equal(5n)
expect(updatedStats.dagSize).to.equal(4194563n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting dagSize to be the same here because extended=true but I guess there's no networking done in the test, so it doesn't retrieve that block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, yes. The blockstore is a MemoryBlockstore that isn't wrapped in a NetworkedBlockstore.

*/
type: 'file' | 'directory' | 'raw'
localSize: bigint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ExtendedStats can trigger network fetching, I suppose if this is successfully returned, it's accurate, because the blocks had to be fetched for this to return. Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unless the offline option was true, in which case it won't pull down and missing blocks so localSize will only be what's in the local blockstore.

@2color
Copy link
Member

2color commented Mar 17, 2025

the only thing confusing me here is that in the tests, we aren't validating that "network requests" will fill out dagSize more when extended=true.

I was confused by the same thing.

If you have a NetworkedBlockstore, will the await blockstore.has(cid, options)) trigger a chain leading to a find providers calls and const block = await blockstore.get(cid, options) call to fetch the block?

async function inspectDag (cid: CID, blockstore: GetStore & HasStore, options: AbortOptions): Promise<InspectDagResults> {
const results: InspectDagResults = {
localSize: 0n,
dagSize: 0n,
blocks: 0n
}
if (await blockstore.has(cid, options)) {
const block = await blockstore.get(cid, options)
results.blocks++
results.dagSize += BigInt(block.byteLength)

unixfs: entry.unixfs,
mode: entry.unixfs.mode ?? (entry.unixfs.isDirectory() ? DEFAULT_DIR_MODE : DEFAULT_FILE_MODE),
mtime: entry.unixfs.mtime,
size: entry.unixfs.fileSize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we expect fileSize() to return for directories?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will return 0n.

@achingbrain
Copy link
Member Author

If you have a NetworkedBlockstore, will the await blockstore.has(cid, options)) trigger a chain leading to a find providers calls and const block = await blockstore.get(cid, options) call to fetch the block?

No, actually this was a bug. I've updated the code and added an interop test that tests pulling the missing block from a network peer.

achingbrain and others added 2 commits March 19, 2025 11:51
@2color
Copy link
Member

2color commented Mar 19, 2025

To better understand and doucment the nuances around stats and extended stats, I created this table:

stat options and result

input CID type offline extended fetched block(s) DAG traversal size localSize dagSize blocks
file file size n/a n/a n/a
directory 0 n/a n/a n/a
raw n/a bytelength n/a n/a n/a
file filesize n/a n/a n/a
directory 0 n/a n/a n/a
raw n/a bytelength n/a n/a n/a
file file size file size file size + pb overhead blocks
directory local dir size 0 local dir size + pb overhead blocks
raw n/a bytelength bytelength bytelength 1
file file size file size file size + pb overhead blocks
directory local dir size 0 local dir size + pb overhead blocks
raw n/a bytelength bytelength bytelength 1

Notes

  • With { extended: true, offline: true }, the difference between size and localSize is the delta between local blocks and total blocks of the dag.
  • For files with normal stats, size will be calculated from the unixfs metadata (which can in theory be spoofed)
  • all sizes are without deduplication, so duplicate blocks are counted twice
  • For directories with normal stats, size will be 0n.
  • For directories with extended stats, size and localSize will be the same. If some blocks are missing, both will be inaccurate.

I'll drop it here for now, and can incorporate it into the READMEs in a follow up to this.

Would be great if you can verify this.

@achingbrain
Copy link
Member Author

achingbrain commented Mar 19, 2025

Would be great if you can verify this.

Looks fine. I've made a change so that when statting a directory with extended: true, size is the recursive sum of the size property of all child files (where the file's DAG root is in the blockstore) and not 0 (we were just using localSize for size before).

This is as close to the "true" or expected directory size as we can get, I think.

For the same stat result, localSize is the recursive sum of the localSize property of all child files.

@achingbrain achingbrain merged commit 325b36f into main Mar 20, 2025
18 checks passed
@achingbrain achingbrain deleted the fix/report-directory-sizes-properly branch March 20, 2025 16:56
@achingbrain achingbrain mentioned this pull request Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnixFs Stat Command Returns Invalid Data
3 participants