Experiment with a hybrid bitfield + range encoding for Span / DefId.

Roughly, if you have a "container" (file/crate/etc.), and sequential indices in it:
* you can use `(container_index, intra_container_index)` (but that takes 2x space)
* you can split an integer's *bitwidth* into two bitfields, one for each half of the pair above
  * the point where you choose to split is a tradeoff and you can run out of either half
* you can split an integer's *range*, with each container having its sequential range
  * `Span` does this currently, where the files are effectively "concatenated"
  * requires binary search to translate into the pair representation

An improvement on all of those is to choose an arbitrary chunk size (e.g. `2^17 = 128kB` for files), and then split each container into a number of chunks (ideally just 1 in the common case).
You can then use bitfields for `(chunk, intra_chunk_index)` (e.g. `15` and `17` bits of `u32`).

The difference is that to translate `chunk` to `container`, we *don't need* to use binary search, because `chunk` is several orders of magnitude smaller than the index space as a whole, and we can use arrays.

That is, `chunk -> container` can be an array, but also, if there is per-container data that would be accessed through `chunk`, we can optimize that by building a `chunk -> Rc<ContainerData>` array.

Translating `intra_chunk_index` to `intra_container_index` is similarly easy, if you can look up per-container data, you can subtract its overall start (if each container is a contiguous range of chunks).

<hr/>

Another reason this might be useful is translating (an unified) `DefId` or `Span` between crates or between incremental (re)compilation sessions - we can have a bitset of changed chunks: if a chunk is unchanged, the index is identical, otherwise we can have an intra-chunk/container binary search for changed ranges (or just a map of changes).

We can grow the number indices within the last chunk of a container, and if we run out of space, we can relocate the container's chunks without a significant cost. Alternatively, another tradeoff we can make is to fragment a container's chunks.

<hr/>

The first step in experimenting with this would have to be take `Span`, and round up the start/end of each file's range to a multiple of a power of `2` (e.g. `2^17` - but an optimal value would require gathering some real-world file-size statistics).
This way we can see if there's a negative performance impact from having unused gaps in the index space, everything else should be an improvement.
We can also try to replace the binary searches to find the `SourceFile` a `Span` is from.

cc @nikomatsakis @michaelwoerister 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment with a hybrid bitfield + range encoding for Span / DefId. #53560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment with a hybrid bitfield + range encoding for Span / DefId. #53560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions