You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vcf_zarr_spec.md
+14-10Lines changed: 14 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@
5
5
This document is a technical specification for VCF Zarr, a means of encoding VCF data in chunked-columnar form using the Zarr format.
6
6
7
7
This specification depends on definitions and terminology from [The Variant Call Format Specification, VCFv4.3 and BCFv2.2](https://samtools.github.io/hts-specs/VCFv4.3.pdf),
8
-
and [Zarr storage specification version 2](https://zarr.readthedocs.io/en/stable/spec/v2.html).
8
+
and [Zarr storage specification version 2](https://zarr.readthedocs.io/en/stable/spec/v2.html) or [Zarr core specification verion 3](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html)[experimental].
9
9
10
10
## Compatibility with VCF and BCF
11
11
@@ -42,17 +42,19 @@ Each VCF field is stored in a separate Zarr array. This specification only manda
42
42
43
43
This document uses a shorthand notation to refer to Zarr data types (dtypes). The following table shows the mapping to VCF types.
|`float`|`<f4`, `<f8` or `>f4`, `>f8`|`float32`, `float64`|Float |
50
+
|`char`|`\<U1` or `\>U1`|`string`|Character |
51
+
|`str`|`\|O`|`string`|String |
52
52
53
53
This specification does not mandate a byte order for numeric types: little-endian (e.g. `<i4`) or big-endian (`>i4`) are both permitted.
54
54
55
-
The `str` dtype is used to represent [variable-length strings](https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays). In this case a Zarr array filter with and `id` of `vlen-utf8` must be specified for the array.
55
+
*[Zarr 2 only]* The `str` dtype is used to represent [variable-length strings](https://zarr.readthedocs.io/en/stable/tutorial.html#string-arrays). In this case a Zarr array filter with and `id` of `vlen-utf8` must be specified for the array.
56
+
57
+
*[Zarr 3 only]* The `str` dtype is used to represent [variable-length strings](https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/string). In this case a Zarr array [`vlen-utf8`](https://github.com/zarr-developers/zarr-extensions/blob/main/codecs/vlen-utf8/README.md) codec must be specified for the array.
56
58
57
59
### Missing and fill values
58
60
@@ -73,7 +75,9 @@ An array called `<name>` may have an accompanying array called `<name>_mask` wit
73
75
74
76
### Array dimension names
75
77
76
-
Following [Xarray conventions](http://xarray.pydata.org/en/stable/internals/zarr-encoding-spec.html), each Zarr array has an attribute `_ARRAY_DIMENSIONS`, which is a list of strings naming the dimensions.
78
+
*[Zarr 2 only]* Following [Xarray conventions](http://xarray.pydata.org/en/stable/internals/zarr-encoding-spec.html), each Zarr array has an attribute `_ARRAY_DIMENSIONS`, which is a list of strings naming the dimensions.
79
+
80
+
*[Zarr 3 only]* The Zarr array metadata must include `dimension_names`, which is a list of strings naming the dimensions.
77
81
78
82
The reserved dimension names and their sizes are listed in the following table, along with the corresponding VCF Number value, if applicable.
0 commit comments