-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add in-depth demo for htsget-rs #8
Merged
+213
−2
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
209 changes: 209 additions & 0 deletions
209
content/post/2025-03-17-htsget-rs-in-depth/2025-03-17-htsget-rs-in-depth.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
--- | ||
title: "htsget-rs in depth" | ||
authors: | ||
- marko-malenic | ||
- roman-valls-guimera | ||
date: "2025-03-17" | ||
slug: htsget-rs-in-depth | ||
layout: post | ||
categories: | ||
- rust | ||
- bioinformatics | ||
- htsget | ||
tags: | ||
- rust | ||
- ga4gh | ||
- bioinformatics | ||
summary: "In depth htsget-rs and the htsget protocol" | ||
--- | ||
# The details of the htsget protocol using htsget-rs and Crypt4GH | ||
|
||
Following on from the [first] blog post about htsget-rs and Crypt4GH, this post goes into further details about how htsget works and illustrates more complex use cases. | ||
|
||
Let's start by querying an example file. The deployed GA4GH htsget instance has access to [example files][example-files] from the htsget-rs | ||
repository. Recall from the first htsget-rs blog post that the reads endpoint can serve BAM files, for example: | ||
|
||
```sh | ||
curl "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878" | ||
``` | ||
|
||
This will return a set of URL "tickets" inside the "urls" field of the JSON response. These "tickets" contain URLs that | ||
should be fetched and concatenated to produce the response. Additionally, there is a "headers" field that contains HTTP | ||
headers that should included when requesting the url in the ticket. Take a look at the [htsget] spec for more details. | ||
|
||
To simplify fetching and concatenating URL tickets, use a htsget client, such as the [GA4GH client][client]. | ||
|
||
## Querying the header | ||
|
||
As a simple example, query the header of the file example file by passing `class=Header`: | ||
|
||
```sh | ||
htsget "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878?class=header" > out.bam | ||
``` | ||
|
||
Internally, this yields a JSON with a URL that can be fetched along with a "Range" header: | ||
|
||
```json | ||
{ | ||
"htsget": { | ||
"format": "BAM", | ||
"urls": [ | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=0-4667" | ||
}, | ||
"class": "header" | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
The client takes care of fetching the URLs and concatenating bytes. | ||
|
||
A strength of the htsget protocol is that the output represents a small part of the full file, allowing the user to | ||
query specific regions of a file without needing to obtain the entire file. | ||
|
||
In this case, the output represents the BAM header of the file: | ||
|
||
```sh | ||
samtools view -H out | ||
``` | ||
|
||
Note that "..." inside the JSON example responses represents some data or a URL. This will be different when executing | ||
the query. | ||
|
||
## Querying reference names with start and end ranges | ||
|
||
A more interesting query would involve selecting a specific region, for example chr11. This can be accomplished by | ||
using the `referenceName` parameter. Viewing the output will show data for that specific region: | ||
|
||
```sh | ||
htsget "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878?referenceName=11" | samtools view | ||
``` | ||
|
||
Similarly, the query can be refined further by specifying specific start and end ranges, so that only those regions | ||
are returned: | ||
|
||
```sh | ||
htsget "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878?referenceName=11&start=500000&end=5001000" | samtools view | ||
``` | ||
|
||
Internally, the output from htsget-rs will contain multiple URL tickets that represent the specific data queried: | ||
|
||
```json | ||
{ | ||
"htsget": { | ||
"format": "BAM", | ||
"urls": [ | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=0-273085" | ||
} | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=499249-574358" | ||
} | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=627987-647345" | ||
} | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=824361-842100" | ||
} | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=977196-996014" | ||
} | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=2596771-2596798" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
## Querying Crypt4GH files | ||
|
||
Moving on to a more complex example, we will now incorporate querying [Crypt4GH][c4gh] encrypted files from htsget-rs. | ||
To decrypt Crypt4GH files, install the Crypt4GH [CLI][c4gh-cli] and get the [keys] from the htsget-rs repository. | ||
|
||
Then, query like before, except add the `encryptionScheme` parameter: | ||
|
||
```sh | ||
curl "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878?class=header&encryptionScheme=C4GH" | ||
``` | ||
|
||
This will return a JSON that contains encrypted data when concatenated. Here, there are additional URLs that are base64 | ||
encoded. These URLs represent inline data to the JSON ticket, and just need to be decoded to obtain the bytes. They | ||
follow the same semantics as the other URLs and should be concatenated after decoding. | ||
|
||
```json | ||
{ | ||
"htsget": { | ||
"format": "BAM", | ||
"urls": [ | ||
{ | ||
"url": "data:;base64,..." | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=16-123" | ||
} | ||
}, | ||
{ | ||
"url": "data:;base64,..." | ||
}, | ||
{ | ||
"url": "...", | ||
"headers": { | ||
"Range": "bytes=124-65687" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
Putting it all together, and using the keys from the htsget-rs repo, the data can be accessed by running: | ||
|
||
```sh | ||
htsget "https://htsget.ga4gh-demo.org/reads/htsnexus_test_NA12878?class=header&encryptionScheme=C4GH" | crypt4gh decrypt --sk bob.sec | samtools view -H | ||
``` | ||
|
||
## Running the htsget-compliance test suite | ||
|
||
As an extra section, the htsget protocol has a compliance suite that can be run on htsget-rs. This contains tests that | ||
ensure that htsget-rs runs as expected. | ||
|
||
In order to run the compliance tests, follow the installation instructions in the [htsget-compliance] repository and | ||
then run the following on the deployed htsget-rs instance: | ||
|
||
```sh | ||
htsget-compliance https://htsget.ga4gh-demo.org | jq '.["summary"]' | ||
``` | ||
|
||
[first]: https://umccr.org/blog/htsget-rs-crypt4gh/ | ||
[example-files]: https://github.com/umccr/htsget-rs/tree/main/data | ||
[client]: https://htsget.readthedocs.io/en/latest/quickstart.html#installation | ||
[htsget]: https://samtools.github.io/hts-specs/htsget.html | ||
[c4gh]: https://samtools.github.io/hts-specs/crypt4gh.pdf | ||
[c4gh-cli]: https://github.com/EGA-archive/crypt4gh-rust | ||
[keys]: https://github.com/umccr/htsget-rs/tree/main/data/c4gh/keys | ||
[htsget-compliance]: https://github.com/ga4gh/htsget-compliance?tab=readme-ov-file#installation |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind that this is an experimental feature, reference spec PR: samtools/hts-specs#808