Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering filament pages using NCBI dataset API #157

Open
2 tasks
nekrut opened this issue Nov 5, 2024 · 17 comments
Open
2 tasks

Rendering filament pages using NCBI dataset API #157

nekrut opened this issue Nov 5, 2024 · 17 comments
Assignees

Comments

@nekrut
Copy link
Contributor

nekrut commented Nov 5, 2024

This issue illustrates how NCBI Datasets API can be used to generates JSON blobs necessary for rendering filament pages (#130).

Linked Tickets

Data!

For initial set of taxa will be limited to these species: https://docs.google.com/spreadsheets/d/1Gg9sw2Qw765tOx2To53XkTAn-RAMiBtqYrfItlLXXrc/edit?usp=sharing

There is an issue on developing a data format for initializing the size = #201

List view

Image

To populate this we call NCBI Datasets API to get additional info (not provided in the initialization JSON dataset):

curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/taxonomy/dataset_report" \
 -H 'accept: application/json'\
 -H 'content-type: application/json' \
 -d '{"taxons":["Plasmodium falciparum","Plasmodium vivax","Plasmodium yoelii","Plasmodium vinckei","Culex pipiens","Anopheles gambiae","Toxoplasma gondii","Mycobacterium tuberculosis","Coccidioides posadasii","Coccidioides immitis"],"children":false,"ranks":["genus"]}' 

This generates the following response:

Click to see JSON response
{
  "reports": [
    {
      "taxonomy": {
        "tax_id": 7165,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Anopheles gambiae",
          "authority": "Giles, 1902"
        },
        "curator_common_name": "African malaria mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Anopheles",
            "id": 7164
          },
          "species": {
            "name": "Anopheles gambiae",
            "id": 7165
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43816,
          7164,
          44534,
          44537,
          44542
        ],
        "children": [
          180454
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 15164
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 422
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 615
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 27
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 12518
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1209
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Anopheles gambiae"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5501,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides immitis",
          "authority": "G.W. Stiles, 1896"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides immitis",
            "id": 5501
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          246410,
          454286,
          404692,
          396776
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 9974
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 147
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 9797
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides immitis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 199306,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides posadasii",
          "authority": "M.C. Fisher, G.L. Koenig, T.J. White & J.W. Taylor, 2002"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides posadasii",
            "id": 199306
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          443226,
          469471
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 13
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8510
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 163
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8342
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides posadasii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 7175,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Culex pipiens",
          "authority": "Linnaeus, 1758"
        },
        "curator_common_name": "northern house mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Culex",
            "id": 7174
          },
          "species": {
            "name": "Culex pipiens",
            "id": 7175
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43817,
          53550,
          7174,
          53527,
          518105
        ],
        "children": [
          1833972,
          38569,
          42434,
          233155
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 19673
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 686
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 155
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 58
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 9
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 16298
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1620
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Culex pipiens"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 1773,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Mycobacterium tuberculosis",
          "authority": "(Zopf 1883) Lehmann and Neumann 1896 (Approved Lists 1980)",
          "basionym": {
            "name": "\"Bacterium tuberculosis\"",
            "authority": "Zopf 1883",
            "notes": [
              {
                "name": "Effective Name",
                "note": "This is an effectively published name.",
                "note_classifier": "effective_name"
              }
            ]
          }
        },
        "group_name": "high G+C Gram-positive bacteria",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Bacteria",
            "id": 2
          },
          "kingdom": {
            "name": "Bacillati",
            "id": 1783272
          },
          "phylum": {
            "name": "Actinomycetota",
            "id": 201174
          },
          "class": {
            "name": "Actinomycetes",
            "id": 1760
          },
          "order": {
            "name": "Mycobacteriales",
            "id": 85007
          },
          "family": {
            "name": "Mycobacteriaceae",
            "id": 1762
          },
          "genus": {
            "name": "Mycobacterium",
            "id": 1763
          },
          "species": {
            "name": "Mycobacterium tuberculosis",
            "id": 1773
          }
        },
        "parents": [
          1,
          131567,
          2,
          1783272,
          201174,
          1760,
          85007,
          1762,
          1763,
          77643
        ],
        "children": [
          1427330,
          1427329
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7819
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 4008
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 3
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 3906
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 20
          },
          {
            "type": "COUNT_TYPE_OTHER",
            "count": 2
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Mycobacterium tuberculosis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5833,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium falciparum"
        },
        "curator_common_name": "malaria parasite P. falciparum",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium falciparum",
            "id": 5833
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418107
        ],
        "children": [
          478864,
          1036723
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5618
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 28
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5285
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 102
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium falciparum"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5860,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vinckei",
          "authority": "(Rodhain, 1952)"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vinckei",
            "id": 5860
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          54757,
          138298,
          138297,
          119398
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 10
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5147
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5050
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vinckei"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5855,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vivax",
          "authority": "(Grassi & Feletti, 1890)"
        },
        "curator_common_name": "malaria parasite P. vivax",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vivax",
            "id": 5855
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418103
        ],
        "children": [
          31273,
          126793,
          1035514,
          1035515,
          882766,
          1077284,
          1033975
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 19
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5513
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 44
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 22
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5395
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 10
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vivax"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5861,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium yoelii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium yoelii",
            "id": 5861
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          73239,
          1050261,
          31274,
          283801,
          1323249,
          1050262
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 15
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 6233
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 52
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 39
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 6037
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 47
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium yoelii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5811,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Toxoplasma gondii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Conoidasida",
            "id": 1280412
          },
          "order": {
            "name": "Eucoccidiorida",
            "id": 75739
          },
          "family": {
            "name": "Sarcocystidae",
            "id": 5809
          },
          "genus": {
            "name": "Toxoplasma",
            "id": 5810
          },
          "species": {
            "name": "Toxoplasma gondii",
            "id": 5811
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          1280412,
          5796,
          75739,
          423054,
          5809,
          5810
        ],
        "children": [
          933077,
          398031
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8925
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 183
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 424
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8318
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Toxoplasma gondii"
      ]
    }
  ],
  "total_count": 10
}

From this response we would like to render the following fields on a page (only showing two rows)

[ ] Taxon TaxId # Assemblies Tags
[ ] Anopheles gambiae 7165 7 Vector
[ ] Coccidioides immitis 5501 5 Fungi

These are populated from the JSON response:

  • taxon = (reports -> taxonomy -> current_scientific_name -> name)
  • taxid = (reports -> taxonomy -> taxid)
  • # Assemblies = (reports -> taxonomy -> counts[0])
  • Tag = custom added by us

Genomes page

image

Now let's suppose on the previous page a clicked both Anopheles gambiae and Coccidioides immitis checkboxes and selected "Go to Genomes" button.

This will be equivalent to passing the following GET request:

https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/7165%2C5501/dataset_report?filters.assembly_source=refseq&filters.has_annotation=true&filters.exclude_paired_reports=true&filters.exclude_atypical=true&filters.assembly_level=scaffold&filters.assembly_level=chromosome&filters.assembly_level=complete_genome

Which will be rendered as the following genome page:

[ ] Taxon TaxId Accession IsRef Level # Chr Len # Scaffolds Scaffold N50 Scaffold L50 Coverage GC% Ann Status
[ ] Anopheles gambiae 7165 GCF_943734735.2 Yes Chromosome 3 264451381 190 99149756 2 54.0x 44.5 Full annotation
  • Taxon = organism -> organism_name
  • TaxId = organism -> tax_id
  • Accession = accession
  • IsRef = assembly_info -> refseq_category
  • Level = assembly_info -> assembly_level
  • # Chr = ssembly_stats -> total_number_of_chromosomes
  • Len = assembly_stats -> total_sequence_length
  • # Scaffolds = assembly_stats -> number_of_scaffolds
  • Scaffold N50 = assembly_stats -> scaffold_n50
  • Scaffold L50 =assembly_stats -> scaffold_l50
  • GC% = assembly_stats -> gc_percent
  • Annotation status = annotation_info -> status
@nekrut nekrut converted this from a draft issue Nov 5, 2024
@nekrut nekrut changed the title Rendering filament pages using NCBI dataset commands Rendering filament pages using NCBI dataset API Nov 5, 2024
@NoopDog
Copy link
Collaborator

NoopDog commented Nov 6, 2024

Ok thx @nekrut we will start on this and collect the tables from NCBI...

@NoopDog
Copy link
Collaborator

NoopDog commented Nov 6, 2024

Also link to UCSC genome browser in the genome file.

@nekrut nekrut moved this to In Progress in BRC development tasks Nov 14, 2024
@d-callan
Copy link
Collaborator

Sry, not sure this is the right place for this comment.. but were it me I'd seriously consider adding some kinetoplastids to that list of initial taxa.

@d-callan
Copy link
Collaborator

T. Cruzi
T. Brucei
Leish major
Leish donovoni
Leish brazilensis

Those are the ones coming to me off the top of my head, though I feel like that's maybe missing a big leish species or two. I might not have the spelling quite right either.. it'd give you Chagas, African sleeping sickness and iirc all three forms of leish though I need to double check that. Considering the popularity of tritrypdb and the impact of these diseases, these species would be a very notable omission.

Also, pretty sure we now have a few locally acquired cases of mucosal leish in Texas, as the sandfly habitat expands, so there's 'local' relevance.. thanks global warming

@nekrut
Copy link
Contributor Author

nekrut commented Nov 15, 2024

@hunterckx
Copy link
Collaborator

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

@nekrut
Copy link
Contributor Author

nekrut commented Nov 16, 2024

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

Good point. They need to be built first. I will initiate process over the weekend. This can happen very quickly, but for now let's not link them to UCSC yet.

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 5, 2024

@hunterckx, can you use the accession field from the NCBI response, e.g., "accession": "GCF_943734735.2," and provide a report on any that do not match either GenBank or RefSeq in the assemblyList.json?

Cheers,
D

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 5, 2024

Also, @hunterckx, please re-import the species list so we can get the latest updates. Thanks!

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 5, 2024

@hunterckx, the "Search all filters" option on the Genomes page throws an error. Can you please fix it? Thanks!

@hunterckx
Copy link
Collaborator

@hunterckx, can you use the accession field from the NCBI response, e.g., "accession": "GCF_943734735.2," and provide a report on any that do not match either GenBank or RefSeq in the assemblyList.json?

Cheers, D

I've set it up to report separately on matches between pairedAccession, accession, genBank, and refSeq, since the matching here is only done between pairedAccession and genBank, and accession and refSeq. Here's what it reports (parentheses around column pairs that are not currently used for matching):

3 values from pairedAccession absent in genBank: GCA_000277735.2, GCA_030566675.1, GCA_963525475.1

(20 values from pairedAccession absent in refSeq: GCA_000195955.2, GCA_000002765.3, GCA_000002725.2, GCA_900002385.2, GCA_018416015.2, GCA_900681995.1, GCA_000227135.2, GCA_000006565.2, GCA_000002445.1, GCA_943734735.2, GCA_000002415.2, GCA_016801865.2, GCA_000002845.2, GCA_000209065.1, GCA_000149335.2, GCA_000277735.2, GCA_009858895.3, GCA_000857045.1, GCA_030566675.1, GCA_963525475.1)

(20 values from accession absent in genBank: GCF_000195955.2, GCF_000002765.6, GCF_000002725.2, GCF_900002385.2, GCF_018416015.2, GCF_900681995.1, GCF_000227135.1, GCF_000006565.2, GCF_000002445.2, GCF_943734735.2, GCF_000002415.2, GCF_016801865.2, GCF_000002845.2, GCF_000209065.1, GCF_000149335.2, GCF_000277735.2, GCF_009858895.2, GCF_000857045.1, GCF_030566675.1, GCF_963525475.1)

3 values from accession absent in refSeq: GCF_000277735.2, GCF_030566675.1, GCF_963525475.1

Looks like the ones missing for the used column pairs are also missing for the unused column pairs (i.e. we're not missing anything extra by not using those pairs)

@hunterckx
Copy link
Collaborator

Updated output after switching to algorithm proposed initially in #194:

3 accessions had no match in assembly list: GCF_000277735.2, GCF_030566675.1, GCF_963525475.1

(The same as above when just matching with refSeq)

I'll also note that this appears to have led to one USCS Browser URL being left out, but that may be what we want if it was an erroneous match

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 11, 2024

@nekrut now that we are using the spreadsheet to identify the curated list of assemblies to include, I suppose we could still call the taxon API and filter out all but the IDs that are given in the spreadsheet.

curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/taxonomy/dataset_report" \
 -H 'accept: application/json'\
 -H 'content-type: application/json' \
 -d '{"taxons":["Plasmodium falciparum","Plasmodium vivax","Plasmodium yoelii","Plasmodium vinckei","Culex pipiens","Anopheles gambiae","Toxoplasma gondii","Mycobacterium tuberculosis","Coccidioides posadasii","Coccidioides immitis"],"children":false,"ranks":["genus"]}' 

Or .. Is there a different API we should use to look up the genome by ID?

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 11, 2024

@hunterckx

The assemblies

GCF_000277735.2_ASM27773v2
GCF_963525475.1_MtbRf
GCF_030566675.1_ASM3056667v1

have now been added to the UCSC system. Can you check to see of we now match on our three above or are the _ASM27773v2 etc. causing us to mismatch.

@hunterckx
Copy link
Collaborator

@NoopDog Seems like everything matches up now -- I've pushed the latest to the branch for #178!

@Smeds
Copy link
Collaborator

Smeds commented Dec 11, 2024

@NoopDog Hi! I'm currently working on migrating the GenomeArk project, with the goal of displaying it using BRC. I've written a script to generate a JSON file containing genome-related data, and I've been experimenting with the BRC code to display this information. The displayed columns will largely remain the same, though some modifications might be necessary. Would you prefer we continue the discussion in this task or create a new one?

@NoopDog
Copy link
Collaborator

NoopDog commented Dec 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

6 participants