Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing sample at NCBI #2

Open
sformel-usgs opened this issue Jul 14, 2023 · 1 comment
Open

missing sample at NCBI #2

sformel-usgs opened this issue Jul 14, 2023 · 1 comment

Comments

@sformel-usgs
Copy link
Collaborator

@McAllister-NOAA It looks like the sample_metadata files include sample E272_2B_NO20 but NCBI does not. There are also some mismatches where NCBI has "1B" as the middle part of the sample name and sample_metadata has "2B".

Here is how I checked:

# Compare sample names in sample_metadata to NCBI

library(xml2)
library(dplyr)

#Load data

#NCBI samples names; xml downloaded by hand from NCBI.
NCBI <- read_xml(x = "documentation/PRJNA982176_biosample_result.xml") %>% 
  xml_find_all(xpath = "//Id[@db_label='Sample name']") %>% 
  as_list() %>%  
  unlist() %>% 
  sort()

sample_metadata <- read.table(file = "data/sample_metadata/sample_metadata_16S.txt",
                              sep = "\t",
                              header = TRUE) %>%
  pull(Sample) %>%
  sort()

#compare samples names

NCBI
sample_metadata

which(!NCBI %in% sample_metadata)

# It looks like a lot of the mismatches are just the middle string being 1B or 2B, let's remove that

NCBI <- sub(pattern = "1B", replacement = "2B", x = NCBI)
sample_metadata <- sub(pattern = "1B", replacement = "2B", x = sample_metadata)

#Everything in NCBI is in sample_metadata
NCBI[!NCBI %in% sample_metadata] %>% sort()

#One sample is missing from NCBI
sample_metadata[!sample_metadata %in% NCBI] %>% sort()
@McAllister-NOAA
Copy link
Collaborator

McAllister-NOAA commented Sep 8, 2023

Thanks for the info Steve, it was very helpful for tracking down the problem, which was primarily due to the incorrect sequences being submitted to the SRA. Long story short, with the submission containing both 16S and 18S, the 18S is all correct, and the 16S had some different replicates chosen (the 1B/2B errors) and one additional sample (the missing one). I have submitted a second round to NCBI to correct these errors and will update and close this comment when I have a public accession to share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants