-
Notifications
You must be signed in to change notification settings - Fork 0
Add sce #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
config/config-example.yaml workflow/Snakefile
# r and library modules must be installed in snakemake conda environment | ||
# https://anaconda.org/conda-forge/r-base | ||
# https://anaconda.org/bioconda/bioconductor-dropletutils | ||
# https://anaconda.org/r/r-tidyverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if the R dependencies would be automatically resolved by snakemake using an own conda environment. See the example here https://snakemake.readthedocs.io/en/v3.10.0/snakefiles/deployment.html#integrated-package-management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented using the construct_sce_objects.yaml environment with the required R packages. I still recommend directly installing the packages into the snakemake environment because it is more stable and faster. snakemake frequently rebuilds the environment and old environments must be removed manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that it frequently rebuilds the env. This should only happen if dependencies (not listed in the environment.yaml) are updated. The separation is definitively preferable. One way to "fix" the env better is to store the full env by activating it and export all definitions with conda env export
.
as.is=TRUE, | ||
colClasses = "character") | ||
|
||
wildcard_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this variable called wildcard_curr
, the code reads like it contains the individual
label
workflow/scripts/samples.py
Outdated
# list identifiers identically to config identifiers and in same order | ||
IDENTIFIERS = ["Species_ID", "Age_ID", "Fraction_ID", "Sample_NR"] # find a way to lift directly from config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is an argument to the Samples-class below. Thereby every element of the config can be accessed via config[key1][key2]
workflow/scripts/samples.py
Outdated
"""Make "individual" column | ||
|
||
A single identifier col is copied and renamed to "individual" but | ||
entries from multiple identifier cols are concatenated to "individual". | ||
|
||
Entries in the "individual" col are used as wildcards. | ||
|
||
For metadata_full it is used for downstream functions (?). | ||
|
||
This may cause a warning that I still have to take care of but that doesn't seem critical. | ||
""" | ||
if not "individual" in metadata_full.columns: | ||
print('establishing "individual" column') | ||
for i in IDENTIFIERS: | ||
if i == IDENTIFIERS[0]: | ||
metadata_full["individual"] = metadata_full[i].map(str) | ||
else: | ||
metadata_full["individual"] = metadata_full["individual"] + '_' + metadata_full[i].map(str) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to make this a function side the class to improve readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shortened this part a lot for better readability but without making a function.
config/config-test-lea.yaml
Outdated
# Enable / Disable rules and specifiy rule-specific parameters | ||
rules: | ||
cellranger_count: | ||
# extra: "" # set additional arguments for cellranger count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the comment here? the empty string should function as no extra arguments given.
workflow/Snakefile
Outdated
"--transcriptome {params.genome} " | ||
"--fastqs {input} " | ||
"--localcores={threads} " | ||
"{params.extra} " | ||
"--sample {wildcards.individual}_{wildcards.sample_type} " | ||
#"{params.extra} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if necessary to comment out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some minor comments. I hope they make sense and help to clarify some things to future maintainers.
One thing, that I would like to do is make changes to scripts/samples.py
. Specifically, I would like to extract the code to adjust metadata (i.e. DATE_OF_BIRTH
). At the same time we should make clear what the specifications of the pipeline's metadata sheets is. In general this is Species
, Sample
, and FastQ Path
. For your case in addition Age
and Fraction
are required. Is that correct?
README.md
Outdated
``` | ||
|
||
to the snakemake command. | ||
For multiple runs, it is recommended to install these packages directly into the snakemake environment as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "multipe runs" - repeated runs? Why would you install them into the base snakemake environment, too? In general, I'm in favour of smaller, isolated environments. The readme here should state why you recommend this approach.
references: | ||
all_masked: "/omics/groups/OE0538/internal/shared_data/CellRangerReferences/GRCm38_masked_allStrains/" | ||
metadata: | ||
raw: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we say that the pipeline reads a defined set of metadata columns, naming it raw
is not really accurate anymore. Could be omit it and just state metadata: path/to/file.yaml
? This would also apply to the the config/config-example.yaml
and (multiple) occurrences in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to table to keep it easily separated from identifiers and single_cell_object_metadata_fields below
workflow/Snakefile
Outdated
@@ -6,27 +6,24 @@ from scripts.samples import Samples | |||
|
|||
samples = Samples(config) | |||
|
|||
REFERENCES = config["references"] | |||
METADATA_PATH = config["metadata"]["raw"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly omit raw
. See above.
- readline=8.1.2=h0f457ee_0 | ||
- frozenlist=1.3.3=py310h5764c6d_0 | ||
- filelock=3.8.0=pyhd8ed1ab_0 | ||
- ruamel_yaml=0.15.80=py310h5764c6d_1008 | ||
- pycosat=0.6.4=py310h5764c6d_1 | ||
- logmuse=0.2.6=pyh8c360ce_0 | ||
- boto3=1.26.5=pyhd8ed1ab_0 | ||
- pyparsing=3.0.9=pyhd8ed1ab_0 | ||
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0 | ||
- pyu2f=0.1.5=pyhd8ed1ab_0 | ||
- protobuf=4.21.9=py310hd8f1fbe_0 | ||
- conda-package-handling=1.9.0=py310h5764c6d_1 | ||
- pysftp=0.2.9=py_1 | ||
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_0 | ||
- python-fastjsonschema=2.16.2=pyhd8ed1ab_0 | ||
- ld_impl_linux-64=2.39=hc81fddc_0 | ||
- ncurses=6.3=h27087fc_1 | ||
- googleapis-common-protos=1.56.4=py310hff52083_1 | ||
- google-cloud-storage=2.6.0=pyh1a96a4e_0 | ||
- slacker=0.14.0=py_0 | ||
- libffi=3.4.2=h7f98852_5 | ||
- _openmp_mutex=4.5=2_gnu | ||
- libzlib=1.2.13=h166bdaf_4 | ||
- coin-or-utils=2.11.6=h202d8b1_2 | ||
- libabseil=20220623.0=cxx17_h48a1fff_5 | ||
- prettytable=3.4.1=pyhd8ed1ab_0 | ||
- cffi=1.15.1=py310h255011f_2 | ||
- typing-extensions=4.4.0=hd8ed1ab_0 | ||
- pyyaml=6.0=py310h5764c6d_5 | ||
- dropbox=11.35.0=pyhd8ed1ab_0 | ||
- python_abi=3.10=2_cp310 | ||
- nbformat=5.7.0=pyhd8ed1ab_0 | ||
- libnsl=2.0.0=h7f98852_0 | ||
- stone=3.3.1=pyhd8ed1ab_0 | ||
- libblas=3.9.0=16_linux64_openblas | ||
- cryptography=38.0.3=py310h600f1e7_0 | ||
- google-auth-httplib2=0.1.0=pyhd8ed1ab_1 | ||
- tk=8.6.12=h27826a3_0 | ||
- libopenblas=0.3.21=pthreads_h78a6416_3 | ||
- pyasn1-modules=0.2.7=py_0 | ||
- jsonschema=4.17.0=pyhd8ed1ab_0 | ||
- coin-or-cgl=0.60.6=h6f57e76_2 | ||
- dpath=2.0.6=py310hff52083_2 | ||
- google-api-python-client=2.65.0=pyhd8ed1ab_0 | ||
- setuptools-scm=7.0.5=pyhd8ed1ab_1 | ||
- rsa=4.9=pyhd8ed1ab_0 | ||
- pyasn1=0.4.8=py_0 | ||
- wcwidth=0.2.5=pyh9f0ad1d_2 | ||
- tqdm=4.64.1=pyhd8ed1ab_0 | ||
- traitlets=5.5.0=pyhd8ed1ab_0 | ||
- wrapt=1.14.1=py310h5764c6d_1 | ||
- zipp=3.10.0=pyhd8ed1ab_0 | ||
- botocore=1.29.5=pyhd8ed1ab_0 | ||
- idna=3.4=pyhd8ed1ab_0 | ||
- google-resumable-media=2.4.0=pyhd8ed1ab_0 | ||
- bcrypt=3.2.2=py310h5764c6d_1 | ||
- attmap=0.13.2=pyhd8ed1ab_0 | ||
- requests=2.28.1=pyhd8ed1ab_1 | ||
- xz=5.2.6=h166bdaf_0 | ||
- grpcio=1.49.1=py310hc32fa93_1 | ||
- libprotobuf=3.21.9=h6239696_0 | ||
- gitpython=3.1.29=pyhd8ed1ab_0 | ||
- s3transfer=0.6.0=pyhd8ed1ab_0 | ||
- yaml=0.2.5=h7f98852_2 | ||
- ratelimiter=1.2.0=pyhd8ed1ab_1003 | ||
- bzip2=1.0.8=h7f98852_4 | ||
- ca-certificates=2022.9.24=ha878542_0 | ||
- uritemplate=4.1.1=pyhd8ed1ab_0 | ||
- future=0.18.2=pyhd8ed1ab_6 | ||
- jupyter_core=4.11.2=py310hff52083_0 | ||
- pluggy=1.0.0=pyhd8ed1ab_5 | ||
- brotlipy=0.7.0=py310h5764c6d_1005 | ||
- jmespath=1.0.1=pyhd8ed1ab_0 | ||
- setuptools=65.5.1=pyhd8ed1ab_0 | ||
- libcrc32c=1.1.2=h9c3ff4c_0 | ||
- libstdcxx-ng=12.2.0=h46fd767_19 | ||
- commonmark=0.9.1=py_0 | ||
- zlib=1.2.13=h166bdaf_4 | ||
- tabulate=0.9.0=pyhd8ed1ab_1 | ||
- re2=2022.06.01=h27087fc_0 | ||
- appdirs=1.4.4=pyh9f0ad1d_0 | ||
- aiosignal=1.3.1=pyhd8ed1ab_0 | ||
- paramiko=2.12.0=pyhd8ed1ab_0 | ||
- filechunkio=1.8=py_2 | ||
- charset-normalizer=2.1.1=pyhd8ed1ab_0 | ||
- python-irodsclient=1.1.5=pyhd8ed1ab_0 | ||
- rich=12.6.0=pyhd8ed1ab_0 | ||
- async-timeout=4.0.2=pyhd8ed1ab_0 | ||
- six=1.16.0=pyh6c4a22f_0 | ||
- coin-or-osi=0.108.7=h2720bb7_2 | ||
- pyopenssl=22.1.0=pyhd8ed1ab_0 | ||
- libuuid=2.32.1=h7f98852_1000 | ||
- pysocks=1.7.1=pyha2e5f31_6 | ||
- openssl=3.0.7=h166bdaf_0 | ||
- typing_extensions=4.4.0=pyha770c72_0 | ||
- python=3.10.6=ha86cf86_0_cpython | ||
- pynacl=1.5.0=py310h5764c6d_2 | ||
- smart_open=6.2.0=pyha770c72_0 | ||
- tomli=2.0.1=pyhd8ed1ab_0 | ||
- libgcc-ng=12.2.0=h65d4601_19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file should be in the projects root directory because it is not activated by snakemake's --use-conda
directive, but used to start snakemake in the first place.
individual_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample | ||
IDENTIFIERS <- snakemake@params[["identifiers"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to move the definitions of constants to the top of the script to improve readability.
IDENTIFIERS <- snakemake@params[["identifiers"]] | ||
|
||
# if necessary, concatenate identifiers again to obtain all possible wildcards | ||
metadata_curr <- metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reassignment necessary? :)
} | ||
|
||
# subset data as specified by wildcard and single_cell_object_metadata_fields | ||
metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not stick to the tidyverse? (sorry for being nitty gritty?
If you're using >= R 4.1, it is also possible to use native pipes (|>
instead of magrittrs %>%
)
metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),] | |
# not tested | |
metadata_curr <- metadata_curr |> filter(individual == individual_curr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in commit 47c9b94
workflow/scripts/samples.py
Outdated
metadata_full = self.get_cellranger_filename(metadata_full) | ||
|
||
self.metadata = self.select_columns(metadata_full) | ||
self.metadata = self.metadata.rename(self.columns_map, axis="columns") | ||
self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS) | |
self.metadata = self.select_columns(metadata_full, custom_columns = IDENTIFIERS) |
workflow/scripts/samples.py
Outdated
def select_columns(self, | ||
df: pd.DataFrame, | ||
columns: list = None): | ||
columns: list = None, | ||
identifiers: str = None): | ||
"""Select/Subset columns from DataFrame to reduce | ||
DataFrame dimensions. """ | ||
|
||
if not columns: | ||
columns = self.columns | ||
columns = self.columns + identifiers | ||
return df[columns] | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality of this method is not super clear to me. What should it do?
Currently, it subsets the dataframe to columns (defined as class attributes inside Samples
).
If a columns
attributes is provided, this list of column names plus the contents of identifiers
will be used for subsetting. If individual
is not set (defaults to None), combining columns + identifiers
will raise a TypeError.
Wouldn't a more clearer functionality be use define the method as follows? Then combining columns
and identifiers
need to be done when calling the method, i.e. in line 43
def select_columns(self,
df: pd.DataFrame,
custom_columns: list = None):
"""Select/Subset columns from DataFrame to reduce
DataFrame dimensions. A list of column names can be provided by `custom_columns` """
if custom_columns:
df_subset = df[custom_columns]
else:
df_subset = df[columns]
return df_subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in commit b2d5b8c. It works if the defined columns are moved below def init(self, config)
Co-authored-by: Fritjof Lammers <[email protected]>
-added rule and script for construction of Single Cell Experiment objects from cellranger output
-added functionality to automatically obtain individual sample names from custom identifiers, to be used as wildcards
-minor changes to cellranger output paths, extra and log paths and used separate, specific config file for remote functionality