Skip to content

Conversation

leawlb
Copy link
Collaborator

@leawlb leawlb commented Oct 12, 2022

-added rule and script for construction of Single Cell Experiment objects from cellranger output
-added functionality to automatically obtain individual sample names from custom identifiers, to be used as wildcards
-minor changes to cellranger output paths, extra and log paths and used separate, specific config file for remote functionality

Comment on lines 5 to 8
# r and library modules must be installed in snakemake conda environment
# https://anaconda.org/conda-forge/r-base
# https://anaconda.org/bioconda/bioconductor-dropletutils
# https://anaconda.org/r/r-tidyverse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if the R dependencies would be automatically resolved by snakemake using an own conda environment. See the example here https://snakemake.readthedocs.io/en/v3.10.0/snakefiles/deployment.html#integrated-package-management

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented using the construct_sce_objects.yaml environment with the required R packages. I still recommend directly installing the packages into the snakemake environment because it is more stable and faster. snakemake frequently rebuilds the environment and old environments must be removed manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that it frequently rebuilds the env. This should only happen if dependencies (not listed in the environment.yaml) are updated. The separation is definitively preferable. One way to "fix" the env better is to store the full env by activating it and export all definitions with conda env export.

as.is=TRUE,
colClasses = "character")

wildcard_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this variable called wildcard_curr, the code reads like it contains the individual label

Comment on lines 5 to 6
# list identifiers identically to config identifiers and in same order
IDENTIFIERS = ["Species_ID", "Age_ID", "Fraction_ID", "Sample_NR"] # find a way to lift directly from config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is an argument to the Samples-class below. Thereby every element of the config can be accessed via config[key1][key2]

Comment on lines 49 to 67
"""Make "individual" column

A single identifier col is copied and renamed to "individual" but
entries from multiple identifier cols are concatenated to "individual".

Entries in the "individual" col are used as wildcards.

For metadata_full it is used for downstream functions (?).

This may cause a warning that I still have to take care of but that doesn't seem critical.
"""
if not "individual" in metadata_full.columns:
print('establishing "individual" column')
for i in IDENTIFIERS:
if i == IDENTIFIERS[0]:
metadata_full["individual"] = metadata_full[i].map(str)
else:
metadata_full["individual"] = metadata_full["individual"] + '_' + metadata_full[i].map(str)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to make this a function side the class to improve readability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shortened this part a lot for better readability but without making a function.

# Enable / Disable rules and specifiy rule-specific parameters
rules:
cellranger_count:
# extra: "" # set additional arguments for cellranger count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the comment here? the empty string should function as no extra arguments given.

"--transcriptome {params.genome} "
"--fastqs {input} "
"--localcores={threads} "
"{params.extra} "
"--sample {wildcards.individual}_{wildcards.sample_type} "
#"{params.extra} "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if necessary to comment out

Copy link
Collaborator

@fritjoflammers fritjoflammers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor comments. I hope they make sense and help to clarify some things to future maintainers.

One thing, that I would like to do is make changes to scripts/samples.py. Specifically, I would like to extract the code to adjust metadata (i.e. DATE_OF_BIRTH). At the same time we should make clear what the specifications of the pipeline's metadata sheets is. In general this is Species , Sample, and FastQ Path. For your case in addition Age and Fraction are required. Is that correct?

README.md Outdated
```

to the snakemake command.
For multiple runs, it is recommended to install these packages directly into the snakemake environment as well.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "multipe runs" - repeated runs? Why would you install them into the base snakemake environment, too? In general, I'm in favour of smaller, isolated environments. The readme here should state why you recommend this approach.

references:
all_masked: "/omics/groups/OE0538/internal/shared_data/CellRangerReferences/GRCm38_masked_allStrains/"
metadata:
raw:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we say that the pipeline reads a defined set of metadata columns, naming it raw is not really accurate anymore. Could be omit it and just state metadata: path/to/file.yaml ? This would also apply to the the config/config-example.yaml and (multiple) occurrences in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to table to keep it easily separated from identifiers and single_cell_object_metadata_fields below

@@ -6,27 +6,24 @@ from scripts.samples import Samples

samples = Samples(config)

REFERENCES = config["references"]
METADATA_PATH = config["metadata"]["raw"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly omit raw. See above.

Comment on lines 1 to 181
- readline=8.1.2=h0f457ee_0
- frozenlist=1.3.3=py310h5764c6d_0
- filelock=3.8.0=pyhd8ed1ab_0
- ruamel_yaml=0.15.80=py310h5764c6d_1008
- pycosat=0.6.4=py310h5764c6d_1
- logmuse=0.2.6=pyh8c360ce_0
- boto3=1.26.5=pyhd8ed1ab_0
- pyparsing=3.0.9=pyhd8ed1ab_0
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
- pyu2f=0.1.5=pyhd8ed1ab_0
- protobuf=4.21.9=py310hd8f1fbe_0
- conda-package-handling=1.9.0=py310h5764c6d_1
- pysftp=0.2.9=py_1
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_0
- python-fastjsonschema=2.16.2=pyhd8ed1ab_0
- ld_impl_linux-64=2.39=hc81fddc_0
- ncurses=6.3=h27087fc_1
- googleapis-common-protos=1.56.4=py310hff52083_1
- google-cloud-storage=2.6.0=pyh1a96a4e_0
- slacker=0.14.0=py_0
- libffi=3.4.2=h7f98852_5
- _openmp_mutex=4.5=2_gnu
- libzlib=1.2.13=h166bdaf_4
- coin-or-utils=2.11.6=h202d8b1_2
- libabseil=20220623.0=cxx17_h48a1fff_5
- prettytable=3.4.1=pyhd8ed1ab_0
- cffi=1.15.1=py310h255011f_2
- typing-extensions=4.4.0=hd8ed1ab_0
- pyyaml=6.0=py310h5764c6d_5
- dropbox=11.35.0=pyhd8ed1ab_0
- python_abi=3.10=2_cp310
- nbformat=5.7.0=pyhd8ed1ab_0
- libnsl=2.0.0=h7f98852_0
- stone=3.3.1=pyhd8ed1ab_0
- libblas=3.9.0=16_linux64_openblas
- cryptography=38.0.3=py310h600f1e7_0
- google-auth-httplib2=0.1.0=pyhd8ed1ab_1
- tk=8.6.12=h27826a3_0
- libopenblas=0.3.21=pthreads_h78a6416_3
- pyasn1-modules=0.2.7=py_0
- jsonschema=4.17.0=pyhd8ed1ab_0
- coin-or-cgl=0.60.6=h6f57e76_2
- dpath=2.0.6=py310hff52083_2
- google-api-python-client=2.65.0=pyhd8ed1ab_0
- setuptools-scm=7.0.5=pyhd8ed1ab_1
- rsa=4.9=pyhd8ed1ab_0
- pyasn1=0.4.8=py_0
- wcwidth=0.2.5=pyh9f0ad1d_2
- tqdm=4.64.1=pyhd8ed1ab_0
- traitlets=5.5.0=pyhd8ed1ab_0
- wrapt=1.14.1=py310h5764c6d_1
- zipp=3.10.0=pyhd8ed1ab_0
- botocore=1.29.5=pyhd8ed1ab_0
- idna=3.4=pyhd8ed1ab_0
- google-resumable-media=2.4.0=pyhd8ed1ab_0
- bcrypt=3.2.2=py310h5764c6d_1
- attmap=0.13.2=pyhd8ed1ab_0
- requests=2.28.1=pyhd8ed1ab_1
- xz=5.2.6=h166bdaf_0
- grpcio=1.49.1=py310hc32fa93_1
- libprotobuf=3.21.9=h6239696_0
- gitpython=3.1.29=pyhd8ed1ab_0
- s3transfer=0.6.0=pyhd8ed1ab_0
- yaml=0.2.5=h7f98852_2
- ratelimiter=1.2.0=pyhd8ed1ab_1003
- bzip2=1.0.8=h7f98852_4
- ca-certificates=2022.9.24=ha878542_0
- uritemplate=4.1.1=pyhd8ed1ab_0
- future=0.18.2=pyhd8ed1ab_6
- jupyter_core=4.11.2=py310hff52083_0
- pluggy=1.0.0=pyhd8ed1ab_5
- brotlipy=0.7.0=py310h5764c6d_1005
- jmespath=1.0.1=pyhd8ed1ab_0
- setuptools=65.5.1=pyhd8ed1ab_0
- libcrc32c=1.1.2=h9c3ff4c_0
- libstdcxx-ng=12.2.0=h46fd767_19
- commonmark=0.9.1=py_0
- zlib=1.2.13=h166bdaf_4
- tabulate=0.9.0=pyhd8ed1ab_1
- re2=2022.06.01=h27087fc_0
- appdirs=1.4.4=pyh9f0ad1d_0
- aiosignal=1.3.1=pyhd8ed1ab_0
- paramiko=2.12.0=pyhd8ed1ab_0
- filechunkio=1.8=py_2
- charset-normalizer=2.1.1=pyhd8ed1ab_0
- python-irodsclient=1.1.5=pyhd8ed1ab_0
- rich=12.6.0=pyhd8ed1ab_0
- async-timeout=4.0.2=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
- coin-or-osi=0.108.7=h2720bb7_2
- pyopenssl=22.1.0=pyhd8ed1ab_0
- libuuid=2.32.1=h7f98852_1000
- pysocks=1.7.1=pyha2e5f31_6
- openssl=3.0.7=h166bdaf_0
- typing_extensions=4.4.0=pyha770c72_0
- python=3.10.6=ha86cf86_0_cpython
- pynacl=1.5.0=py310h5764c6d_2
- smart_open=6.2.0=pyha770c72_0
- tomli=2.0.1=pyhd8ed1ab_0
- libgcc-ng=12.2.0=h65d4601_19
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should be in the projects root directory because it is not activated by snakemake's --use-conda directive, but used to start snakemake in the first place.

Comment on lines 23 to 24
individual_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample
IDENTIFIERS <- snakemake@params[["identifiers"]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to move the definitions of constants to the top of the script to improve readability.

IDENTIFIERS <- snakemake@params[["identifiers"]]

# if necessary, concatenate identifiers again to obtain all possible wildcards
metadata_curr <- metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reassignment necessary? :)

}

# subset data as specified by wildcard and single_cell_object_metadata_fields
metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not stick to the tidyverse? (sorry for being nitty gritty?

If you're using >= R 4.1, it is also possible to use native pipes (|> instead of magrittrs %>%)

Suggested change
metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),]
# not tested
metadata_curr <- metadata_curr |> filter(individual == individual_curr)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented in commit 47c9b94

metadata_full = self.get_cellranger_filename(metadata_full)

self.metadata = self.select_columns(metadata_full)
self.metadata = self.metadata.rename(self.columns_map, axis="columns")
self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS)
self.metadata = self.select_columns(metadata_full, custom_columns = IDENTIFIERS)

Comment on lines 81 to 91
def select_columns(self,
df: pd.DataFrame,
columns: list = None):
columns: list = None,
identifiers: str = None):
"""Select/Subset columns from DataFrame to reduce
DataFrame dimensions. """

if not columns:
columns = self.columns
columns = self.columns + identifiers
return df[columns]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of this method is not super clear to me. What should it do?
Currently, it subsets the dataframe to columns (defined as class attributes inside Samples).
If a columns attributes is provided, this list of column names plus the contents of identifiers will be used for subsetting. If individual is not set (defaults to None), combining columns + identifiers will raise a TypeError.

Wouldn't a more clearer functionality be use define the method as follows? Then combining columns and identifiers need to be done when calling the method, i.e. in line 43

    def select_columns(self,
                       df: pd.DataFrame,
                       custom_columns: list = None):
        """Select/Subset columns from DataFrame to reduce
        DataFrame dimensions. A list of column names can be provided by `custom_columns` """
        
        if custom_columns:
            df_subset =  df[custom_columns]
        else: 
            df_subset =  df[columns]
        return df_subset
            

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented in commit b2d5b8c. It works if the defined columns are moved below def init(self, config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants