-
Notifications
You must be signed in to change notification settings - Fork 6
Major changes in the documentation and schemas created #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 40
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
quantmsio/utils/system.py (1)
17-35
: Do not assume “enough disk” on error; handle non-existent output_dir and improve loggingReturning True on exceptions can lead to partial writes/data loss. Also, shutil.disk_usage requires an existing path—new output_dir may not exist yet. Check the nearest existing parent, log details, and make the “assume on error” behavior explicit.
Apply:
-def check_disk_space(output_dir: Path, required_gb: float = 1.0) -> bool: +def check_disk_space(output_dir: Path, required_gb: float = 1.0, assume_enough_on_error: bool = False) -> bool: """ - Check if there's sufficient disk space for processing. + Check if there's sufficient disk space for processing. Args: - output_dir: Directory to check space for - required_gb: Required space in GB (default: 1GB) + output_dir: Directory to check space for (may not exist yet). + required_gb: Required space in GiB (default: 1.0). + assume_enough_on_error: If True, return True when the check fails; otherwise return False. Returns: - True if sufficient space is available, False otherwise + True if sufficient space is available, False otherwise. """ try: - total, used, free = shutil.disk_usage(output_dir) - free_gb = free / (1024**3) # Convert to GB - return free_gb >= required_gb + path = output_dir if output_dir.exists() else output_dir.parent + total, used, free = shutil.disk_usage(path) + free_gib = free / (1024 ** 3) # Convert to GiB + has_space = free_gib >= required_gb + logging.getLogger(__name__).debug( + "Disk space check on %s: free=%.2f GiB, required=%.2f GiB, ok=%s", + path, free_gib, required_gb, has_space + ) + return has_space except Exception as e: - logging.warning(f"Could not check disk space: {e}") - return True # Assume sufficient space if we can't check + logging.warning("Could not check disk space for %s: %s", output_dir, e) + return assume_enough_on_errordocs/pg.avsc (2)
44-56
: Avro types: prefer float over float32; struct → record.Replace
float32
with Avrofloat
and define a named record for array items.- {"name": "intensities", "type": ["null",{ - "type": "array", - "items": { - "type": "struct", - "fields": [ - {"name": "sample_accession", "type": "string"}, - {"name": "channel", "type": "string"}, - {"name": "intensity", "type": "float32"} - ] - } - }], + {"name": "intensities", "type": ["null",{ + "type": "array", + "items": { + "type": "record", + "name": "pg_intensity", + "fields": [ + {"name": "sample_accession", "type": "string"}, + {"name": "channel", "type": "string"}, + {"name": "intensity", "type": "float"} + ] + } + }],
57-78
: Nested arrays: define records and use float.Bring
additional_intensities
into Avro-compliant form.- {"name": "additional_intensities", "type": ["null",{ - "type": "array", - "items": { - "type": "struct", - "fields": [ - {"name": "sample_accession", "type": "string"}, - {"name": "channel", "type": "string"}, - {"name": "intensities", "type": { - "type": "array", - "items": { - "type": "struct", - "fields": [ - {"name": "intensity_name", "type": "string"}, - {"name": "intensity_value", "type": "float32"} - ] - } - }} - ] - } - }], + {"name": "additional_intensities", "type": ["null",{ + "type": "array", + "items": { + "type": "record", + "name": "pg_additional_intensity", + "fields": [ + {"name": "sample_accession", "type": "string"}, + {"name": "channel", "type": "string"}, + {"name": "intensities", "type": { + "type": "array", + "items": { + "type": "record", + "name": "pg_intensity_kv", + "fields": [ + {"name": "intensity_name", "type": "string"}, + {"name": "intensity_value", "type": "float"} + ] + } + }} + ] + } + }],quantmsio/utils/pride_utils.py (1)
457-466
: Guard against None modifications_definition to avoid runtime errors.When modifications exist but modifications_definition is None, current code will raise. Fallback to raw sequence in that case.
- if "opt_global_cv_MS:1000889_peptidoform_sequence" in es: - psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"] - else: - psm["peptidoform"] = get_petidoform_msstats_notation( - psm["sequence"], psm["modifications"], modifications_definition - ) + if "opt_global_cv_MS:1000889_peptidoform_sequence" in es: + psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"] + else: + psm["peptidoform"] = ( + get_petidoform_msstats_notation( + psm["sequence"], psm["modifications"], modifications_definition + ) + if modifications_definition + else psm["sequence"] + )- psm["proforma_peptidoform"] = get_peptidoform_proforma_version_in_mztab( - psm["sequence"], psm["modifications"], modifications_definition - ) + psm["proforma_peptidoform"] = ( + get_peptidoform_proforma_version_in_mztab( + psm["sequence"], psm["modifications"], modifications_definition + ) + if modifications_definition + else psm["sequence"] + )
♻️ Duplicate comments (4)
quantmsio/core/fragpipe/fragpipe.py (1)
371-387
: Reduce default batch size and restore pre-flight checks (duplicate of prior feedback)1e6 rows risks memory pressure; also add input existence and ensure output dir.
- def write_psms_to_parquet( + def write_psms_to_parquet( self, - msms_file: Path, - batch_size: int = 1000000, + msms_file: Path, + batch_size: int = 100000, output_prefix: Optional[str] = None, ) -> None: @@ - if not output_prefix: + if not msms_file.exists(): + raise FileNotFoundError(f"PSM file not found: {msms_file}") + self.output_directory.mkdir(parents=True, exist_ok=True) + if not output_prefix: output_prefix = "psm"quantmsio/core/quantms/pg.py (1)
543-548
: DuckDB cannot read a Python DataFrame by name in SQL; must register it.This creates an empty table or fails silently. Register/unregister the DataFrame before CREATE TABLE.
- if protein_data: - protein_df = pd.DataFrame(protein_data) - msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups") - msstats_indexer._duckdb.execute( - "CREATE TABLE protein_groups AS SELECT * FROM protein_df" - ) + if protein_data: + protein_df = pd.DataFrame(protein_data) + msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups") + msstats_indexer._duckdb.register("protein_df_view", protein_df) + try: + msstats_indexer._duckdb.execute( + "CREATE TABLE protein_groups AS SELECT * FROM protein_df_view" + ) + finally: + msstats_indexer._duckdb.unregister("protein_df_view")quantmsio/core/diann/diann.py (1)
151-167
: Consider aligning field names for clarity.Based on the past review comments, the field names in
peptide_counts
could be more consistent.Consider using more consistent field naming:
- In
peptide_counts
: useunique_peptides
andtotal_peptides
- Keep
feature_counts
as is withunique_features
andtotal_features
This would make the distinction between peptides and features clearer.
quantmsio/core/quantms/psm.py (1)
58-69
: Use context manager for file operations.The file handling needs improvement for better resource management and exception safety. The manual file opening and closing should use a context manager.
🧹 Nitpick comments (80)
quantmsio/utils/system.py (1)
8-15
: Add type hints and tighten the log output (units, PID)Minor polish: type the logger param/return, and log MiB with PID for easier tracing.
Apply:
-def log_memory_usage(logger, step: str): - """Log memory usage for a given step.""" - process = psutil.Process(os.getpid()) - mem_info = process.memory_info() - logger.debug( - f"Memory usage after {step}: RSS={mem_info.rss / 1024 / 1024:.1f}MB, VMS={mem_info.vms / 1024 / 1024:.1f}MB" - ) +def log_memory_usage(logger: logging.Logger, step: str) -> None: + """Log current process memory usage (RSS/VMS in MiB) for the given step.""" + process = psutil.Process(os.getpid()) + mem = process.memory_info() + rss = mem.rss / (1024 ** 2) + vms = mem.vms / (1024 ** 2) + logger.debug(f"[pid={process.pid}] Memory usage after {step}: RSS={rss:.1f} MiB, VMS={vms:.1f} MiB")Optional micro-opt (outside range): cache the Process at module load to avoid re-creating it per call:
# at module top _PROCESS = psutil.Process(os.getpid()) # in function process = _PROCESStests/core/test_project.py (4)
20-36
: Patch the correct requests symbol and simplify the mock response.
Targets the import site to avoid future import-style changes breaking the patch; Mock is lighter than constructing a Response.- @patch("requests.get") + @patch("quantmsio.core.project.requests.get") def test_populate_from_pride_archive_successful(self, mock_get): - # Mock the API response for a successful request - mock_response = requests.models.Response() - mock_response.status_code = 200 - mock_response.json = lambda: { - "title": "Test Project", - "projectDescription": "Test description", - "sampleProcessingProtocol": "Test sample processing protocol", - "dataProcessingProtocol": "Test data processing protocol", - "references": [{"pubmedId": "12345678"}], - "keywords": ["keyword1", "keyword2"], - "projectTags": ["tag1", "tag2"], - # Add other mock data as needed - } - mock_get.return_value = mock_response + # Mock the API response for a successful request + mock_get.return_value = Mock( + status_code=200, + json=lambda: { + "title": "Test Project", + "projectDescription": "Test description", + "sampleProcessingProtocol": "Test sample processing protocol", + "dataProcessingProtocol": "Test data processing protocol", + "references": [{"pubmedId": "12345678"}], + "keywords": ["keyword1", "keyword2"], + "projectTags": ["tag1", "tag2"], + }, + )Additional change needed outside this hunk:
-from unittest.mock import patch +from unittest.mock import patch, Mock
50-61
: Avoid live network in unit tests; gate as integration and remove print.
Prevents flakiness and noisy output; allow opt-in via env flag.def test_populate_from_pride_archive_api(self): + if os.environ.get("PRIDE_ONLINE") != "1": + self.skipTest("Integration test requires PRIDE_ONLINE=1") project_accession = "PXD020453" project_manager = ProjectHandler(project_accession) project_manager.populate_from_pride_archive() # Assert that the project_info has been updated self.assertEqual( project_manager.project.project_info["project_title"], "Structural insights into Cullin4-RING ubiquitin ligase remodelling by Vpr from simian immunodeficiency viruses", ) - print(project_manager.project.project_info)Additional change needed outside this hunk:
+import os
If you prefer keeping this as a pure unit test, patch requests here too (same target as above) and assert on stable fields rather than a long, mutable title.
6-6
: ddt is imported but unused.
Either add data-driven cases or remove the decorator/import to reduce test overhead.-from ddt import ddt +# from ddt import ddt # Uncomment when adding data-driven tests @@ -@ddt class TestProjectHandler(TestCase):Alternatively, parameterize via ddt using test_dataset.
Also applies to: 17-18
62-69
: Enhance test_save_project_info with SDRF file existence and ingestion assertionsThe
populate_from_sdrf
call populatesproject_manager.project.project_info
with keys like"organisms"
,"organism_parts"
,"diseases"
,"cell_lines"
,"instruments"
,"enzymes"
and"acquisition_properties"
. Strengthen the test to:• Verify the SDRF test file is present
• Assert that at least one SDRF-derived property is added toproject_info
In tests/core/test_project.py around lines 66–69, update:
def test_save_project_info(self): project_accession = test_dataset[0] sdrf_file = test_dataset[1] project_manager = ProjectHandler(project_accession) project_manager.populate_from_pride_archive() - project_manager.populate_from_sdrf(sdrf_file) + # Ensure the SDRF test file exists + self.assertTrue(sdrf_file.exists(), f"Missing SDRF test file: {sdrf_file}") + project_manager.populate_from_sdrf(sdrf_file) + # Verify SDRF ingestion added properties to project_info + info = project_manager.project.project_info + self.assertIn("organisms", info) + self.assertIsInstance(info["organisms"], list) + self.assertGreater(len(info["organisms"]), 0)tests/core/test_large_mztab_performance.py (1)
20-22
: Make CI skip robust to different CI valuesSkip if CI is set to any truthy string, not just "true".
-@pytest.mark.skipif( - os.environ.get("CI") == "true", reason="Skipping performance test in CI environment" -) +@pytest.mark.skipif( + bool(os.environ.get("CI")), reason="Skipping performance test in CI environment" +)quantmsio/operate/tools.py (4)
29-33
: Tighten return typing for init_save_infoSecond element can become a ParquetWriter; advertise Optional type.
-def init_save_info(parquet_path: str) -> tuple[dict, None, str]: +def init_save_info(parquet_path: str) -> tuple[dict, Optional[pq.ParquetWriter], str]:
124-134
: Fix typo in docstring: “retrun” → “return”Minor documentation fix.
- retrun: None + return: None
159-175
: Consider using sets in peptide_map to avoid O(n) duplicate checksUsing a set avoids repeated membership scans; convert to list at the end if needed.
-def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict: - peptide_map: defaultdict = defaultdict(list) +def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict: + peptide_map: defaultdict = defaultdict(set) @@ - if accession not in peptide_map[peptide]: - peptide_map[peptide].append(accession) - return peptide_map + peptide_map[peptide].add(accession) + # downstream expects lists + return defaultdict(list, {k: list(v) for k, v in peptide_map.items()})
261-268
: Make column drop resilient when label is absentAvoid KeyError if label/reference aren’t present in the merge result.
- df.drop( + df.drop( [ "reference", "label", ], - axis=1, - inplace=True, + axis=1, + inplace=True, + errors="ignore", )tests/core/operate/test_query.py (4)
13-17
: Replace prints with minimal assertionsEnsure generators yield at least one batch; avoid noisy output.
- for _ in q.iter_samples(): - print("ok") + assert next(q.iter_samples(), None) is not None
23-27
: Assert iterator yieldsSame rationale as above.
- for _ in q.iter_chunk(): - print("ok") + assert next(q.iter_chunk(), None) is not None
33-37
: Assert iterator yieldsSame rationale as above.
- for _ in q.iter_file(): - print("ok") + assert next(q.iter_file(), None) is not None
43-48
: Validate result of inject_position_msgGuard against silent no-ops.
protein_dict = q.get_protein_dict(fasta) - q.inject_position_msg(df, protein_dict) + out = q.inject_position_msg(df, protein_dict) + assert "pg_positions" in out.columnsquantmsio/utils/file_utils.py (1)
41-56
: Simplify existence check using all(...)Matches Ruff SIM110 hint.
- for file_name in file_names: - # Now we have Path / str which works correctly on all platforms - if not (path_obj / file_name).exists(): - return False - return True + return all((path_obj / file_name).exists() for file_name in file_names)quantmsio/core/fragpipe/fragpipe.py (3)
5-5
: Remove unused import datePer static analysis hint.
-from datetime import date
11-11
: Remove unused import pyarrow.parquet as pqPer static analysis hint.
-import pyarrow.parquet as pq
405-407
: Optionally pass string path to ParquetBatchWriter for wider compatibilityMinor portability improvement.
- batch_writer = ParquetBatchWriter( - output_path, - schema=batch.schema, - ) + batch_writer = ParquetBatchWriter(str(output_path), schema=batch.schema)quantmsio/utils/constants.py (1)
106-117
: Consider adding validation for mzTab protein columns.The
MZTAB_PROTEIN_COLUMNS
list contains a mix of referenced constants (ACCESSION
,MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE
) and literal strings. This is good for maintainability, but consider whether some of the literal strings (e.g., "ambiguity_members") should also be constants if they're used elsewhere.If any of these literal strings are referenced in other parts of the codebase, consider extracting them as constants:
+AMBIGUITY_MEMBERS_COLUMN = "ambiguity_members" +DESCRIPTION_COLUMN = "description" +MODIFICATIONS_COLUMN = "modifications" +PROTEIN_COVERAGE_COLUMN = "protein_coverage" + MZTAB_PROTEIN_COLUMNS = [ ACCESSION, - "description", + DESCRIPTION_COLUMN, MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE, - "ambiguity_members", + AMBIGUITY_MEMBERS_COLUMN, - "modifications", + MODIFICATIONS_COLUMN, - "protein_coverage", + PROTEIN_COVERAGE_COLUMN, "opt_global_Posterior_Probability_score", "opt_global_cv_PRIDE:0000303_decoy_hit", "opt_global_result_type", ]quantmsio/utils/mztab_utils.py (4)
1-4
: Add module-level docstring for better documentation.The module lacks a docstring describing its purpose and contents.
Add a module-level docstring:
+""" +Utilities for parsing and processing mzTab format files. + +This module provides functions for parsing mzTab metadata, headers, protein and PSM lines, +as well as handling modifications and peptidoform conversions from OpenMS format. +""" import re from typing import Optional, Dict, List
33-43
: Simplify nested loop iteration.The nested loop for finding the accession can be simplified.
elif "site" in line_parts[1]: index = line_parts[1].split("[")[1].split("]")[0] accession = None - for ( - key, - value, - ) in ( - _modifications.items() - ): # for name, age in dictionary.iteritems(): (for Python 2.x) + for key, value in _modifications.items(): if value[1] == index: accession = key
280-289
: Use ternary operator for cleaner code.The if-else block can be simplified to a ternary operator as suggested by static analysis.
def _clean_file_path(file_path: str) -> str: """ Clean a file path to extract just the filename without extension. This helper function removes the file:// protocol and extracts the filename without the file extension from various file path formats. Args: file_path: File path that may include file:// protocol and extensions Returns: Clean filename without protocol and extension Example: >>> _clean_file_path("file:///path/to/file.mzML") 'file' >>> _clean_file_path("/data/sample_001.mzML") 'sample_001' >>> _clean_file_path("relative/path/file.txt") 'file' """ - if file_path.startswith("file://"): - # Remove file:// protocol - clean_path = file_path[7:] - else: - clean_path = file_path + clean_path = file_path[7:] if file_path.startswith("file://") else file_path # Extract filename without extension filename = clean_path.split("/")[-1].split(".")[0] return filename
533-658
: Consider refactoring the complex modification parsing function.The
parse_modifications_openms
function is quite complex with deeply nested conditionals. Consider breaking it down into smaller, more focused helper functions for better readability and maintainability.Consider extracting the terminal modification handling logic into separate helper functions:
def _handle_n_terminal_mod(peptidoform_to_process: str, mod_name_to_accession: dict, parsed_mods: dict) -> str: """Handle N-terminal modification parsing.""" if peptidoform_to_process.startswith(".("): match = re.match(r"\.\(([^)]+)\)", peptidoform_to_process) if match: mod_name = match.group(1) accession = mod_name_to_accession.get(mod_name, mod_name) if accession not in parsed_mods: parsed_mods[accession] = {"name": mod_name, "positions": []} parsed_mods[accession]["positions"].append( {"position": "N-term.0", "scores": None} ) return peptidoform_to_process[len(match.group(0)):] return peptidoform_to_process def _handle_c_terminal_mod(peptidoform_to_process: str, pure_sequence: str, mod_name_to_accession: dict, parsed_mods: dict) -> tuple[str, Optional[str]]: """Handle C-terminal modification parsing.""" c_term_mod_name = None if peptidoform_to_process.endswith(")") and peptidoform_to_process.rfind(".(") > 0: match = re.search(r"\.\(([^)]+)\)$", peptidoform_to_process) if match: c_term_mod_name = match.group(1) peptidoform_to_process = peptidoform_to_process[:-len(match.group(0))] return peptidoform_to_process, c_term_mod_nameThis would make the main function cleaner and easier to test.
quantmsio/core/common.py (2)
41-41
: Consider using compiled regex for better performance.The
OPENMS_PROTEIN_QVALUE_WORDS
list suggests these words will be used for pattern matching. If this is used frequently, consider pre-compiling a regex pattern.import re # Regular expression that ensures the protein search engine score contains required words OPENMS_PROTEIN_QVALUE_PATTERN = re.compile( r"(?=.*openms)(?=.*target-decoy)(?=.*protein)(?=.*q-value)", re.IGNORECASE )
98-103
: Add documentation for removed fields.The comments mention that certain fields were removed in DIA-NN version 2.0. Consider adding version compatibility information or migration guidance.
Consider adding a version compatibility note at the module level or creating a separate compatibility matrix documentation:
# DIA-NN Version Compatibility: # - Version 2.0+: Removed "File.Name", "PG.Quantity", "PG.Normalised" from main report # - Version 2.0+: Added "Decoy" to main report # - Version 2.0+: Added "N.Sequences", "N.Proteotypic.Sequences" to report.pg_matrix.tsvdebug_test_convert_to_parquet.py (3)
9-9
: Remove unused import.
os
isn’t used.-import os
44-45
: Drop unnecessary f-string.No placeholders present.
- print(f"=== Full traceback: ===") + print("=== Full traceback: ===")
16-24
: Make logging setup resilient and Unicode-safe.Avoid duplicate handlers when re-running, and set encoding for the file handler.
logging.basicConfig( level=logging.DEBUG, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[ logging.StreamHandler(sys.stdout), - logging.FileHandler("debug_test_convert_to_parquet.log"), + logging.FileHandler("debug_test_convert_to_parquet.log", encoding="utf-8"), ], + force=True, )tests/core/quantms/psms/test_mztab_functions.py (2)
36-49
: Consider adding a probability-bearing case.Add a case like
2[0.7]|5[0.3]-UNIMOD:4
to assert probabilities are stripped while positions remain integers.I can add a parametrized test covering probability annotations if you want.
51-57
: Reduce flakiness by making order-agnostic (optional).Current assertion depends on dict insertion order. If upstream changes order, the test could fail even with correct content.
- assert modification_list == ["2|5-UNIMOD:4", "1|3|7-UNIMOD:21"] + assert set(modification_list) == {"2|5-UNIMOD:4", "1|3|7-UNIMOD:21"}tests/core/quantms/test_feature.py (2)
16-25
: Gracefully skip when example data is missing.Mirror the idXML pattern to avoid brittle failures in environments without the example data.
+import pytest @@ - # Initialize Feature + # Skip if data files don't exist + if not (mztab_file.exists() and msstats_file.exists() and sdrf_file.exists()): + pytest.skip("Example data files not found") + # Initialize Feature feature = Feature(mztab_file, sdrf_file, msstats_file)Apply the same guard in
test_extract_psm_msg
andtest_generate_feature
.Also applies to: 43-52, 70-79
83-89
: Optional: cap iterations to keep tests fast.If generators can yield many batches, consider breaking after N iterations.
- for feature_table in feature.generate_feature(): + for feature_table in feature.generate_feature(): # Add assertions to verify the result assert feature_table is not None assert len(feature_table) > 0 assert "peptidoform" in feature_table.column_names count += 1 + if count >= 3: + breakdocs/pg.avsc (1)
79-79
: Consider boolean for decoy flag.
is_decoy
fitsboolean
better thanint
.- {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the protein group is a decoy, 0 if target"}, + {"name": "is_decoy", "type": "boolean", "doc": "Decoy indicator: true if decoy, false if target"},tests/core/test_sdrf.py (3)
15-15
: Remove redundant private call.
__init__
already loads the SDRF; calling_load_sdrf_info
again is unnecessary and touches a private API.- sdrf_handler._load_sdrf_info(file)
34-43
: Align allowed experiment types with implementation (optional).
get_experiment_type_from_sdrf
currently returns LFQ/TMT*/ITRAQ* only. Including "SILAC" here may be misleading.If SILAC support is intended, please confirm and update
quantmsio/core/sdrf.py
accordingly; otherwise remove it from the allowed list in tests.
110-115
: Use direct dict membership.Simpler and faster than checking
.keys()
.- for mod in expected_mods: - assert any( - mod in key for key in mods_dict.keys() - ), f"Expected modification {mod} not found in mods_dict" + for mod in expected_mods: + assert any(mod in key for key in mods_dict), ( + f"Expected modification {mod} not found in mods_dict" + )quantmsio/quantmsioc.py (2)
103-114
: Register DuckDB and project conversion commands or drop imports.
duckdb_creator
andquantms_project_convert
are imported but unused (Ruff F401). If intended, register them under convert; otherwise remove the imports.convert.add_command(convert_idxml_file, name="idxml") +convert.add_command(duckdb_creator, name="duckdb") +convert.add_command(quantms_project_convert, name="quantms-project")Run flake8 after this change to confirm F401 is resolved.
116-124
: Give transform subcommands stable, human-friendly names.Right now Click will default to function names (long/unstable). Use explicit names matching the PR surface.
-transform.add_command(ae_transform) -transform.add_command(de_transform) -transform.add_command(gene_transform) -transform.add_command(ibaq_transform) -transform.add_command(spectra_transform) -transform.add_command(uniprot_transform) -transform.add_command(anndata_transform) +transform.add_command(ae_transform, name="ae") +transform.add_command(de_transform, name="de") +transform.add_command(gene_transform, name="gene") +transform.add_command(ibaq_transform, name="ibaq") +transform.add_command(spectra_transform, name="spectra") +transform.add_command(uniprot_transform, name="uniprot") +transform.add_command(anndata_transform, name="anndata")tests/core/idxml/test_idxml_integration.py (2)
2-2
: Remove unused import.
os
is imported but never used.-import os
82-88
: Optional: Align with the new CLI entry (future-proof).This test invokes a Click command function directly from
convert.quantms
. Given the CLI now exposesconvert idxml
, consider testing the top-level CLI too for integration coverage. No action required if you want to keep this as a unit-style CLI test.tests/core/quantms/test_diann.py (1)
16-29
: LGTM on feature path; consider asserting outputs to strengthen the test.You exercise the full path but don’t validate shapes or schema. Adding minimal assertions (e.g., non-empty DataFrame after transform) would catch silent regressions.
tests/core/idxml/test_idxml_psm.py (1)
198-201
: Assert a specific exception type.Catching bare
Exception
hides failures. If the loader raises a file-not-found error, make it explicit.- with pytest.raises(Exception): + with pytest.raises(FileNotFoundError): IdXmlPsm("non_existent_file.idXML")If the actual exception differs (e.g., OSError), update the type accordingly.
tests/core/quantms/test_diann_duckdb_performance.py (1)
3-8
: Remove unused imports.
os
andpandas
are not used in this test module.Apply:
-import os ... -import pandas as pdtests/core/quantms/psms/test_mztab_indexer.py (1)
129-162
: Reduce log spam and turn referential checks into actionable assertions.Current loop logs every accession and a boolean. Aggregate missing items and assert, or at least summarize.
Example refactor:
- for psm_acc_group in psm_accessions: - for psm_acc in psm_acc_group.split(","): - logging.info(f"PSM accession {psm_acc} protein group not found") - logging.info( - any(psm_acc in prot_acc for prot_acc in all_protein_accessions) - ) + missing_psm = [] + for psm_acc_group in psm_accessions: + for psm_acc in psm_acc_group.split(","): + if not any(psm_acc in prot_acc for prot_acc in all_protein_accessions): + missing_psm.append(psm_acc) + assert not missing_psm, f"{len(missing_psm)} PSM accessions not found in proteins: {missing_psm[:10]}" - for msstats_acc_group in msstats_accessions: - for msstats_acc in msstats_acc_group.split(","): - logging.info(f"MSstats accession {msstats_acc} protein group not found") - logging.info( - any(msstats_acc in prot_acc for prot_acc in all_protein_accessions) - ) + missing_ms = [] + for msstats_acc_group in msstats_accessions: + for msstats_acc in msstats_acc_group.split(","): + if not any(msstats_acc in prot_acc for prot_acc in all_protein_accessions): + missing_ms.append(msstats_acc) + assert not missing_ms, f"{len(missing_ms)} MSstats accessions not found in proteins: {missing_ms[:10]}"Also applies to: 269-293
tests/core/quantms/test_maxquant.py (1)
1-2
: Prune unused imports.
os
,tempfile
, andMzTabProteinGroups
are unused.Apply:
-import os -import tempfile ... -from quantmsio.core.quantms.pg import MzTabProteinGroupsAlso applies to: 10-12
tests/core/quantms/psms/test_duckdb_creation.py (5)
1264-1280
: Use contextlib.suppress for cleanup helpers.Simplifies error-tolerant cleanup and satisfies linters (Ruff SIM105, CodeFactor B110).
Apply:
+import contextlib ... def cleanup_mztab_indexer(mztab_indexer): """Clean up MzTabIndexer instance and its connections.""" if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb: - try: - mztab_indexer._duckdb.close() - except Exception: - pass # Ignore errors during cleanup + with contextlib.suppress(Exception): + mztab_indexer._duckdb.close() def cleanup_duckdb_connection(con): """Clean up DuckDB connection.""" if con: - try: - con.close() - except Exception: - pass # Ignore errors during cleanup + with contextlib.suppress(Exception): + con.close()
559-563
: Docstring/log text doesn’t match behavior (MSstats is included in Step 1).Either exclude MSstats in Step 1 or update wording. Given assertions expect msstats.parquet present, update text.
Apply:
- """ - Test the two-step process with Parquet backend using TMT data: - 1. Create parquet store with mzTab data only - 2. Add MSstats data to the existing parquet store - """ + """ + Test Parquet backend with TMT data: + 1. Create parquet store with mzTab + MSstats + 2. Open existing store and re-add MSstats (idempotent replace) + """ ... - print("Step 1: Creating parquet store with mzTab data only...") + print("Step 1: Creating parquet store with mzTab + MSstats...")Also applies to: 582-589
586-596
: Close indexer instances to release resources.Avoid leaked DuckDB connections and silence “assigned but unused” warnings by using the objects in cleanup.
Apply:
- mztab_step1 = MzTabIndexer.create( + mztab_step1 = MzTabIndexer.create( mztab_path=mztab_path, - msstats_path=msstats_path, + msstats_path=msstats_path, backend="parquet", database_path=output_path, ) ... - mztab_existing = MzTabIndexer.open( + mztab_existing = MzTabIndexer.open( database_path=output_path, backend="parquet", ) ... - # Clean up - con.close() + # Clean up + cleanup_mztab_indexer(mztab_step1) + cleanup_mztab_indexer(mztab_existing) + con.close()Also applies to: 628-632, 694-696
1013-1018
: Ensure parity in parquet create/open test: close the created indexer.This both releases resources and makes
mztab_new
a used variable.Apply:
mztab_new = MzTabIndexer.create( mztab_path=mztab_path, msstats_path=msstats_path, backend="parquet", database_path=output_path, ) +cleanup_mztab_indexer(mztab_new) ... - # Clean up - con.close() + # Clean up + con.close()Also applies to: 1121-1123
184-184
: Remove extraneous f-strings without placeholders.Several
print(f"...")
have no{}
. Drop thef
to satisfy Ruff F541.You can auto-fix across this file:
#!/bin/bash # Remove f-prefix when no {} placeholders python - <<'PY' import re, pathlib p = pathlib.Path("tests/core/quantms/psms/test_duckdb_creation.py") s = p.read_text() s = re.sub(r'print\(f"([^"{]*?)"\)', r'print("\1")', s) p.write_text(s) PYAlso applies to: 214-214, 274-274, 310-310, 341-341, 401-401, 436-436, 475-475, 542-542, 580-580, 615-615, 662-662, 689-689, 719-719, 753-753, 806-806, 833-833, 863-863, 897-897, 950-950, 977-977, 1007-1007, 1042-1042, 1089-1089, 1116-1116, 1146-1146, 1181-1181, 1228-1228, 1255-1255
tests/core/quantms/psms/test_psm.py (5)
259-266
: Avoid comparisons to True/False in tests.Use truthiness for clarity and to satisfy linters (E712).
- assert ( - _is_ms_run_location_line( - ["MTD", "ms_run[1]-location", "file:///path/file.mzML"] - ) - == True - ) - assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False - assert _is_ms_run_location_line(["PRT", "accession", "P12345"]) == False + assert _is_ms_run_location_line(["MTD", "ms_run[1]-location", "file:///path/file.mzML"]) + assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) + assert not _is_ms_run_location_line(["PRT", "accession", "P12345"])
520-525
: Prefer iterating dict directly (SIM118).- for protein_acc in protein_qvalue_map.keys(): + for protein_acc in protein_qvalue_map:
552-556
: Remove redundant f-prefix in string.- logger.info(f"\nTop 10 protein groups with lowest q-values:") + logger.info("\nTop 10 protein groups with lowest q-values:")
558-566
: Drop unused loop variable.- for i, (accession, qvalue) in enumerate(sorted_groups[:10], 1): + for accession, qvalue in sorted_groups[:10]:
414-593
: This test is doing too much; consider parametrization/splitting.Split by concern (stats, bounds, groups) or parametrize over backends to reduce complexity and noise.
quantmsio/utils/pride_utils.py (2)
424-451
: Condense optionals with dict.get and reduce branching.- if "opt_global_cv_MS:1002217_decoy_peptide" in es: - psm["is_decoy"] = es["opt_global_cv_MS:1002217_decoy_peptide"] - else: - psm["is_decoy"] = None + psm["is_decoy"] = es.get("opt_global_cv_MS:1002217_decoy_peptide") - if "opt_global_Posterior_Error_Probability_score" in es: - psm["posterior_error_probability"] = es[ - "opt_global_Posterior_Error_Probability_score" - ] - else: - psm["posterior_error_probability"] = None + psm["posterior_error_probability"] = es.get("opt_global_Posterior_Error_Probability_score") - if "opt_global_q-value" in es: - psm["global_qvalue"] = es["opt_global_q-value"] - else: - psm["global_qvalue"] = None + psm["global_qvalue"] = es.get("opt_global_q-value") - if "opt_global_consensus_support" in es: - psm["consensus_support"] = es["opt_global_consensus_support"] - else: - psm["consensus_support"] = None + psm["consensus_support"] = es.get("opt_global_consensus_support")
469-481
: Raise ValueError for missing ms_run mapping; include context.Use a more specific exception type and a concise message for callers.
- if ms_runs is not None: - if ms_run in ms_runs: - psm["ms_run"] = ms_runs[ms_run] - else: - raise Exception(f"The ms run {ms_run} is not in the ms runs index") + if ms_runs is not None: + if ms_run in ms_runs: + psm["ms_run"] = ms_runs[ms_run] + else: + raise ValueError(f"ms_run '{ms_run}' not found in provided ms_runs index")quantmsio/core/openms.py (2)
3-3
: Remove unused imports.-from typing import List, Dict, Any, Optional, Iterator, Union +from typing import List, Dict, Any, Optional, Union-from quantmsio.utils.mztab_utils import parse_peptidoform_openms +# parse_peptidoform_openms not used here; remove importAlso applies to: 11-11
217-226
: Broaden numeric check to handle non-floatables safely.- except ValueError: + except (TypeError, ValueError): return Falsetests/core/quantms/test_diann_duckdb.py (3)
3-7
: Remove unused imports.-import os import time from pathlib import Path -import pandas as pd import pytest
40-41
: Drop redundant f-strings without placeholders.- print(f"\nSmall dataset statistics:") + print("\nSmall dataset statistics:")- print(f"\nFull dataset statistics:") + print("\nFull dataset statistics:")- print(f"\nFiltered protein groups matrix statistics (5 proteins):") + print("\nFiltered protein groups matrix statistics (5 proteins):")- print(f"\nCaching performance:") + print("\nCaching performance:")Also applies to: 125-126, 232-233, 278-281
106-249
: Consolidate repeated stats logic via parametrization/fixtures (optional).Small and full tests duplicate large chunks. Consider pytest parametrization for dataset and thresholds to trim runtime and code size.
quantmsio/core/maxquant/maxquant.py (1)
779-798
: Simplify type checking for better maintainability.The explicit type comparisons violate Python best practices. Consider using
isinstance()
for type checking or a dictionary-based approach.Apply this refactor for cleaner type checking:
- for field, dtype in required_fields.items(): - if field not in df.columns: - if dtype == list: - df[field] = [[] for _ in range(len(df))] - elif dtype == str: - df[field] = None - elif dtype == float: - df[field] = 0.0 - elif dtype == int: - df[field] = 0 - else: - # Handle type conversions - if dtype == float: - df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0) - elif dtype == int: - df[field] = ( - pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int) - ) - elif dtype == str: - df[field] = df[field].astype(str).fillna("") + default_values = { + list: lambda: [[] for _ in range(len(df))], + str: lambda: None, + float: lambda: 0.0, + int: lambda: 0 + } + + for field, dtype in required_fields.items(): + if field not in df.columns: + df[field] = default_values[dtype]() + else: + # Handle type conversions + if dtype == float: + df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0) + elif dtype == int: + df[field] = ( + pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int) + ) + elif dtype == str: + df[field] = df[field].astype(str).fillna("")tests/core/quantms/test_msstats_in.py (4)
1-17
: Remove unused imports to clean up the module.The
gzip
andtempfile
imports are not used in the code.-import gzip -import tempfile from pathlib import Path
74-74
: Remove extraneous f-string prefixes.These strings don't contain any placeholders, so the f-prefix is unnecessary.
- print(f"Intensity distribution analysis completed") + print("Intensity distribution analysis completed") - print(f"\nLFQ test completed successfully!") + print("\nLFQ test completed successfully!") - print(f"\nTMT test completed successfully!") + print("\nTMT test completed successfully!") - print(f"Intensity distribution analysis completed") + print("Intensity distribution analysis completed") - print(f"\nDataset Comparison:") + print("\nDataset Comparison:")Also applies to: 117-117, 465-465, 536-536, 589-589
272-319
: Consider breaking down the complex channel processing function.The
_process_channel_statistics
function handles both direct channel counts and intensities-based channel counts with nested logic. Consider splitting into two separate functions for clarity.def _process_channel_statistics( msstats_batch, channel_counts, file_channel_matrix, channel_column="Channel" ): """Process channel statistics from batch data.""" # Count channels from the specified channel column chan_counts = msstats_batch[channel_column].value_counts() for channel, count in chan_counts.items(): if channel not in channel_counts: channel_counts[channel] = 0 channel_counts[channel] += count + # Process intensities separately + _process_intensities_channel_statistics( + msstats_batch, channel_counts, file_channel_matrix + ) + +def _process_intensities_channel_statistics( + msstats_batch, channel_counts, file_channel_matrix +): + """Process channel statistics from intensities column.""" # Count channels from the 'intensities' column intensities_channel_counts = {} file_channel_matrix_from_intensities = {} for _, row in msstats_batch.iterrows(): ref_file = row["reference_file_name"] if "intensities" in row and row["intensities"] is not None: for intensity_info in row["intensities"]: channel = intensity_info["channel"] # Count channels from intensities if channel not in intensities_channel_counts: intensities_channel_counts[channel] = 0 intensities_channel_counts[channel] += 1 # Count file-channel combinations from intensities if ref_file not in file_channel_matrix_from_intensities: file_channel_matrix_from_intensities[ref_file] = {} if channel not in file_channel_matrix_from_intensities[ref_file]: file_channel_matrix_from_intensities[ref_file][channel] = 0 file_channel_matrix_from_intensities[ref_file][channel] += 1 # Update the main counting with intensities data if intensities_channel_counts: - for channel, count in intensities_channel_counts.items(): - if channel not in channel_counts: - channel_counts[channel] = 0 - # Use intensities-based matrix for ref_file, channels in file_channel_matrix_from_intensities.items(): if ref_file not in file_channel_matrix: file_channel_matrix[ref_file] = {} for channel, count in channels.items(): if channel not in file_channel_matrix[ref_file]: file_channel_matrix[ref_file][channel] = 0 file_channel_matrix[ref_file][channel] += count
357-357
: Simplify dictionary key check.Use simpler syntax for checking if a key exists in a dictionary.
- for channel in file_channels.keys() + for channel in file_channelsquantmsio/commands/convert/quantms.py (1)
7-7
: Remove unused import.The
tempfile
import at the module level is unused since you're importing it locally where needed.-import tempfile from typing import Optional
quantmsio/core/idxml/idxml.py (1)
397-397
: Remove unnecessary f-string prefix.The string doesn't contain any placeholders.
- logger.info(f"=== DEBUG: convert_to_parquet started ===") + logger.info("=== DEBUG: convert_to_parquet started ===")quantmsio/core/diann/diann.py (2)
7-7
: Remove unused imports.Several imports are unused in the code.
-from typing import Union, Optional, List +from typing import Union ... -from quantmsio.core.duckdb import DiannDuckDB -from quantmsio.core.project import create_uuid_filename from quantmsio.core.quantms.feature import Feature from quantmsio.core.quantms.mztab import MzTab from quantmsio.core.sdrf import SDRFHandler -from quantmsio.operate.tools import get_ahocorasickAlso applies to: 23-24, 28-28
343-345
: Add error handling for type conversion.The type conversion could fail if the data is not in the expected format.
group = group.copy() - if group["rt"].dtype != "float64": - group["rt"] = group["rt"].astype("float64") + try: + if group["rt"].dtype != "float64": + group["rt"] = pd.to_numeric(group["rt"], errors="coerce") + except Exception as e: + logging.warning(f"Failed to convert retention time to float: {e}")quantmsio/core/quantms/psm.py (2)
242-255
: Remove unused loop variable to improve code clarity.The
index
variable is not used in the loop body and should be replaced with_
to indicate it's intentionally unused.# Now we need to convert the temp_modifications to the final modifications dictionary where key # of the dictionary is accession, and indexes are moved as second element of the tuple. It could be that # one accession has mutliple indexes, then we need to group them and add then also site and position. modifications = {} -for index, ( +for _, ( name, accession, sites, positions, ) in temp_modifications.items():
506-507
: Remove extraneous f-string prefix.The f-string has no placeholders and doesn't need the
f
prefix.- self.logger.warning(f"Could not convert global_qvalue to float") + self.logger.warning("Could not convert global_qvalue to float")quantmsio/core/duckdb.py (1)
197-197
: Remove unused variable assignment.The
result
variable is assigned but never used in the performance stats method.try: - result = self.query_to_df(query) + self.query_to_df(query) end_time = time.time()quantmsio/core/quantms/feature.py (6)
2-8
: Remove unused imports.Several imports are not used in the code and should be removed for cleaner code.
import logging -import tempfile -from pathlib import Path -from typing import Union import pandas as pd import pyarrow as pa -import pyarrow.parquet as pq
14-16
: Remove unused imports from the module.The
Psm
andget_protein_accession
imports are not directly used at the module level.-from quantmsio.core.quantms.psm import Psm from quantmsio.core.sdrf import SDRFHandler -from quantmsio.operate.tools import get_protein_accession
229-238
: Simplify nested if statements.The nested if statements can be combined into a single condition for better readability.
for _, row in metadata_df.iterrows(): - if "fixed_mod[" in row["key"] or "var_mod[" in row["key"]: - if "site" not in row["key"] and "position" not in row["key"]: + if ("fixed_mod[" in row["key"] or "var_mod[" in row["key"]) and "site" not in row["key"] and "position" not in row["key"]: values = ( row["value"].replace("[", "").replace("]", "").split(",") )
245-246
: Fix unused loop variables.Multiple loop variables are not used and should be prefixed with underscore.
- for accession, (name, index, site, position) in modifications.items(): + for accession, (name, _, site, position) in modifications.items(): mods_map[name] = [accession, site or "X"]
317-317
: Remove unused variable assignment.The
charge_col
variable is assigned but never used.# Add charge column if available, otherwise use default if "Charge" in msstats_batch.columns: grouping_cols.append("Charge") - charge_col = "Charge" elif "PrecursorCharge" in msstats_batch.columns: grouping_cols.append("PrecursorCharge") - charge_col = "PrecursorCharge" else: # Add a default charge if not available msstats_batch["Charge"] = 3 grouping_cols.append("Charge") - charge_col = "Charge"
647-763
: Consider breaking down complex conversion method.The
convert_to_parquet_format
method is quite long and handles many different column types. Consider refactoring into smaller helper methods for better maintainability.Would you like me to help refactor this method into smaller, more focused helper methods like
_convert_float_columns
,_convert_int_columns
,_convert_list_columns
, etc.? This would improve readability and make the code easier to test and maintain.
@@ -15,47 +15,72 @@ | |||
{"name": "file_type", "type": "string", "doc": "Type of the file (feature_file)"}, | |||
{"name": "creation_date", "type": "string", "doc": "Date when the file was created"}, | |||
{"name": "uuid", "type": "string", "doc": "Unique identifier for the file"}, | |||
{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the scan_format field values are validated.
The scan_format
field documentation mentions possible values (scan, index, nativeId) but the schema doesn't enforce these values through an enum type.
Consider using an Avro enum type to enforce valid values:
-{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
+{"name": "scan_format", "type": {"type": "enum", "name": "ScanFormat", "symbols": ["scan", "index", "nativeId"]}, "doc": "The format of the scan"},
🏁 Script executed:
#!/bin/bash
# Search for validation of scan_format values in the codebase
rg -n "scan_format.*(?:scan|index|nativeId)" --type py
# Check if there's any enum or validation for scan_format
ast-grep --pattern 'scan_format == $_'
Length of output: 377
Enforce scan_format values via an Avro enum
I didn’t find any code that validates scan_format
against the documented values (“scan”, “index”, “nativeId”), nor does the Avro schema enforce them. Without schema-level enforcement or runtime checks, invalid values could slip through and cause downstream failures.
Suggested changes:
- Update the Avro schema (
docs/feature.avsc
) to use anenum
forscan_format
. - Review all code that serializes or deserializes this field—e.g.
quantmsio/core/idxml/idxml.py:366
quantmsio/core/quantms/psm.py:326
quantmsio/core/quantms/feature.py:264
to handle the enum type or validate string values accordingly.
Diff for docs/feature.avsc
:
- {"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
+ {"name": "scan_format",
+ "type": {
+ "type": "enum",
+ "name": "ScanFormat",
+ "symbols": ["scan", "index", "nativeId"]
+ },
+ "doc": "The format of the scan"}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."}, | |
{"name": "scan_format", | |
"type": { | |
"type": "enum", | |
"name": "ScanFormat", | |
"symbols": ["scan", "index", "nativeId"] | |
}, | |
"doc": "The format of the scan"}, |
🤖 Prompt for AI Agents
In docs/feature.avsc around line 18, replace the current string-typed
"scan_format" field with an Avro enum type that defines the allowed symbols
["scan","index","nativeId"] so the schema enforces valid values; then
update/inspect the serialization/deserialization and validation sites
(quantmsio/core/idxml/idxml.py:366, quantmsio/core/quantms/psm.py:326,
quantmsio/core/quantms/feature.py:264) to handle the enum (either by mapping
enum values to/from strings or by adding explicit runtime validation that
accepts only these three values), and run schema compatibility tests to ensure
no breaking changes for existing data.
"type": "struct", | ||
"fields": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent field type definition.
The best_id_score
field uses "struct"
type which is not standard in Avro. Avro uses "record"
for complex types.
Apply this fix to use the correct Avro type:
{"name": "best_id_score", "type": ["null", {
"type": "array",
"items": {
- "type": "struct",
+ "type": "record",
+ "name": "best_id_score_entry",
"fields": [
{"name": "score_name", "type": "string"},
{"name": "score_value", "type": "float32"}
]
}
}],
"doc": "A named score type and value representing an identification's measure of confidence or input feature"
},
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"type": "struct", | |
"fields": [ | |
{"name": "best_id_score", "type": ["null", { | |
"type": "array", | |
"items": { | |
"type": "record", | |
"name": "best_id_score_entry", | |
"fields": [ | |
{"name": "score_name", "type": "string"}, | |
{"name": "score_value", "type": "float32"} | |
] | |
} | |
}], | |
"doc": "A named score type and value representing an identification's measure of confidence or input feature" | |
}, |
🤖 Prompt for AI Agents
In docs/peptide.avsc around lines 74 to 75, the complex type for the
best_id_score field is incorrectly declared as "struct"; change that to the Avro
standard "record" (i.e., replace "type": "struct" with "type": "record") and
ensure the record includes a proper "name" and "fields" array so the schema
remains valid.
{"name": "peptide_counts", "type": ["null", { | ||
"type": "struct", | ||
"fields": [ | ||
{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"}, | ||
{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"} | ||
] | ||
}], "doc": "Peptide sequence counts for this protein group in this specific file"}, | ||
{"name": "feature_counts", "type": ["null", { | ||
"type": "struct", | ||
"fields": [ | ||
{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"}, | ||
{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"} | ||
] | ||
}], "doc": "Peptide feature counts for this protein group in this specific file"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avro type: use record, not struct; add names.
Avro JSON doesn’t support struct
; nested types must be record
with a name
.
- {"name": "peptide_counts", "type": ["null", {
- "type": "struct",
- "fields": [
+ {"name": "peptide_counts", "type": ["null", {
+ "type": "record",
+ "name": "pg_peptide_counts",
+ "fields": [
{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
]
}], "doc": "Peptide sequence counts for this protein group in this specific file"},
- {"name": "feature_counts", "type": ["null", {
- "type": "struct",
- "fields": [
+ {"name": "feature_counts", "type": ["null", {
+ "type": "record",
+ "name": "pg_feature_counts",
+ "fields": [
{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
]
}], "doc": "Peptide feature counts for this protein group in this specific file"},
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{"name": "peptide_counts", "type": ["null", { | |
"type": "struct", | |
"fields": [ | |
{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"}, | |
{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"} | |
] | |
}], "doc": "Peptide sequence counts for this protein group in this specific file"}, | |
{"name": "feature_counts", "type": ["null", { | |
"type": "struct", | |
"fields": [ | |
{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"}, | |
{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"} | |
] | |
}], "doc": "Peptide feature counts for this protein group in this specific file"}, | |
{"name": "peptide_counts", "type": ["null", { | |
"type": "record", | |
"name": "pg_peptide_counts", | |
"fields": [ | |
{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"}, | |
{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"} | |
] | |
}], "doc": "Peptide sequence counts for this protein group in this specific file"}, | |
{"name": "feature_counts", "type": ["null", { | |
"type": "record", | |
"name": "pg_feature_counts", | |
"fields": [ | |
{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"}, | |
{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"} | |
] | |
}], "doc": "Peptide feature counts for this protein group in this specific file"}, |
🤖 Prompt for AI Agents
In docs/pg.avsc around lines 28-41, the Avro schema uses "struct" for nested
types which is invalid JSON Avro; replace each nested "struct" with a "record"
object and add a unique "name" property (e.g., PeptideCounts, FeatureCounts)
inside each record, keeping the "fields" arrays intact and preserving the
surrounding nullable union (["null", { ... }]); ensure the record names are
valid Avro identifiers and update any references if needed.
{ "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" } | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use valid Avro primitive types (int/float), not int32/float32.
Avro primitives are: null, boolean, int, long, float, double, bytes, string. int32/float32 are invalid.
Apply:
- { "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" }
+ { "name": "score_value", "type": "float", "doc": "Value of the score for this specific position" }
- {"name": "precursor_charge", "type": "int32", "doc": "Precursor charge"},
+ {"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
- {"name": "is_decoy", "type": "int32", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
+ {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
- {"name": "number_peaks", "type": ["null", "int32"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"},
+ {"name": "number_peaks", "type": ["null", "int"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"},
Also update other occurrences (posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array, intensity_array, calculated_mz, observed_mz) to use "float".
To fix all in this file:
#!/bin/bash
# Replace non-Avro primitives in-place
sed -i -E 's/"int32"/"int"/g; s/"float32"/"float"/g' docs/psm.avsc
Also applies to: 82-82, 84-84, 120-120
🤖 Prompt for AI Agents
In docs/psm.avsc around lines 67-68 (also occurrences at 82, 84 and 120), the
schema uses invalid Avro primitive types "int32" and "float32"; update the field
definition for score_value to use "float" and change any other fields listed
(posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array,
intensity_array, calculated_mz, observed_mz) that use "int32"/"float32" to the
correct Avro primitives "int"/"float" respectively; perform a project-wide
replace of "int32" → "int" and "float32" → "float" to ensure all instances are
fixed and validate the AVSC file afterwards.
evidence_file = test_data[1] | ||
sdrf_file = test_data[2] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard feature test with file existence checks.
Apply:
evidence_file = test_data[1]
sdrf_file = test_data[2]
+if not evidence_file.exists() or not sdrf_file.exists():
+ pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
evidence_file = test_data[1] | |
sdrf_file = test_data[2] | |
evidence_file = test_data[1] | |
sdrf_file = test_data[2] | |
if not evidence_file.exists() or not sdrf_file.exists(): | |
pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}") |
🤖 Prompt for AI Agents
In tests/core/quantms/test_maxquant.py around lines 51 to 53, the test assumes
test_data[1] and test_data[2] (evidence_file and sdrf_file) exist; add guards to
check os.path.exists for each file and call pytest.skip with a clear message if
either file is missing so the test is skipped rather than failing; ensure you
import os and pytest at the top of the file if not already present.
evidence_file = test_data[1] | ||
sdrf_file = test_data[2] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard feature slicing test with file existence checks.
Apply:
evidence_file = test_data[1]
sdrf_file = test_data[2]
+if not evidence_file.exists() or not sdrf_file.exists():
+ pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
evidence_file = test_data[1] | |
sdrf_file = test_data[2] | |
evidence_file = test_data[1] | |
sdrf_file = test_data[2] | |
if not evidence_file.exists() or not sdrf_file.exists(): | |
pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}") |
🤖 Prompt for AI Agents
In tests/core/quantms/test_maxquant.py around lines 81 to 83, the test uses
evidence_file = test_data[1] and sdrf_file = test_data[2] without verifying
those paths exist; add existence guards so the test is skipped when required
input files are missing. Update the test to check os.path.exists (or
Path.exists) for each file and call pytest.skip with a clear message if either
file is absent (or use pytest.mark.skipif at module/function level with the same
condition), so the test won't fail on CI or local runs when those fixture files
are not present.
"PXD007683", | ||
TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Dataset switch to PXD007683 looks good; verify test asset path exists in repo.
Ensures the SDRF file is present so the test doesn’t silently skip behavior.
🏁 Script executed:
#!/bin/bash
# Verify the SDRF file exists where the test expects it.
set -euo pipefail
echo "Searching for PXD007683TMT.sdrf.tsv under tests/ ..."
fd -a 'PXD007683TMT.sdrf.tsv' tests | tee /dev/stderr
# Fail if not found
test -n "$(fd -a 'PXD007683TMT.sdrf.tsv' tests)" || { echo 'Missing SDRF test file'; exit 1; }
Length of output: 282
🏁 Script executed:
#!/bin/bash
# Re-verify the presence of PXD007683TMT.sdrf.tsv using find, avoiding unavailable commands.
set -euo pipefail
echo "Searching for PXD007683TMT.sdrf.tsv under tests/ using find..."
RESULT=$(find tests -type f -name 'PXD007683TMT.sdrf.tsv' || true)
echo "$RESULT"
# Fail if not found
if [ -z "$RESULT" ]; then
echo "Missing SDRF test file"
exit 1
fi
Length of output: 305
Add or update the missing SDRF test asset
The test in tests/core/test_project.py is pointing to PXD007683TMT.sdrf.tsv under the TEST_DATA_ROOT, but that file was not found in the repo. Without it, the test will error out or silently skip important validation.
• Location: tests/core/test_project.py (lines 12–14)
• Expected asset: tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv
Please ensure that the SDRF file is committed at the correct path, or adjust the test to reference the existing file location.
🤖 Prompt for AI Agents
In tests/core/test_project.py around lines 12 to 14, the test references
TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv" which is missing
from the repository; either add the SDRF file at
tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv so the path resolves, or
update the test to point to the correct existing SDRF file path (or a fixture)
and adjust any expected contents accordingly so the test no longer depends on a
non-existent asset.
Update idxml.py and test_idxml.py to improve the modification extraction structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
quantmsio/core/idxml.py (1)
627-629
: Ensure parent directory exists and compress Parquet outputPrevents write failures when dirs are missing and reduces file size. Matches prior review guidance.
- table = pa.Table.from_pandas(df, schema=PSM_SCHEMA) - pq.write_table(table, output_path) - logging.info(f"Successfully converted IdXML to parquet: {output_path}") + output_path = Path(output_path) + output_path.parent.mkdir(parents=True, exist_ok=True) + table = pa.Table.from_pandas(df, schema=PSM_SCHEMA) + pq.write_table(table, str(output_path), compression="zstd") + logging.info(f"Successfully converted IdXML to parquet: {output_path}")
🧹 Nitpick comments (7)
quantmsio/core/idxml.py (7)
55-57
: Log full traceback when IdXML parsing failsUse logging.exception to include the stack trace; avoid interpolating the exception object redundantly.
- except Exception as e: - logging.error(f"Error parsing IdXML file with pyopenms: {e}") - raise + except Exception: + logging.exception("Error parsing IdXML file with pyopenms") + raise
170-172
: Emit traceback when a peptide hit cannot be parsedThis improves diagnosability and aligns with TRY400 guidance.
- except Exception as e: - logging.warning(f"Error parsing peptide hit with pyopenms: {e}") - return None + except Exception: + logging.exception("Error parsing peptide hit with pyopenms") + return None
512-520
: Broadened catch is fine here; include traceback to aid debuggingGiven this wraps two alternative attachment strategies, keep the broad catch but log with traceback.
- except Exception as e: - method_name = "OnDiscExperiment" if self._use_ondisc else "traditional" - logging.warning(f"Failed to attach mzML spectra using {method_name}: {e}") + except Exception: + method_name = "OnDiscExperiment" if self._use_ondisc else "traditional" + logging.exception(f"Failed to attach mzML spectra using {method_name}")
529-533
: OnDisc path: prefer openFile when available (true on-disk access)Using OnDiscMSExperiment.openFile avoids loading the full file into memory. Fall back to MzMLFile.load only if needed.
- ondisc_exp = oms.OnDiscMSExperiment() - oms.MzMLFile().load(str(self._mzml_path), ondisc_exp) + ondisc_exp = oms.OnDiscMSExperiment() + # Prefer true on-disk access when available + if hasattr(ondisc_exp, "openFile"): + ondisc_exp.openFile(str(self._mzml_path)) + else: + # Fallback for older pyopenms bindings + oms.MzMLFile().load(str(self._mzml_path), ondisc_exp)
562-564
: Remove unnecessary int() castnum_peaks is already an int.
- psm["number_peaks"] = int(num_peaks) + psm["number_peaks"] = num_peaks
584-586
: Remove unnecessary int() castlen(...) returns an int.
- psm["number_peaks"] = int(len(spectrum_mz)) + psm["number_peaks"] = len(spectrum_mz)
433-470
: Avoid double numpy→list normalizationto_dataframe already normalizes list-like fields; the extra conversion in to_parquet duplicates work. Consider keeping normalization in one place (preferably to_dataframe) and removing it from to_parquet.
- def convert_numpy_to_list(obj): - """Recursively convert numpy arrays to Python lists""" - if hasattr(obj, "tolist"): - return obj.tolist() - elif isinstance(obj, list): - return [convert_numpy_to_list(item) for item in obj] - elif isinstance(obj, dict): - return {key: convert_numpy_to_list(value) for key, value in obj.items()} - else: - return obj - - for col in df.columns: - if col in [ - "modifications", - "protein_accessions", - "additional_scores", - "cv_params", - "mz_array", - "intensity_array", - ]: - df[col] = df[col].apply(convert_numpy_to_list) + # Data is already normalized in to_dataframe(); avoid re-normalizing here.Also applies to: 605-626
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
quantmsio/core/idxml.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
quantmsio/core/idxml.py (1)
quantmsio/core/openms.py (2)
OpenMSHandler
(14-344)get_spectrum_from_scan
(21-43)
🪛 Ruff (0.12.2)
quantmsio/core/idxml.py
56-56: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
147-168: Consider moving this statement to an else
block
(TRY300)
170-170: Do not catch blind exception: Exception
(BLE001)
517-517: Do not catch blind exception: Exception
(BLE001)
584-584: Value being cast to int
is already an integer
Remove unnecessary int
call
(RUF046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: build
🔇 Additional comments (1)
quantmsio/core/idxml.py (1)
557-567
: Heads-up: upstream return shape may throw here; verify OpenMSHandler contractOpenMSHandler.get_spectrum_from_scan appears to return only two values on some error paths (see openms.py snippet), which triggers ValueError during unpacking. Your except already handles it, but the root cause should be fixed upstream to always return a consistent 3-tuple.
fix the bugs in quantms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a major restructuring of the quantms.io CLI with organized command groups and enhanced functionality. The changes introduce a hierarchical command structure with groups for conversion, transformation, visualization, statistics, and project management, while adding comprehensive logging and improving error handling.
- Major CLI restructuring from flat commands to organized groups (convert, transform, visualize, stats, project)
- Enhanced logging system with structured output support and comprehensive documentation
- New quantms workflow conversion command with support for multiple proteomics tools (DIA-NN, MaxQuant, FragPipe)
Reviewed Changes
Copilot reviewed 106 out of 152 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
quantmsioc.py | Restructure main CLI with organized command groups and enhanced error handling |
diann.py, maxquant.py, fragpipe.py | Add conversion commands for different proteomics tools |
psm.py, feature.py, plot.py | Add PSM/feature conversion and visualization commands |
ae.py, de.py, ibaq.py, gene.py | Add transformation commands for various analyses |
project.py, stats.py | Add project management and statistical analysis commands |
logger.py | Add centralized logging utility module |
Test files | Comprehensive test coverage for new functionality and performance testing |
Schema files | Update Avro schemas with nullable fields and new differential/absolute expression schemas |
Documentation | Major README update with usage examples and comprehensive logging documentation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
"MSV000079033", | ||
TEST_DATA_ROOT / "DDA-plex/MSV000079033-Blood-Plasma-iTRAQ.sdrf.tsv", | ||
"PXD007683", | ||
TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename contains "PXD007683TMT.sdrf.tsv" but should likely be "PXD007683-TMT.sdrf.tsv" to match the pattern used elsewhere in the codebase (with hyphen between dataset ID and experiment type).
TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv", | |
TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683-TMT.sdrf.tsv", |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
quantmsio/core/diann/diann.py (3)
414-424
: Incorrect call to Feature.generate_modifications_details (signature mismatch).Feature.generate_modifications_details is not a static method and expects a modifications map, not sample map + select mods. This will error.
Minimal safe fix: map peptidoform to a canonical string and set modifications to None.
- report[["peptidoform", "modifications"]] = report[["peptidoform"]].apply( - lambda row: Feature.generate_modifications_details( - row["peptidoform"], self._sample_map, select_mods - ), - axis=1, - result_type="expand", - ) + _, modifications_map = self.get_masses_and_modifications_map() + report["peptidoform"] = report["peptidoform"].map(modifications_map) + report["modifications"] = NoneIf you need full modification details, refactor to use an instance method on Feature that has access to a proper modifications_dict.
426-439
: Same LFQ mapping issue in add_additional_msg().Apply the same fallback to avoid KeyError.
- { - "sample_accession": self._sample_map[ - rows["reference_file_name"] + "-" + rows["channel"] - ], + { + "sample_accession": self._sample_map.get( + rows["reference_file_name"] + "-" + rows["channel"], + self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]), + ), "channel": rows["channel"], "intensity": rows["intensity"], }
449-464
: Apply LFQ mapping fallback to additional_intensities as well.Mirror the robust lookup.
- { - "sample_accession": self._sample_map[ - rows["reference_file_name"] + "-" + rows["channel"] - ], + { + "sample_accession": self._sample_map.get( + rows["reference_file_name"] + "-" + rows["channel"], + self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]), + ), "channel": rows["channel"], "intensities": [ {"intensity_name": "lfq", "intensity_value": rows["lfq"]}, ], }
♻️ Duplicate comments (16)
quantmsio/core/sdrf.py (1)
129-131
: Add proper exception chaining (ruff B904).Preserve original context or explicitly suppress it.
- except FileNotFoundError: - raise FileNotFoundError( - "The SDRF file provided not found: " + str(sdrf_file) - ) + except FileNotFoundError as err: + raise FileNotFoundError( + "The SDRF file provided not found: " + str(sdrf_file) + ) from errquantmsio/utils/file_utils.py (3)
17-18
: Library modules shouldn’t call basicConfig.Don’t configure global logging in a library; attach a NullHandler instead.
-# Configure logging -logging.basicConfig(level=logging.INFO) -logger = logging.getLogger(__name__) +# Configure logging +logger = logging.getLogger(__name__) +logger.addHandler(logging.NullHandler())
30-39
: Support multi-suffix and case-insensitive extensions.Current suffix equality fails for “.mzTab.gz” and case variants.
def validate_extension(file_path: Union[str, Path], extension: str) -> bool: """Validate that the file has the correct extension.""" if not Path(file_path).exists(): raise FileNotFoundError(f"File not found: {file_path}") - if not Path(file_path).suffix == extension: + path = Path(file_path) + ext = extension.lower() + # Normalize extension to start with dot + if not ext.startswith("."): + ext = "." + ext + # Handle multi-suffix like ".mzTab.gz" and case-insensitive compare + path_ext = "".join(path.suffixes).lower() or path.suffix.lower() + if path_ext != ext: raise ValueError( f"File {file_path} does not have the correct extension {extension}" ) return True
390-405
: Fix Parquet metadata and ensure output directory exists.
- PyArrow expects metadata keys/values as bytes.
- Create parent directory before ParquetWriter.
- schema_metadata = self.schema.metadata or {} - - # Convert file_metadata to string format for parquet metadata - if self.file_metadata: - for key, value in self.file_metadata.items(): - schema_metadata[f"file_metadata.{key}"] = str(value) - - # Update schema with metadata - schema_with_metadata = self.schema.with_metadata(schema_metadata) + # Start from existing metadata (bytes->bytes) + schema_metadata = dict(self.schema.metadata or {}) + # Merge file-level metadata as bytes + if self.file_metadata: + for key, value in self.file_metadata.items(): + k = f"file_metadata.{key}".encode("utf-8") + v = str(value).encode("utf-8") + schema_metadata[k] = v + schema_with_metadata = self.schema.with_metadata(schema_metadata) @@ - self.parquet_writer = pq.ParquetWriter( + # Ensure parent directory exists + Path(self.output_path).parent.mkdir(parents=True, exist_ok=True) + self.parquet_writer = pq.ParquetWriter( - where=self.output_path, + where=str(self.output_path), schema=schema_with_metadata, compression=self.compression, )quantmsio/operate/tools.py (2)
335-346
: Ensure output dir exists and pass str path to ParquetWriter.Avoids failures when parent dir is missing and improves PyArrow compatibility.
pqwriter = None for feature in generate_ibaq_feature(sdrf_path, parquet_path): if not pqwriter: - pqwriter = pq.ParquetWriter(output_path, feature.schema) + Path(output_path).parent.mkdir(parents=True, exist_ok=True) + pqwriter = pq.ParquetWriter(str(output_path), feature.schema)
291-299
: Fix typo in public API: rename genereate_ibaq_feature → generate_ibaq_feature and update call site.Prevents leaking a misspelling and improves discoverability.
-def genereate_ibaq_feature( +def generate_ibaq_feature( sdrf_path: Union[Path, str], parquet_path: Union[Path, str] ) -> Generator[pa.Table, None, None]: @@ - for feature in genereate_ibaq_feature(sdrf_path, parquet_path): + for feature in generate_ibaq_feature(sdrf_path, parquet_path):Also applies to: 342-348
quantmsio/core/quantms/pg.py (1)
650-686
: Fix DuckDB filter: ANY([...]) is unreliable; use a quoted IN list.This pattern has caused failures before. Build a safe IN (...) list. (Same concern was raised earlier; re-flagging.)
- FROM processed_msstats_with_pg - WHERE reference_file_name = ANY({file_batch}) + FROM processed_msstats_with_pg + WHERE reference_file_name IN ({", ".join(f"'{f.replace(\"'\",\"''\")}'" for f in file_batch)})quantmsio/core/diann/diann.py (2)
152-185
: Naming consistency for peptide_counts vs feature_counts.Prior review suggested using “unique_peptides/total_peptides” in peptide_counts and “unique_features/total_features” in feature_counts to avoid mixing “sequences” terminology.
46-56
: super().init() will raise — class no longer inherits. Initialize DiannDuckDB instead.This is a runtime blocker; _duckdb is never set and all DB calls will fail.
- super(DiaNNConvert, self).__init__( - diann_report, duckdb_max_memory, duckdb_threads - ) + # Initialize DuckDB connection + self._duckdb = DiannDuckDB( + diann_report, + max_memory=duckdb_max_memory, + worker_threads=duckdb_threads, + pg_matrix_path=pg_matrix_path, + )quantmsio/commands/convert/quantms.py (3)
191-191
: Replace insecuremktemp()
withNamedTemporaryFile
.- temp_db_path = tempfile.mktemp(suffix=".duckdb") + from tempfile import NamedTemporaryFile + temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False) + temp_db_path = temp_file.name + temp_file.close()
379-379
: Replace insecuremktemp()
withNamedTemporaryFile
.- temp_db_path = tempfile.mktemp(suffix=".duckdb") + from tempfile import NamedTemporaryFile + temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False) + temp_db_path = temp_file.name + temp_file.close()
104-104
: Replace insecuremktemp()
withNamedTemporaryFile
.Using
tempfile.mktemp()
is deprecated and insecure as it creates a predictable filename that attackers could potentially exploit.- temp_db_path = tempfile.mktemp(suffix=".duckdb") + from tempfile import NamedTemporaryFile + temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False) + temp_db_path = temp_file.name + temp_file.close()quantmsio/core/quantms/psm.py (2)
526-527
: Fix typo in CV parameter name.The CV parameter name contains a typo: "consesus_support" should be "consensus_support".
struct = { - "cv_name": "consesus_support", + "cv_name": "consensus_support", "cv_value": str(consensus_support), }
664-665
: Fix variable shadowing in generator.The loop variable
df
shadows the iterable it iterates over.- for key, df in df.groupby(partitions): - yield key, df + for key, group_df in df.groupby(partitions): + yield key, group_dfquantmsio/core/quantms/feature.py (2)
472-472
: Fix closure issue in lambda function.The lambda function doesn't properly bind the loop variable
i
, which could lead to incorrect behavior if the lambda is evaluated after the loop completes.for i, feature in enumerate(map_features): msstats.loc[:, feature] = msstats[ ["reference_file_name", "peptidoform", "precursor_charge"] ].apply( - lambda rows: merge_psm(rows, i), + lambda rows, idx=i: merge_psm(rows, idx), axis=1, )
541-542
: Fix variable shadowing in generator.The loop variable
df
shadows the iterable it iterates over.- for key, df in df.groupby(partitions): - yield key, df + for key, group_df in df.groupby(partitions): + yield key, group_df
🧹 Nitpick comments (36)
tests/core/quantms/psms/test_duckdb_creation.py (7)
9-9
: Use a robust, normalized path for TEST_DATA_ROOTAvoid unresolved ".." segments; use resolve()+parents for clarity and portability.
-TEST_DATA_ROOT = Path(__file__).parent / "../.." / ".." / ".." / "tests" / "examples" +TEST_DATA_ROOT = Path(__file__).resolve().parents[4] / "tests" / "examples"
96-98
: Simplify resource management; avoid double-closing and use context managersClose the indexer once and set it to None to prevent a second close in finally. Use a context manager for DuckDB instead of manual connect/close.
- cleanup_mztab_indexer(mztab) + cleanup_mztab_indexer(mztab) + mztab = None assert output_path.exists(), f"DuckDB file not created: {output_path}" - con = duckdb.connect(str(output_path), read_only=True) - tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()] + with duckdb.connect(str(output_path), read_only=True) as con: + tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()] - assert ( - "metadata" in tables and "proteins" in tables and "psms" in tables - ) - metadata_count = con.execute( - "SELECT COUNT(*) FROM metadata" - ).fetchone()[0] - proteins_count = con.execute( - "SELECT COUNT(*) FROM proteins" - ).fetchone()[0] - psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0] - if msstats_path: - msstats_count = con.execute( - "SELECT COUNT(*) FROM msstats" - ).fetchone()[0] + assert ( + "metadata" in tables and "proteins" in tables and "psms" in tables + ) + metadata_count = con.execute( + "SELECT COUNT(*) FROM metadata" + ).fetchone()[0] + proteins_count = con.execute( + "SELECT COUNT(*) FROM proteins" + ).fetchone()[0] + psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0] + if msstats_path: + msstats_count = con.execute( + "SELECT COUNT(*) FROM msstats" + ).fetchone()[0]- cleanup_mztab_indexer(mztab) - cleanup_duckdb_connection(con) + cleanup_mztab_indexer(mztab)Also applies to: 99-114, 153-156
4-7
: Replace bare try/except-pass with contextlib.suppressEliminates Ruff BLE001/S110 and keeps intent clear.
+from contextlib import suppress
def cleanup_mztab_indexer(mztab_indexer): """Clean up MzTabIndexer instance and its connections.""" - if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb: - try: - mztab_indexer._duckdb.close() - except Exception: - pass # Ignore errors during cleanup + if mztab_indexer and getattr(mztab_indexer, "_duckdb", None): + with suppress(Exception): + mztab_indexer._duckdb.close() def cleanup_duckdb_connection(con): """Clean up DuckDB connection.""" - if con: - try: - con.close() - except Exception: - pass # Ignore errors during cleanup + if con: + with suppress(Exception): + con.close()Also applies to: 1251-1267
181-181
: Drop redundant f-prefixes (no placeholders) to satisfy Ruff F541Also helps the Black job by keeping strings tidy.
-print(f"\n=== Testing Two-Step Creation Process ===") +print("\n=== Testing Two-Step Creation Process ===") @@ -print(f"Initial database created with:") +print("Initial database created with:") @@ -print(f"\nFinal database contains:") +print("\nFinal database contains:") @@ -print(f"\n=== Testing Two-Step Creation Process (TMT) ===") +print("\n=== Testing Two-Step Creation Process (TMT) ===") @@ -print(f"Initial database created with:") +print("Initial database created with:") @@ -print(f"\nFinal database contains:") +print("\nFinal database contains:") @@ -print(f"\n=== Testing Create vs Open Database with Class Methods ===") +print("\n=== Testing Create vs Open Database with Class Methods ===") @@ -print(f"New database created with:") +print("New database created with:") @@ -print(f"Existing database opened with:") +print("Existing database opened with:") @@ -print(f"\nSummary:") +print("\nSummary:") @@ -print(f"\n=== Testing Create vs Open Database with Class Methods (TMT) ===") +print("\n=== Testing Create vs Open Database with Class Methods (TMT) ===") @@ -print(f"New database created with:") +print("New database created with:") @@ -print(f"Existing database opened with:") +print("Existing database opened with:") @@ -print(f"\nSummary:") +print("\nSummary:")Also applies to: 210-210, 269-269, 305-305, 335-335, 394-394, 712-712, 745-745, 797-797, 824-824, 854-854, 887-887, 939-939, 966-966
62-62
: Rename test to reflect actual behavior (DuckDB only)Removes “parquet” from the name to avoid confusion while parquet paths are commented.
-def test_create_duckdb_and_parquet(test_name, mztab_path, msstats_path): +def test_create_duckdb_database(test_name, mztab_path, msstats_path):
193-198
: Avoid relying on private attribute _duckdb in testsAccessing internals couples tests to implementation details. Prefer opening a temporary read-only DuckDB connection for assertions (as in the first test) or add a small public helper on MzTabIndexer (e.g., tables(), count(table)) and use it here.
Also applies to: 231-236, 317-323, 356-361, 725-731, 767-774, 867-872, 908-915
1-7
: Optionally silence Ruff S101 for testsPytest-style asserts are fine in tests. Either configure per-file ignores in pyproject.toml or add a file header.
pyproject.toml:
"""
[tool.ruff]
per-file-ignores = { "tests/**" = ["S101"] }
"""Or add at the top of this file:
"""ruff: noqa: S101
"""
quantmsio/core/sdrf.py (3)
104-112
: Remove commented-out SUPOORTED_LABELING block.Dead/commented code adds noise; delete or move to docs if needed.
- # SUPOORTED_LABELING = [ - # "LABEL FREE", - # "TMT10", - # "TMT11", - # "TMT16", - # "TMT6", - # "ITRAQ4", - # "ITRAQ8", - # ]
452-452
: Tiny style: avoid list concatenation in drop.Use iterable unpacking for readability (ruff RUF005).
- sdrf.drop( - ["comment[technical replicate]"] + factor, + sdrf.drop( + ["comment[technical replicate]", *factor], axis=1, inplace=True, )
457-457
: Remove commented-out return.Avoid stale commented code.
- # return sdrf[["reference_file_name", "channel", "condition"]].copy()
quantmsio/utils/file_utils.py (2)
341-351
: Type hint nit: Optional for file_metadata.Use Optional[Dict[str, Any]] (ruff RUF013).
-from typing import Dict, Iterator, List, Tuple, Union +from typing import Any, Dict, Iterator, List, Tuple, Union, Optional @@ - def __init__( + def __init__( self, output_path: str, schema: pa.Schema, batch_size: int = 1000000, compression: str = "gzip", - file_metadata: dict = None, + file_metadata: Optional[Dict[str, Any]] = None, ):
414-416
: Use logging.exception to preserve tracebacks.Provides full stack traces for easier debugging (ruff TRY400).
- self.logger.error( + self.logger.exception( f"Error during batch writing: {e}, file path: {self.output_path}" ) @@ - self.logger.error(f"Error closing writer: {e}") + self.logger.exception(f"Error closing writer: {e}")Also applies to: 431-431
quantmsio/operate/tools.py (2)
159-167
: Remove “Before Code” banner; keep helper or move to utils.The heading looks like leftover. If MaxQuant relies on this (see self._automaton), keep it here or import from a shared utils module.
-# Before Code: def get_ahocorasick(mods_dict: dict) -> ahocorasick.Automaton:
169-212
: Guard against select_mods=None in get_modification_details.Avoid TypeError when select_mods is None; also minor cleanup.
-def get_modification_details( +def get_modification_details( seq: str, mods_dict: dict, automaton: ahocorasick.Automaton, - select_mods: list | None = None, + select_mods: list | None = None, ) -> tuple[str, list]: @@ - elif modification in select_mods: + elif (select_mods is None) or (modification in select_mods): modifications.append(modification) modification_details.append(quantmsio/core/idxml_utils/idxml.py (4)
86-96
: Avoid bare except; catch Exception explicitly.Also keep calc_mz fallback.
- except: + except Exception: calc_mz = 0.0
304-315
: Use logger.exception and trim excessive debug noise.Preserve traceback; consider downgrading DEBUG banners to debug level.
- except Exception as e: - self.logger.error(f"=== DEBUG: Error transforming PSM record: {e} ===") + except Exception as e: + self.logger.exception("Error transforming PSM record") self.logger.error(f"=== DEBUG: Record keys: {list(record.keys())} ===") self.logger.error( f"=== DEBUG: Record sample: {dict(list(record.items())[:5])} ===" ) import traceback - - self.logger.error( - f"=== DEBUG: Transformation traceback: {traceback.format_exc()} ===" - ) + self.logger.debug(f"Transformation traceback: {traceback.format_exc()}") return None
397-401
: Remove redundant f-string and keep message at info level.Also, minor copy edit.
- logger.info(f"=== DEBUG: convert_to_parquet started ===") + logger.info("convert_to_parquet started")
445-449
: Use logging.exception to include stack trace.Easier troubleshooting for failures.
- except Exception as e: - logger.error(f"=== DEBUG: Error during parquet conversion: {e} ===") - logger.error(f"=== DEBUG: Exception type: {type(e)} ===") + except Exception as e: + logger.exception("Error during parquet conversion") + logger.error(f"Exception type: {type(e)}") import traceback - - logger.error(f"=== DEBUG: Traceback: {traceback.format_exc()} ===") + logger.debug(f"Traceback: {traceback.format_exc()}") raisequantmsio/core/maxquant/maxquant.py (2)
191-233
: get_modification_details dependency must be present.With the import restored and automaton initialized, this block will work as intended. Consider adding a quick guard if self._automaton is missing.
- peptidoform, other_modification_details = get_modification_details( - seq, self.mods_map, self._automaton, other_mods - ) + peptidoform, other_modification_details = get_modification_details( + seq, self.mods_map, getattr(self, "_automaton", get_ahocorasick(self.mods_map)), other_mods + )
761-799
: Prefer isinstance/type-safe checks in required fields loop (ruff E721).Use identity or isinstance to avoid equality on types; also simplifies conversions.
- for field, dtype in required_fields.items(): + for field, dtype in required_fields.items(): if field not in df.columns: - if dtype == list: + if dtype is list: df[field] = [[] for _ in range(len(df))] - elif dtype == str: + elif dtype is str: df[field] = None - elif dtype == float: + elif dtype is float: df[field] = 0.0 - elif dtype == int: + elif dtype is int: df[field] = 0 else: # Handle type conversions - if dtype == float: + if dtype is float: df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0) - elif dtype == int: + elif dtype is int: df[field] = ( pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int) ) - elif dtype == str: + elif dtype is str: df[field] = df[field].astype(str).fillna("")quantmsio/utils/constants.py (3)
91-101
: Consolidate MSstats “Reference” constants to avoid ambiguity.You define both MSSTATS_REFERENCE and MSSTATS_QUANTIFICATION_ID with the same value ("Reference"). Keeping two names for the same column is confusing and invites drift. Use one, and if you want an alias, assign it explicitly with a comment.
- MSSTATS_REFERENCE = "Reference" + MSSTATS_REFERENCE = "Reference" # primary name used across the codebase ... - MSSTATS_QUANTIFICATION_ID = "Reference" + # Alias kept for backward compatibility; prefer MSSTATS_REFERENCE + MSSTATS_QUANTIFICATION_ID = MSSTATS_REFERENCE
109-111
: Prefer an immutable container for shared prefixes.DECOY_PREFIXES is a global shared across modules. Make it a tuple to prevent accidental mutation.
-DECOY_PREFIXES = ["DECOY", "REV", "RANDOM"] +DECOY_PREFIXES = ("DECOY", "REV", "RANDOM")
114-125
: Add a brief note on mzTab column set origin.The MZTAB_PROTEIN_COLUMNS set mixes fixed columns and instrument-dependent optional columns. A short comment that this is the minimal/expected set used by downstream mapping (see core/common.py MZTAB_PG_MAP) will help maintainers align changes.
quantmsio/core/common.py (2)
40-42
: Clarify comment: this is not a regex.OPENMS_PROTEIN_QVALUE_WORDS is a list of required tokens, not a regular expression. Update the comment to avoid confusion.
-# Regular expression that make sure the protein search engine score contains the following words: openms target-decoy protein and q-value +# Required tokens for matching OpenMS protein q-value score column names
209-209
: Ruff RUF005: prefer list unpacking for readability.Minor style: apply Ruff’s suggestion.
-MZTAB_PG_USECOLS = list(MZTAB_PG_MAP.keys()) + ["description"] +MZTAB_PG_USECOLS = [*list(MZTAB_PG_MAP.keys()), "description"]quantmsio/core/quantms/pg.py (4)
20-33
: Constructor docstring and unused variable.
- Docstring still references mztab_path; update to indexer.
- protein_details_df is unused; drop to avoid unnecessary data load.
- def __init__(self, mztab_indexer): - """Initialize MzTabProteinGroups with an MzTabIndexer instance. - - Args: - mztab_path: Path to mzTab file - """ + def __init__(self, mztab_indexer): + """Initialize MzTabProteinGroups with an MzTabIndexer instance.""" self._indexer: MzTabIndexer = mztab_indexer - - # Initialize tracking lists first self._temp_files = [] # Track any temporary files created self._file_handles = [] # Track any open file handles - protein_details_df = self._indexer.get_protein_details() self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
401-404
: Use logging.exception to preserve traceback for SQL batch failures.You’re catching broad exceptions; at minimum, include the traceback.
- except Exception as e: - logger.warning(f"[ERROR] SQL batch failed: {e}, skipping batch") + except Exception: + logger.exception("[ERROR] SQL batch failed; skipping batch")
448-452
: Avoid destroying a shared indexer; delegate lifecycle to the caller.destroy_database() here will close the indexer even if the caller wants to reuse it. Since the CLI wraps you in a context manager already, drop this call and let the owner manage lifetime.
- # Clean up the temporary MzTabIndexer - self._indexer.destroy_database() - - # Cleanup any temporary files created during processing + # Cleanup any temporary files created during processing self.cleanup()
691-699
: Parameter ‘topn’ is ignored.Either implement TopN properly (needs peptide-level detail) or drop the parameter and flag approximation in docs.
quantmsio/core/diann/diann.py (2)
440-443
: Use numeric/boolean for is_decoy.String "0" risks type inconsistency downstream.
- report.loc[:, "is_decoy"] = "0" + report.loc[:, "is_decoy"] = 0
80-92
: Prefer DiannDuckDB helper for query construction.To avoid ad-hoc IN (...) formatting and improve reuse, delegate to DiannDuckDB.get_report_from_database.
- report = self._duckdb.query_to_df( - """ - select {} - from report - where Run IN {} - """.format( - sql, tuple(runs) - ) - ) + report = self._duckdb.get_report_from_database(runs, columns=sql)quantmsio/quantmsioc.py (1)
60-73
: Logger init once at CLI entry is good. Consider honoring --verbose globally.If you plan a global --verbose, add a click option on cli() and set level accordingly, instead of per-subcommand.
quantmsio/commands/convert/quantms.py (1)
512-512
: Remove extraneous f-string prefix.The f-string prefix is unnecessary here as there are no placeholders.
- logger.info(f"Successfully processed:") + logger.info("Successfully processed:")quantmsio/core/quantms/psm.py (1)
658-663
: Improve exception handling.Consider using more specific exception types for better error handling and clarity.
if not isinstance(partitions, list): - raise Exception(f"{partitions} is not a list") + raise TypeError(f"Expected list, got {type(partitions).__name__}") if len(partitions) == 0: - raise Exception(f"{partitions} is empty") + raise ValueError("Partitions list cannot be empty") for partion in partitions: if partion not in cols: - raise Exception(f"{partion} does not exist") + raise ValueError(f"Column '{partion}' does not exist in dataframe")quantmsio/core/quantms/feature.py (2)
535-540
: Improve exception handling.Consider using more specific exception types for better error handling and clarity.
if not isinstance(partitions, list): - raise Exception(f"{partitions} is not a list") + raise TypeError(f"Expected list, got {type(partitions).__name__}") if len(partitions) == 0: - raise Exception(f"{partitions} is empty") + raise ValueError("Partitions list cannot be empty") for partion in partitions: if partion not in cols: - raise Exception(f"{partion} does not exist") + raise ValueError(f"Column '{partion}' does not exist in dataframe")
201-201
: Address TODO comments.The code contains unresolved TODO comments that need attention:
- Line 201: Need to determine the data source for score_name and score_value
- Line 248: Need to confirm whether the proteins table contains "opt_global_q-value"
These TODOs indicate missing implementation details. Would you like me to help investigate the correct data sources or open issues to track these tasks?
Also applies to: 248-249
|
||
except Exception as e: | ||
logger.exception(f"Error in feature conversion: {str(e)}") | ||
raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use proper exception chaining to preserve traceback.
When re-raising exceptions within an except clause, use from e
to preserve the original traceback context for better debugging.
- raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")
+ raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") from e
Apply this change to all five occurrences in the file at lines 128, 208, 264, 410, and 528.
Also applies to: 208-208, 264-264, 410-410, 528-528
🧰 Tools
🪛 Ruff (0.12.2)
128-128: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
128-128: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
In quantmsio/commands/convert/quantms.py around lines 128, 208, 264, 410, and
528 the code re-raises ClickException inside except blocks without exception
chaining; update each raise to use "from e" (e.g., raise
click.ClickException(...) from e) so the original traceback is preserved,
applying the same change at all five specified line locations.
"Run": "reference_file_name", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Be explicit about DIA-NN “Run” → “reference_file_name” rename across the codebase.
You remap “Run” to “reference_file_name”. Please confirm all call sites downstream (e.g., joins, filters, and CLI code) use the canonical “reference_file_name”. Mixed usage is a common source of bugs when reading older report exports.
Run to check remaining usage:
🏁 Script executed:
#!/bin/bash
# Find usages of raw "Run" column outside mapping dicts and tests
rg -nP -C2 '(?<!DIANN_MAP)\bRun\b' --type=py -g '!**/test/**' -g '!**/tests/**'
Length of output: 4408
Be explicit about renaming “Run” → “reference_file_name” in all DIA-NN call sites
- In quantmsio/core/diann/diann.py update:
- lines 85–87 (
WHERE Run IN …
) → use"reference_file_name"
orDIANN_MAP["Run"]
- lines 110–114 (
SELECT "Run"
) → select"reference_file_name"
- line 122 (
peptide_df.to_dict()["Run"]
) → index by"reference_file_name"
- lines 520–522 (
get_unique_references("Run")
) → callget_unique_references("reference_file_name")
- lines 85–87 (
- In quantmsio/utils/constants.py reconsider keeping
RUN = "Run"
andMSSTATS_RUN_NAME = "Run"
alongsideMSSTATS_REFERENCE_NAME = "reference_file_name"
to avoid mixed-key usage.
🤖 Prompt for AI Agents
In quantmsio/core/common.py around lines 95–96, ensure the mapping explicitly
renames "Run" to "reference_file_name" and propagate that change: update
quantmsio/core/diann/diann.py so that lines 85–87 use "reference_file_name" (or
DIANN_MAP["Run"]) instead of WHERE Run IN …, lines 110–114 SELECT
"reference_file_name" instead of SELECT "Run", line 122 indexes
peptide_df.to_dict() by "reference_file_name" instead of "Run", and lines
520–522 call get_unique_references("reference_file_name") instead of
get_unique_references("Run"); also review quantmsio/utils/constants.py and
either keep RUN = "Run" and MSSTATS_RUN_NAME = "Run" alongside
MSSTATS_REFERENCE_NAME = "reference_file_name" or consolidate keys to avoid
mixed-key usage so all call sites consistently use the new reference_file_name
key.
report.loc[:, "intensities"] = report[ | ||
["reference_file_name", "pg_quantity"] | ||
].apply( | ||
lambda rows: [ | ||
{ | ||
"sample_accession": self._sample_map[ | ||
rows["reference_file_name"] + "-LFQ" | ||
], | ||
"channel": "LFQ", | ||
"intensity": rows["pg_quantity"], | ||
} | ||
], | ||
axis=1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
LFQ sample mapping likely wrong: SDRF LFQ keys don’t include “-LFQ”. Add a fallback.
SDRFHandler.get_sample_map_run() maps LFQ as “file” (no “-LFQ” suffix). Your lookup uses “file-LFQ”.
- lambda rows: [
- {
- "sample_accession": self._sample_map[
- rows["reference_file_name"] + "-LFQ"
- ],
+ lambda rows: [
+ {
+ "sample_accession": self._sample_map.get(
+ rows["reference_file_name"] + "-LFQ",
+ self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
+ ),
"channel": "LFQ",
"intensity": rows["pg_quantity"],
}
],
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
report.loc[:, "intensities"] = report[ | |
["reference_file_name", "pg_quantity"] | |
].apply( | |
lambda rows: [ | |
{ | |
"sample_accession": self._sample_map[ | |
rows["reference_file_name"] + "-LFQ" | |
], | |
"channel": "LFQ", | |
"intensity": rows["pg_quantity"], | |
} | |
], | |
axis=1, | |
) | |
report.loc[:, "intensities"] = report[ | |
["reference_file_name", "pg_quantity"] | |
].apply( | |
lambda rows: [ | |
{ | |
"sample_accession": self._sample_map.get( | |
rows["reference_file_name"] + "-LFQ", | |
self._sample_map.get( | |
rows["reference_file_name"], | |
rows["reference_file_name"] | |
) | |
), | |
"channel": "LFQ", | |
"intensity": rows["pg_quantity"], | |
} | |
], | |
axis=1, | |
) |
🤖 Prompt for AI Agents
In quantmsio/core/diann/diann.py around lines 190 to 203, the code looks up
sample IDs using keys like reference_file_name + "-LFQ" but
SDRFHandler.get_sample_map_run() stores LFQ keys without the "-LFQ" suffix;
update the lookup to first try self._sample_map.get(reference_file_name +
"-LFQ") and if that returns None fall back to
self._sample_map.get(reference_file_name) (or vice versa if preferred), and
handle missing mappings by raising a clear error or skipping the row; ensure the
lambda uses .get to avoid KeyError and produces the same dict structure.
def __init__(self, idxml_path: str): | ||
"""Initialize IdXML PSM processor. | ||
|
||
Args: | ||
idxml_path: Path to idXML file | ||
""" | ||
self.idxml_path = Path(idxml_path) | ||
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | ||
prot_ids = [] | ||
pep_ids = [] | ||
oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids) | ||
self.prot_ids = prot_ids | ||
self.pep_ids = pep_ids | ||
self.openms_handler = OpenMSHandler() | ||
self.logger.info( | ||
f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Seed missing attributes and log summary.
Initialize search_metadata to avoid AttributeError later.
self.pep_ids = pep_ids
self.openms_handler = OpenMSHandler()
+ # Optional metadata injected by callers; default to empty dict
+ self.search_metadata: dict = {}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__(self, idxml_path: str): | |
"""Initialize IdXML PSM processor. | |
Args: | |
idxml_path: Path to idXML file | |
""" | |
self.idxml_path = Path(idxml_path) | |
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | |
prot_ids = [] | |
pep_ids = [] | |
oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids) | |
self.prot_ids = prot_ids | |
self.pep_ids = pep_ids | |
self.openms_handler = OpenMSHandler() | |
self.logger.info( | |
f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}" | |
) | |
def __init__(self, idxml_path: str): | |
"""Initialize IdXML PSM processor. | |
Args: | |
idxml_path: Path to idXML file | |
""" | |
self.idxml_path = Path(idxml_path) | |
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | |
prot_ids = [] | |
pep_ids = [] | |
oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids) | |
self.prot_ids = prot_ids | |
self.pep_ids = pep_ids | |
self.openms_handler = OpenMSHandler() | |
# Optional metadata injected by callers; default to empty dict | |
self.search_metadata: dict = {} | |
self.logger.info( | |
f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}" | |
) |
🤖 Prompt for AI Agents
In quantmsio/core/idxml_utils/idxml.py around lines 26 to 43, the IdXML class
does not initialize the attribute search_metadata which causes an AttributeError
later; seed this attribute in __init__ (e.g. self.search_metadata = {} or
appropriate default) before any method may reference it, and keep the existing
load and logging logic unchanged so the instance always has a defined
search_metadata attribute after construction.
# Extract protein accessions | ||
protein_accessions = [ | ||
acc.decode() for acc in hit.extractProteinAccessionsSet() | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protein accessions decoding may crash.
extractProteinAccessionsSet often yields strings; decoding blindly can fail. Decode bytes only.
- protein_accessions = [
- acc.decode() for acc in hit.extractProteinAccessionsSet()
- ]
+ protein_accessions = [
+ acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc)
+ for acc in hit.extractProteinAccessionsSet()
+ ]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Extract protein accessions | |
protein_accessions = [ | |
acc.decode() for acc in hit.extractProteinAccessionsSet() | |
] | |
# Extract protein accessions | |
protein_accessions = [ | |
acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc) | |
for acc in hit.extractProteinAccessionsSet() | |
] |
🤖 Prompt for AI Agents
In quantmsio/core/idxml_utils/idxml.py around lines 104 to 107, the code
unconditionally calls .decode() on each accession from
hit.extractProteinAccessionsSet(), which can raise if items are already strings;
change the comprehension to decode only when an item is a bytes instance (e.g.,
acc.decode()) and otherwise keep the item as-is (or str(acc) if you want to
coerce non-str types), ensuring None or unexpected types are handled safely to
avoid crashes.
from quantmsio.operate.tools import ( | ||
get_ahocorasick, | ||
get_modification_details, | ||
# get_ahocorasick, | ||
# get_modification_details, | ||
get_protein_accession, | ||
) | ||
from pyopenms.Constants import PROTON_MASS_U | ||
from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-import helpers required by this module.
get_modification_details is used below; import it (and get_ahocorasick if you keep the automaton).
-from quantmsio.operate.tools import (
- # get_ahocorasick,
- # get_modification_details,
- get_protein_accession,
-)
+from quantmsio.operate.tools import (
+ get_ahocorasick,
+ get_modification_details,
+ get_protein_accession,
+)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from quantmsio.operate.tools import ( | |
get_ahocorasick, | |
get_modification_details, | |
# get_ahocorasick, | |
# get_modification_details, | |
get_protein_accession, | |
) | |
from pyopenms.Constants import PROTON_MASS_U | |
from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS | |
from quantmsio.operate.tools import ( | |
get_ahocorasick, | |
get_modification_details, | |
get_protein_accession, | |
) | |
from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS |
🤖 Prompt for AI Agents
In quantmsio/core/maxquant/maxquant.py around lines 26 to 31, re-import the
helper(s) currently commented out: add get_modification_details to the import
list from quantmsio.operate.tools (and also add get_ahocorasick if the
Aho–Corasick automaton is used elsewhere in this module); update the import
statement to include these names so the functions used later are available, or
remove/comment any references below if you decide not to import them.
def extract_col_msg(self, col_df, label: str = "feature"): | ||
line = "\t".join(col_df.columns) | ||
self.mods_map = self.get_mods_map(line) | ||
self._automaton = get_ahocorasick(self.mods_map) | ||
# self._automaton = get_ahocorasick(self.mods_map) | ||
if label == "feature": | ||
intensity_normalize_names = [] | ||
intensity_names = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore automaton initialization to avoid AttributeError.
self._automaton is referenced later; initialize it.
self.mods_map = self.get_mods_map(line)
- # self._automaton = get_ahocorasick(self.mods_map)
+ self._automaton = get_ahocorasick(self.mods_map)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def extract_col_msg(self, col_df, label: str = "feature"): | |
line = "\t".join(col_df.columns) | |
self.mods_map = self.get_mods_map(line) | |
self._automaton = get_ahocorasick(self.mods_map) | |
# self._automaton = get_ahocorasick(self.mods_map) | |
if label == "feature": | |
intensity_normalize_names = [] | |
intensity_names = [] | |
def extract_col_msg(self, col_df, label: str = "feature"): | |
line = "\t".join(col_df.columns) | |
self.mods_map = self.get_mods_map(line) | |
self._automaton = get_ahocorasick(self.mods_map) | |
if label == "feature": | |
intensity_normalize_names = [] | |
intensity_names = [] |
🤖 Prompt for AI Agents
In quantmsio/core/maxquant/maxquant.py around lines 63 to 69, restore
initialization of self._automaton after self.mods_map is set to prevent later
AttributeError: call and assign self._automaton = get_ahocorasick(self.mods_map)
(or the appropriate factory function) immediately after self.mods_map =
self.get_mods_map(line); if get_ahocorasick may return None, ensure you create a
valid automaton or raise/log a clear error so downstream references are safe.
msstats.loc[:, feature] = msstats[ | ||
["reference_file_name", "peptidoform", "precursor_charge"] | ||
].apply( | ||
lambda rows: merge_psm(rows, i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix closure issue in lambda function.
The lambda function at line 527 has the same closure issue with the loop variable i
.
for i, feature in enumerate(map_features):
if feature not in msstats.columns:
msstats.loc[:, feature] = msstats[
["reference_file_name", "peptidoform", "precursor_charge"]
].apply(
- lambda rows: merge_psm(rows, i),
+ lambda rows, idx=i: merge_psm(rows, idx),
axis=1,
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
lambda rows: merge_psm(rows, i), | |
for i, feature in enumerate(map_features): | |
if feature not in msstats.columns: | |
msstats.loc[:, feature] = msstats[ | |
["reference_file_name", "peptidoform", "precursor_charge"] | |
].apply( | |
lambda rows, idx=i: merge_psm(rows, idx), | |
axis=1, | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
527-527: Function definition does not bind loop variable i
(B023)
🤖 Prompt for AI Agents
In quantmsio/core/quantms/feature.py around line 527 the lambda passed (lambda
rows: merge_psm(rows, i)) captures the loop variable i by reference causing all
lambdas to use the final i; fix by binding the current i into the lambda's
default arguments (e.g. lambda rows, i=i: merge_psm(rows, i)) or replace the
lambda with functools.partial(merge_psm, i=i) so each closure uses the intended
i value.
if len(factor_columns) != 1: | ||
raise ValueError("The number of factor columns should be 1") | ||
|
||
# Rename the factor value column with condition as name | ||
sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"}) | ||
sdrf_pd["condition"] = ( | ||
sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1) | ||
) | ||
else: | ||
sdrf_pd["condition"] = sdrf_pd[factor_columns] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: assigning a DataFrame to a Series column.
When exactly one factor column exists, sdrf_pd[factor_columns] is a DataFrame; assign the single Series instead.
- else:
- sdrf_pd["condition"] = sdrf_pd[factor_columns]
+ else:
+ # factor_columns has exactly 1 element here
+ sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if len(factor_columns) != 1: | |
raise ValueError("The number of factor columns should be 1") | |
# Rename the factor value column with condition as name | |
sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"}) | |
sdrf_pd["condition"] = ( | |
sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1) | |
) | |
else: | |
sdrf_pd["condition"] = sdrf_pd[factor_columns] | |
if len(factor_columns) != 1: | |
sdrf_pd["condition"] = ( | |
sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1) | |
) | |
else: | |
# factor_columns has exactly 1 element here | |
sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str) |
🤖 Prompt for AI Agents
In quantmsio/core/sdrf.py around lines 224 to 230, when there is exactly one
factor column the code assigns sdrf_pd[factor_columns] (a one-column DataFrame)
to the "condition" column; change this to assign the single Series instead by
indexing the single column name (e.g., use factor_columns[0]) and ensure you
cast it to string to match the multi-column path before assigning to
sdrf_pd["condition"].
Fix the quantms LFQ/TMT modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 16
♻️ Duplicate comments (15)
tests/core/quantms/test_full_conversion_suite.py (3)
17-20
: Fix workspace root to repository rootMatches prior feedback; avoids path issues on CI.
def get_workspace_root(): """Get the workspace root directory""" - return Path(__file__).parent.parent.parent + return Path(__file__).resolve().parents[3]
7-15
: Use current interpreter instead of hardcoding 'python'Prevents environment mismatches; add
sys
import.import os import subprocess import tempfile import time from pathlib import Path +import sys
119-131
: Replace 'python' withsys.executable
in all CLI invocationsEnsures consistency across platforms and virtualenvs.
- cmd = [ - "python", + cmd = [ + sys.executable,(Apply to all six command builders in these ranges.)
Also applies to: 171-187, 227-243, 281-293, 333-349, 389-405
quantmsio/core/quantms/feature.py (3)
458-464
: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).Each lambda closes over i; all rows get the last value. Bind i at definition time.
Apply:
for i, feature in enumerate(map_features): - msstats.loc[:, feature] = msstats[ + index = i # bind current index + msstats.loc[:, feature] = msstats[ ["reference_file_name", "peptidoform", "precursor_charge"] ].apply( - lambda rows: merge_psm(rows, i), + lambda rows, idx=index: merge_psm(rows, idx), axis=1, )
512-519
: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).Same closure issue here; bind i.
Apply:
for i, feature in enumerate(map_features): if feature not in msstats.columns: - msstats.loc[:, feature] = msstats[ + index = i + msstats.loc[:, feature] = msstats[ ["reference_file_name", "peptidoform", "precursor_charge"] ].apply( - lambda rows: merge_psm(rows, i), + lambda rows, idx=index: merge_psm(rows, idx), axis=1, )
521-533
: Loop variable shadows outer df; yields confusing bugs.Rename inner variable.
- for key, df in df.groupby(partitions): - yield key, df + for key, group_df in df.groupby(partitions): + yield key, group_dftests/core/quantms/psms/test_psm.py (2)
102-121
: Create output parquet safely inside TemporaryDirectory.Avoid mktemp for files; use a path inside a temp dir.
- temp_db_path = tempfile.mktemp(suffix=".duckdb") - temp_output_path = tempfile.mktemp(suffix=".parquet") + with tempfile.TemporaryDirectory() as tmpdir: + temp_db_path = Path(tmpdir) / "psm.duckdb" + temp_output_path = Path(tmpdir) / "psm.parquet"
43-60
: Replace insecure mktemp with TemporaryDirectory; remove manual unlink.Avoid CWE-377 and let the directory handle cleanup.
- # Create a temporary database path (don't create the file, let DuckDB do it) - temp_db_path = tempfile.mktemp(suffix=".duckdb") + with tempfile.TemporaryDirectory() as tmpdir: + temp_db_path = Path(tmpdir) / "psm.duckdb" ... - finally: - # Clean up temporary database - if os.path.exists(temp_db_path): - os.unlink(temp_db_path) + finally: + passApply the same pattern to Lines 69-93, 102-129 (also for temp_output_path: create inside tmpdir), 141-159, 304-382, 394-536, and 548-585.
quantmsio/core/quantms/pg.py (1)
180-191
: Dead method references undefined attribute_mztab_path
.Calling
_is_protein_line
would raiseAttributeError
. Remove or guard it.- def _is_protein_line(self, line_num: int) -> bool: - """Check if a line is a protein data line.""" - try: - # Use safe file opening with automatic cleanup - with self._safe_file_open(self._mztab_path, "r") as file: - for i, line in enumerate(file): - if i == line_num: - return line.startswith("PRT") - return False - except Exception: - return False + def _is_protein_line(self, line_num: int) -> bool: + """Deprecated: no raw-file parsing in this class.""" + raise NotImplementedError("Not supported. Use indexer-backed access.")quantmsio/core/quantms/psm.py (2)
378-384
: Typo in CV param key: “consesus_support” → “consensus_support”Align the CV name with the intended spelling and downstream readers.
- cv_list = [] - if consensus_support and consensus_support != "null": - struct = { - "cv_name": "consesus_support", - "cv_value": str(consensus_support), - } - cv_list.append(struct) + cv_list = [] + if consensus_support and consensus_support != "null": + struct = { + "cv_name": "consensus_support", + "cv_value": str(consensus_support), + } + cv_list.append(struct)Also applies to: 523-531
654-666
: Fix variable shadowing in generatorLoop variable
df
shadows the outerdf
.- for key, df in df.groupby(partitions): - yield key, df + for key, group_df in df.groupby(partitions): + yield key, group_dfquantmsio/core/duckdb.py (4)
151-159
: Fix table-existence check (variable shadowing bug)The method shadows
tables
and checks each table against itself; always True. Use a separate variable.def check_tables_in_db(self, tables: List[str]) -> bool: - """Check if the database connection is valid.""" - tables = [t[0] for t in duckdb.execute("SHOW TABLES").fetchall()] - for table in tables: - if table not in tables: + """Check if required tables exist in the database.""" + existing_tables = [t[0] for t in self._duckdb.execute("SHOW TABLES").fetchall()] + for table in tables: + if table not in existing_tables: logging.info(f"Table {table} not in database") return False return True
507-512
: Parameterizeiter_file
batch queryInline stringified list and quotes are incorrect and unsafe.
- query = f""" - SELECT {cols} - FROM report - WHERE "{field}" = ANY('{refs}') - """ - yield refs, self.query_to_df(query) + placeholders = ",".join("?" for _ in refs) + query = f'SELECT {cols} FROM report WHERE "{field}" IN ({placeholders})' + yield refs, self._duckdb.execute(query, refs).df()
477-483
: Parameterizeget_report_from_database
refs filterAvoid interpolating lists; use
IN (?,...,?)
.- query = f""" - SELECT {columns} - FROM report - WHERE "{RUN}" = ANY({refs}) - """ - return self.query_to_df(query) + placeholders = ",".join("?" for _ in refs) + query = f'SELECT {columns} FROM report WHERE "{RUN}" IN ({placeholders})' + return self._duckdb.execute(query, refs).df()
354-358
: Parameterize filtering (avoid ANY(...) with inline list)Inline Python lists in SQL is unsafe and brittle. Use placeholders.
query = "SELECT * FROM pg_matrix" if protein_groups_tuple: - query += f' WHERE "{PROTEIN_GROUP}" = ANY({list(protein_groups_tuple)})' - - return self.query_to_df(query) + placeholders = ",".join("?" for _ in protein_groups_tuple) + query += f' WHERE "{PROTEIN_GROUP}" IN ({placeholders})' + return self._duckdb.execute(query, list(protein_groups_tuple)).df() + return self._duckdb.execute(query).df()
🧹 Nitpick comments (37)
tests/core/quantms/test_msstats_in.py (6)
287-301
: Handle both reference file column names when building the matrixHard-coding
reference_file_name
can break with legacy inputs usingReference_Name
.- for _, row in msstats_batch.iterrows(): - ref_file = row["reference_file_name"] + for _, row in msstats_batch.iterrows(): + ref_col = ( + "Reference_Name" + if "Reference_Name" in msstats_batch.columns + else "reference_file_name" + ) + ref_file = row[ref_col]
165-179
: Prefer public API over private_duckdb
and use normalized column namesUse
query_to_df
and the normalized lower-case columns to avoid brittle tests.- raw_channels = indexer._duckdb.sql( - "SELECT DISTINCT Channel FROM msstats ORDER BY Channel" - ).df() + raw_channels = indexer.query_to_df( + "SELECT DISTINCT channel FROM msstats ORDER BY channel" + ) @@ - sample_raw = indexer._duckdb.sql( - """ - SELECT Channel, Reference, ProteinName, PeptideSequence, Intensity - FROM msstats - WHERE Reference LIKE '%a05058%' - AND ProteinName LIKE '%P55011%' - LIMIT 15 - """ - ).df() + sample_raw = indexer.query_to_df( + """ + SELECT channel, reference_file_name, pg_accessions, peptidoform, intensity + FROM msstats + WHERE reference_file_name LIKE '%a05058%' + AND pg_accessions LIKE '%P55011%' + LIMIT 15 + """ + )
349-349
: Replace ambiguous multiplication signAvoid the Unicode “×” to prevent encoding/test log issues.
- print("\nFeatures by File × Channel Matrix:") + print("\nFeatures by File x Channel Matrix:")
117-117
: Remove extraneous f-strings without placeholdersMinor cleanliness; keeps linters quiet (Ruff F541).
- print(f"\nLFQ test completed successfully!") + print("\nLFQ test completed successfully!") @@ - print(f"\nTMT test completed successfully!") + print("\nTMT test completed successfully!") @@ - print(f"Intensity distribution analysis completed") + print("Intensity distribution analysis completed") @@ - print(f"\nDataset Comparison:") + print("\nDataset Comparison:")Also applies to: 465-465, 536-536, 589-589
9-10
: Drop unused imports
gzip
andtempfile
are not used in this module.-import gzip -import tempfile +import tempfile # if you still need it elsewhere, keep; otherwise removeIf
tempfile
is also unused, remove it as well.
181-182
: Avoid catching bare Exception in debug helperNarrow the exception or re-raise after logging to keep failures visible in CI.
- except Exception as e: + except (RuntimeError, ValueError) as e:quantmsio/utils/constants.py (3)
94-95
: Alias peptide name to the sequence constant to prevent driftAvoid duplicating the same string literal in two places.
-MSSTATS_PEPTIDE_NAME = "PeptideSequence" +MSSTATS_PEPTIDE_NAME = MSSTATS_PEPTIDE_SEQUENCE
111-114
: Make MSstats columns order deterministicUsing
set()
makes column order non-deterministic, which can lead to unstable schemas across chunks.-MSSTATS_USECOLS = list( - set(MSSTATS_NAME_MAP.values()) - | {"sample_accession", "reference_file_name", "channel"} -) +MSSTATS_USECOLS = [ + "pg_accessions", + "peptidoform", + "charge", + "intensity", + "sample_accession", + "reference_file_name", + "channel", +]
121-133
: Use declared constants withinMZTAB_PROTEIN_COLUMNS
for consistencyKeeps identifiers centralized and reduces typos.
MZTAB_PROTEIN_COLUMNS = [ ACCESSION, "description", MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE, - "ambiguity_members", + AMBIGUITY_MEMBERS, "modifications", "protein_coverage", "opt_global_nr_found_peptides", "opt_global_Posterior_Probability_score", "opt_global_cv_PRIDE:0000303_decoy_hit", - "opt_global_result_type", + OPT_GLOBAL_RESULT_TYPE, ]tests/core/quantms/test_full_conversion_suite.py (3)
1-1
: Remove shebang or mark file executableThe shebang isn’t needed for a pytest module and triggers Ruff EXE001.
-#!/usr/bin/env python3
91-101
: Avoid loading full Parquet into memory for simple statsUse metadata/schema to get rows/cols and column names.
- try: - df = pq.read_table(file_path).to_pandas() - size_mb = file_path.stat().st_size / (1024 * 1024) - - return { - "rows": len(df), - "columns": len(df.columns), - "size_mb": size_mb, - "column_names": list(df.columns)[:10], # First 10 columns - } + try: + pf = pq.ParquetFile(file_path) + size_mb = file_path.stat().st_size / (1024 * 1024) + names = list(pf.schema_arrow.names) + return { + "rows": pf.metadata.num_rows, + "columns": len(names), + "size_mb": size_mb, + "column_names": names[:10], + }
81-84
: Narrow broad exception or re-raise after loggingCatching
Exception
hides real failures during CI; consider narrowing toOSError
/CalledProcessError
or re-raising.quantmsio/core/quantms/feature.py (4)
488-496
: Column name “mp_accessions” looks unintended; likely “accession” or “pg_accessions”.“mp_accessions” is not used elsewhere and may not exist in FEATURE_SCHEMA; risk of silent drop/mismatch. Please confirm the intended field and rename accordingly.
Example:
- "mp_accessions", + "accession", # or "pg_accessions" if you intend PG group accessionsIf you intend PGs, consider deriving from existing pg_accessions instead of the PSM accession string.
660-664
: anchor_protein type inconsistency with schema.add_additional_msg converts anchor_protein to list, but convert_to_parquet_format later casts anchor_protein to str. Confirm FEATURE_SCHEMA: if anchor_protein is string, keep the list in a new field (anchor_proteins) and leave anchor_protein as string; if it’s a list, remove it from string_columns and add to list_columns.
Option A (keep string + add list):
- msstats["anchor_protein"] = msstats["anchor_protein"].apply( - get_protein_accession - ) + msstats["anchor_proteins"] = msstats["anchor_protein"].apply(get_protein_accession)string_columns = [ - "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name", + "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name", ]-list_columns = ["pg_accessions", "gg_accessions", "gg_names"] +list_columns = ["pg_accessions", "gg_accessions", "gg_names", "anchor_proteins"]Option B (schema expects list): remove anchor_protein from string_columns and add it to list_columns.
Also applies to: 739-747, 751-763
114-116
: Overbroad exception handlers hide real errors.Catching bare Exception and immediately returning empty structures makes debugging hard. Narrow exceptions and re-raise unexpected errors, or at least log with stack_info/exec_info.
Example:
- except Exception as e: - self.logger.warning(f"Could not extract PSM messages: {e}") - return {}, {} + except (KeyError, AttributeError, ValueError) as e: + self.logger.warning("Could not extract PSM messages", exc_info=e) + return {}, {}Repeat pattern for other helpers.
Also applies to: 233-235, 257-259, 270-272, 327-331
134-223
: generate_modifications_details: simplify score assignment and guard indexing.
- scores selection uses “mod_name in select_mods” which is always True (select_mods = keys).
- Accessing sequence[aa_pos-1] needs a length guard.
- aa = sequence[aa_pos - 1] + if 0 < aa_pos <= len(sequence): + aa = sequence[aa_pos - 1] + else: + continue ... - "scores": ( - [ - {"score_name": "localization_probability","score_value": None} - ] if mod_name in select_mods else None - ), + "scores": None, # until a real source is availableAlso drop the unused select_mods variable.
tests/core/quantms/test_feature.py (1)
36-40
: Prefer context-managing the indexer.MzTabIndexer supports enter/exit. Using “with MzTabIndexer.create(...) as indexer” guarantees cleanup even on failure.
- indexer = MzTabIndexer.create( + with MzTabIndexer.create( mztab_path=mztab_file, msstats_path=msstats_file, sdrf_path=sdrf_file, - database_path=temp_db_path, - ) + database_path=temp_db_path, + ) as indexer:tests/core/quantms/psms/test_psm.py (1)
246-254
: Use Pythonic truth checks in tests.E712 flags “== True/False”. Prefer direct truthiness.
- ) == True + ) ... - assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False + assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"])tests/core/quantms/psms/test_duckdb_creation.py (4)
137-139
: Remove unnecessary f-strings without placeholders.These prints don’t interpolate variables and trigger F541 in Ruff. Drop the
f
prefix.Apply across the listed lines, e.g.:
- print(f"\n=== Testing Two-Step Creation Process ===") + print("\n=== Testing Two-Step Creation Process ===")Also applies to: 166-177, 226-234, 263-268, 293-304, 350-361, 385-392, 419-431, 472-480, 499-503, 530-537, 564-569, 617-625, 644-648
149-154
: Avoid accessing private attribute_duckdb
in tests.Use a read-only DuckDB connection for verification to keep tests decoupled from internals.
Example for one block:
- tables = [t[0] for t in mztab_step1._duckdb.execute("SHOW TABLES").fetchall()] + with duckdb.connect(str(output_path), read_only=True) as con_verify: + tables = [t[0] for t in con_verify.execute("SHOW TABLES").fetchall()]Also applies to: 276-281, 311-317, 399-404, 442-449, 586-594
653-660
: Replace bare except/pass in cleanup helpers.Use
contextlib.suppress
or log at debug to avoid silencing unexpected errors and S110/BLE001 warnings.Apply:
+import contextlib @@ - if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb: - try: - mztab_indexer._duckdb.close() - except Exception: - pass # Ignore errors during cleanup + if mztab_indexer and getattr(mztab_indexer, "_duckdb", None): + with contextlib.suppress(Exception): + mztab_indexer._duckdb.close() @@ - if con: - try: - con.close() - except Exception: - pass # Ignore errors during cleanup + if con: + with contextlib.suppress(Exception): + con.close()Also applies to: 662-669
46-50
: Test name/docs mention Parquet but only DuckDB is exercised.Either add Parquet path coverage or rename to avoid confusion.
quantmsio/core/quantms/pg.py (3)
45-55
: Use a consistent logger; prefer the instance logger.
cleanup()
uses a different logger name than the class logger, fragmenting log control.- logger = logging.getLogger("quantmsio.core.mztab") + logger = self.loggerAlso replace other
logging.getLogger("quantmsio.core.mztab")
/quantmsio.core.quantms.pg
occurrences in this method withself.logger
.Also applies to: 57-66
266-271
: Preferlogging.exception
in except blocks.Leverage stack traces directly; current error messages duplicate
str(e)
.- logging.getLogger("quantmsio.core.mztab").error( - f"Error opening file {file_path}: {e}" - ) + logging.getLogger("quantmsio.core.mztab").exception( + "Error opening file %s", file_path + ) @@ - logger.error(f"Error loading protein groups table: {e}") + logger.exception("Error loading protein groups table") @@ - logger.error(f"Error creating msstats protein join: {e}") + logger.exception("Error creating msstats protein join")Also applies to: 283-289, 485-487, 555-557
425-491
: Fix logging and remove unusedprotein_groups
var.You build
protein_data
but loglen(protein_groups)
, which is always 0.- protein_groups = {} - protein_data = list() + protein_data: list[dict] = [] @@ - logger.info( - f"Loaded {len(protein_groups)} protein groups from {total_rows} total rows" - ) + logger.info( + "Loaded %d protein groups from %d total rows", + len(protein_data), + total_rows, + )quantmsio/commands/convert/quantms.py (5)
129-132
: Preserve traceback and avoid redundantstr(e)
in logs.Use
logger.exception("...")
and chain exceptions withfrom e
.- except Exception as e: - logger.exception(f"Error in feature conversion: {str(e)}") - raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") + except Exception as e: + logger.exception("Error in feature conversion") + raise click.ClickException(f"Error: {e}\nCheck the logs for more details.") from eRepeat similarly for PSM, idXML-PSM, PG, and create-duckdb handlers.
Also applies to: 211-214, 267-270, 416-419, 534-537
324-335
: Annotate optional parameters explicitly withOptional[...]
.PEP 484 discourages implicit Optional typing; update annotations.
-def convert_quantms_pg_cmd( - mztab_path: Path = None, - database_path: Path = None, - msstats_file: Path = None, - sdrf_file: Path = None, - output_folder: Path = None, +def convert_quantms_pg_cmd( + mztab_path: Optional[Path] = None, + database_path: Optional[Path] = None, + msstats_file: Optional[Path] = None, + sdrf_file: Optional[Path] = None, + output_folder: Optional[Path] = None,Ensure
from typing import Optional
is present (it already is).
520-520
: Drop extraneous f-prefix.Minor logging polish: remove
f
when no placeholders are used.- logger.info(f"Successfully processed:") + logger.info("Successfully processed:")
88-95
: Minor: avoid redundantPath()
wrapping.
database_path
is already aPath
; use it directly for.exists()
andstr()
conversions.Example:
- if database_path and Path(database_path).exists(): + if database_path and database_path.exists():Also applies to: 181-194, 369-385, 501-509
402-415
: Use public conversion helper to write Parquet.If possible, expose a writer method on
MzTabProteinGroups
similar toFeature.write_feature_to_file
to avoid calling a private_convert_to_parquet_format
.quantmsio/core/quantms/psm.py (6)
386-392
: Avoid KeyError; reuse computed spectra_ref and handle NoneUse the local
spectra_ref
and guard for None to prevent KeyError.- transformed_record["scan"] = generate_scan_number(record["spectra_ref"]) + transformed_record["scan"] = ( + generate_scan_number(spectra_ref) if spectra_ref else None + )
430-436
: Remove duplicate assignments
mz_array
andintensity_array
are set twice.transformed_record["number_peaks"] = None transformed_record["ion_mobility"] = None - transformed_record["mz_array"] = None - transformed_record["intensity_array"] = None
439-441
: Log full traceback on transform failuresUse
logger.exception
for better diagnostics.- except Exception as e: - self.logger.warning(f"Error transforming PSM record: {e}") + except Exception: + self.logger.exception("Error transforming PSM record") return None
478-491
: Use.get()
to avoid KeyError on missing score columnsPSM rows may not carry all
search_engine_score[i]
fields.- for score_index, score_name in self._score_names["psms"].items(): - value = record[f"search_engine_score[{score_index}]"] + for score_index, score_name in self._score_names["psms"].items(): + value = record.get(f"search_engine_score[{score_index}]") if value is not None and value != "null":
507-509
: Minor: remove unnecessary f-stringNo placeholders present.
- self.logger.warning(f"Could not convert global_qvalue to float") + self.logger.warning("Could not convert global_qvalue to float")
580-582
: Preferlogger.exception
when catching unexpected errorsPreserves stack trace for debugging.
- except Exception as e: - self.logger.error(f"Could not extract peptide columns: {e}") + except Exception: + self.logger.exception("Could not extract peptide columns") self._pep_columns = []quantmsio/core/duckdb.py (1)
223-225
: Uselogger.exception
to capture stack trace on benchmark failures- except Exception as e: - self.logger.error(f"Query failed on iteration {i+1}: {e}") + except Exception: + self.logger.exception(f"Query failed on iteration {i+1}") return {"error": str(e)}
elif mztab_path: | ||
# No database_path provided, create a temporary one | ||
temp_db_path = tempfile.mktemp(suffix=".duckdb") | ||
logger.info( | ||
f"Creating temporary MzTabIndexer at {temp_db_path} from {mztab_path}" | ||
) | ||
indexer = MzTabIndexer.create( | ||
mztab_path=str(mztab_path), | ||
msstats_path=msstats_file, | ||
sdrf_path=sdrf_file, | ||
database_path=temp_db_path, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace insecure tempfile.mktemp()
with NamedTemporaryFile(delete=False)
.
mktemp
is vulnerable to race conditions (CWE-377/B306). Use a secure temp file.
- temp_db_path = tempfile.mktemp(suffix=".duckdb")
+ from tempfile import NamedTemporaryFile
+ with NamedTemporaryFile(suffix=".duckdb", delete=False) as tf:
+ temp_db_path = tf.name
Apply in all three places.
Also applies to: 195-204, 386-396
🧰 Tools
🪛 ast-grep (0.38.6)
[warning] 106-106: The function mktemp
is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile()
instead and pass it the delete=False
parameter.
Context: tempfile.mktemp(suffix=".duckdb")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-mktemp-python)
🪛 Ruff (0.12.2)
107-107: Use of insecure and deprecated function (mktemp
)
(S306)
🪛 GitHub Check: CodeFactor
[warning] 110-110: quantmsio/commands/convert/quantms.py#L110
Use of insecure and deprecated function (mktemp). (B306)
🤖 Prompt for AI Agents
In quantmsio/commands/convert/quantms.py around lines 105 to 116 (and also apply
the same change at 195-204 and 386-396), replace the insecure tempfile.mktemp()
calls with a secure tempfile.NamedTemporaryFile(delete=False, suffix=".duckdb")
usage: create a NamedTemporaryFile with delete=False, capture its .name for
database_path, close the file descriptor immediately (so it can be opened by
other code) and pass the .name to MzTabIndexer.create; ensure the temporary file
is cleaned up later if needed.
def __init__( | ||
self, | ||
report_path: Union[Path, str], | ||
duckdb_max_memory: str = "16GB", | ||
duckdb_threads: int = 4, | ||
database_path: Optional[Union[Path, str]] = None, | ||
max_memory: str = "16GB", | ||
worker_threads: int = 8, | ||
): | ||
self.parquet_db: duckdb.DuckDBPyConnection | None = None | ||
self._report_path = report_path | ||
self._duckdb_name = create_uuid_filename("report-duckdb", ".db") | ||
self._duckdb = self.create_duckdb_from_diann_report( | ||
duckdb_max_memory, duckdb_threads | ||
"""Initialize DuckDB connection. | ||
|
||
Args: | ||
database_path: Optional path to DuckDB database file | ||
max_memory: Maximum memory to use | ||
worker_threads: Number of worker threads | ||
""" | ||
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | ||
self.logger.info(f"duckdb uses {worker_threads} threads.") | ||
self.logger.info(f"duckdb uses {max_memory} of memory.") | ||
|
||
start_time = time.time() | ||
self._duckdb = duckdb.connect( | ||
database=str(database_path) if database_path else ":memory:", | ||
config={ | ||
"memory_limit": max_memory, | ||
"threads": worker_threads, | ||
}, | ||
) | ||
self.logger.info( | ||
f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Persist database path and apply resource settings via constructor
Store database_path
for cleanup and pass max_memory
/worker_threads
through __init__
. Otherwise, cleanup_duckdb
can’t locate the file and subclass workarounds appear (see DiannDuckDB).
def __init__(
self,
database_path: Optional[Union[Path, str]] = None,
max_memory: str = "16GB",
worker_threads: int = 8,
):
"""Initialize DuckDB connection.
@@
- self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
+ self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
self.logger.info(f"duckdb uses {worker_threads} threads.")
self.logger.info(f"duckdb uses {max_memory} of memory.")
start_time = time.time()
- self._duckdb = duckdb.connect(
- database=str(database_path) if database_path else ":memory:",
+ # Track database path for cleanup
+ self._database_path = Path(database_path) if database_path else None
+ self._duckdb = duckdb.connect(
+ database=str(self._database_path) if self._database_path else ":memory:",
config={
"memory_limit": max_memory,
"threads": worker_threads,
},
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __init__( | |
self, | |
report_path: Union[Path, str], | |
duckdb_max_memory: str = "16GB", | |
duckdb_threads: int = 4, | |
database_path: Optional[Union[Path, str]] = None, | |
max_memory: str = "16GB", | |
worker_threads: int = 8, | |
): | |
self.parquet_db: duckdb.DuckDBPyConnection | None = None | |
self._report_path = report_path | |
self._duckdb_name = create_uuid_filename("report-duckdb", ".db") | |
self._duckdb = self.create_duckdb_from_diann_report( | |
duckdb_max_memory, duckdb_threads | |
"""Initialize DuckDB connection. | |
Args: | |
database_path: Optional path to DuckDB database file | |
max_memory: Maximum memory to use | |
worker_threads: Number of worker threads | |
""" | |
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | |
self.logger.info(f"duckdb uses {worker_threads} threads.") | |
self.logger.info(f"duckdb uses {max_memory} of memory.") | |
start_time = time.time() | |
self._duckdb = duckdb.connect( | |
database=str(database_path) if database_path else ":memory:", | |
config={ | |
"memory_limit": max_memory, | |
"threads": worker_threads, | |
}, | |
) | |
self.logger.info( | |
f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds" | |
) | |
def __init__( | |
self, | |
database_path: Optional[Union[Path, str]] = None, | |
max_memory: str = "16GB", | |
worker_threads: int = 8, | |
): | |
"""Initialize DuckDB connection. | |
Args: | |
database_path: Optional path to DuckDB database file | |
max_memory: Maximum memory to use | |
worker_threads: Number of worker threads | |
""" | |
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") | |
self.logger.info(f"duckdb uses {worker_threads} threads.") | |
self.logger.info(f"duckdb uses {max_memory} of memory.") | |
start_time = time.time() | |
# Track database path for cleanup | |
self._database_path = Path(database_path) if database_path else None | |
self._duckdb = duckdb.connect( | |
database=str(self._database_path) if self._database_path else ":memory:", | |
config={ | |
"memory_limit": max_memory, | |
"threads": worker_threads, | |
}, | |
) | |
self.logger.info( | |
f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds" | |
) |
🤖 Prompt for AI Agents
In quantmsio/core/duckdb.py around lines 24 to 51, the constructor currently
does not persist the provided database_path or the resource settings on the
instance, which prevents cleanup_duckdb and subclasses from locating the DB file
or reading configured resources; fix by storing the incoming parameters on self
(e.g., self.database_path = Path(database_path) if provided else None,
self.max_memory = max_memory, self.worker_threads = worker_threads) and continue
to use those stored values when connecting (and later when cleaning up) so
cleanup_duckdb and subclasses can access the actual path and resource settings.
def cleanup_duckdb(self): | ||
"""Check if DuckDB connection is closed, then delete the database file.""" | ||
# Close connection if it is still open | ||
if self._duckdb: | ||
try: | ||
self._duckdb.close() | ||
self.logger.info("[Check] DuckDB connection closed.") | ||
except Exception as e: | ||
self.logger.info(f"Failed to close DuckDB connection: {e}") | ||
finally: | ||
self._duckdb = None | ||
|
||
db_file = Path(self._database_path) | ||
# Delete the database file using pathlib | ||
if db_file.exists(): | ||
try: | ||
db_file.unlink() | ||
self.logger.info(f"[CleanUp] Database file deleted: {db_file}") | ||
except Exception as e: | ||
self.logger.info(f"Failed to delete database file: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup_duckdb breaks when _database_path is unset
When using an in-memory DB, _database_path
is None; current code raises. Guard for None before deletion.
- db_file = Path(self._database_path)
- # Delete the database file using pathlib
- if db_file.exists():
+ if not getattr(self, "_database_path", None):
+ return
+ db_file = Path(self._database_path)
+ # Delete the database file using pathlib
+ if db_file.exists():
try:
db_file.unlink()
self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
except Exception as e:
self.logger.info(f"Failed to delete database file: {e}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def cleanup_duckdb(self): | |
"""Check if DuckDB connection is closed, then delete the database file.""" | |
# Close connection if it is still open | |
if self._duckdb: | |
try: | |
self._duckdb.close() | |
self.logger.info("[Check] DuckDB connection closed.") | |
except Exception as e: | |
self.logger.info(f"Failed to close DuckDB connection: {e}") | |
finally: | |
self._duckdb = None | |
db_file = Path(self._database_path) | |
# Delete the database file using pathlib | |
if db_file.exists(): | |
try: | |
db_file.unlink() | |
self.logger.info(f"[CleanUp] Database file deleted: {db_file}") | |
except Exception as e: | |
self.logger.info(f"Failed to delete database file: {e}") | |
def cleanup_duckdb(self): | |
"""Check if DuckDB connection is closed, then delete the database file.""" | |
# Close connection if it is still open | |
if self._duckdb: | |
try: | |
self._duckdb.close() | |
self.logger.info("[Check] DuckDB connection closed.") | |
except Exception as e: | |
self.logger.info(f"Failed to close DuckDB connection: {e}") | |
finally: | |
self._duckdb = None | |
# If no on-disk database path is set (e.g. in-memory DB), skip file deletion | |
if not getattr(self, "_database_path", None): | |
return | |
db_file = Path(self._database_path) | |
# Delete the database file using pathlib | |
if db_file.exists(): | |
try: | |
db_file.unlink() | |
self.logger.info(f"[CleanUp] Database file deleted: {db_file}") | |
except Exception as e: | |
self.logger.info(f"Failed to delete database file: {e}") |
🧰 Tools
🪛 Ruff (0.12.2)
66-66: Do not catch blind exception: Exception
(BLE001)
77-77: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In quantmsio/core/duckdb.py around lines 59 to 78, cleanup_duckdb assumes
self._database_path is set and constructs a Path which raises when it's None;
guard before attempting deletion by checking if self._database_path is truthy
(not None/empty) and only then create Path(self._database_path) and call
exists()/unlink(); if it's None or represents an in-memory DB skip file deletion
and log that no file was present; preserve existing exception handling and set
self._duckdb = None in the finally block.
def secure_query_to_df( | ||
self, query: str, params: Optional[Dict[str, Any]] = None | ||
) -> pd.DataFrame: | ||
"""Execute parameterized query safely and return result as DataFrame. | ||
|
||
Args: | ||
query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?') | ||
params: Dictionary of parameters to safely substitute | ||
|
||
if max_memory is not None: | ||
database.execute("SET max_memory='{}'".format(max_memory)) | ||
if worker_threads is not None: | ||
database.execute("SET worker_threads='{}'".format(worker_threads)) | ||
Returns: | ||
DataFrame with query results | ||
|
||
msg = database.execute( | ||
"SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')" | ||
).df() | ||
logging.info("duckdb uses {} threads.".format(str(msg["value"][0]))) | ||
logging.info("duckdb uses {} of memory.".format(str(msg["value"][1]))) | ||
Example: | ||
>>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"}) | ||
""" | ||
if params: | ||
# DuckDB uses positional parameters with ? | ||
param_values = list(params.values()) | ||
return self._duckdb.execute(query, param_values).df() | ||
else: | ||
return self._duckdb.execute(query).df() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Parameter order bug and API mismatch in secure_query_to_df
DuckDB uses positional ?
placeholders. Passing an unordered dict can bind parameters in an unintended order. Accept a sequence and pass through.
- def secure_query_to_df(
- self, query: str, params: Optional[Dict[str, Any]] = None
- ) -> pd.DataFrame:
+ def secure_query_to_df(
+ self, query: str, params: Optional[List[Any]] = None
+ ) -> pd.DataFrame:
@@
- Example:
- >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})
+ Example:
+ >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"])
@@
- if params:
- # DuckDB uses positional parameters with ?
- param_values = list(params.values())
- return self._duckdb.execute(query, param_values).df()
+ if params:
+ # DuckDB uses positional parameters with ?
+ return self._duckdb.execute(query, params).df()
else:
return self._duckdb.execute(query).df()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def secure_query_to_df( | |
self, query: str, params: Optional[Dict[str, Any]] = None | |
) -> pd.DataFrame: | |
"""Execute parameterized query safely and return result as DataFrame. | |
Args: | |
query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?') | |
params: Dictionary of parameters to safely substitute | |
if max_memory is not None: | |
database.execute("SET max_memory='{}'".format(max_memory)) | |
if worker_threads is not None: | |
database.execute("SET worker_threads='{}'".format(worker_threads)) | |
Returns: | |
DataFrame with query results | |
msg = database.execute( | |
"SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')" | |
).df() | |
logging.info("duckdb uses {} threads.".format(str(msg["value"][0]))) | |
logging.info("duckdb uses {} of memory.".format(str(msg["value"][1]))) | |
Example: | |
>>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"}) | |
""" | |
if params: | |
# DuckDB uses positional parameters with ? | |
param_values = list(params.values()) | |
return self._duckdb.execute(query, param_values).df() | |
else: | |
return self._duckdb.execute(query).df() | |
def secure_query_to_df( | |
self, query: str, params: Optional[List[Any]] = None | |
) -> pd.DataFrame: | |
"""Execute parameterized query safely and return result as DataFrame. | |
Args: | |
query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?') | |
params: List of parameters to safely substitute | |
Returns: | |
DataFrame with query results | |
Example: | |
>>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"]) | |
""" | |
if params: | |
# DuckDB uses positional parameters with ? | |
return self._duckdb.execute(query, params).df() | |
else: | |
return self._duckdb.execute(query).df() |
🤖 Prompt for AI Agents
In quantmsio/core/duckdb.py around lines 84 to 105, secure_query_to_df currently
accepts params: Optional[Dict[str, Any]] which can reorder positional ?
bindings; change the API to accept a sequence (Optional[Sequence[Any]] or
List[Any]) instead, update the docstring/example to show a positional list/tuple
(e.g., ["P12345"]), and when params is provided pass it (or list(params))
directly to self._duckdb.execute(query, params) so binding order matches the
query; keep the no-params branch unchanged.
database_name = create_uuid_filename("diann-report", ".duckdb") | ||
super().__init__(database_name) | ||
|
||
def get_unique_references(self, field: str) -> list: | ||
# Initialize database and create report table | ||
self.initialize_database(max_memory, worker_threads) | ||
self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to undefined initialize_database
and unused resource args
DiannDuckDB.__init__
calls a non-existent method. Pass max_memory
/worker_threads
to super().__init__
and remove the call.
- database_name = create_uuid_filename("diann-report", ".duckdb")
- super().__init__(database_name)
-
- # Initialize database and create report table
- self.initialize_database(max_memory, worker_threads)
+ database_name = create_uuid_filename("diann-report", ".duckdb")
+ super().__init__(database_name, max_memory=max_memory, worker_threads=worker_threads)
self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
database_name = create_uuid_filename("diann-report", ".duckdb") | |
super().__init__(database_name) | |
def get_unique_references(self, field: str) -> list: | |
# Initialize database and create report table | |
self.initialize_database(max_memory, worker_threads) | |
self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN]) | |
database_name = create_uuid_filename("diann-report", ".duckdb") | |
super().__init__( | |
database_name, | |
max_memory=max_memory, | |
worker_threads=worker_threads | |
) | |
self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN]) |
# Get reference file name | ||
ref_file = None | ||
if pd.notna(row["spectra_ref"]): | ||
try: | ||
ms_run_id = row["spectra_ref"].split(":")[0] | ||
ref_file = self._ms_runs.get(ms_run_id) | ||
except (ValueError, AttributeError): | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MS run mapping bug: wrong key type used for _ms_runs
self._ms_runs
is keyed by integer ms_run IDs, but here you use the raw "ms_run[X]"
string; this always yields None. Reuse your parser.
- if pd.notna(row["spectra_ref"]):
- try:
- ms_run_id = row["spectra_ref"].split(":")[0]
- ref_file = self._ms_runs.get(ms_run_id)
- except (ValueError, AttributeError):
- pass
+ if pd.notna(row["spectra_ref"]):
+ ref_file = self._get_reference_file_name(row["spectra_ref"])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Get reference file name | |
ref_file = None | |
if pd.notna(row["spectra_ref"]): | |
try: | |
ms_run_id = row["spectra_ref"].split(":")[0] | |
ref_file = self._ms_runs.get(ms_run_id) | |
except (ValueError, AttributeError): | |
pass | |
# Get reference file name | |
ref_file = None | |
if pd.notna(row["spectra_ref"]): | |
ref_file = self._get_reference_file_name(row["spectra_ref"]) |
🤖 Prompt for AI Agents
In quantmsio/core/quantms/psm.py around lines 633 to 641, the code looks up
self._ms_runs using the raw string token from row["spectra_ref"] which yields
None because _ms_runs is keyed by integer ms_run IDs; parse the spectra_ref to
extract the numeric ms_run id exactly as the existing parser does (convert to
int) and use that integer as the key when calling self._ms_runs.get(...),
preserving the current try/except behavior to handle malformed values.
# Update map with best score | ||
if key not in pep_map or row["score"] < pep_map[key][0]: | ||
pep_map[key] = [row["score"], ref_file, scan_number] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard None and normalize score to float before comparison
Avoid TypeError when score
is missing; store numeric consistently.
- if key not in pep_map or row["score"] < pep_map[key][0]:
- pep_map[key] = [row["score"], ref_file, scan_number]
+ try:
+ score_val = float(row["score"])
+ except (TypeError, ValueError):
+ continue
+ if key not in pep_map or score_val < pep_map[key][0]:
+ pep_map[key] = [score_val, ref_file, scan_number]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Update map with best score | |
if key not in pep_map or row["score"] < pep_map[key][0]: | |
pep_map[key] = [row["score"], ref_file, scan_number] | |
# Update map with best score | |
try: | |
score_val = float(row["score"]) | |
except (TypeError, ValueError): | |
continue | |
if key not in pep_map or score_val < pep_map[key][0]: | |
pep_map[key] = [score_val, ref_file, scan_number] |
🤖 Prompt for AI Agents
In quantmsio/core/quantms/psm.py around lines 642 to 645, the code compares
row["score"] directly which can be None or a non-numeric string; normalize by
parsing score to a float (e.g. try: score = float(row.get("score")) except
(TypeError, ValueError): score = float("inf")) and then use that numeric score
in the comparison and when storing: if key not in pep_map or score <
pep_map[key][0]: pep_map[key] = [score, ref_file, scan_number].
temp_db_path = tempfile.mktemp(suffix=".duckdb") | ||
|
||
# Initialize Feature | ||
indexer = MzTabIndexer.create( | ||
mztab_path=mztab_file, | ||
msstats_path=msstats_file, | ||
sdrf_path=sdrf_file, | ||
database_path=temp_db_path, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace insecure tempfile.mktemp with TemporaryDirectory path (security, reliability).
mktemp is vulnerable (CWE-377) and can race; also MzTabIndexer expects a non-existent DB path. Use a temp directory and create the filename within it.
- # Use a temporary directory for the backend database
- temp_db_path = tempfile.mktemp(suffix=".duckdb")
-
- # Initialize Feature
- indexer = MzTabIndexer.create(
- mztab_path=mztab_file,
- msstats_path=msstats_file,
- sdrf_path=sdrf_file,
- database_path=temp_db_path,
- )
+ with tempfile.TemporaryDirectory() as tmpdir:
+ temp_db_path = Path(tmpdir) / "feature_test.duckdb"
+ indexer = MzTabIndexer.create(
+ mztab_path=mztab_file,
+ msstats_path=msstats_file,
+ sdrf_path=sdrf_file,
+ database_path=temp_db_path,
+ )
+ feature = Feature(mztab_indexer=indexer)
+ ...
Repeat the TemporaryDirectory pattern for test_extract_psm_msg (Lines 66-74) and test_generate_feature (Lines 106-114).
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.12.2)
26-26: Use of insecure and deprecated function (mktemp
)
(S306)
🪛 GitHub Check: CodeFactor
[warning] 26-26: tests/core/quantms/test_feature.py#L26
Use of insecure and deprecated function (mktemp). (B306)
🤖 Prompt for AI Agents
In tests/core/quantms/test_feature.py around lines 26-35 (and similarly update
the patterns used in test_extract_psm_msg at lines 66-74 and
test_generate_feature at lines 106-114), replace the insecure use of
tempfile.mktemp with a context-managed tempfile.TemporaryDirectory and construct
the database filename inside that directory (e.g., join the tempdir path with a
desired filename ending in .duckdb) so the path is created safely and avoids
race conditions; ensure the created path is non-existent when passed to
MzTabIndexer.create and that the TemporaryDirectory context scope covers the
test operations so the temp is cleaned up after the test.
if intensities_channel_counts: | ||
for channel, count in intensities_channel_counts.items(): | ||
if channel not in channel_counts: | ||
channel_counts[channel] = 0 | ||
|
||
# Use intensities-based matrix | ||
for ref_file, channels in file_channel_matrix_from_intensities.items(): | ||
if ref_file not in file_channel_matrix: | ||
file_channel_matrix[ref_file] = {} | ||
for channel, count in channels.items(): | ||
if channel not in file_channel_matrix[ref_file]: | ||
file_channel_matrix[ref_file][channel] = 0 | ||
file_channel_matrix[ref_file][channel] += count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix dropped channel counts when merging intensity-derived stats
Counts from intensities_channel_counts
are not added to channel_counts
. Add the accumulation.
- if intensities_channel_counts:
- for channel, count in intensities_channel_counts.items():
- if channel not in channel_counts:
- channel_counts[channel] = 0
+ if intensities_channel_counts:
+ for channel, count in intensities_channel_counts.items():
+ channel_counts[channel] = channel_counts.get(channel, 0) + count
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if intensities_channel_counts: | |
for channel, count in intensities_channel_counts.items(): | |
if channel not in channel_counts: | |
channel_counts[channel] = 0 | |
# Use intensities-based matrix | |
for ref_file, channels in file_channel_matrix_from_intensities.items(): | |
if ref_file not in file_channel_matrix: | |
file_channel_matrix[ref_file] = {} | |
for channel, count in channels.items(): | |
if channel not in file_channel_matrix[ref_file]: | |
file_channel_matrix[ref_file][channel] = 0 | |
file_channel_matrix[ref_file][channel] += count | |
if intensities_channel_counts: | |
for channel, count in intensities_channel_counts.items(): | |
channel_counts[channel] = channel_counts.get(channel, 0) + count | |
# Use intensities-based matrix | |
for ref_file, channels in file_channel_matrix_from_intensities.items(): | |
if ref_file not in file_channel_matrix: | |
file_channel_matrix[ref_file] = {} | |
for channel, count in channels.items(): | |
if channel not in file_channel_matrix[ref_file]: | |
file_channel_matrix[ref_file][channel] = 0 | |
file_channel_matrix[ref_file][channel] += count |
🧰 Tools
🪛 Ruff (0.12.2)
307-307: Loop control variable count
not used within loop body
(B007)
🪛 GitHub Check: CodeFactor
[notice] 272-318: tests/core/quantms/test_msstats_in.py#L272-L318
Complex Method
🤖 Prompt for AI Agents
In tests/core/quantms/test_msstats_in.py around lines 306 to 318, the loop that
handles intensities_channel_counts currently ensures keys exist in
channel_counts but never adds the counts themselves; update the code to add each
intensities_channel_counts[channel] value into channel_counts (initializing
channel_counts[channel] to 0 if missing) so the intensity-derived counts are
accumulated into channel_counts before proceeding to build the
file_channel_matrix.
# Use a temporary directory for the backend database | ||
temp_db_path = tempfile.mktemp(suffix=".duckdb") | ||
|
||
with MzTabIndexer( | ||
mztab_path=LARGE_DATASET["mztab"], | ||
msstats_path=LARGE_DATASET["msstats"], | ||
sdrf_path=LARGE_DATASET["sdrf"], | ||
database_path=temp_db_path, | ||
max_memory="16GB", # Adjust as needed for your system | ||
worker_threads=4, # Adjust as needed for your system | ||
batch_size=100000, # Using a larger batch size for a larger file | ||
) as indexer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace insecure mktemp with TemporaryDirectory; thread-safe and matches DB creation semantics.
mktemp creates a race and may be pre-created by attackers. Also the DB must not exist before create.
- # Use a temporary directory for the backend database
- temp_db_path = tempfile.mktemp(suffix=".duckdb")
-
- with MzTabIndexer(
+ with tempfile.TemporaryDirectory() as tmpdir:
+ temp_db_path = Path(tmpdir) / "perf.duckdb"
+ with MzTabIndexer(
mztab_path=LARGE_DATASET["mztab"],
msstats_path=LARGE_DATASET["msstats"],
sdrf_path=LARGE_DATASET["sdrf"],
database_path=temp_db_path,
max_memory="16GB", # Adjust as needed for your system
worker_threads=4, # Adjust as needed for your system
batch_size=100000, # Using a larger batch size for a larger file
- ) as indexer:
+ ) as indexer:
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 ast-grep (0.38.6)
[warning] 38-38: The function mktemp
is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile()
instead and pass it the delete=False
parameter.
Context: tempfile.mktemp(suffix=".duckdb")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-mktemp-python)
🪛 Ruff (0.12.2)
39-39: Use of insecure and deprecated function (mktemp
)
(S306)
🪛 GitHub Check: CodeFactor
[warning] 39-39: tests/core/test_large_mztab_performance.py#L39
Use of insecure and deprecated function (mktemp). (B306)
🤖 Prompt for AI Agents
In tests/core/test_large_mztab_performance.py around lines 38–49, replace the
insecure tempfile.mktemp usage with a tempfile.TemporaryDirectory context so the
DB path is created inside a fresh, thread-safe temp directory and does not
pre-exist; inside the TemporaryDirectory context construct a db filepath (e.g.,
os.path.join(tmpdir, "test.duckdb") or with your desired suffix), pass that path
to MzTabIndexer, and ensure you import tempfile and os if not already imported.
PR Type
Enhancement
Description
Major CLI restructuring with organized command groups
Enhanced logging system with structured output support
New quantms workflow conversion command
Improved error handling and verbose logging
Changes walkthrough 📝
28 files
Add quantms workflow conversion command
Add DIA-NN conversion commands
Add PSM conversion and comparison commands
Add feature conversion command
Add visualization plotting commands
Restructure CLI with organized command groups
Add differential expression transformation command
Add statistical analysis commands
Update MaxQuant conversion commands
Add enhanced logging to PSM processing
Add enhanced logging to feature processing
Add project management commands
Add absolute expression transformation command
Update schema with nullable fields
Update FragPipe PSM conversion method
Add gene mapping transformation command
Update project save methods
Update FragPipe conversion command
Add UniProt mapping transformation command
Add IBAQ transformation command
Add spectra transformation command
Update AnnData merge command
Update file attachment command
Add logging to MzTab class
Add logger utility module
Add convert commands package
Add transform commands package
Add utils commands package
1 files
Fix regex patterns with raw strings
7 files
Update feature schema with scan format
Major documentation update with usage examples
Update PSM schema with record types
Add comprehensive logging documentation
Add differential expression schema
Update IBAQ use case documentation
Add absolute expression schema
2 files
Update DuckDB version requirement
Update Python and PyOpenMS version requirements
13 files
Summary by CodeRabbit
New Features
Documentation
Tests
Chores