Major changes in the documentation and schemas created #99

ypriverol · 2025-06-24T15:21:50Z

PR Type

Enhancement

Description

Major CLI restructuring with organized command groups
Enhanced logging system with structured output support
New quantms workflow conversion command
Improved error handling and verbose logging

Changes walkthrough 📝

Relevant files

Enhancement

28 files

quantms.py `Add quantms workflow conversion command`	+318/-0
diann.py `Add DIA-NN conversion commands`	+256/-0
psm.py `Add PSM conversion and comparison commands`	+184/-0
feature.py `Add feature conversion command`	+167/-0
plot.py `Add visualization plotting commands`	+207/-0
quantmsioc.py `Restructure CLI with organized command groups`	+87/-42
de.py `Add differential expression transformation command`	+128/-0
stats.py `Add statistical analysis commands`	+114/-0
maxquant.py `Update MaxQuant conversion commands`	+31/-21
psm.py `Add enhanced logging to PSM processing`	+17/-6
feature.py `Add enhanced logging to feature processing`	+19/-1
project.py `Add project management commands`	+84/-0
ae.py `Add absolute expression transformation command`	+82/-0
format.py `Update schema with nullable fields`	+6/-0
fragpipe.py `Update FragPipe PSM conversion method`	+24/-21
gene.py `Add gene mapping transformation command`	+66/-0
project.py `Update project save methods`	+22/-7
fragpipe.py `Update FragPipe conversion command`	+15/-7
uniprot.py `Add UniProt mapping transformation command`	+56/-0
ibaq.py `Add IBAQ transformation command`	+55/-0
spectra.py `Add spectra transformation command`	+58/-0
anndata.py `Update AnnData merge command`	+20/-9
attach.py `Update file attachment command`	+6/-6
mztab.py `Add logging to MzTab class`	+3/-1
logger.py `Add logger utility module`	+10/-0
__init__.py `Add convert commands package`	+3/-0
__init__.py `Add transform commands package`	+3/-0
__init__.py `Add utils commands package`	+3/-0

Bug fix

1 files

pride_utils.py `Fix regex patterns with raw strings`	+2/-2

Documentation

7 files

feature.avsc `Update feature schema with scan format`	+18/-10
README.md `Major documentation update with usage examples`	+168/-1
psm.avsc `Update PSM schema with record types`	+11/-7
README_LOGGING.md `Add comprehensive logging documentation`	+127/-0
differential.avsc `Add differential expression schema`	+51/-0
ibaq_usecase.adoc `Update IBAQ use case documentation`	+7/-7
absolute.avsc `Add absolute expression schema`	+46/-0

Dependencies

2 files

enviroment.yml `Update DuckDB version requirement`	+1/-1
pyproject.toml `Update Python and PyOpenMS version requirements`	+2/-2

Additional files

13 files

ae_command.py	+0/-60
de_command.py	+0/-67
diann_command.py	+0/-191
feature_command.py	+0/-102
generate_gene_message_command.py	+0/-49
generate_spectra_message_command.py	+0/-46
ibaq_command.py	+0/-52
map_latest_uniport_command.py	+0/-44
plot_command.py	+0/-126
project_command.py	+0/-81
psm_command.py	+0/-79
quantms_command.py	+0/-140
statistic_command.py	+0/-104

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

Summary by CodeRabbit

New Features
- Reorganized CLI into grouped commands (convert/transform/visualize/stats/project) with many new conversion & transform subcommands (MaxQuant, DIA‑NN, IdXML, project workflows, mapping/transform utilities) and improved plotting/stats tools.
- Batch‑oriented Parquet export and faster DuckDB-backed workflows for large datasets.
Documentation
- Major README overhaul and updated docs/examples for new CLI and workflows.
Tests
- Large expansion of unit, integration and performance tests and new CI workflows.
Chores
- Minimum Python raised to 3.10; dependency adjustments and expanded .gitignore.

coderabbitai

Actionable comments posted: 40

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

quantmsio/utils/system.py (1)

17-35: Do not assume “enough disk” on error; handle non-existent output_dir and improve logging

Returning True on exceptions can lead to partial writes/data loss. Also, shutil.disk_usage requires an existing path—new output_dir may not exist yet. Check the nearest existing parent, log details, and make the “assume on error” behavior explicit.

Apply:

-def check_disk_space(output_dir: Path, required_gb: float = 1.0) -> bool:
+def check_disk_space(output_dir: Path, required_gb: float = 1.0, assume_enough_on_error: bool = False) -> bool:
     """
-    Check if there's sufficient disk space for processing.
+    Check if there's sufficient disk space for processing.
 
     Args:
-        output_dir: Directory to check space for
-        required_gb: Required space in GB (default: 1GB)
+        output_dir: Directory to check space for (may not exist yet).
+        required_gb: Required space in GiB (default: 1.0).
+        assume_enough_on_error: If True, return True when the check fails; otherwise return False.
 
     Returns:
-        True if sufficient space is available, False otherwise
+        True if sufficient space is available, False otherwise.
     """
     try:
-        total, used, free = shutil.disk_usage(output_dir)
-        free_gb = free / (1024**3)  # Convert to GB
-        return free_gb >= required_gb
+        path = output_dir if output_dir.exists() else output_dir.parent
+        total, used, free = shutil.disk_usage(path)
+        free_gib = free / (1024 ** 3)  # Convert to GiB
+        has_space = free_gib >= required_gb
+        logging.getLogger(__name__).debug(
+            "Disk space check on %s: free=%.2f GiB, required=%.2f GiB, ok=%s",
+            path, free_gib, required_gb, has_space
+        )
+        return has_space
     except Exception as e:
-        logging.warning(f"Could not check disk space: {e}")
-        return True  # Assume sufficient space if we can't check
+        logging.warning("Could not check disk space for %s: %s", output_dir, e)
+        return assume_enough_on_error

docs/pg.avsc (2)

44-56: Avro types: prefer float over float32; struct → record.

Replace float32 with Avro float and define a named record for array items.

-        {"name": "intensities", "type": ["null",{
-            "type": "array",
-            "items": {
-                "type": "struct",
-                "fields": [
-                    {"name": "sample_accession", "type": "string"},
-                    {"name": "channel", "type": "string"},
-                    {"name": "intensity", "type": "float32"}
-                ]
-            }
-            }],
+        {"name": "intensities", "type": ["null",{
+            "type": "array",
+            "items": {
+                "type": "record",
+                "name": "pg_intensity",
+                "fields": [
+                    {"name": "sample_accession", "type": "string"},
+                    {"name": "channel", "type": "string"},
+                    {"name": "intensity", "type": "float"}
+                ]
+            }
+        }],

57-78: Nested arrays: define records and use float.

Bring additional_intensities into Avro-compliant form.

-        {"name": "additional_intensities", "type": ["null",{
-            "type": "array",
-            "items": {
-                "type": "struct",
-                "fields": [
-                    {"name": "sample_accession", "type": "string"},
-                    {"name": "channel", "type": "string"},
-                    {"name": "intensities", "type": {
-                        "type": "array",
-                        "items": {
-                            "type": "struct",
-                            "fields": [
-                                {"name": "intensity_name", "type": "string"},
-                                {"name": "intensity_value", "type": "float32"}
-                            ]
-                        }
-                    }}
-                ]
-            }
-            }],
+        {"name": "additional_intensities", "type": ["null",{
+            "type": "array",
+            "items": {
+                "type": "record",
+                "name": "pg_additional_intensity",
+                "fields": [
+                    {"name": "sample_accession", "type": "string"},
+                    {"name": "channel", "type": "string"},
+                    {"name": "intensities", "type": {
+                        "type": "array",
+                        "items": {
+                            "type": "record",
+                            "name": "pg_intensity_kv",
+                            "fields": [
+                                {"name": "intensity_name", "type": "string"},
+                                {"name": "intensity_value", "type": "float"}
+                            ]
+                        }
+                    }}
+                ]
+            }
+        }],

quantmsio/utils/pride_utils.py (1)

457-466: Guard against None modifications_definition to avoid runtime errors.

When modifications exist but modifications_definition is None, current code will raise. Fallback to raw sequence in that case.

-    if "opt_global_cv_MS:1000889_peptidoform_sequence" in es:
-        psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"]
-    else:
-        psm["peptidoform"] = get_petidoform_msstats_notation(
-            psm["sequence"], psm["modifications"], modifications_definition
-        )
+    if "opt_global_cv_MS:1000889_peptidoform_sequence" in es:
+        psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"]
+    else:
+        psm["peptidoform"] = (
+            get_petidoform_msstats_notation(
+                psm["sequence"], psm["modifications"], modifications_definition
+            )
+            if modifications_definition
+            else psm["sequence"]
+        )

-    psm["proforma_peptidoform"] = get_peptidoform_proforma_version_in_mztab(
-        psm["sequence"], psm["modifications"], modifications_definition
-    )
+    psm["proforma_peptidoform"] = (
+        get_peptidoform_proforma_version_in_mztab(
+            psm["sequence"], psm["modifications"], modifications_definition
+        )
+        if modifications_definition
+        else psm["sequence"]
+    )

♻️ Duplicate comments (4)

quantmsio/core/fragpipe/fragpipe.py (1)
371-387: Reduce default batch size and restore pre-flight checks (duplicate of prior feedback)

1e6 rows risks memory pressure; also add input existence and ensure output dir.
-    def write_psms_to_parquet(
+    def write_psms_to_parquet(
         self,
-        msms_file: Path,
-        batch_size: int = 1000000,
+        msms_file: Path,
+        batch_size: int = 100000,
         output_prefix: Optional[str] = None,
     ) -> None:
@@
-        if not output_prefix:
+        if not msms_file.exists():
+            raise FileNotFoundError(f"PSM file not found: {msms_file}")
+        self.output_directory.mkdir(parents=True, exist_ok=True)
+        if not output_prefix:
             output_prefix = "psm"
quantmsio/core/quantms/pg.py (1)
543-548: DuckDB cannot read a Python DataFrame by name in SQL; must register it.

This creates an empty table or fails silently. Register/unregister the DataFrame before CREATE TABLE.
-            if protein_data:
-                protein_df = pd.DataFrame(protein_data)
-                msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups")
-                msstats_indexer._duckdb.execute(
-                    "CREATE TABLE protein_groups AS SELECT * FROM protein_df"
-                )
+            if protein_data:
+                protein_df = pd.DataFrame(protein_data)
+                msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups")
+                msstats_indexer._duckdb.register("protein_df_view", protein_df)
+                try:
+                    msstats_indexer._duckdb.execute(
+                        "CREATE TABLE protein_groups AS SELECT * FROM protein_df_view"
+                    )
+                finally:
+                    msstats_indexer._duckdb.unregister("protein_df_view")
quantmsio/core/diann/diann.py (1)

151-167: Consider aligning field names for clarity.

Based on the past review comments, the field names in peptide_counts could be more consistent.

Consider using more consistent field naming:

In peptide_counts: use unique_peptides and total_peptides

Keep feature_counts as is with unique_features and total_features

This would make the distinction between peptides and features clearer.

quantmsio/core/quantms/psm.py (1)

58-69: Use context manager for file operations.

The file handling needs improvement for better resource management and exception safety. The manual file opening and closing should use a context manager.

🧹 Nitpick comments (80)

quantmsio/utils/system.py (1)

8-15: Add type hints and tighten the log output (units, PID)

Minor polish: type the logger param/return, and log MiB with PID for easier tracing.

Apply:

-def log_memory_usage(logger, step: str):
-    """Log memory usage for a given step."""
-    process = psutil.Process(os.getpid())
-    mem_info = process.memory_info()
-    logger.debug(
-        f"Memory usage after {step}: RSS={mem_info.rss / 1024 / 1024:.1f}MB, VMS={mem_info.vms / 1024 / 1024:.1f}MB"
-    )
+def log_memory_usage(logger: logging.Logger, step: str) -> None:
+    """Log current process memory usage (RSS/VMS in MiB) for the given step."""
+    process = psutil.Process(os.getpid())
+    mem = process.memory_info()
+    rss = mem.rss / (1024 ** 2)
+    vms = mem.vms / (1024 ** 2)
+    logger.debug(f"[pid={process.pid}] Memory usage after {step}: RSS={rss:.1f} MiB, VMS={vms:.1f} MiB")

Optional micro-opt (outside range): cache the Process at module load to avoid re-creating it per call:

# at module top
_PROCESS = psutil.Process(os.getpid())
# in function
process = _PROCESS

tests/core/test_project.py (4)

20-36: Patch the correct requests symbol and simplify the mock response.
Targets the import site to avoid future import-style changes breaking the patch; Mock is lighter than constructing a Response.

-    @patch("requests.get")
+    @patch("quantmsio.core.project.requests.get")
     def test_populate_from_pride_archive_successful(self, mock_get):
-        # Mock the API response for a successful request
-        mock_response = requests.models.Response()
-        mock_response.status_code = 200
-        mock_response.json = lambda: {
-            "title": "Test Project",
-            "projectDescription": "Test description",
-            "sampleProcessingProtocol": "Test sample processing protocol",
-            "dataProcessingProtocol": "Test data processing protocol",
-            "references": [{"pubmedId": "12345678"}],
-            "keywords": ["keyword1", "keyword2"],
-            "projectTags": ["tag1", "tag2"],
-            # Add other mock data as needed
-        }
-        mock_get.return_value = mock_response
+        # Mock the API response for a successful request
+        mock_get.return_value = Mock(
+            status_code=200,
+            json=lambda: {
+                "title": "Test Project",
+                "projectDescription": "Test description",
+                "sampleProcessingProtocol": "Test sample processing protocol",
+                "dataProcessingProtocol": "Test data processing protocol",
+                "references": [{"pubmedId": "12345678"}],
+                "keywords": ["keyword1", "keyword2"],
+                "projectTags": ["tag1", "tag2"],
+            },
+        )

Additional change needed outside this hunk:

-from unittest.mock import patch
+from unittest.mock import patch, Mock

50-61: Avoid live network in unit tests; gate as integration and remove print.
Prevents flakiness and noisy output; allow opt-in via env flag.

     def test_populate_from_pride_archive_api(self):
+        if os.environ.get("PRIDE_ONLINE") != "1":
+            self.skipTest("Integration test requires PRIDE_ONLINE=1")
         project_accession = "PXD020453"
         project_manager = ProjectHandler(project_accession)
         project_manager.populate_from_pride_archive()
 
         # Assert that the project_info has been updated
         self.assertEqual(
             project_manager.project.project_info["project_title"],
             "Structural insights into Cullin4-RING ubiquitin ligase remodelling by Vpr from simian immunodeficiency viruses",
         )
-        print(project_manager.project.project_info)

Additional change needed outside this hunk:

+import os

If you prefer keeping this as a pure unit test, patch requests here too (same target as above) and assert on stable fields rather than a long, mutable title.

6-6: ddt is imported but unused.
Either add data-driven cases or remove the decorator/import to reduce test overhead.

-from ddt import ddt
+# from ddt import ddt  # Uncomment when adding data-driven tests
@@
-@ddt
 class TestProjectHandler(TestCase):

Alternatively, parameterize via ddt using test_dataset.

Also applies to: 17-18

62-69: Enhance test_save_project_info with SDRF file existence and ingestion assertions

The populate_from_sdrf call populates project_manager.project.project_info with keys like "organisms", "organism_parts", "diseases", "cell_lines", "instruments", "enzymes" and "acquisition_properties". Strengthen the test to:

• Verify the SDRF test file is present
• Assert that at least one SDRF-derived property is added to project_info

In tests/core/test_project.py around lines 66–69, update:

     def test_save_project_info(self):
         project_accession = test_dataset[0]
         sdrf_file = test_dataset[1]

         project_manager = ProjectHandler(project_accession)
         project_manager.populate_from_pride_archive()
-        project_manager.populate_from_sdrf(sdrf_file)
+        # Ensure the SDRF test file exists
+        self.assertTrue(sdrf_file.exists(), f"Missing SDRF test file: {sdrf_file}")
+        project_manager.populate_from_sdrf(sdrf_file)
+        # Verify SDRF ingestion added properties to project_info
+        info = project_manager.project.project_info
+        self.assertIn("organisms", info)
+        self.assertIsInstance(info["organisms"], list)
+        self.assertGreater(len(info["organisms"]), 0)

tests/core/test_large_mztab_performance.py (1)

20-22: Make CI skip robust to different CI values

Skip if CI is set to any truthy string, not just "true".

-@pytest.mark.skipif(
-    os.environ.get("CI") == "true", reason="Skipping performance test in CI environment"
-)
+@pytest.mark.skipif(
+    bool(os.environ.get("CI")), reason="Skipping performance test in CI environment"
+)

quantmsio/operate/tools.py (4)

29-33: Tighten return typing for init_save_info

Second element can become a ParquetWriter; advertise Optional type.

-def init_save_info(parquet_path: str) -> tuple[dict, None, str]:
+def init_save_info(parquet_path: str) -> tuple[dict, Optional[pq.ParquetWriter], str]:

124-134: Fix typo in docstring: “retrun” → “return”

Minor documentation fix.

-    retrun: None
+    return: None

159-175: Consider using sets in peptide_map to avoid O(n) duplicate checks

Using a set avoids repeated membership scans; convert to list at the end if needed.

-def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict:
-    peptide_map: defaultdict = defaultdict(list)
+def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict:
+    peptide_map: defaultdict = defaultdict(set)
@@
-            if accession not in peptide_map[peptide]:
-                peptide_map[peptide].append(accession)
-    return peptide_map
+            peptide_map[peptide].add(accession)
+    # downstream expects lists
+    return defaultdict(list, {k: list(v) for k, v in peptide_map.items()})

261-268: Make column drop resilient when label is absent

Avoid KeyError if label/reference aren’t present in the merge result.

-        df.drop(
+        df.drop(
             [
                 "reference",
                 "label",
             ],
-            axis=1,
-            inplace=True,
+            axis=1,
+            inplace=True,
+            errors="ignore",
         )

tests/core/operate/test_query.py (4)

13-17: Replace prints with minimal assertions

Ensure generators yield at least one batch; avoid noisy output.

-        for _ in q.iter_samples():
-            print("ok")
+        assert next(q.iter_samples(), None) is not None

23-27: Assert iterator yields

Same rationale as above.

-        for _ in q.iter_chunk():
-            print("ok")
+        assert next(q.iter_chunk(), None) is not None

33-37: Assert iterator yields

Same rationale as above.

-        for _ in q.iter_file():
-            print("ok")
+        assert next(q.iter_file(), None) is not None

43-48: Validate result of inject_position_msg

Guard against silent no-ops.

         protein_dict = q.get_protein_dict(fasta)
-        q.inject_position_msg(df, protein_dict)
+        out = q.inject_position_msg(df, protein_dict)
+        assert "pg_positions" in out.columns

quantmsio/utils/file_utils.py (1)

41-56: Simplify existence check using all(...)

Matches Ruff SIM110 hint.

-    for file_name in file_names:
-        # Now we have Path / str which works correctly on all platforms
-        if not (path_obj / file_name).exists():
-            return False
-    return True
+    return all((path_obj / file_name).exists() for file_name in file_names)

quantmsio/core/fragpipe/fragpipe.py (3)

5-5: Remove unused import date

Per static analysis hint.

-from datetime import date

11-11: Remove unused import pyarrow.parquet as pq

Per static analysis hint.

-import pyarrow.parquet as pq

405-407: Optionally pass string path to ParquetBatchWriter for wider compatibility

Minor portability improvement.

-                    batch_writer = ParquetBatchWriter(
-                        output_path,
-                        schema=batch.schema,
-                    )
+                    batch_writer = ParquetBatchWriter(str(output_path), schema=batch.schema)

quantmsio/utils/constants.py (1)

106-117: Consider adding validation for mzTab protein columns.

The MZTAB_PROTEIN_COLUMNS list contains a mix of referenced constants (ACCESSION, MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE) and literal strings. This is good for maintainability, but consider whether some of the literal strings (e.g., "ambiguity_members") should also be constants if they're used elsewhere.

If any of these literal strings are referenced in other parts of the codebase, consider extracting them as constants:
+AMBIGUITY_MEMBERS_COLUMN = "ambiguity_members"
+DESCRIPTION_COLUMN = "description"
+MODIFICATIONS_COLUMN = "modifications"
+PROTEIN_COVERAGE_COLUMN = "protein_coverage"
+
 MZTAB_PROTEIN_COLUMNS = [
     ACCESSION,
-    "description",
+    DESCRIPTION_COLUMN,
     MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE,
-    "ambiguity_members",
+    AMBIGUITY_MEMBERS_COLUMN,
-    "modifications",
+    MODIFICATIONS_COLUMN,
-    "protein_coverage",
+    PROTEIN_COVERAGE_COLUMN,
     "opt_global_Posterior_Probability_score",
     "opt_global_cv_PRIDE:0000303_decoy_hit",
     "opt_global_result_type",
 ]

quantmsio/utils/mztab_utils.py (4)

1-4: Add module-level docstring for better documentation.

The module lacks a docstring describing its purpose and contents.

Add a module-level docstring:

+"""
+Utilities for parsing and processing mzTab format files.
+
+This module provides functions for parsing mzTab metadata, headers, protein and PSM lines,
+as well as handling modifications and peptidoform conversions from OpenMS format.
+"""
 import re
 from typing import Optional, Dict, List

33-43: Simplify nested loop iteration.

The nested loop for finding the accession can be simplified.

         elif "site" in line_parts[1]:
             index = line_parts[1].split("[")[1].split("]")[0]
             accession = None
-            for (
-                key,
-                value,
-            ) in (
-                _modifications.items()
-            ):  # for name, age in dictionary.iteritems():  (for Python 2.x)
+            for key, value in _modifications.items():
                 if value[1] == index:
                     accession = key

280-289: Use ternary operator for cleaner code.

The if-else block can be simplified to a ternary operator as suggested by static analysis.

 def _clean_file_path(file_path: str) -> str:
     """
     Clean a file path to extract just the filename without extension.
 
     This helper function removes the file:// protocol and extracts the filename
     without the file extension from various file path formats.
 
     Args:
         file_path: File path that may include file:// protocol and extensions
 
     Returns:
         Clean filename without protocol and extension
 
     Example:
         >>> _clean_file_path("file:///path/to/file.mzML")
         'file'
         >>> _clean_file_path("/data/sample_001.mzML")
         'sample_001'
         >>> _clean_file_path("relative/path/file.txt")
         'file'
     """
-    if file_path.startswith("file://"):
-        # Remove file:// protocol
-        clean_path = file_path[7:]
-    else:
-        clean_path = file_path
+    clean_path = file_path[7:] if file_path.startswith("file://") else file_path
 
     # Extract filename without extension
     filename = clean_path.split("/")[-1].split(".")[0]
     return filename

533-658: Consider refactoring the complex modification parsing function.

The parse_modifications_openms function is quite complex with deeply nested conditionals. Consider breaking it down into smaller, more focused helper functions for better readability and maintainability.

Consider extracting the terminal modification handling logic into separate helper functions:

def _handle_n_terminal_mod(peptidoform_to_process: str, mod_name_to_accession: dict, parsed_mods: dict) -> str:
    """Handle N-terminal modification parsing."""
    if peptidoform_to_process.startswith(".("):
        match = re.match(r"\.\(([^)]+)\)", peptidoform_to_process)
        if match:
            mod_name = match.group(1)
            accession = mod_name_to_accession.get(mod_name, mod_name)
            if accession not in parsed_mods:
                parsed_mods[accession] = {"name": mod_name, "positions": []}
            parsed_mods[accession]["positions"].append(
                {"position": "N-term.0", "scores": None}
            )
            return peptidoform_to_process[len(match.group(0)):]
    return peptidoform_to_process

def _handle_c_terminal_mod(peptidoform_to_process: str, pure_sequence: str, mod_name_to_accession: dict, parsed_mods: dict) -> tuple[str, Optional[str]]:
    """Handle C-terminal modification parsing."""
    c_term_mod_name = None
    if peptidoform_to_process.endswith(")") and peptidoform_to_process.rfind(".(") > 0:
        match = re.search(r"\.\(([^)]+)\)$", peptidoform_to_process)
        if match:
            c_term_mod_name = match.group(1)
            peptidoform_to_process = peptidoform_to_process[:-len(match.group(0))]
    return peptidoform_to_process, c_term_mod_name

This would make the main function cleaner and easier to test.

quantmsio/core/common.py (2)

41-41: Consider using compiled regex for better performance.

The OPENMS_PROTEIN_QVALUE_WORDS list suggests these words will be used for pattern matching. If this is used frequently, consider pre-compiling a regex pattern.
import re

# Regular expression that ensures the protein search engine score contains required words
OPENMS_PROTEIN_QVALUE_PATTERN = re.compile(
    r"(?=.*openms)(?=.*target-decoy)(?=.*protein)(?=.*q-value)",
    re.IGNORECASE
)
98-103: Add documentation for removed fields.

The comments mention that certain fields were removed in DIA-NN version 2.0. Consider adding version compatibility information or migration guidance.

Consider adding a version compatibility note at the module level or creating a separate compatibility matrix documentation:
# DIA-NN Version Compatibility:
# - Version 2.0+: Removed "File.Name", "PG.Quantity", "PG.Normalised" from main report
# - Version 2.0+: Added "Decoy" to main report
# - Version 2.0+: Added "N.Sequences", "N.Proteotypic.Sequences" to report.pg_matrix.tsv

debug_test_convert_to_parquet.py (3)

9-9: Remove unused import.

os isn’t used.

-import os

44-45: Drop unnecessary f-string.

No placeholders present.

-        print(f"=== Full traceback: ===")
+        print("=== Full traceback: ===")

16-24: Make logging setup resilient and Unicode-safe.

Avoid duplicate handlers when re-running, and set encoding for the file handler.

 logging.basicConfig(
     level=logging.DEBUG,
     format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
     handlers=[
         logging.StreamHandler(sys.stdout),
-        logging.FileHandler("debug_test_convert_to_parquet.log"),
+        logging.FileHandler("debug_test_convert_to_parquet.log", encoding="utf-8"),
     ],
+    force=True,
 )

tests/core/quantms/psms/test_mztab_functions.py (2)

36-49: Consider adding a probability-bearing case.

Add a case like 2[0.7]|5[0.3]-UNIMOD:4 to assert probabilities are stripped while positions remain integers.

I can add a parametrized test covering probability annotations if you want.

51-57: Reduce flakiness by making order-agnostic (optional).

Current assertion depends on dict insertion order. If upstream changes order, the test could fail even with correct content.
-    assert modification_list == ["2|5-UNIMOD:4", "1|3|7-UNIMOD:21"]
+    assert set(modification_list) == {"2|5-UNIMOD:4", "1|3|7-UNIMOD:21"}

tests/core/quantms/test_feature.py (2)

16-25: Gracefully skip when example data is missing.

Mirror the idXML pattern to avoid brittle failures in environments without the example data.

+import pytest
@@
-    # Initialize Feature
+    # Skip if data files don't exist
+    if not (mztab_file.exists() and msstats_file.exists() and sdrf_file.exists()):
+        pytest.skip("Example data files not found")
+    # Initialize Feature
     feature = Feature(mztab_file, sdrf_file, msstats_file)

Apply the same guard in test_extract_psm_msg and test_generate_feature.

Also applies to: 43-52, 70-79

83-89: Optional: cap iterations to keep tests fast.

If generators can yield many batches, consider breaking after N iterations.

-        for feature_table in feature.generate_feature():
+        for feature_table in feature.generate_feature():
             # Add assertions to verify the result
             assert feature_table is not None
             assert len(feature_table) > 0
             assert "peptidoform" in feature_table.column_names
             count += 1
+            if count >= 3:
+                break

docs/pg.avsc (1)

79-79: Consider boolean for decoy flag.

is_decoy fits boolean better than int.

-        {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the protein group is a decoy, 0 if target"},
+        {"name": "is_decoy", "type": "boolean", "doc": "Decoy indicator: true if decoy, false if target"},

tests/core/test_sdrf.py (3)

15-15: Remove redundant private call.

__init__ already loads the SDRF; calling _load_sdrf_info again is unnecessary and touches a private API.
-    sdrf_handler._load_sdrf_info(file)
34-43: Align allowed experiment types with implementation (optional).

get_experiment_type_from_sdrf currently returns LFQ/TMT*/ITRAQ* only. Including "SILAC" here may be misleading.

If SILAC support is intended, please confirm and update quantmsio/core/sdrf.py accordingly; otherwise remove it from the allowed list in tests.

110-115: Use direct dict membership.

Simpler and faster than checking .keys().
-    for mod in expected_mods:
-        assert any(
-            mod in key for key in mods_dict.keys()
-        ), f"Expected modification {mod} not found in mods_dict"
+    for mod in expected_mods:
+        assert any(mod in key for key in mods_dict), (
+            f"Expected modification {mod} not found in mods_dict"
+        )

quantmsio/quantmsioc.py (2)

103-114: Register DuckDB and project conversion commands or drop imports.

duckdb_creator and quantms_project_convert are imported but unused (Ruff F401). If intended, register them under convert; otherwise remove the imports.
 convert.add_command(convert_idxml_file, name="idxml")
+convert.add_command(duckdb_creator, name="duckdb")
+convert.add_command(quantms_project_convert, name="quantms-project")
Run flake8 after this change to confirm F401 is resolved.

116-124: Give transform subcommands stable, human-friendly names.

Right now Click will default to function names (long/unstable). Use explicit names matching the PR surface.
-transform.add_command(ae_transform)
-transform.add_command(de_transform)
-transform.add_command(gene_transform)
-transform.add_command(ibaq_transform)
-transform.add_command(spectra_transform)
-transform.add_command(uniprot_transform)
-transform.add_command(anndata_transform)
+transform.add_command(ae_transform, name="ae")
+transform.add_command(de_transform, name="de")
+transform.add_command(gene_transform, name="gene")
+transform.add_command(ibaq_transform, name="ibaq")
+transform.add_command(spectra_transform, name="spectra")
+transform.add_command(uniprot_transform, name="uniprot")
+transform.add_command(anndata_transform, name="anndata")

tests/core/idxml/test_idxml_integration.py (2)

2-2: Remove unused import.

os is imported but never used.
-import os
82-88: Optional: Align with the new CLI entry (future-proof).

This test invokes a Click command function directly from convert.quantms. Given the CLI now exposes convert idxml, consider testing the top-level CLI too for integration coverage. No action required if you want to keep this as a unit-style CLI test.

tests/core/quantms/test_diann.py (1)

16-29: LGTM on feature path; consider asserting outputs to strengthen the test.

You exercise the full path but don’t validate shapes or schema. Adding minimal assertions (e.g., non-empty DataFrame after transform) would catch silent regressions.

tests/core/idxml/test_idxml_psm.py (1)

198-201: Assert a specific exception type.

Catching bare Exception hides failures. If the loader raises a file-not-found error, make it explicit.
-    with pytest.raises(Exception):
+    with pytest.raises(FileNotFoundError):
         IdXmlPsm("non_existent_file.idXML")
If the actual exception differs (e.g., OSError), update the type accordingly.

tests/core/quantms/test_diann_duckdb_performance.py (1)

3-8: Remove unused imports.

os and pandas are not used in this test module.

Apply:
-import os
...
-import pandas as pd

tests/core/quantms/psms/test_mztab_indexer.py (1)

129-162: Reduce log spam and turn referential checks into actionable assertions.

Current loop logs every accession and a boolean. Aggregate missing items and assert, or at least summarize.

Example refactor:

-        for psm_acc_group in psm_accessions:
-            for psm_acc in psm_acc_group.split(","):
-                logging.info(f"PSM accession {psm_acc} protein group not found")
-                logging.info(
-                    any(psm_acc in prot_acc for prot_acc in all_protein_accessions)
-                )
+        missing_psm = []
+        for psm_acc_group in psm_accessions:
+            for psm_acc in psm_acc_group.split(","):
+                if not any(psm_acc in prot_acc for prot_acc in all_protein_accessions):
+                    missing_psm.append(psm_acc)
+        assert not missing_psm, f"{len(missing_psm)} PSM accessions not found in proteins: {missing_psm[:10]}"

-        for msstats_acc_group in msstats_accessions:
-            for msstats_acc in msstats_acc_group.split(","):
-                logging.info(f"MSstats accession {msstats_acc} protein group not found")
-                logging.info(
-                    any(msstats_acc in prot_acc for prot_acc in all_protein_accessions)
-                )
+        missing_ms = []
+        for msstats_acc_group in msstats_accessions:
+            for msstats_acc in msstats_acc_group.split(","):
+                if not any(msstats_acc in prot_acc for prot_acc in all_protein_accessions):
+                    missing_ms.append(msstats_acc)
+        assert not missing_ms, f"{len(missing_ms)} MSstats accessions not found in proteins: {missing_ms[:10]}"

Also applies to: 269-293

tests/core/quantms/test_maxquant.py (1)

1-2: Prune unused imports.

os, tempfile, and MzTabProteinGroups are unused.

Apply:
-import os
-import tempfile
...
-from quantmsio.core.quantms.pg import MzTabProteinGroups
Also applies to: 10-12

tests/core/quantms/psms/test_duckdb_creation.py (5)

1264-1280: Use contextlib.suppress for cleanup helpers.

Simplifies error-tolerant cleanup and satisfies linters (Ruff SIM105, CodeFactor B110).

Apply:

+import contextlib
...
 def cleanup_mztab_indexer(mztab_indexer):
     """Clean up MzTabIndexer instance and its connections."""
     if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
-        try:
-            mztab_indexer._duckdb.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+        with contextlib.suppress(Exception):
+            mztab_indexer._duckdb.close()
 
 def cleanup_duckdb_connection(con):
     """Clean up DuckDB connection."""
     if con:
-        try:
-            con.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+        with contextlib.suppress(Exception):
+            con.close()

559-563: Docstring/log text doesn’t match behavior (MSstats is included in Step 1).

Either exclude MSstats in Step 1 or update wording. Given assertions expect msstats.parquet present, update text.

Apply:

-    """
-    Test the two-step process with Parquet backend using TMT data:
-    1. Create parquet store with mzTab data only
-    2. Add MSstats data to the existing parquet store
-    """
+    """
+    Test Parquet backend with TMT data:
+    1. Create parquet store with mzTab + MSstats
+    2. Open existing store and re-add MSstats (idempotent replace)
+    """
...
-        print("Step 1: Creating parquet store with mzTab data only...")
+        print("Step 1: Creating parquet store with mzTab + MSstats...")

Also applies to: 582-589

586-596: Close indexer instances to release resources.

Avoid leaked DuckDB connections and silence “assigned but unused” warnings by using the objects in cleanup.

Apply:

-        mztab_step1 = MzTabIndexer.create(
+        mztab_step1 = MzTabIndexer.create(
             mztab_path=mztab_path,
-            msstats_path=msstats_path,
+            msstats_path=msstats_path,
             backend="parquet",
             database_path=output_path,
         )
...
-        mztab_existing = MzTabIndexer.open(
+        mztab_existing = MzTabIndexer.open(
             database_path=output_path,
             backend="parquet",
         )
...
-        # Clean up
-        con.close()
+        # Clean up
+        cleanup_mztab_indexer(mztab_step1)
+        cleanup_mztab_indexer(mztab_existing)
+        con.close()

Also applies to: 628-632, 694-696

1013-1018: Ensure parity in parquet create/open test: close the created indexer.

This both releases resources and makes mztab_new a used variable.

Apply:

 mztab_new = MzTabIndexer.create(
     mztab_path=mztab_path,
     msstats_path=msstats_path,
     backend="parquet",
     database_path=output_path,
 )
+cleanup_mztab_indexer(mztab_new)
...
-        # Clean up
-        con.close()
+        # Clean up
+        con.close()

Also applies to: 1121-1123

184-184: Remove extraneous f-strings without placeholders.

Several print(f"...") have no {}. Drop the f to satisfy Ruff F541.

You can auto-fix across this file:

#!/bin/bash
# Remove f-prefix when no {} placeholders
python - <<'PY'
import re, pathlib
p = pathlib.Path("tests/core/quantms/psms/test_duckdb_creation.py")
s = p.read_text()
s = re.sub(r'print\(f"([^"{]*?)"\)', r'print("\1")', s)
p.write_text(s)
PY

Also applies to: 214-214, 274-274, 310-310, 341-341, 401-401, 436-436, 475-475, 542-542, 580-580, 615-615, 662-662, 689-689, 719-719, 753-753, 806-806, 833-833, 863-863, 897-897, 950-950, 977-977, 1007-1007, 1042-1042, 1089-1089, 1116-1116, 1146-1146, 1181-1181, 1228-1228, 1255-1255

tests/core/quantms/psms/test_psm.py (5)

259-266: Avoid comparisons to True/False in tests.

Use truthiness for clarity and to satisfy linters (E712).

-    assert (
-        _is_ms_run_location_line(
-            ["MTD", "ms_run[1]-location", "file:///path/file.mzML"]
-        )
-        == True
-    )
-    assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False
-    assert _is_ms_run_location_line(["PRT", "accession", "P12345"]) == False
+    assert _is_ms_run_location_line(["MTD", "ms_run[1]-location", "file:///path/file.mzML"])
+    assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"])
+    assert not _is_ms_run_location_line(["PRT", "accession", "P12345"])

520-525: Prefer iterating dict directly (SIM118).

-                for protein_acc in protein_qvalue_map.keys():
+                for protein_acc in protein_qvalue_map:

552-556: Remove redundant f-prefix in string.

-                    logger.info(f"\nTop 10 protein groups with lowest q-values:")
+                    logger.info("\nTop 10 protein groups with lowest q-values:")

558-566: Drop unused loop variable.

-                    for i, (accession, qvalue) in enumerate(sorted_groups[:10], 1):
+                    for accession, qvalue in sorted_groups[:10]:

414-593: This test is doing too much; consider parametrization/splitting.

Split by concern (stats, bounds, groups) or parametrize over backends to reduce complexity and noise.

quantmsio/utils/pride_utils.py (2)

424-451: Condense optionals with dict.get and reduce branching.

-    if "opt_global_cv_MS:1002217_decoy_peptide" in es:
-        psm["is_decoy"] = es["opt_global_cv_MS:1002217_decoy_peptide"]
-    else:
-        psm["is_decoy"] = None
+    psm["is_decoy"] = es.get("opt_global_cv_MS:1002217_decoy_peptide")

-    if "opt_global_Posterior_Error_Probability_score" in es:
-        psm["posterior_error_probability"] = es[
-            "opt_global_Posterior_Error_Probability_score"
-        ]
-    else:
-        psm["posterior_error_probability"] = None
+    psm["posterior_error_probability"] = es.get("opt_global_Posterior_Error_Probability_score")

-    if "opt_global_q-value" in es:
-        psm["global_qvalue"] = es["opt_global_q-value"]
-    else:
-        psm["global_qvalue"] = None
+    psm["global_qvalue"] = es.get("opt_global_q-value")

-    if "opt_global_consensus_support" in es:
-        psm["consensus_support"] = es["opt_global_consensus_support"]
-    else:
-        psm["consensus_support"] = None
+    psm["consensus_support"] = es.get("opt_global_consensus_support")

469-481: Raise ValueError for missing ms_run mapping; include context.

Use a more specific exception type and a concise message for callers.

-        if ms_runs is not None:
-            if ms_run in ms_runs:
-                psm["ms_run"] = ms_runs[ms_run]
-            else:
-                raise Exception(f"The ms run {ms_run} is not in the ms runs index")
+        if ms_runs is not None:
+            if ms_run in ms_runs:
+                psm["ms_run"] = ms_runs[ms_run]
+            else:
+                raise ValueError(f"ms_run '{ms_run}' not found in provided ms_runs index")

quantmsio/core/openms.py (2)

3-3: Remove unused imports.

-from typing import List, Dict, Any, Optional, Iterator, Union
+from typing import List, Dict, Any, Optional, Union

-from quantmsio.utils.mztab_utils import parse_peptidoform_openms
+# parse_peptidoform_openms not used here; remove import

Also applies to: 11-11

217-226: Broaden numeric check to handle non-floatables safely.

-        except ValueError:
+        except (TypeError, ValueError):
             return False

tests/core/quantms/test_diann_duckdb.py (3)

3-7: Remove unused imports.
-import os
 import time
 from pathlib import Path
 
-import pandas as pd
 import pytest
40-41: Drop redundant f-strings without placeholders.
-        print(f"\nSmall dataset statistics:")
+        print("\nSmall dataset statistics:")
-        print(f"\nFull dataset statistics:")
+        print("\nFull dataset statistics:")
-        print(f"\nFiltered protein groups matrix statistics (5 proteins):")
+        print("\nFiltered protein groups matrix statistics (5 proteins):")
-        print(f"\nCaching performance:")
+        print("\nCaching performance:")
Also applies to: 125-126, 232-233, 278-281

106-249: Consolidate repeated stats logic via parametrization/fixtures (optional).

Small and full tests duplicate large chunks. Consider pytest parametrization for dataset and thresholds to trim runtime and code size.

quantmsio/core/maxquant/maxquant.py (1)

779-798: Simplify type checking for better maintainability.

The explicit type comparisons violate Python best practices. Consider using isinstance() for type checking or a dictionary-based approach.

Apply this refactor for cleaner type checking:

-        for field, dtype in required_fields.items():
-            if field not in df.columns:
-                if dtype == list:
-                    df[field] = [[] for _ in range(len(df))]
-                elif dtype == str:
-                    df[field] = None
-                elif dtype == float:
-                    df[field] = 0.0
-                elif dtype == int:
-                    df[field] = 0
-            else:
-                # Handle type conversions
-                if dtype == float:
-                    df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
-                elif dtype == int:
-                    df[field] = (
-                        pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
-                    )
-                elif dtype == str:
-                    df[field] = df[field].astype(str).fillna("")
+        default_values = {
+            list: lambda: [[] for _ in range(len(df))],
+            str: lambda: None,
+            float: lambda: 0.0,
+            int: lambda: 0
+        }
+        
+        for field, dtype in required_fields.items():
+            if field not in df.columns:
+                df[field] = default_values[dtype]()
+            else:
+                # Handle type conversions
+                if dtype == float:
+                    df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
+                elif dtype == int:
+                    df[field] = (
+                        pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
+                    )
+                elif dtype == str:
+                    df[field] = df[field].astype(str).fillna("")

tests/core/quantms/test_msstats_in.py (4)

1-17: Remove unused imports to clean up the module.

The gzip and tempfile imports are not used in the code.

-import gzip
-import tempfile
 from pathlib import Path

74-74: Remove extraneous f-string prefixes.

These strings don't contain any placeholders, so the f-prefix is unnecessary.

-            print(f"Intensity distribution analysis completed")
+            print("Intensity distribution analysis completed")

-        print(f"\nLFQ test completed successfully!")
+        print("\nLFQ test completed successfully!")

-    print(f"\nTMT test completed successfully!")
+    print("\nTMT test completed successfully!")

-                print(f"Intensity distribution analysis completed")
+                print("Intensity distribution analysis completed")

-    print(f"\nDataset Comparison:")
+    print("\nDataset Comparison:")

Also applies to: 117-117, 465-465, 536-536, 589-589

272-319: Consider breaking down the complex channel processing function.

The _process_channel_statistics function handles both direct channel counts and intensities-based channel counts with nested logic. Consider splitting into two separate functions for clarity.

 def _process_channel_statistics(
     msstats_batch, channel_counts, file_channel_matrix, channel_column="Channel"
 ):
     """Process channel statistics from batch data."""
     # Count channels from the specified channel column
     chan_counts = msstats_batch[channel_column].value_counts()
     for channel, count in chan_counts.items():
         if channel not in channel_counts:
             channel_counts[channel] = 0
         channel_counts[channel] += count
 
+    # Process intensities separately
+    _process_intensities_channel_statistics(
+        msstats_batch, channel_counts, file_channel_matrix
+    )
+
+def _process_intensities_channel_statistics(
+    msstats_batch, channel_counts, file_channel_matrix
+):
+    """Process channel statistics from intensities column."""
     # Count channels from the 'intensities' column
     intensities_channel_counts = {}
     file_channel_matrix_from_intensities = {}
 
     for _, row in msstats_batch.iterrows():
         ref_file = row["reference_file_name"]
         if "intensities" in row and row["intensities"] is not None:
             for intensity_info in row["intensities"]:
                 channel = intensity_info["channel"]
 
                 # Count channels from intensities
                 if channel not in intensities_channel_counts:
                     intensities_channel_counts[channel] = 0
                 intensities_channel_counts[channel] += 1
 
                 # Count file-channel combinations from intensities
                 if ref_file not in file_channel_matrix_from_intensities:
                     file_channel_matrix_from_intensities[ref_file] = {}
                 if channel not in file_channel_matrix_from_intensities[ref_file]:
                     file_channel_matrix_from_intensities[ref_file][channel] = 0
                 file_channel_matrix_from_intensities[ref_file][channel] += 1
 
     # Update the main counting with intensities data
     if intensities_channel_counts:
-        for channel, count in intensities_channel_counts.items():
-            if channel not in channel_counts:
-                channel_counts[channel] = 0
-
         # Use intensities-based matrix
         for ref_file, channels in file_channel_matrix_from_intensities.items():
             if ref_file not in file_channel_matrix:
                 file_channel_matrix[ref_file] = {}
             for channel, count in channels.items():
                 if channel not in file_channel_matrix[ref_file]:
                     file_channel_matrix[ref_file][channel] = 0
                 file_channel_matrix[ref_file][channel] += count

357-357: Simplify dictionary key check.

Use simpler syntax for checking if a key exists in a dictionary.

-            for channel in file_channels.keys()
+            for channel in file_channels

quantmsio/commands/convert/quantms.py (1)

7-7: Remove unused import.

The tempfile import at the module level is unused since you're importing it locally where needed.
-import tempfile
 from typing import Optional

quantmsio/core/idxml/idxml.py (1)

397-397: Remove unnecessary f-string prefix.

The string doesn't contain any placeholders.
-        logger.info(f"=== DEBUG: convert_to_parquet started ===")
+        logger.info("=== DEBUG: convert_to_parquet started ===")

quantmsio/core/diann/diann.py (2)

7-7: Remove unused imports.

Several imports are unused in the code.

-from typing import Union, Optional, List
+from typing import Union
 
 ...
 
-from quantmsio.core.duckdb import DiannDuckDB
-from quantmsio.core.project import create_uuid_filename
 from quantmsio.core.quantms.feature import Feature
 from quantmsio.core.quantms.mztab import MzTab
 from quantmsio.core.sdrf import SDRFHandler
-from quantmsio.operate.tools import get_ahocorasick

Also applies to: 23-24, 28-28

343-345: Add error handling for type conversion.

The type conversion could fail if the data is not in the expected format.

         group = group.copy()
-        if group["rt"].dtype != "float64":
-            group["rt"] = group["rt"].astype("float64")
+        try:
+            if group["rt"].dtype != "float64":
+                group["rt"] = pd.to_numeric(group["rt"], errors="coerce")
+        except Exception as e:
+            logging.warning(f"Failed to convert retention time to float: {e}")

quantmsio/core/quantms/psm.py (2)

242-255: Remove unused loop variable to improve code clarity.

The index variable is not used in the loop body and should be replaced with _ to indicate it's intentionally unused.

 # Now we need to convert the temp_modifications to the final modifications dictionary where key
 # of the dictionary is accession, and indexes are moved as second element of the tuple. It could be that
 # one accession has mutliple indexes, then we need to group them and add then also site and position.
 modifications = {}
-for index, (
+for _, (
     name,
     accession,
     sites,
     positions,
 ) in temp_modifications.items():

506-507: Remove extraneous f-string prefix.

The f-string has no placeholders and doesn't need the f prefix.

-                self.logger.warning(f"Could not convert global_qvalue to float")
+                self.logger.warning("Could not convert global_qvalue to float")

quantmsio/core/duckdb.py (1)

197-197: Remove unused variable assignment.

The result variable is assigned but never used in the performance stats method.

             try:
-                result = self.query_to_df(query)
+                self.query_to_df(query)
                 end_time = time.time()

quantmsio/core/quantms/feature.py (6)

2-8: Remove unused imports.

Several imports are not used in the code and should be removed for cleaner code.
 import logging
-import tempfile
-from pathlib import Path
-from typing import Union
 
 import pandas as pd
 import pyarrow as pa
-import pyarrow.parquet as pq
14-16: Remove unused imports from the module.

The Psm and get_protein_accession imports are not directly used at the module level.
-from quantmsio.core.quantms.psm import Psm
 from quantmsio.core.sdrf import SDRFHandler
-from quantmsio.operate.tools import get_protein_accession
229-238: Simplify nested if statements.

The nested if statements can be combined into a single condition for better readability.
             for _, row in metadata_df.iterrows():
-                if "fixed_mod[" in row["key"] or "var_mod[" in row["key"]:
-                    if "site" not in row["key"] and "position" not in row["key"]:
+                if ("fixed_mod[" in row["key"] or "var_mod[" in row["key"]) and "site" not in row["key"] and "position" not in row["key"]:
                         values = (
                             row["value"].replace("[", "").replace("]", "").split(",")
                         )
245-246: Fix unused loop variables.

Multiple loop variables are not used and should be prefixed with underscore.
-            for accession, (name, index, site, position) in modifications.items():
+            for accession, (name, _, site, position) in modifications.items():
                 mods_map[name] = [accession, site or "X"]
317-317: Remove unused variable assignment.

The charge_col variable is assigned but never used.
         # Add charge column if available, otherwise use default
         if "Charge" in msstats_batch.columns:
             grouping_cols.append("Charge")
-            charge_col = "Charge"
         elif "PrecursorCharge" in msstats_batch.columns:
             grouping_cols.append("PrecursorCharge")
-            charge_col = "PrecursorCharge"
         else:
             # Add a default charge if not available
             msstats_batch["Charge"] = 3
             grouping_cols.append("Charge")
-            charge_col = "Charge"
647-763: Consider breaking down complex conversion method.

The convert_to_parquet_format method is quite long and handles many different column types. Consider refactoring into smaller helper methods for better maintainability.

Would you like me to help refactor this method into smaller, more focused helper methods like _convert_float_columns, _convert_int_columns, _convert_list_columns, etc.? This would improve readability and make the code easier to test and maintain.

debug_test_convert_to_parquet.py

coderabbitai · 2025-08-27T10:42:25Z

docs/feature.avsc

@@ -15,47 +15,72 @@
 					{"name": "file_type", "type": "string", "doc": "Type of the file (feature_file)"},
 					{"name": "creation_date", "type": "string", "doc": "Date when the file was created"},
 					{"name": "uuid", "type": "string", "doc": "Unique identifier for the file"},
+					{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},


💡 Verification agent

🧩 Analysis chain

Verify the scan_format field values are validated.

The scan_format field documentation mentions possible values (scan, index, nativeId) but the schema doesn't enforce these values through an enum type.

Consider using an Avro enum type to enforce valid values:

-{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."}, +{"name": "scan_format", "type": {"type": "enum", "name": "ScanFormat", "symbols": ["scan", "index", "nativeId"]}, "doc": "The format of the scan"},

🏁 Script executed:

#!/bin/bash # Search for validation of scan_format values in the codebase rg -n "scan_format.*(?:scan|index|nativeId)" --type py # Check if there's any enum or validation for scan_format ast-grep --pattern 'scan_format == $_'

Length of output: 377

Enforce scan_format values via an Avro enum

I didn’t find any code that validates scan_format against the documented values (“scan”, “index”, “nativeId”), nor does the Avro schema enforce them. Without schema-level enforcement or runtime checks, invalid values could slip through and cause downstream failures.

Suggested changes:

Update the Avro schema (docs/feature.avsc) to use an enum for scan_format.

Review all code that serializes or deserializes this field—e.g.

quantmsio/core/idxml/idxml.py:366

quantmsio/core/quantms/psm.py:326

quantmsio/core/quantms/feature.py:264
to handle the enum type or validate string values accordingly.

Diff for docs/feature.avsc:

- {"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."}, + {"name": "scan_format", + "type": { + "type": "enum", + "name": "ScanFormat", + "symbols": ["scan", "index", "nativeId"] + }, + "doc": "The format of the scan"}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},

{"name": "scan_format",

"type": {

"type": "enum",

"name": "ScanFormat",

"symbols": ["scan", "index", "nativeId"]

},

"doc": "The format of the scan"},

🤖 Prompt for AI Agents

In docs/feature.avsc around line 18, replace the current string-typed "scan_format" field with an Avro enum type that defines the allowed symbols ["scan","index","nativeId"] so the schema enforces valid values; then update/inspect the serialization/deserialization and validation sites (quantmsio/core/idxml/idxml.py:366, quantmsio/core/quantms/psm.py:326, quantmsio/core/quantms/feature.py:264) to handle the enum (either by mapping enum values to/from strings or by adding explicit runtime validation that accepts only these three values), and run schema compatibility tests to ensure no breaking changes for existing data.

coderabbitai · 2025-08-27T10:42:25Z

docs/peptide.avsc

+			"type": "struct",
+			"fields": [


⚠️ Potential issue

Fix inconsistent field type definition.

The best_id_score field uses "struct" type which is not standard in Avro. Avro uses "record" for complex types.

Apply this fix to use the correct Avro type:

{"name": "best_id_score", "type": ["null", { "type": "array", "items": { - "type": "struct", + "type": "record", + "name": "best_id_score_entry", "fields": [ {"name": "score_name", "type": "string"}, {"name": "score_value", "type": "float32"} ] } }], "doc": "A named score type and value representing an identification's measure of confidence or input feature" },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"type": "struct",

"fields": [

{"name": "best_id_score", "type": ["null", {

"type": "array",

"items": {

"type": "record",

"name": "best_id_score_entry",

"fields": [

{"name": "score_name", "type": "string"},

{"name": "score_value", "type": "float32"}

]

}

}],

"doc": "A named score type and value representing an identification's measure of confidence or input feature"

},

🤖 Prompt for AI Agents

In docs/peptide.avsc around lines 74 to 75, the complex type for the best_id_score field is incorrectly declared as "struct"; change that to the Avro standard "record" (i.e., replace "type": "struct" with "type": "record") and ensure the record includes a proper "name" and "fields" array so the schema remains valid.

coderabbitai · 2025-08-27T10:42:25Z

docs/pg.avsc

+		{"name": "peptide_counts", "type": ["null", {
+			"type": "struct",
+			"fields": [
+				{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
+				{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
+			]
+		}], "doc": "Peptide sequence counts for this protein group in this specific file"},
+		{"name": "feature_counts", "type": ["null", {
+			"type": "struct",
+			"fields": [
+				{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
+				{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
+			]
+		}], "doc": "Peptide feature counts for this protein group in this specific file"},


🛠️ Refactor suggestion

Avro type: use record, not struct; add names.

Avro JSON doesn’t support struct; nested types must be record with a name.

- {"name": "peptide_counts", "type": ["null", { - "type": "struct", - "fields": [ + {"name": "peptide_counts", "type": ["null", { + "type": "record", + "name": "pg_peptide_counts", + "fields": [ {"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"}, {"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"} ] }], "doc": "Peptide sequence counts for this protein group in this specific file"}, - {"name": "feature_counts", "type": ["null", { - "type": "struct", - "fields": [ + {"name": "feature_counts", "type": ["null", { + "type": "record", + "name": "pg_feature_counts", + "fields": [ {"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"}, {"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"} ] }], "doc": "Peptide feature counts for this protein group in this specific file"},

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{"name": "peptide_counts", "type": ["null", {

"type": "struct",

"fields": [

{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},

{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}

]

}], "doc": "Peptide sequence counts for this protein group in this specific file"},

{"name": "feature_counts", "type": ["null", {

"type": "struct",

"fields": [

{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},

{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}

]

}], "doc": "Peptide feature counts for this protein group in this specific file"},

{"name": "peptide_counts", "type": ["null", {

"type": "record",

"name": "pg_peptide_counts",

"fields": [

{"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},

{"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}

]

}], "doc": "Peptide sequence counts for this protein group in this specific file"},

{"name": "feature_counts", "type": ["null", {

"type": "record",

"name": "pg_feature_counts",

"fields": [

{"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},

{"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}

]

}], "doc": "Peptide feature counts for this protein group in this specific file"},

🤖 Prompt for AI Agents

In docs/pg.avsc around lines 28-41, the Avro schema uses "struct" for nested types which is invalid JSON Avro; replace each nested "struct" with a "record" object and add a unique "name" property (e.g., PeptideCounts, FeatureCounts) inside each record, keeping the "fields" arrays intact and preserving the surrounding nullable union (["null", { ... }]); ensure the record names are valid Avro identifiers and update any references if needed.

coderabbitai · 2025-08-27T10:42:25Z

docs/psm.avsc

+                                    { "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" }
+                                  ]


🛠️ Refactor suggestion

⚠️ Potential issue

Use valid Avro primitive types (int/float), not int32/float32.

Avro primitives are: null, boolean, int, long, float, double, bytes, string. int32/float32 are invalid.

Apply:

- { "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" } + { "name": "score_value", "type": "float", "doc": "Value of the score for this specific position" }

- {"name": "precursor_charge", "type": "int32", "doc": "Precursor charge"}, + {"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},

- {"name": "is_decoy", "type": "int32", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"}, + {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},

- {"name": "number_peaks", "type": ["null", "int32"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"}, + {"name": "number_peaks", "type": ["null", "int"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"},

Also update other occurrences (posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array, intensity_array, calculated_mz, observed_mz) to use "float".

To fix all in this file:

#!/bin/bash # Replace non-Avro primitives in-place sed -i -E 's/"int32"/"int"/g; s/"float32"/"float"/g' docs/psm.avsc

Also applies to: 82-82, 84-84, 120-120

🤖 Prompt for AI Agents

In docs/psm.avsc around lines 67-68 (also occurrences at 82, 84 and 120), the schema uses invalid Avro primitive types "int32" and "float32"; update the field definition for score_value to use "float" and change any other fields listed (posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array, intensity_array, calculated_mz, observed_mz) that use "int32"/"float32" to the correct Avro primitives "int"/"float" respectively; perform a project-wide replace of "int32" → "int" and "float32" → "float" to ensure all instances are fixed and validate the AVSC file afterwards.

coderabbitai · 2025-08-27T10:42:29Z

tests/core/quantms/test_maxquant.py

+    evidence_file = test_data[1]
+    sdrf_file = test_data[2]
+


🛠️ Refactor suggestion

Guard feature test with file existence checks.

Apply:

evidence_file = test_data[1] sdrf_file = test_data[2] +if not evidence_file.exists() or not sdrf_file.exists(): + pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

evidence_file = test_data[1]

sdrf_file = test_data[2]

evidence_file = test_data[1]

sdrf_file = test_data[2]

if not evidence_file.exists() or not sdrf_file.exists():

pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")

🤖 Prompt for AI Agents

In tests/core/quantms/test_maxquant.py around lines 51 to 53, the test assumes test_data[1] and test_data[2] (evidence_file and sdrf_file) exist; add guards to check os.path.exists for each file and call pytest.skip with a clear message if either file is missing so the test is skipped rather than failing; ensure you import os and pytest at the top of the file if not already present.

coderabbitai · 2025-08-27T10:42:29Z

tests/core/quantms/test_maxquant.py

+    evidence_file = test_data[1]
+    sdrf_file = test_data[2]
+


🛠️ Refactor suggestion

Guard feature slicing test with file existence checks.

Apply:

evidence_file = test_data[1] sdrf_file = test_data[2] +if not evidence_file.exists() or not sdrf_file.exists(): + pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

evidence_file = test_data[1]

sdrf_file = test_data[2]

evidence_file = test_data[1]

sdrf_file = test_data[2]

if not evidence_file.exists() or not sdrf_file.exists():

pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")

🤖 Prompt for AI Agents

In tests/core/quantms/test_maxquant.py around lines 81 to 83, the test uses evidence_file = test_data[1] and sdrf_file = test_data[2] without verifying those paths exist; add existence guards so the test is skipped when required input files are missing. Update the test to check os.path.exists (or Path.exists) for each file and call pytest.skip with a clear message if either file is absent (or use pytest.mark.skipif at module/function level with the same condition), so the test won't fail on CI or local runs when those fixture files are not present.

tests/core/test_large_mztab_performance.py

coderabbitai · 2025-08-27T10:42:29Z

tests/core/test_project.py

+    "PXD007683",
+    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",
 )


💡 Verification agent

🧩 Analysis chain

Dataset switch to PXD007683 looks good; verify test asset path exists in repo.
Ensures the SDRF file is present so the test doesn’t silently skip behavior.

🏁 Script executed:

#!/bin/bash # Verify the SDRF file exists where the test expects it. set -euo pipefail echo "Searching for PXD007683TMT.sdrf.tsv under tests/ ..." fd -a 'PXD007683TMT.sdrf.tsv' tests | tee /dev/stderr # Fail if not found test -n "$(fd -a 'PXD007683TMT.sdrf.tsv' tests)" || { echo 'Missing SDRF test file'; exit 1; }

Length of output: 282

🏁 Script executed:

#!/bin/bash # Re-verify the presence of PXD007683TMT.sdrf.tsv using find, avoiding unavailable commands. set -euo pipefail echo "Searching for PXD007683TMT.sdrf.tsv under tests/ using find..." RESULT=$(find tests -type f -name 'PXD007683TMT.sdrf.tsv' || true) echo "$RESULT" # Fail if not found if [ -z "$RESULT" ]; then echo "Missing SDRF test file" exit 1 fi

Length of output: 305

Add or update the missing SDRF test asset

The test in tests/core/test_project.py is pointing to PXD007683TMT.sdrf.tsv under the TEST_DATA_ROOT, but that file was not found in the repo. Without it, the test will error out or silently skip important validation.

• Location: tests/core/test_project.py (lines 12–14)
• Expected asset: tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv

Please ensure that the SDRF file is committed at the correct path, or adjust the test to reference the existing file location.

🤖 Prompt for AI Agents

In tests/core/test_project.py around lines 12 to 14, the test references TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv" which is missing from the repository; either add the SDRF file at tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv so the path resolves, or update the test to point to the correct existing SDRF file path (or a fixture) and adjust any expected contents accordingly so the test no longer depends on a non-existent asset.

Update idxml.py and test_idxml.py to improve the modification extraction structure.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

quantmsio/core/idxml.py (1)

627-629: Ensure parent directory exists and compress Parquet output

Prevents write failures when dirs are missing and reduces file size. Matches prior review guidance.

-        table = pa.Table.from_pandas(df, schema=PSM_SCHEMA)
-        pq.write_table(table, output_path)
-        logging.info(f"Successfully converted IdXML to parquet: {output_path}")
+        output_path = Path(output_path)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        table = pa.Table.from_pandas(df, schema=PSM_SCHEMA)
+        pq.write_table(table, str(output_path), compression="zstd")
+        logging.info(f"Successfully converted IdXML to parquet: {output_path}")

🧹 Nitpick comments (7)

quantmsio/core/idxml.py (7)

55-57: Log full traceback when IdXML parsing fails

Use logging.exception to include the stack trace; avoid interpolating the exception object redundantly.

-        except Exception as e:
-            logging.error(f"Error parsing IdXML file with pyopenms: {e}")
-            raise
+        except Exception:
+            logging.exception("Error parsing IdXML file with pyopenms")
+            raise

170-172: Emit traceback when a peptide hit cannot be parsed

This improves diagnosability and aligns with TRY400 guidance.

-        except Exception as e:
-            logging.warning(f"Error parsing peptide hit with pyopenms: {e}")
-            return None
+        except Exception:
+            logging.exception("Error parsing peptide hit with pyopenms")
+            return None

512-520: Broadened catch is fine here; include traceback to aid debugging

Given this wraps two alternative attachment strategies, keep the broad catch but log with traceback.

-        except Exception as e:
-            method_name = "OnDiscExperiment" if self._use_ondisc else "traditional"
-            logging.warning(f"Failed to attach mzML spectra using {method_name}: {e}")
+        except Exception:
+            method_name = "OnDiscExperiment" if self._use_ondisc else "traditional"
+            logging.exception(f"Failed to attach mzML spectra using {method_name}")

529-533: OnDisc path: prefer openFile when available (true on-disk access)

Using OnDiscMSExperiment.openFile avoids loading the full file into memory. Fall back to MzMLFile.load only if needed.

-        ondisc_exp = oms.OnDiscMSExperiment()
-        oms.MzMLFile().load(str(self._mzml_path), ondisc_exp)
+        ondisc_exp = oms.OnDiscMSExperiment()
+        # Prefer true on-disk access when available
+        if hasattr(ondisc_exp, "openFile"):
+            ondisc_exp.openFile(str(self._mzml_path))
+        else:
+            # Fallback for older pyopenms bindings
+            oms.MzMLFile().load(str(self._mzml_path), ondisc_exp)

562-564: Remove unnecessary int() cast

num_peaks is already an int.

-            psm["number_peaks"] = int(num_peaks)
+            psm["number_peaks"] = num_peaks

584-586: Remove unnecessary int() cast

len(...) returns an int.

-                psm["number_peaks"] = int(len(spectrum_mz))
+                psm["number_peaks"] = len(spectrum_mz)

433-470: Avoid double numpy→list normalization

to_dataframe already normalizes list-like fields; the extra conversion in to_parquet duplicates work. Consider keeping normalization in one place (preferably to_dataframe) and removing it from to_parquet.

-        def convert_numpy_to_list(obj):
-            """Recursively convert numpy arrays to Python lists"""
-            if hasattr(obj, "tolist"):
-                return obj.tolist()
-            elif isinstance(obj, list):
-                return [convert_numpy_to_list(item) for item in obj]
-            elif isinstance(obj, dict):
-                return {key: convert_numpy_to_list(value) for key, value in obj.items()}
-            else:
-                return obj
-
-        for col in df.columns:
-            if col in [
-                "modifications",
-                "protein_accessions",
-                "additional_scores",
-                "cv_params",
-                "mz_array",
-                "intensity_array",
-            ]:
-                df[col] = df[col].apply(convert_numpy_to_list)
+        # Data is already normalized in to_dataframe(); avoid re-normalizing here.

Also applies to: 605-626

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b0b4d76 and 1896847.

📒 Files selected for processing (1)

quantmsio/core/idxml.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

quantmsio/core/idxml.py (1)

quantmsio/core/openms.py (2)

OpenMSHandler (14-344)

get_spectrum_from_scan (21-43)

🪛 Ruff (0.12.2)

quantmsio/core/idxml.py

56-56: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

147-168: Consider moving this statement to an else block

(TRY300)

170-170: Do not catch blind exception: Exception

(BLE001)

517-517: Do not catch blind exception: Exception

(BLE001)

584-584: Value being cast to int is already an integer

Remove unnecessary int call

(RUF046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Codacy Static Code Analysis
GitHub Check: build

🔇 Additional comments (1)

quantmsio/core/idxml.py (1)

557-567: Heads-up: upstream return shape may throw here; verify OpenMSHandler contract

OpenMSHandler.get_spectrum_from_scan appears to return only two values on some error paths (see openms.py snippet), which triggers ValueError during unpacking. Your except already handles it, but the root cause should be fixed upstream to always return a consistent 3-tuple.

fix the bugs in quantms

Copilot

Pull Request Overview

This PR implements a major restructuring of the quantms.io CLI with organized command groups and enhanced functionality. The changes introduce a hierarchical command structure with groups for conversion, transformation, visualization, statistics, and project management, while adding comprehensive logging and improving error handling.

Major CLI restructuring from flat commands to organized groups (convert, transform, visualize, stats, project)
Enhanced logging system with structured output support and comprehensive documentation
New quantms workflow conversion command with support for multiple proteomics tools (DIA-NN, MaxQuant, FragPipe)

Reviewed Changes

Copilot reviewed 106 out of 152 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
quantmsioc.py	Restructure main CLI with organized command groups and enhanced error handling
diann.py, maxquant.py, fragpipe.py	Add conversion commands for different proteomics tools
psm.py, feature.py, plot.py	Add PSM/feature conversion and visualization commands
ae.py, de.py, ibaq.py, gene.py	Add transformation commands for various analyses
project.py, stats.py	Add project management and statistical analysis commands
logger.py	Add centralized logging utility module
Test files	Comprehensive test coverage for new functionality and performance testing
Schema files	Update Avro schemas with nullable fields and new differential/absolute expression schemas
Documentation	Major README update with usage examples and comprehensive logging documentation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-31T15:20:18Z

tests/core/test_project.py

-    "MSV000079033",
-    TEST_DATA_ROOT / "DDA-plex/MSV000079033-Blood-Plasma-iTRAQ.sdrf.tsv",
+    "PXD007683",
+    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",


The filename contains "PXD007683TMT.sdrf.tsv" but should likely be "PXD007683-TMT.sdrf.tsv" to match the pattern used elsewhere in the codebase (with hyphen between dataset ID and experiment type).

Suggested change

TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",

TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683-TMT.sdrf.tsv",

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

quantmsio/core/diann/diann.py (3)

414-424: Incorrect call to Feature.generate_modifications_details (signature mismatch).

Feature.generate_modifications_details is not a static method and expects a modifications map, not sample map + select mods. This will error.

Minimal safe fix: map peptidoform to a canonical string and set modifications to None.

-        report[["peptidoform", "modifications"]] = report[["peptidoform"]].apply(
-            lambda row: Feature.generate_modifications_details(
-                row["peptidoform"], self._sample_map, select_mods
-            ),
-            axis=1,
-            result_type="expand",
-        )
+        _, modifications_map = self.get_masses_and_modifications_map()
+        report["peptidoform"] = report["peptidoform"].map(modifications_map)
+        report["modifications"] = None

If you need full modification details, refactor to use an instance method on Feature that has access to a proper modifications_dict.

426-439: Same LFQ mapping issue in add_additional_msg().

Apply the same fallback to avoid KeyError.

-                {
-                    "sample_accession": self._sample_map[
-                        rows["reference_file_name"] + "-" + rows["channel"]
-                    ],
+                {
+                    "sample_accession": self._sample_map.get(
+                        rows["reference_file_name"] + "-" + rows["channel"],
+                        self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
+                    ),
                     "channel": rows["channel"],
                     "intensity": rows["intensity"],
                 }

449-464: Apply LFQ mapping fallback to additional_intensities as well.

Mirror the robust lookup.

-                {
-                    "sample_accession": self._sample_map[
-                        rows["reference_file_name"] + "-" + rows["channel"]
-                    ],
+                {
+                    "sample_accession": self._sample_map.get(
+                        rows["reference_file_name"] + "-" + rows["channel"],
+                        self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
+                    ),
                     "channel": rows["channel"],
                     "intensities": [
                         {"intensity_name": "lfq", "intensity_value": rows["lfq"]},
                     ],
                 }

♻️ Duplicate comments (16)

quantmsio/core/sdrf.py (1)

129-131: Add proper exception chaining (ruff B904).

Preserve original context or explicitly suppress it.

-        except FileNotFoundError:
-            raise FileNotFoundError(
-                "The SDRF file provided not found: " + str(sdrf_file)
-            )
+        except FileNotFoundError as err:
+            raise FileNotFoundError(
+                "The SDRF file provided not found: " + str(sdrf_file)
+            ) from err

quantmsio/utils/file_utils.py (3)

17-18: Library modules shouldn’t call basicConfig.

Don’t configure global logging in a library; attach a NullHandler instead.

-# Configure logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
+# Configure logging
+logger = logging.getLogger(__name__)
+logger.addHandler(logging.NullHandler())

30-39: Support multi-suffix and case-insensitive extensions.

Current suffix equality fails for “.mzTab.gz” and case variants.

 def validate_extension(file_path: Union[str, Path], extension: str) -> bool:
     """Validate that the file has the correct extension."""
     if not Path(file_path).exists():
         raise FileNotFoundError(f"File not found: {file_path}")
-    if not Path(file_path).suffix == extension:
+    path = Path(file_path)
+    ext = extension.lower()
+    # Normalize extension to start with dot
+    if not ext.startswith("."):
+        ext = "." + ext
+    # Handle multi-suffix like ".mzTab.gz" and case-insensitive compare
+    path_ext = "".join(path.suffixes).lower() or path.suffix.lower()
+    if path_ext != ext:
         raise ValueError(
             f"File {file_path} does not have the correct extension {extension}"
         )
     return True

390-405: Fix Parquet metadata and ensure output directory exists.

PyArrow expects metadata keys/values as bytes.
Create parent directory before ParquetWriter.

-                    schema_metadata = self.schema.metadata or {}
-
-                    # Convert file_metadata to string format for parquet metadata
-                    if self.file_metadata:
-                        for key, value in self.file_metadata.items():
-                            schema_metadata[f"file_metadata.{key}"] = str(value)
-
-                    # Update schema with metadata
-                    schema_with_metadata = self.schema.with_metadata(schema_metadata)
+                    # Start from existing metadata (bytes->bytes)
+                    schema_metadata = dict(self.schema.metadata or {})
+                    # Merge file-level metadata as bytes
+                    if self.file_metadata:
+                        for key, value in self.file_metadata.items():
+                            k = f"file_metadata.{key}".encode("utf-8")
+                            v = str(value).encode("utf-8")
+                            schema_metadata[k] = v
+                    schema_with_metadata = self.schema.with_metadata(schema_metadata)
@@
-                    self.parquet_writer = pq.ParquetWriter(
+                    # Ensure parent directory exists
+                    Path(self.output_path).parent.mkdir(parents=True, exist_ok=True)
+                    self.parquet_writer = pq.ParquetWriter(
-                        where=self.output_path,
+                        where=str(self.output_path),
                         schema=schema_with_metadata,
                         compression=self.compression,
                     )

quantmsio/operate/tools.py (2)

335-346: Ensure output dir exists and pass str path to ParquetWriter.

Avoids failures when parent dir is missing and improves PyArrow compatibility.
     pqwriter = None
     for feature in generate_ibaq_feature(sdrf_path, parquet_path):
         if not pqwriter:
-            pqwriter = pq.ParquetWriter(output_path, feature.schema)
+            Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+            pqwriter = pq.ParquetWriter(str(output_path), feature.schema)
291-299: Fix typo in public API: rename genereate_ibaq_feature → generate_ibaq_feature and update call site.

Prevents leaking a misspelling and improves discoverability.
-def genereate_ibaq_feature(
+def generate_ibaq_feature(
     sdrf_path: Union[Path, str], parquet_path: Union[Path, str]
 ) -> Generator[pa.Table, None, None]:
@@
-    for feature in genereate_ibaq_feature(sdrf_path, parquet_path):
+    for feature in generate_ibaq_feature(sdrf_path, parquet_path):
Also applies to: 342-348

quantmsio/core/quantms/pg.py (1)

650-686: Fix DuckDB filter: ANY([...]) is unreliable; use a quoted IN list.

This pattern has caused failures before. Build a safe IN (...) list. (Same concern was raised earlier; re-flagging.)
-        FROM processed_msstats_with_pg 
-        WHERE reference_file_name = ANY({file_batch})
+        FROM processed_msstats_with_pg 
+        WHERE reference_file_name IN ({", ".join(f"'{f.replace(\"'\",\"''\")}'" for f in file_batch)})

quantmsio/core/diann/diann.py (2)

152-185: Naming consistency for peptide_counts vs feature_counts.

Prior review suggested using “unique_peptides/total_peptides” in peptide_counts and “unique_features/total_features” in feature_counts to avoid mixing “sequences” terminology.

46-56: super().init() will raise — class no longer inherits. Initialize DiannDuckDB instead.

This is a runtime blocker; _duckdb is never set and all DB calls will fail.
-        super(DiaNNConvert, self).__init__(
-            diann_report, duckdb_max_memory, duckdb_threads
-        )
+        # Initialize DuckDB connection
+        self._duckdb = DiannDuckDB(
+            diann_report,
+            max_memory=duckdb_max_memory,
+            worker_threads=duckdb_threads,
+            pg_matrix_path=pg_matrix_path,
+        )

quantmsio/commands/convert/quantms.py (3)

191-191: Replace insecure mktemp() with NamedTemporaryFile.

-            temp_db_path = tempfile.mktemp(suffix=".duckdb")
+            from tempfile import NamedTemporaryFile
+            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
+            temp_db_path = temp_file.name
+            temp_file.close()

379-379: Replace insecure mktemp() with NamedTemporaryFile.

-            temp_db_path = tempfile.mktemp(suffix=".duckdb")
+            from tempfile import NamedTemporaryFile
+            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
+            temp_db_path = temp_file.name
+            temp_file.close()

104-104: Replace insecure mktemp() with NamedTemporaryFile.

Using tempfile.mktemp() is deprecated and insecure as it creates a predictable filename that attackers could potentially exploit.

-            temp_db_path = tempfile.mktemp(suffix=".duckdb")
+            from tempfile import NamedTemporaryFile
+            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
+            temp_db_path = temp_file.name
+            temp_file.close()

quantmsio/core/quantms/psm.py (2)

526-527: Fix typo in CV parameter name.

The CV parameter name contains a typo: "consesus_support" should be "consensus_support".

             struct = {
-                "cv_name": "consesus_support",
+                "cv_name": "consensus_support",
                 "cv_value": str(consensus_support),
             }

664-665: Fix variable shadowing in generator.

The loop variable df shadows the iterable it iterates over.

-        for key, df in df.groupby(partitions):
-            yield key, df
+        for key, group_df in df.groupby(partitions):
+            yield key, group_df

quantmsio/core/quantms/feature.py (2)

472-472: Fix closure issue in lambda function.

The lambda function doesn't properly bind the loop variable i, which could lead to incorrect behavior if the lambda is evaluated after the loop completes.

         for i, feature in enumerate(map_features):
             msstats.loc[:, feature] = msstats[
                 ["reference_file_name", "peptidoform", "precursor_charge"]
             ].apply(
-                lambda rows: merge_psm(rows, i),
+                lambda rows, idx=i: merge_psm(rows, idx),
                 axis=1,
             )

541-542: Fix variable shadowing in generator.

The loop variable df shadows the iterable it iterates over.

-        for key, df in df.groupby(partitions):
-            yield key, df
+        for key, group_df in df.groupby(partitions):
+            yield key, group_df

🧹 Nitpick comments (36)

tests/core/quantms/psms/test_duckdb_creation.py (7)

9-9: Use a robust, normalized path for TEST_DATA_ROOT

Avoid unresolved ".." segments; use resolve()+parents for clarity and portability.

-TEST_DATA_ROOT = Path(__file__).parent / "../.." / ".." / ".." / "tests" / "examples"
+TEST_DATA_ROOT = Path(__file__).resolve().parents[4] / "tests" / "examples"

96-98: Simplify resource management; avoid double-closing and use context managers

Close the indexer once and set it to None to prevent a second close in finally. Use a context manager for DuckDB instead of manual connect/close.

-            cleanup_mztab_indexer(mztab)
+            cleanup_mztab_indexer(mztab)
+            mztab = None
             assert output_path.exists(), f"DuckDB file not created: {output_path}"
-            con = duckdb.connect(str(output_path), read_only=True)
-            tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
+            with duckdb.connect(str(output_path), read_only=True) as con:
+                tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
-            assert (
-                "metadata" in tables and "proteins" in tables and "psms" in tables
-            )
-            metadata_count = con.execute(
-                "SELECT COUNT(*) FROM metadata"
-            ).fetchone()[0]
-            proteins_count = con.execute(
-                "SELECT COUNT(*) FROM proteins"
-            ).fetchone()[0]
-            psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0]
-            if msstats_path:
-                msstats_count = con.execute(
-                    "SELECT COUNT(*) FROM msstats"
-                ).fetchone()[0]
+                assert (
+                    "metadata" in tables and "proteins" in tables and "psms" in tables
+                )
+                metadata_count = con.execute(
+                    "SELECT COUNT(*) FROM metadata"
+                ).fetchone()[0]
+                proteins_count = con.execute(
+                    "SELECT COUNT(*) FROM proteins"
+                ).fetchone()[0]
+                psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0]
+                if msstats_path:
+                    msstats_count = con.execute(
+                        "SELECT COUNT(*) FROM msstats"
+                    ).fetchone()[0]

-            cleanup_mztab_indexer(mztab)
-            cleanup_duckdb_connection(con)
+            cleanup_mztab_indexer(mztab)

Also applies to: 99-114, 153-156

4-7: Replace bare try/except-pass with contextlib.suppress

Eliminates Ruff BLE001/S110 and keeps intent clear.

+from contextlib import suppress

 def cleanup_mztab_indexer(mztab_indexer):
     """Clean up MzTabIndexer instance and its connections."""
-    if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
-        try:
-            mztab_indexer._duckdb.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+    if mztab_indexer and getattr(mztab_indexer, "_duckdb", None):
+        with suppress(Exception):
+            mztab_indexer._duckdb.close()
 
 def cleanup_duckdb_connection(con):
     """Clean up DuckDB connection."""
-    if con:
-        try:
-            con.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+    if con:
+        with suppress(Exception):
+            con.close()

Also applies to: 1251-1267

181-181: Drop redundant f-prefixes (no placeholders) to satisfy Ruff F541

Also helps the Black job by keeping strings tidy.

-print(f"\n=== Testing Two-Step Creation Process ===")
+print("\n=== Testing Two-Step Creation Process ===")
@@
-print(f"Initial database created with:")
+print("Initial database created with:")
@@
-print(f"\nFinal database contains:")
+print("\nFinal database contains:")
@@
-print(f"\n=== Testing Two-Step Creation Process (TMT) ===")
+print("\n=== Testing Two-Step Creation Process (TMT) ===")
@@
-print(f"Initial database created with:")
+print("Initial database created with:")
@@
-print(f"\nFinal database contains:")
+print("\nFinal database contains:")
@@
-print(f"\n=== Testing Create vs Open Database with Class Methods ===")
+print("\n=== Testing Create vs Open Database with Class Methods ===")
@@
-print(f"New database created with:")
+print("New database created with:")
@@
-print(f"Existing database opened with:")
+print("Existing database opened with:")
@@
-print(f"\nSummary:")
+print("\nSummary:")
@@
-print(f"\n=== Testing Create vs Open Database with Class Methods (TMT) ===")
+print("\n=== Testing Create vs Open Database with Class Methods (TMT) ===")
@@
-print(f"New database created with:")
+print("New database created with:")
@@
-print(f"Existing database opened with:")
+print("Existing database opened with:")
@@
-print(f"\nSummary:")
+print("\nSummary:")

Also applies to: 210-210, 269-269, 305-305, 335-335, 394-394, 712-712, 745-745, 797-797, 824-824, 854-854, 887-887, 939-939, 966-966

62-62: Rename test to reflect actual behavior (DuckDB only)

Removes “parquet” from the name to avoid confusion while parquet paths are commented.

-def test_create_duckdb_and_parquet(test_name, mztab_path, msstats_path):
+def test_create_duckdb_database(test_name, mztab_path, msstats_path):

193-198: Avoid relying on private attribute _duckdb in tests

Accessing internals couples tests to implementation details. Prefer opening a temporary read-only DuckDB connection for assertions (as in the first test) or add a small public helper on MzTabIndexer (e.g., tables(), count(table)) and use it here.

Also applies to: 231-236, 317-323, 356-361, 725-731, 767-774, 867-872, 908-915

1-7: Optionally silence Ruff S101 for tests

Pytest-style asserts are fine in tests. Either configure per-file ignores in pyproject.toml or add a file header.

pyproject.toml:
"""
[tool.ruff]
per-file-ignores = { "tests/**" = ["S101"] }
"""
Or add at the top of this file:
"""

ruff: noqa: S101

"""

quantmsio/core/sdrf.py (3)

104-112: Remove commented-out SUPOORTED_LABELING block.

Dead/commented code adds noise; delete or move to docs if needed.

-    # SUPOORTED_LABELING = [
-    #     "LABEL FREE",
-    #     "TMT10",
-    #     "TMT11",
-    #     "TMT16",
-    #     "TMT6",
-    #     "ITRAQ4",
-    #     "ITRAQ8",
-    # ]

452-452: Tiny style: avoid list concatenation in drop.

Use iterable unpacking for readability (ruff RUF005).

-        sdrf.drop(
-            ["comment[technical replicate]"] + factor,
+        sdrf.drop(
+            ["comment[technical replicate]", *factor],
             axis=1,
             inplace=True,
         )

457-457: Remove commented-out return.

Avoid stale commented code.

-        # return sdrf[["reference_file_name", "channel", "condition"]].copy()

quantmsio/utils/file_utils.py (2)

341-351: Type hint nit: Optional for file_metadata.

Use Optional[Dict[str, Any]] (ruff RUF013).

-from typing import Dict, Iterator, List, Tuple, Union
+from typing import Any, Dict, Iterator, List, Tuple, Union, Optional
@@
-    def __init__(
+    def __init__(
         self,
         output_path: str,
         schema: pa.Schema,
         batch_size: int = 1000000,
         compression: str = "gzip",
-        file_metadata: dict = None,
+        file_metadata: Optional[Dict[str, Any]] = None,
     ):

414-416: Use logging.exception to preserve tracebacks.

Provides full stack traces for easier debugging (ruff TRY400).

-            self.logger.error(
+            self.logger.exception(
                 f"Error during batch writing: {e}, file path: {self.output_path}"
             )
@@
-            self.logger.error(f"Error closing writer: {e}")
+            self.logger.exception(f"Error closing writer: {e}")

Also applies to: 431-431

quantmsio/operate/tools.py (2)

159-167: Remove “Before Code” banner; keep helper or move to utils.

The heading looks like leftover. If MaxQuant relies on this (see self._automaton), keep it here or import from a shared utils module.
-# Before Code:
 def get_ahocorasick(mods_dict: dict) -> ahocorasick.Automaton:
169-212: Guard against select_mods=None in get_modification_details.

Avoid TypeError when select_mods is None; also minor cleanup.
-def get_modification_details(
+def get_modification_details(
     seq: str,
     mods_dict: dict,
     automaton: ahocorasick.Automaton,
-    select_mods: list | None = None,
+    select_mods: list | None = None,
 ) -> tuple[str, list]:
@@
-        elif modification in select_mods:
+        elif (select_mods is None) or (modification in select_mods):
             modifications.append(modification)
             modification_details.append(

quantmsio/core/idxml_utils/idxml.py (4)

86-96: Avoid bare except; catch Exception explicitly.

Also keep calc_mz fallback.

-                except:
+                except Exception:
                     calc_mz = 0.0

304-315: Use logger.exception and trim excessive debug noise.

Preserve traceback; consider downgrading DEBUG banners to debug level.

-        except Exception as e:
-            self.logger.error(f"=== DEBUG: Error transforming PSM record: {e} ===")
+        except Exception as e:
+            self.logger.exception("Error transforming PSM record")
             self.logger.error(f"=== DEBUG: Record keys: {list(record.keys())} ===")
             self.logger.error(
                 f"=== DEBUG: Record sample: {dict(list(record.items())[:5])} ==="
             )
             import traceback
-
-            self.logger.error(
-                f"=== DEBUG: Transformation traceback: {traceback.format_exc()} ==="
-            )
+            self.logger.debug(f"Transformation traceback: {traceback.format_exc()}")
             return None

397-401: Remove redundant f-string and keep message at info level.

Also, minor copy edit.

-        logger.info(f"=== DEBUG: convert_to_parquet started ===")
+        logger.info("convert_to_parquet started")

445-449: Use logging.exception to include stack trace.

Easier troubleshooting for failures.

-        except Exception as e:
-            logger.error(f"=== DEBUG: Error during parquet conversion: {e} ===")
-            logger.error(f"=== DEBUG: Exception type: {type(e)} ===")
+        except Exception as e:
+            logger.exception("Error during parquet conversion")
+            logger.error(f"Exception type: {type(e)}")
             import traceback
-
-            logger.error(f"=== DEBUG: Traceback: {traceback.format_exc()} ===")
+            logger.debug(f"Traceback: {traceback.format_exc()}")
             raise

quantmsio/core/maxquant/maxquant.py (2)

191-233: get_modification_details dependency must be present.

With the import restored and automaton initialized, this block will work as intended. Consider adding a quick guard if self._automaton is missing.

-            peptidoform, other_modification_details = get_modification_details(
-                seq, self.mods_map, self._automaton, other_mods
-            )
+            peptidoform, other_modification_details = get_modification_details(
+                seq, self.mods_map, getattr(self, "_automaton", get_ahocorasick(self.mods_map)), other_mods
+            )

761-799: Prefer isinstance/type-safe checks in required fields loop (ruff E721).

Use identity or isinstance to avoid equality on types; also simplifies conversions.

-        for field, dtype in required_fields.items():
+        for field, dtype in required_fields.items():
             if field not in df.columns:
-                if dtype == list:
+                if dtype is list:
                     df[field] = [[] for _ in range(len(df))]
-                elif dtype == str:
+                elif dtype is str:
                     df[field] = None
-                elif dtype == float:
+                elif dtype is float:
                     df[field] = 0.0
-                elif dtype == int:
+                elif dtype is int:
                     df[field] = 0
             else:
                 # Handle type conversions
-                if dtype == float:
+                if dtype is float:
                     df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
-                elif dtype == int:
+                elif dtype is int:
                     df[field] = (
                         pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
                     )
-                elif dtype == str:
+                elif dtype is str:
                     df[field] = df[field].astype(str).fillna("")

quantmsio/utils/constants.py (3)

91-101: Consolidate MSstats “Reference” constants to avoid ambiguity.

You define both MSSTATS_REFERENCE and MSSTATS_QUANTIFICATION_ID with the same value ("Reference"). Keeping two names for the same column is confusing and invites drift. Use one, and if you want an alias, assign it explicitly with a comment.
- MSSTATS_REFERENCE = "Reference"
+ MSSTATS_REFERENCE = "Reference"  # primary name used across the codebase
...
- MSSTATS_QUANTIFICATION_ID = "Reference"
+ # Alias kept for backward compatibility; prefer MSSTATS_REFERENCE
+ MSSTATS_QUANTIFICATION_ID = MSSTATS_REFERENCE
109-111: Prefer an immutable container for shared prefixes.

DECOY_PREFIXES is a global shared across modules. Make it a tuple to prevent accidental mutation.
-DECOY_PREFIXES = ["DECOY", "REV", "RANDOM"]
+DECOY_PREFIXES = ("DECOY", "REV", "RANDOM")
114-125: Add a brief note on mzTab column set origin.

The MZTAB_PROTEIN_COLUMNS set mixes fixed columns and instrument-dependent optional columns. A short comment that this is the minimal/expected set used by downstream mapping (see core/common.py MZTAB_PG_MAP) will help maintainers align changes.

quantmsio/core/common.py (2)

40-42: Clarify comment: this is not a regex.

OPENMS_PROTEIN_QVALUE_WORDS is a list of required tokens, not a regular expression. Update the comment to avoid confusion.
-# Regular expression that make sure the protein search engine score contains the following words: openms target-decoy protein and q-value
+# Required tokens for matching OpenMS protein q-value score column names
209-209: Ruff RUF005: prefer list unpacking for readability.

Minor style: apply Ruff’s suggestion.
-MZTAB_PG_USECOLS = list(MZTAB_PG_MAP.keys()) + ["description"]
+MZTAB_PG_USECOLS = [*list(MZTAB_PG_MAP.keys()), "description"]

quantmsio/core/quantms/pg.py (4)

20-33: Constructor docstring and unused variable.

Docstring still references mztab_path; update to indexer.

protein_details_df is unused; drop to avoid unnecessary data load.
-    def __init__(self, mztab_indexer):
-        """Initialize MzTabProteinGroups with an MzTabIndexer instance.
-
-        Args:
-            mztab_path: Path to mzTab file
-        """
+    def __init__(self, mztab_indexer):
+        """Initialize MzTabProteinGroups with an MzTabIndexer instance."""
         self._indexer: MzTabIndexer = mztab_indexer
-
-        # Initialize tracking lists first
         self._temp_files = []  # Track any temporary files created
         self._file_handles = []  # Track any open file handles
-        protein_details_df = self._indexer.get_protein_details()
         self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
401-404: Use logging.exception to preserve traceback for SQL batch failures.

You’re catching broad exceptions; at minimum, include the traceback.
-            except Exception as e:
-                logger.warning(f"[ERROR] SQL batch failed: {e}, skipping batch")
+            except Exception:
+                logger.exception("[ERROR] SQL batch failed; skipping batch")
448-452: Avoid destroying a shared indexer; delegate lifecycle to the caller.

destroy_database() here will close the indexer even if the caller wants to reuse it. Since the CLI wraps you in a context manager already, drop this call and let the owner manage lifetime.
-        # Clean up the temporary MzTabIndexer
-        self._indexer.destroy_database()
-
-        # Cleanup any temporary files created during processing
+        # Cleanup any temporary files created during processing
         self.cleanup()
691-699: Parameter ‘topn’ is ignored.

Either implement TopN properly (needs peptide-level detail) or drop the parameter and flag approximation in docs.

quantmsio/core/diann/diann.py (2)

440-443: Use numeric/boolean for is_decoy.

String "0" risks type inconsistency downstream.

-        report.loc[:, "is_decoy"] = "0"
+        report.loc[:, "is_decoy"] = 0

80-92: Prefer DiannDuckDB helper for query construction.

To avoid ad-hoc IN (...) formatting and improve reuse, delegate to DiannDuckDB.get_report_from_database.

-        report = self._duckdb.query_to_df(
-            """
-            select {}
-            from report
-            where Run IN {}
-            """.format(
-                sql, tuple(runs)
-            )
-        )
+        report = self._duckdb.get_report_from_database(runs, columns=sql)

quantmsio/quantmsioc.py (1)

60-73: Logger init once at CLI entry is good. Consider honoring --verbose globally.

If you plan a global --verbose, add a click option on cli() and set level accordingly, instead of per-subcommand.

quantmsio/commands/convert/quantms.py (1)

512-512: Remove extraneous f-string prefix.

The f-string prefix is unnecessary here as there are no placeholders.
-            logger.info(f"Successfully processed:")
+            logger.info("Successfully processed:")

quantmsio/core/quantms/psm.py (1)

658-663: Improve exception handling.

Consider using more specific exception types for better error handling and clarity.

         if not isinstance(partitions, list):
-            raise Exception(f"{partitions} is not a list")
+            raise TypeError(f"Expected list, got {type(partitions).__name__}")
         if len(partitions) == 0:
-            raise Exception(f"{partitions} is empty")
+            raise ValueError("Partitions list cannot be empty")
         for partion in partitions:
             if partion not in cols:
-                raise Exception(f"{partion} does not exist")
+                raise ValueError(f"Column '{partion}' does not exist in dataframe")

quantmsio/core/quantms/feature.py (2)

535-540: Improve exception handling.

Consider using more specific exception types for better error handling and clarity.
         if not isinstance(partitions, list):
-            raise Exception(f"{partitions} is not a list")
+            raise TypeError(f"Expected list, got {type(partitions).__name__}")
         if len(partitions) == 0:
-            raise Exception(f"{partitions} is empty")
+            raise ValueError("Partitions list cannot be empty")
         for partion in partitions:
             if partion not in cols:
-                raise Exception(f"{partion} does not exist")
+                raise ValueError(f"Column '{partion}' does not exist in dataframe")
201-201: Address TODO comments.

The code contains unresolved TODO comments that need attention:

Line 201: Need to determine the data source for score_name and score_value

Line 248: Need to confirm whether the proteins table contains "opt_global_q-value"

These TODOs indicate missing implementation details. Would you like me to help investigate the correct data sources or open issues to track these tasks?

Also applies to: 248-249

coderabbitai · 2025-08-31T15:34:15Z

quantmsio/commands/convert/quantms.py

+
+    except Exception as e:
+        logger.exception(f"Error in feature conversion: {str(e)}")
+        raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")


🛠️ Refactor suggestion

Use proper exception chaining to preserve traceback.

When re-raising exceptions within an except clause, use from e to preserve the original traceback context for better debugging.

- raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") + raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") from e

Apply this change to all five occurrences in the file at lines 128, 208, 264, 410, and 528.

Also applies to: 208-208, 264-264, 410-410, 528-528

🧰 Tools

🪛 Ruff (0.12.2)

128-128: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

128-128: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🤖 Prompt for AI Agents

In quantmsio/commands/convert/quantms.py around lines 128, 208, 264, 410, and 528 the code re-raises ClickException inside except blocks without exception chaining; update each raise to use "from e" (e.g., raise click.ClickException(...) from e) so the original traceback is preserved, applying the same change at all five specified line locations.

coderabbitai · 2025-08-31T15:34:15Z

quantmsio/core/common.py

+    "Run": "reference_file_name",
 }


💡 Verification agent

🧩 Analysis chain

Be explicit about DIA-NN “Run” → “reference_file_name” rename across the codebase.

You remap “Run” to “reference_file_name”. Please confirm all call sites downstream (e.g., joins, filters, and CLI code) use the canonical “reference_file_name”. Mixed usage is a common source of bugs when reading older report exports.

Run to check remaining usage:

🏁 Script executed:

#!/bin/bash # Find usages of raw "Run" column outside mapping dicts and tests rg -nP -C2 '(?<!DIANN_MAP)\bRun\b' --type=py -g '!**/test/**' -g '!**/tests/**'

Length of output: 4408

Be explicit about renaming “Run” → “reference_file_name” in all DIA-NN call sites

In quantmsio/core/diann/diann.py update:

lines 85–87 (WHERE Run IN …) → use "reference_file_name" or DIANN_MAP["Run"]

lines 110–114 (SELECT "Run") → select "reference_file_name"

line 122 (peptide_df.to_dict()["Run"]) → index by "reference_file_name"

lines 520–522 (get_unique_references("Run")) → call get_unique_references("reference_file_name")

In quantmsio/utils/constants.py reconsider keeping RUN = "Run" and MSSTATS_RUN_NAME = "Run" alongside MSSTATS_REFERENCE_NAME = "reference_file_name" to avoid mixed-key usage.

🤖 Prompt for AI Agents

In quantmsio/core/common.py around lines 95–96, ensure the mapping explicitly renames "Run" to "reference_file_name" and propagate that change: update quantmsio/core/diann/diann.py so that lines 85–87 use "reference_file_name" (or DIANN_MAP["Run"]) instead of WHERE Run IN …, lines 110–114 SELECT "reference_file_name" instead of SELECT "Run", line 122 indexes peptide_df.to_dict() by "reference_file_name" instead of "Run", and lines 520–522 call get_unique_references("reference_file_name") instead of get_unique_references("Run"); also review quantmsio/utils/constants.py and either keep RUN = "Run" and MSSTATS_RUN_NAME = "Run" alongside MSSTATS_REFERENCE_NAME = "reference_file_name" or consolidate keys to avoid mixed-key usage so all call sites consistently use the new reference_file_name key.

coderabbitai · 2025-08-31T15:34:15Z

quantmsio/core/diann/diann.py

+        report.loc[:, "intensities"] = report[
+            ["reference_file_name", "pg_quantity"]
+        ].apply(
+            lambda rows: [
+                {
+                    "sample_accession": self._sample_map[
+                        rows["reference_file_name"] + "-LFQ"
+                    ],
+                    "channel": "LFQ",
+                    "intensity": rows["pg_quantity"],
+                }
+            ],
+            axis=1,
+        )


🛠️ Refactor suggestion

LFQ sample mapping likely wrong: SDRF LFQ keys don’t include “-LFQ”. Add a fallback.

SDRFHandler.get_sample_map_run() maps LFQ as “file” (no “-LFQ” suffix). Your lookup uses “file-LFQ”.

- lambda rows: [ - { - "sample_accession": self._sample_map[ - rows["reference_file_name"] + "-LFQ" - ], + lambda rows: [ + { + "sample_accession": self._sample_map.get( + rows["reference_file_name"] + "-LFQ", + self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]), + ), "channel": "LFQ", "intensity": rows["pg_quantity"], } ],

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

report.loc[:, "intensities"] = report[

["reference_file_name", "pg_quantity"]

].apply(

lambda rows: [

{

"sample_accession": self._sample_map[

rows["reference_file_name"] + "-LFQ"

],

"channel": "LFQ",

"intensity": rows["pg_quantity"],

}

],

axis=1,

)

report.loc[:, "intensities"] = report[

["reference_file_name", "pg_quantity"]

].apply(

lambda rows: [

{

"sample_accession": self._sample_map.get(

rows["reference_file_name"] + "-LFQ",

self._sample_map.get(

rows["reference_file_name"],

rows["reference_file_name"]

)

),

"channel": "LFQ",

"intensity": rows["pg_quantity"],

}

],

axis=1,

)

🤖 Prompt for AI Agents

In quantmsio/core/diann/diann.py around lines 190 to 203, the code looks up sample IDs using keys like reference_file_name + "-LFQ" but SDRFHandler.get_sample_map_run() stores LFQ keys without the "-LFQ" suffix; update the lookup to first try self._sample_map.get(reference_file_name + "-LFQ") and if that returns None fall back to self._sample_map.get(reference_file_name) (or vice versa if preferred), and handle missing mappings by raising a clear error or skipping the row; ensure the lambda uses .get to avoid KeyError and produces the same dict structure.

coderabbitai · 2025-08-31T15:34:15Z

quantmsio/core/idxml_utils/idxml.py

+    def __init__(self, idxml_path: str):
+        """Initialize IdXML PSM processor.
+
+        Args:
+            idxml_path: Path to idXML file
+        """
+        self.idxml_path = Path(idxml_path)
+        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
+        prot_ids = []
+        pep_ids = []
+        oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)
+        self.prot_ids = prot_ids
+        self.pep_ids = pep_ids
+        self.openms_handler = OpenMSHandler()
+        self.logger.info(
+            f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"
+        )
+


🛠️ Refactor suggestion

Seed missing attributes and log summary.

Initialize search_metadata to avoid AttributeError later.

self.pep_ids = pep_ids self.openms_handler = OpenMSHandler() + # Optional metadata injected by callers; default to empty dict + self.search_metadata: dict = {}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, idxml_path: str):

"""Initialize IdXML PSM processor.

Args:

idxml_path: Path to idXML file

"""

self.idxml_path = Path(idxml_path)

self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

prot_ids = []

pep_ids = []

oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)

self.prot_ids = prot_ids

self.pep_ids = pep_ids

self.openms_handler = OpenMSHandler()

self.logger.info(

f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"

)

def __init__(self, idxml_path: str):

"""Initialize IdXML PSM processor.

Args:

idxml_path: Path to idXML file

"""

self.idxml_path = Path(idxml_path)

self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

prot_ids = []

pep_ids = []

oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)

self.prot_ids = prot_ids

self.pep_ids = pep_ids

self.openms_handler = OpenMSHandler()

# Optional metadata injected by callers; default to empty dict

self.search_metadata: dict = {}

self.logger.info(

f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"

)

🤖 Prompt for AI Agents

In quantmsio/core/idxml_utils/idxml.py around lines 26 to 43, the IdXML class does not initialize the attribute search_metadata which causes an AttributeError later; seed this attribute in __init__ (e.g. self.search_metadata = {} or appropriate default) before any method may reference it, and keep the existing load and logging logic unchanged so the instance always has a defined search_metadata attribute after construction.

coderabbitai · 2025-08-31T15:34:16Z

quantmsio/core/idxml_utils/idxml.py

+                # Extract protein accessions
+                protein_accessions = [
+                    acc.decode() for acc in hit.extractProteinAccessionsSet()
+                ]


⚠️ Potential issue

Protein accessions decoding may crash.

extractProteinAccessionsSet often yields strings; decoding blindly can fail. Decode bytes only.

- protein_accessions = [ - acc.decode() for acc in hit.extractProteinAccessionsSet() - ] + protein_accessions = [ + acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc) + for acc in hit.extractProteinAccessionsSet() + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Extract protein accessions

protein_accessions = [

acc.decode() for acc in hit.extractProteinAccessionsSet()

]

# Extract protein accessions

protein_accessions = [

acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc)

for acc in hit.extractProteinAccessionsSet()

]

🤖 Prompt for AI Agents

In quantmsio/core/idxml_utils/idxml.py around lines 104 to 107, the code unconditionally calls .decode() on each accession from hit.extractProteinAccessionsSet(), which can raise if items are already strings; change the comprehension to decode only when an item is a bytes instance (e.g., acc.decode()) and otherwise keep the item as-is (or str(acc) if you want to coerce non-str types), ensuring None or unexpected types are handled safely to avoid crashes.

coderabbitai · 2025-08-31T15:34:16Z

quantmsio/core/maxquant/maxquant.py

 from quantmsio.operate.tools import (
-    get_ahocorasick,
-    get_modification_details,
+    # get_ahocorasick,
+    # get_modification_details,
    get_protein_accession,
 )
-from pyopenms.Constants import PROTON_MASS_U
 from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS


⚠️ Potential issue

Re-import helpers required by this module.

get_modification_details is used below; import it (and get_ahocorasick if you keep the automaton).

-from quantmsio.operate.tools import ( - # get_ahocorasick, - # get_modification_details, - get_protein_accession, -) +from quantmsio.operate.tools import ( + get_ahocorasick, + get_modification_details, + get_protein_accession, +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from quantmsio.operate.tools import (

get_ahocorasick,

get_modification_details,

# get_ahocorasick,

# get_modification_details,

get_protein_accession,

)

from pyopenms.Constants import PROTON_MASS_U

from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS

from quantmsio.operate.tools import (

get_ahocorasick,

get_modification_details,

get_protein_accession,

)

from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS

🤖 Prompt for AI Agents

In quantmsio/core/maxquant/maxquant.py around lines 26 to 31, re-import the helper(s) currently commented out: add get_modification_details to the import list from quantmsio.operate.tools (and also add get_ahocorasick if the Aho–Corasick automaton is used elsewhere in this module); update the import statement to include these names so the functions used later are available, or remove/comment any references below if you decide not to import them.

coderabbitai · 2025-08-31T15:34:16Z

quantmsio/core/maxquant/maxquant.py

    def extract_col_msg(self, col_df, label: str = "feature"):
        line = "\t".join(col_df.columns)
        self.mods_map = self.get_mods_map(line)
-        self._automaton = get_ahocorasick(self.mods_map)
+        # self._automaton = get_ahocorasick(self.mods_map)
        if label == "feature":
            intensity_normalize_names = []
            intensity_names = []


⚠️ Potential issue

Restore automaton initialization to avoid AttributeError.

self._automaton is referenced later; initialize it.

self.mods_map = self.get_mods_map(line) - # self._automaton = get_ahocorasick(self.mods_map) + self._automaton = get_ahocorasick(self.mods_map)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_col_msg(self, col_df, label: str = "feature"):

line = "\t".join(col_df.columns)

self.mods_map = self.get_mods_map(line)

self._automaton = get_ahocorasick(self.mods_map)

# self._automaton = get_ahocorasick(self.mods_map)

if label == "feature":

intensity_normalize_names = []

intensity_names = []

def extract_col_msg(self, col_df, label: str = "feature"):

line = "\t".join(col_df.columns)

self.mods_map = self.get_mods_map(line)

self._automaton = get_ahocorasick(self.mods_map)

if label == "feature":

intensity_normalize_names = []

intensity_names = []

🤖 Prompt for AI Agents

In quantmsio/core/maxquant/maxquant.py around lines 63 to 69, restore initialization of self._automaton after self.mods_map is set to prevent later AttributeError: call and assign self._automaton = get_ahocorasick(self.mods_map) (or the appropriate factory function) immediately after self.mods_map = self.get_mods_map(line); if get_ahocorasick may return None, ensure you create a valid automaton or raise/log a clear error so downstream references are safe.

coderabbitai · 2025-08-31T15:34:16Z

quantmsio/core/quantms/feature.py

+                msstats.loc[:, feature] = msstats[
+                    ["reference_file_name", "peptidoform", "precursor_charge"]
+                ].apply(
+                    lambda rows: merge_psm(rows, i),


⚠️ Potential issue

Fix closure issue in lambda function.

The lambda function at line 527 has the same closure issue with the loop variable i.

for i, feature in enumerate(map_features): if feature not in msstats.columns: msstats.loc[:, feature] = msstats[ ["reference_file_name", "peptidoform", "precursor_charge"] ].apply( - lambda rows: merge_psm(rows, i), + lambda rows, idx=i: merge_psm(rows, idx), axis=1, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

lambda rows: merge_psm(rows, i),

for i, feature in enumerate(map_features):

if feature not in msstats.columns:

msstats.loc[:, feature] = msstats[

["reference_file_name", "peptidoform", "precursor_charge"]

].apply(

lambda rows, idx=i: merge_psm(rows, idx),

axis=1,

)

🧰 Tools

🪛 Ruff (0.12.2)

527-527: Function definition does not bind loop variable i

(B023)

🤖 Prompt for AI Agents

In quantmsio/core/quantms/feature.py around line 527 the lambda passed (lambda rows: merge_psm(rows, i)) captures the loop variable i by reference causing all lambdas to use the final i; fix by binding the current i into the lambda's default arguments (e.g. lambda rows, i=i: merge_psm(rows, i)) or replace the lambda with functools.partial(merge_psm, i=i) so each closure uses the intended i value.

quantmsio/core/quantms/pg.py

coderabbitai · 2025-08-31T15:34:17Z

quantmsio/core/sdrf.py

        if len(factor_columns) != 1:
-            raise ValueError("The number of factor columns should be 1")
-
-        # Rename the factor value column with condition as name
-        sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"})
+            sdrf_pd["condition"] = (
+                sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)
+            )
+        else:
+            sdrf_pd["condition"] = sdrf_pd[factor_columns]



⚠️ Potential issue

Bug: assigning a DataFrame to a Series column.

When exactly one factor column exists, sdrf_pd[factor_columns] is a DataFrame; assign the single Series instead.

- else: - sdrf_pd["condition"] = sdrf_pd[factor_columns] + else: + # factor_columns has exactly 1 element here + sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(factor_columns) != 1:

raise ValueError("The number of factor columns should be 1")

# Rename the factor value column with condition as name

sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"})

sdrf_pd["condition"] = (

sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)

)

else:

sdrf_pd["condition"] = sdrf_pd[factor_columns]

if len(factor_columns) != 1:

sdrf_pd["condition"] = (

sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)

)

else:

# factor_columns has exactly 1 element here

sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str)

🤖 Prompt for AI Agents

In quantmsio/core/sdrf.py around lines 224 to 230, when there is exactly one factor column the code assigns sdrf_pd[factor_columns] (a one-column DataFrame) to the "condition" column; change this to assign the single Series instead by indexing the single column name (e.g., use factor_columns[0]) and ensure you cast it to string to match the multi-column path before assigning to sdrf_pd["condition"].

Fix the quantms LFQ/TMT modules

coderabbitai

Actionable comments posted: 16

♻️ Duplicate comments (15)

tests/core/quantms/test_full_conversion_suite.py (3)

17-20: Fix workspace root to repository root

Matches prior feedback; avoids path issues on CI.
 def get_workspace_root():
     """Get the workspace root directory"""
-    return Path(__file__).parent.parent.parent
+    return Path(__file__).resolve().parents[3]
7-15: Use current interpreter instead of hardcoding 'python'

Prevents environment mismatches; add sys import.
 import os
 import subprocess
 import tempfile
 import time
 from pathlib import Path
+import sys
119-131: Replace 'python' with sys.executable in all CLI invocations

Ensures consistency across platforms and virtualenvs.
-        cmd = [
-            "python",
+        cmd = [
+            sys.executable,
(Apply to all six command builders in these ranges.)

Also applies to: 171-187, 227-243, 281-293, 333-349, 389-405

quantmsio/core/quantms/feature.py (3)

458-464: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).

Each lambda closes over i; all rows get the last value. Bind i at definition time.

Apply:

 for i, feature in enumerate(map_features):
-    msstats.loc[:, feature] = msstats[
+    index = i  # bind current index
+    msstats.loc[:, feature] = msstats[
         ["reference_file_name", "peptidoform", "precursor_charge"]
     ].apply(
-        lambda rows: merge_psm(rows, i),
+        lambda rows, idx=index: merge_psm(rows, idx),
         axis=1,
     )

512-519: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).

Same closure issue here; bind i.

Apply:

 for i, feature in enumerate(map_features):
     if feature not in msstats.columns:
-        msstats.loc[:, feature] = msstats[
+        index = i
+        msstats.loc[:, feature] = msstats[
             ["reference_file_name", "peptidoform", "precursor_charge"]
         ].apply(
-            lambda rows: merge_psm(rows, i),
+            lambda rows, idx=index: merge_psm(rows, idx),
             axis=1,
         )

521-533: Loop variable shadows outer df; yields confusing bugs.

Rename inner variable.

-        for key, df in df.groupby(partitions):
-            yield key, df
+        for key, group_df in df.groupby(partitions):
+            yield key, group_df

tests/core/quantms/psms/test_psm.py (2)

102-121: Create output parquet safely inside TemporaryDirectory.

Avoid mktemp for files; use a path inside a temp dir.

-    temp_db_path = tempfile.mktemp(suffix=".duckdb")
-    temp_output_path = tempfile.mktemp(suffix=".parquet")
+    with tempfile.TemporaryDirectory() as tmpdir:
+        temp_db_path = Path(tmpdir) / "psm.duckdb"
+        temp_output_path = Path(tmpdir) / "psm.parquet"

43-60: Replace insecure mktemp with TemporaryDirectory; remove manual unlink.

Avoid CWE-377 and let the directory handle cleanup.

-    # Create a temporary database path (don't create the file, let DuckDB do it)
-    temp_db_path = tempfile.mktemp(suffix=".duckdb")
+    with tempfile.TemporaryDirectory() as tmpdir:
+        temp_db_path = Path(tmpdir) / "psm.duckdb"
 ...
-    finally:
-        # Clean up temporary database
-        if os.path.exists(temp_db_path):
-            os.unlink(temp_db_path)
+    finally:
+        pass

Apply the same pattern to Lines 69-93, 102-129 (also for temp_output_path: create inside tmpdir), 141-159, 304-382, 394-536, and 548-585.

quantmsio/core/quantms/pg.py (1)

180-191: Dead method references undefined attribute _mztab_path.

Calling _is_protein_line would raise AttributeError. Remove or guard it.

-    def _is_protein_line(self, line_num: int) -> bool:
-        """Check if a line is a protein data line."""
-        try:
-            # Use safe file opening with automatic cleanup
-            with self._safe_file_open(self._mztab_path, "r") as file:
-                for i, line in enumerate(file):
-                    if i == line_num:
-                        return line.startswith("PRT")
-            return False
-        except Exception:
-            return False
+    def _is_protein_line(self, line_num: int) -> bool:
+        """Deprecated: no raw-file parsing in this class."""
+        raise NotImplementedError("Not supported. Use indexer-backed access.")

quantmsio/core/quantms/psm.py (2)

378-384: Typo in CV param key: “consesus_support” → “consensus_support”

Align the CV name with the intended spelling and downstream readers.

-        cv_list = []
-        if consensus_support and consensus_support != "null":
-            struct = {
-                "cv_name": "consesus_support",
-                "cv_value": str(consensus_support),
-            }
-            cv_list.append(struct)
+        cv_list = []
+        if consensus_support and consensus_support != "null":
+            struct = {
+                "cv_name": "consensus_support",
+                "cv_value": str(consensus_support),
+            }
+            cv_list.append(struct)

Also applies to: 523-531

654-666: Fix variable shadowing in generator

Loop variable df shadows the outer df.

-        for key, df in df.groupby(partitions):
-            yield key, df
+        for key, group_df in df.groupby(partitions):
+            yield key, group_df

quantmsio/core/duckdb.py (4)

151-159: Fix table-existence check (variable shadowing bug)

The method shadows tables and checks each table against itself; always True. Use a separate variable.

     def check_tables_in_db(self, tables: List[str]) -> bool:
-        """Check if the database connection is valid."""
-        tables = [t[0] for t in duckdb.execute("SHOW TABLES").fetchall()]
-        for table in tables:
-            if table not in tables:
+        """Check if required tables exist in the database."""
+        existing_tables = [t[0] for t in self._duckdb.execute("SHOW TABLES").fetchall()]
+        for table in tables:
+            if table not in existing_tables:
                 logging.info(f"Table {table} not in database")
                 return False
         return True

507-512: Parameterize iter_file batch query

Inline stringified list and quotes are incorrect and unsafe.

-            query = f"""
-            SELECT {cols}
-            FROM report
-            WHERE "{field}" = ANY('{refs}')
-            """
-            yield refs, self.query_to_df(query)
+            placeholders = ",".join("?" for _ in refs)
+            query = f'SELECT {cols} FROM report WHERE "{field}" IN ({placeholders})'
+            yield refs, self._duckdb.execute(query, refs).df()

477-483: Parameterize get_report_from_database refs filter

Avoid interpolating lists; use IN (?,...,?).

-        query = f"""
-        SELECT {columns}
-        FROM report
-        WHERE "{RUN}" = ANY({refs})
-        """
-        return self.query_to_df(query)
+        placeholders = ",".join("?" for _ in refs)
+        query = f'SELECT {columns} FROM report WHERE "{RUN}" IN ({placeholders})'
+        return self._duckdb.execute(query, refs).df()

354-358: Parameterize filtering (avoid ANY(...) with inline list)

Inline Python lists in SQL is unsafe and brittle. Use placeholders.

         query = "SELECT * FROM pg_matrix"
         if protein_groups_tuple:
-            query += f' WHERE "{PROTEIN_GROUP}" = ANY({list(protein_groups_tuple)})'
-
-        return self.query_to_df(query)
+            placeholders = ",".join("?" for _ in protein_groups_tuple)
+            query += f' WHERE "{PROTEIN_GROUP}" IN ({placeholders})'
+            return self._duckdb.execute(query, list(protein_groups_tuple)).df()
+        return self._duckdb.execute(query).df()

🧹 Nitpick comments (37)

tests/core/quantms/test_msstats_in.py (6)
287-301: Handle both reference file column names when building the matrix

Hard-coding reference_file_name can break with legacy inputs using Reference_Name.
-    for _, row in msstats_batch.iterrows():
-        ref_file = row["reference_file_name"]
+    for _, row in msstats_batch.iterrows():
+        ref_col = (
+            "Reference_Name"
+            if "Reference_Name" in msstats_batch.columns
+            else "reference_file_name"
+        )
+        ref_file = row[ref_col]
165-179: Prefer public API over private _duckdb and use normalized column names

Use query_to_df and the normalized lower-case columns to avoid brittle tests.
-        raw_channels = indexer._duckdb.sql(
-            "SELECT DISTINCT Channel FROM msstats ORDER BY Channel"
-        ).df()
+        raw_channels = indexer.query_to_df(
+            "SELECT DISTINCT channel FROM msstats ORDER BY channel"
+        )
@@
-        sample_raw = indexer._duckdb.sql(
-            """
-            SELECT Channel, Reference, ProteinName, PeptideSequence, Intensity 
-            FROM msstats 
-            WHERE Reference LIKE '%a05058%' 
-            AND ProteinName LIKE '%P55011%' 
-            LIMIT 15
-        """
-        ).df()
+        sample_raw = indexer.query_to_df(
+            """
+            SELECT channel, reference_file_name, pg_accessions, peptidoform, intensity
+            FROM msstats
+            WHERE reference_file_name LIKE '%a05058%'
+              AND pg_accessions LIKE '%P55011%'
+            LIMIT 15
+            """
+        )
349-349: Replace ambiguous multiplication sign

Avoid the Unicode “×” to prevent encoding/test log issues.
-    print("\nFeatures by File × Channel Matrix:")
+    print("\nFeatures by File x Channel Matrix:")
117-117: Remove extraneous f-strings without placeholders

Minor cleanliness; keeps linters quiet (Ruff F541).
-        print(f"\nLFQ test completed successfully!")
+        print("\nLFQ test completed successfully!")
@@
-    print(f"\nTMT test completed successfully!")
+    print("\nTMT test completed successfully!")
@@
-                print(f"Intensity distribution analysis completed")
+                print("Intensity distribution analysis completed")
@@
-    print(f"\nDataset Comparison:")
+    print("\nDataset Comparison:")
Also applies to: 465-465, 536-536, 589-589

9-10: Drop unused imports

gzip and tempfile are not used in this module.
-import gzip
-import tempfile
+import tempfile  # if you still need it elsewhere, keep; otherwise remove
If tempfile is also unused, remove it as well.

181-182: Avoid catching bare Exception in debug helper

Narrow the exception or re-raise after logging to keep failures visible in CI.
-    except Exception as e:
+    except (RuntimeError, ValueError) as e:
quantmsio/utils/constants.py (3)
94-95: Alias peptide name to the sequence constant to prevent drift

Avoid duplicating the same string literal in two places.
-MSSTATS_PEPTIDE_NAME = "PeptideSequence"
+MSSTATS_PEPTIDE_NAME = MSSTATS_PEPTIDE_SEQUENCE
111-114: Make MSstats columns order deterministic

Using set() makes column order non-deterministic, which can lead to unstable schemas across chunks.
-MSSTATS_USECOLS = list(
-    set(MSSTATS_NAME_MAP.values())
-    | {"sample_accession", "reference_file_name", "channel"}
-)
+MSSTATS_USECOLS = [
+    "pg_accessions",
+    "peptidoform",
+    "charge",
+    "intensity",
+    "sample_accession",
+    "reference_file_name",
+    "channel",
+]
121-133: Use declared constants within MZTAB_PROTEIN_COLUMNS for consistency

Keeps identifiers centralized and reduces typos.
 MZTAB_PROTEIN_COLUMNS = [
     ACCESSION,
     "description",
     MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE,
-    "ambiguity_members",
+    AMBIGUITY_MEMBERS,
     "modifications",
     "protein_coverage",
     "opt_global_nr_found_peptides",
     "opt_global_Posterior_Probability_score",
     "opt_global_cv_PRIDE:0000303_decoy_hit",
-    "opt_global_result_type",
+    OPT_GLOBAL_RESULT_TYPE,
 ]
tests/core/quantms/test_full_conversion_suite.py (3)
1-1: Remove shebang or mark file executable

The shebang isn’t needed for a pytest module and triggers Ruff EXE001.
-#!/usr/bin/env python3
91-101: Avoid loading full Parquet into memory for simple stats

Use metadata/schema to get rows/cols and column names.
-    try:
-        df = pq.read_table(file_path).to_pandas()
-        size_mb = file_path.stat().st_size / (1024 * 1024)
-
-        return {
-            "rows": len(df),
-            "columns": len(df.columns),
-            "size_mb": size_mb,
-            "column_names": list(df.columns)[:10],  # First 10 columns
-        }
+    try:
+        pf = pq.ParquetFile(file_path)
+        size_mb = file_path.stat().st_size / (1024 * 1024)
+        names = list(pf.schema_arrow.names)
+        return {
+            "rows": pf.metadata.num_rows,
+            "columns": len(names),
+            "size_mb": size_mb,
+            "column_names": names[:10],
+        }
81-84: Narrow broad exception or re-raise after logging

Catching Exception hides real failures during CI; consider narrowing to OSError/CalledProcessError or re-raising.
quantmsio/core/quantms/feature.py (4)
488-496: Column name “mp_accessions” looks unintended; likely “accession” or “pg_accessions”.

“mp_accessions” is not used elsewhere and may not exist in FEATURE_SCHEMA; risk of silent drop/mismatch. Please confirm the intended field and rename accordingly.

Example:
-    "mp_accessions",
+    "accession",  # or "pg_accessions" if you intend PG group accessions
If you intend PGs, consider deriving from existing pg_accessions instead of the PSM accession string.

660-664: anchor_protein type inconsistency with schema.

add_additional_msg converts anchor_protein to list, but convert_to_parquet_format later casts anchor_protein to str. Confirm FEATURE_SCHEMA: if anchor_protein is string, keep the list in a new field (anchor_proteins) and leave anchor_protein as string; if it’s a list, remove it from string_columns and add to list_columns.

Option A (keep string + add list):
-    msstats["anchor_protein"] = msstats["anchor_protein"].apply(
-        get_protein_accession
-    )
+    msstats["anchor_proteins"] = msstats["anchor_protein"].apply(get_protein_accession)
 string_columns = [
-    "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name",
+    "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name",
 ]
-list_columns = ["pg_accessions", "gg_accessions", "gg_names"]
+list_columns = ["pg_accessions", "gg_accessions", "gg_names", "anchor_proteins"]
Option B (schema expects list): remove anchor_protein from string_columns and add it to list_columns.

Also applies to: 739-747, 751-763

114-116: Overbroad exception handlers hide real errors.

Catching bare Exception and immediately returning empty structures makes debugging hard. Narrow exceptions and re-raise unexpected errors, or at least log with stack_info/exec_info.

Example:
-        except Exception as e:
-            self.logger.warning(f"Could not extract PSM messages: {e}")
-            return {}, {}
+        except (KeyError, AttributeError, ValueError) as e:
+            self.logger.warning("Could not extract PSM messages", exc_info=e)
+            return {}, {}
Repeat pattern for other helpers.

Also applies to: 233-235, 257-259, 270-272, 327-331

134-223: generate_modifications_details: simplify score assignment and guard indexing.

scores selection uses “mod_name in select_mods” which is always True (select_mods = keys).

Accessing sequence[aa_pos-1] needs a length guard.
-                        aa = sequence[aa_pos - 1]
+                        if 0 < aa_pos <= len(sequence):
+                            aa = sequence[aa_pos - 1]
+                        else:
+                            continue
...
-                            "scores": (
-                                [
-                                    {"score_name": "localization_probability","score_value": None}
-                                ] if mod_name in select_mods else None
-                            ),
+                            "scores": None,  # until a real source is available
Also drop the unused select_mods variable.
tests/core/quantms/test_feature.py (1)
36-40: Prefer context-managing the indexer.

MzTabIndexer supports enter/exit. Using “with MzTabIndexer.create(...) as indexer” guarantees cleanup even on failure.
-    indexer = MzTabIndexer.create(
+    with MzTabIndexer.create(
         mztab_path=mztab_file,
         msstats_path=msstats_file,
         sdrf_path=sdrf_file,
-        database_path=temp_db_path,
-    )
+        database_path=temp_db_path,
+    ) as indexer:
tests/core/quantms/psms/test_psm.py (1)
246-254: Use Pythonic truth checks in tests.

E712 flags “== True/False”. Prefer direct truthiness.
-    ) == True
+    )
...
-    assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False
+    assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"])
tests/core/quantms/psms/test_duckdb_creation.py (4)
137-139: Remove unnecessary f-strings without placeholders.

These prints don’t interpolate variables and trigger F541 in Ruff. Drop the f prefix.

Apply across the listed lines, e.g.:
- print(f"\n=== Testing Two-Step Creation Process ===")
+ print("\n=== Testing Two-Step Creation Process ===")
Also applies to: 166-177, 226-234, 263-268, 293-304, 350-361, 385-392, 419-431, 472-480, 499-503, 530-537, 564-569, 617-625, 644-648

149-154: Avoid accessing private attribute _duckdb in tests.

Use a read-only DuckDB connection for verification to keep tests decoupled from internals.

Example for one block:
- tables = [t[0] for t in mztab_step1._duckdb.execute("SHOW TABLES").fetchall()]
+ with duckdb.connect(str(output_path), read_only=True) as con_verify:
+     tables = [t[0] for t in con_verify.execute("SHOW TABLES").fetchall()]
Also applies to: 276-281, 311-317, 399-404, 442-449, 586-594

653-660: Replace bare except/pass in cleanup helpers.

Use contextlib.suppress or log at debug to avoid silencing unexpected errors and S110/BLE001 warnings.

Apply:
+import contextlib
@@
-    if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
-        try:
-            mztab_indexer._duckdb.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+    if mztab_indexer and getattr(mztab_indexer, "_duckdb", None):
+        with contextlib.suppress(Exception):
+            mztab_indexer._duckdb.close()
@@
-    if con:
-        try:
-            con.close()
-        except Exception:
-            pass  # Ignore errors during cleanup
+    if con:
+        with contextlib.suppress(Exception):
+            con.close()
Also applies to: 662-669

46-50: Test name/docs mention Parquet but only DuckDB is exercised.

Either add Parquet path coverage or rename to avoid confusion.
quantmsio/core/quantms/pg.py (3)
45-55: Use a consistent logger; prefer the instance logger.

cleanup() uses a different logger name than the class logger, fragmenting log control.
-        logger = logging.getLogger("quantmsio.core.mztab")
+        logger = self.logger
Also replace other logging.getLogger("quantmsio.core.mztab")/quantmsio.core.quantms.pg occurrences in this method with self.logger.

Also applies to: 57-66

266-271: Prefer logging.exception in except blocks.

Leverage stack traces directly; current error messages duplicate str(e).
-                logging.getLogger("quantmsio.core.mztab").error(
-                    f"Error opening file {file_path}: {e}"
-                )
+                logging.getLogger("quantmsio.core.mztab").exception(
+                    "Error opening file %s", file_path
+                )
@@
-            logger.error(f"Error loading protein groups table: {e}")
+            logger.exception("Error loading protein groups table")
@@
-            logger.error(f"Error creating msstats protein join: {e}")
+            logger.exception("Error creating msstats protein join")
Also applies to: 283-289, 485-487, 555-557

425-491: Fix logging and remove unused protein_groups var.

You build protein_data but log len(protein_groups), which is always 0.
-        protein_groups = {}
-        protein_data = list()
+        protein_data: list[dict] = []
@@
-        logger.info(
-            f"Loaded {len(protein_groups)} protein groups from {total_rows} total rows"
-        )
+        logger.info(
+            "Loaded %d protein groups from %d total rows",
+            len(protein_data),
+            total_rows,
+        )
quantmsio/commands/convert/quantms.py (5)
129-132: Preserve traceback and avoid redundant str(e) in logs.

Use logger.exception("...") and chain exceptions with from e.
-    except Exception as e:
-        logger.exception(f"Error in feature conversion: {str(e)}")
-        raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")
+    except Exception as e:
+        logger.exception("Error in feature conversion")
+        raise click.ClickException(f"Error: {e}\nCheck the logs for more details.") from e
Repeat similarly for PSM, idXML-PSM, PG, and create-duckdb handlers.

Also applies to: 211-214, 267-270, 416-419, 534-537

324-335: Annotate optional parameters explicitly with Optional[...].

PEP 484 discourages implicit Optional typing; update annotations.
-def convert_quantms_pg_cmd(
-    mztab_path: Path = None,
-    database_path: Path = None,
-    msstats_file: Path = None,
-    sdrf_file: Path = None,
-    output_folder: Path = None,
+def convert_quantms_pg_cmd(
+    mztab_path: Optional[Path] = None,
+    database_path: Optional[Path] = None,
+    msstats_file: Optional[Path] = None,
+    sdrf_file: Optional[Path] = None,
+    output_folder: Optional[Path] = None,
Ensure from typing import Optional is present (it already is).

520-520: Drop extraneous f-prefix.

Minor logging polish: remove f when no placeholders are used.
-            logger.info(f"Successfully processed:")
+            logger.info("Successfully processed:")
88-95: Minor: avoid redundant Path() wrapping.

database_path is already a Path; use it directly for .exists() and str() conversions.

Example:
-        if database_path and Path(database_path).exists():
+        if database_path and database_path.exists():
Also applies to: 181-194, 369-385, 501-509

402-415: Use public conversion helper to write Parquet.

If possible, expose a writer method on MzTabProteinGroups similar to Feature.write_feature_to_file to avoid calling a private _convert_to_parquet_format.
quantmsio/core/quantms/psm.py (6)
386-392: Avoid KeyError; reuse computed spectra_ref and handle None

Use the local spectra_ref and guard for None to prevent KeyError.
-            transformed_record["scan"] = generate_scan_number(record["spectra_ref"])
+            transformed_record["scan"] = (
+                generate_scan_number(spectra_ref) if spectra_ref else None
+            )
430-436: Remove duplicate assignments

mz_array and intensity_array are set twice.
             transformed_record["number_peaks"] = None
             transformed_record["ion_mobility"] = None
-            transformed_record["mz_array"] = None
-            transformed_record["intensity_array"] = None
439-441: Log full traceback on transform failures

Use logger.exception for better diagnostics.
-        except Exception as e:
-            self.logger.warning(f"Error transforming PSM record: {e}")
+        except Exception:
+            self.logger.exception("Error transforming PSM record")
             return None
478-491: Use .get() to avoid KeyError on missing score columns

PSM rows may not carry all search_engine_score[i] fields.
-        for score_index, score_name in self._score_names["psms"].items():
-            value = record[f"search_engine_score[{score_index}]"]
+        for score_index, score_name in self._score_names["psms"].items():
+            value = record.get(f"search_engine_score[{score_index}]")
             if value is not None and value != "null":
507-509: Minor: remove unnecessary f-string

No placeholders present.
-                self.logger.warning(f"Could not convert global_qvalue to float")
+                self.logger.warning("Could not convert global_qvalue to float")
580-582: Prefer logger.exception when catching unexpected errors

Preserves stack trace for debugging.
-        except Exception as e:
-            self.logger.error(f"Could not extract peptide columns: {e}")
+        except Exception:
+            self.logger.exception("Could not extract peptide columns")
             self._pep_columns = []
quantmsio/core/duckdb.py (1)
223-225: Use logger.exception to capture stack trace on benchmark failures
-            except Exception as e:
-                self.logger.error(f"Query failed on iteration {i+1}: {e}")
+            except Exception:
+                self.logger.exception(f"Query failed on iteration {i+1}")
                 return {"error": str(e)}

coderabbitai · 2025-09-06T07:52:38Z

quantmsio/commands/convert/quantms.py

+        elif mztab_path:
+            # No database_path provided, create a temporary one
+            temp_db_path = tempfile.mktemp(suffix=".duckdb")
+            logger.info(
+                f"Creating temporary MzTabIndexer at {temp_db_path} from {mztab_path}"
+            )
+            indexer = MzTabIndexer.create(
+                mztab_path=str(mztab_path),
+                msstats_path=msstats_file,
+                sdrf_path=sdrf_file,
+                database_path=temp_db_path,
+            )


⚠️ Potential issue

Replace insecure tempfile.mktemp() with NamedTemporaryFile(delete=False).

mktemp is vulnerable to race conditions (CWE-377/B306). Use a secure temp file.

- temp_db_path = tempfile.mktemp(suffix=".duckdb") + from tempfile import NamedTemporaryFile + with NamedTemporaryFile(suffix=".duckdb", delete=False) as tf: + temp_db_path = tf.name

Apply in all three places.

Also applies to: 195-204, 386-396

🧰 Tools

🪛 ast-grep (0.38.6)

[warning] 106-106: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(suffix=".duckdb")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 Ruff (0.12.2)

107-107: Use of insecure and deprecated function (mktemp)

(S306)

🪛 GitHub Check: CodeFactor

[warning] 110-110: quantmsio/commands/convert/quantms.py#L110
Use of insecure and deprecated function (mktemp). (B306)

🤖 Prompt for AI Agents

In quantmsio/commands/convert/quantms.py around lines 105 to 116 (and also apply the same change at 195-204 and 386-396), replace the insecure tempfile.mktemp() calls with a secure tempfile.NamedTemporaryFile(delete=False, suffix=".duckdb") usage: create a NamedTemporaryFile with delete=False, capture its .name for database_path, close the file descriptor immediately (so it can be opened by other code) and pass the .name to MzTabIndexer.create; ensure the temporary file is cleaned up later if needed.

coderabbitai · 2025-09-06T07:52:38Z

quantmsio/core/duckdb.py

    def __init__(
        self,
-        report_path: Union[Path, str],
-        duckdb_max_memory: str = "16GB",
-        duckdb_threads: int = 4,
+        database_path: Optional[Union[Path, str]] = None,
+        max_memory: str = "16GB",
+        worker_threads: int = 8,
    ):
-        self.parquet_db: duckdb.DuckDBPyConnection | None = None
-        self._report_path = report_path
-        self._duckdb_name = create_uuid_filename("report-duckdb", ".db")
-        self._duckdb = self.create_duckdb_from_diann_report(
-            duckdb_max_memory, duckdb_threads
+        """Initialize DuckDB connection.
+
+        Args:
+            database_path: Optional path to DuckDB database file
+            max_memory: Maximum memory to use
+            worker_threads: Number of worker threads
+        """
+        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
+        self.logger.info(f"duckdb uses {worker_threads} threads.")
+        self.logger.info(f"duckdb uses {max_memory} of memory.")
+
+        start_time = time.time()
+        self._duckdb = duckdb.connect(
+            database=str(database_path) if database_path else ":memory:",
+            config={
+                "memory_limit": max_memory,
+                "threads": worker_threads,
+            },
        )
+        self.logger.info(
+            f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"
+        )


🛠️ Refactor suggestion

Persist database path and apply resource settings via constructor

Store database_path for cleanup and pass max_memory/worker_threads through __init__. Otherwise, cleanup_duckdb can’t locate the file and subclass workarounds appear (see DiannDuckDB).

def __init__( self, database_path: Optional[Union[Path, str]] = None, max_memory: str = "16GB", worker_threads: int = 8, ): """Initialize DuckDB connection. @@ - self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") + self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}") self.logger.info(f"duckdb uses {worker_threads} threads.") self.logger.info(f"duckdb uses {max_memory} of memory.") start_time = time.time() - self._duckdb = duckdb.connect( - database=str(database_path) if database_path else ":memory:", + # Track database path for cleanup + self._database_path = Path(database_path) if database_path else None + self._duckdb = duckdb.connect( + database=str(self._database_path) if self._database_path else ":memory:", config={ "memory_limit": max_memory, "threads": worker_threads, }, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(

self,

report_path: Union[Path, str],

duckdb_max_memory: str = "16GB",

duckdb_threads: int = 4,

database_path: Optional[Union[Path, str]] = None,

max_memory: str = "16GB",

worker_threads: int = 8,

):

self.parquet_db: duckdb.DuckDBPyConnection | None = None

self._report_path = report_path

self._duckdb_name = create_uuid_filename("report-duckdb", ".db")

self._duckdb = self.create_duckdb_from_diann_report(

duckdb_max_memory, duckdb_threads

"""Initialize DuckDB connection.

Args:

database_path: Optional path to DuckDB database file

max_memory: Maximum memory to use

worker_threads: Number of worker threads

"""

self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

self.logger.info(f"duckdb uses {worker_threads} threads.")

self.logger.info(f"duckdb uses {max_memory} of memory.")

start_time = time.time()

self._duckdb = duckdb.connect(

database=str(database_path) if database_path else ":memory:",

config={

"memory_limit": max_memory,

"threads": worker_threads,

},

)

self.logger.info(

f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"

)

def __init__(

self,

database_path: Optional[Union[Path, str]] = None,

max_memory: str = "16GB",

worker_threads: int = 8,

):

"""Initialize DuckDB connection.

Args:

database_path: Optional path to DuckDB database file

max_memory: Maximum memory to use

worker_threads: Number of worker threads

"""

self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

self.logger.info(f"duckdb uses {worker_threads} threads.")

self.logger.info(f"duckdb uses {max_memory} of memory.")

start_time = time.time()

# Track database path for cleanup

self._database_path = Path(database_path) if database_path else None

self._duckdb = duckdb.connect(

database=str(self._database_path) if self._database_path else ":memory:",

config={

"memory_limit": max_memory,

"threads": worker_threads,

},

)

self.logger.info(

f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"

)

🤖 Prompt for AI Agents

In quantmsio/core/duckdb.py around lines 24 to 51, the constructor currently does not persist the provided database_path or the resource settings on the instance, which prevents cleanup_duckdb and subclasses from locating the DB file or reading configured resources; fix by storing the incoming parameters on self (e.g., self.database_path = Path(database_path) if provided else None, self.max_memory = max_memory, self.worker_threads = worker_threads) and continue to use those stored values when connecting (and later when cleaning up) so cleanup_duckdb and subclasses can access the actual path and resource settings.

coderabbitai · 2025-09-06T07:52:39Z

quantmsio/core/duckdb.py

+    def cleanup_duckdb(self):
+        """Check if DuckDB connection is closed, then delete the database file."""
+        # Close connection if it is still open
+        if self._duckdb:
+            try:
+                self._duckdb.close()
+                self.logger.info("[Check] DuckDB connection closed.")
+            except Exception as e:
+                self.logger.info(f"Failed to close DuckDB connection: {e}")
+            finally:
+                self._duckdb = None
+
+        db_file = Path(self._database_path)
+        # Delete the database file using pathlib
+        if db_file.exists():
+            try:
+                db_file.unlink()
+                self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
+            except Exception as e:
+                self.logger.info(f"Failed to delete database file: {e}")


⚠️ Potential issue

cleanup_duckdb breaks when _database_path is unset

When using an in-memory DB, _database_path is None; current code raises. Guard for None before deletion.

- db_file = Path(self._database_path) - # Delete the database file using pathlib - if db_file.exists(): + if not getattr(self, "_database_path", None): + return + db_file = Path(self._database_path) + # Delete the database file using pathlib + if db_file.exists(): try: db_file.unlink() self.logger.info(f"[CleanUp] Database file deleted: {db_file}") except Exception as e: self.logger.info(f"Failed to delete database file: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def cleanup_duckdb(self):

"""Check if DuckDB connection is closed, then delete the database file."""

# Close connection if it is still open

if self._duckdb:

try:

self._duckdb.close()

self.logger.info("[Check] DuckDB connection closed.")

except Exception as e:

self.logger.info(f"Failed to close DuckDB connection: {e}")

finally:

self._duckdb = None

db_file = Path(self._database_path)

# Delete the database file using pathlib

if db_file.exists():

try:

db_file.unlink()

self.logger.info(f"[CleanUp] Database file deleted: {db_file}")

except Exception as e:

self.logger.info(f"Failed to delete database file: {e}")

def cleanup_duckdb(self):

"""Check if DuckDB connection is closed, then delete the database file."""

# Close connection if it is still open

if self._duckdb:

try:

self._duckdb.close()

self.logger.info("[Check] DuckDB connection closed.")

except Exception as e:

self.logger.info(f"Failed to close DuckDB connection: {e}")

finally:

self._duckdb = None

# If no on-disk database path is set (e.g. in-memory DB), skip file deletion

if not getattr(self, "_database_path", None):

return

db_file = Path(self._database_path)

# Delete the database file using pathlib

if db_file.exists():

try:

db_file.unlink()

self.logger.info(f"[CleanUp] Database file deleted: {db_file}")

except Exception as e:

self.logger.info(f"Failed to delete database file: {e}")

🧰 Tools

🪛 Ruff (0.12.2)

66-66: Do not catch blind exception: Exception

(BLE001)

77-77: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In quantmsio/core/duckdb.py around lines 59 to 78, cleanup_duckdb assumes self._database_path is set and constructs a Path which raises when it's None; guard before attempting deletion by checking if self._database_path is truthy (not None/empty) and only then create Path(self._database_path) and call exists()/unlink(); if it's None or represents an in-memory DB skip file deletion and log that no file was present; preserve existing exception handling and set self._duckdb = None in the finally block.

coderabbitai · 2025-09-06T07:52:39Z

quantmsio/core/duckdb.py

+    def secure_query_to_df(
+        self, query: str, params: Optional[Dict[str, Any]] = None
+    ) -> pd.DataFrame:
+        """Execute parameterized query safely and return result as DataFrame.
+
+        Args:
+            query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')
+            params: Dictionary of parameters to safely substitute

-        if max_memory is not None:
-            database.execute("SET max_memory='{}'".format(max_memory))
-        if worker_threads is not None:
-            database.execute("SET worker_threads='{}'".format(worker_threads))
+        Returns:
+            DataFrame with query results

-        msg = database.execute(
-            "SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')"
-        ).df()
-        logging.info("duckdb uses {} threads.".format(str(msg["value"][0])))
-        logging.info("duckdb uses {} of memory.".format(str(msg["value"][1])))
+        Example:
+            >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})
+        """
+        if params:
+            # DuckDB uses positional parameters with ?
+            param_values = list(params.values())
+            return self._duckdb.execute(query, param_values).df()
+        else:
+            return self._duckdb.execute(query).df()



🛠️ Refactor suggestion

Parameter order bug and API mismatch in secure_query_to_df

DuckDB uses positional ? placeholders. Passing an unordered dict can bind parameters in an unintended order. Accept a sequence and pass through.

- def secure_query_to_df( - self, query: str, params: Optional[Dict[str, Any]] = None - ) -> pd.DataFrame: + def secure_query_to_df( + self, query: str, params: Optional[List[Any]] = None + ) -> pd.DataFrame: @@ - Example: - >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"}) + Example: + >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"]) @@ - if params: - # DuckDB uses positional parameters with ? - param_values = list(params.values()) - return self._duckdb.execute(query, param_values).df() + if params: + # DuckDB uses positional parameters with ? + return self._duckdb.execute(query, params).df() else: return self._duckdb.execute(query).df()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def secure_query_to_df(

self, query: str, params: Optional[Dict[str, Any]] = None

) -> pd.DataFrame:

"""Execute parameterized query safely and return result as DataFrame.

Args:

query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')

params: Dictionary of parameters to safely substitute

if max_memory is not None:

database.execute("SET max_memory='{}'".format(max_memory))

if worker_threads is not None:

database.execute("SET worker_threads='{}'".format(worker_threads))

Returns:

DataFrame with query results

msg = database.execute(

"SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')"

).df()

logging.info("duckdb uses {} threads.".format(str(msg["value"][0])))

logging.info("duckdb uses {} of memory.".format(str(msg["value"][1])))

Example:

>>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})

"""

if params:

# DuckDB uses positional parameters with ?

param_values = list(params.values())

return self._duckdb.execute(query, param_values).df()

else:

return self._duckdb.execute(query).df()

def secure_query_to_df(

self, query: str, params: Optional[List[Any]] = None

) -> pd.DataFrame:

"""Execute parameterized query safely and return result as DataFrame.

Args:

query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')

params: List of parameters to safely substitute

Returns:

DataFrame with query results

Example:

>>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"])

"""

if params:

# DuckDB uses positional parameters with ?

return self._duckdb.execute(query, params).df()

else:

return self._duckdb.execute(query).df()

🤖 Prompt for AI Agents

In quantmsio/core/duckdb.py around lines 84 to 105, secure_query_to_df currently accepts params: Optional[Dict[str, Any]] which can reorder positional ? bindings; change the API to accept a sequence (Optional[Sequence[Any]] or List[Any]) instead, update the docstring/example to show a positional list/tuple (e.g., ["P12345"]), and when params is provided pass it (or list(params)) directly to self._duckdb.execute(query, params) so binding order matches the query; keep the no-params branch unchanged.

coderabbitai · 2025-09-06T07:52:39Z

quantmsio/core/duckdb.py

+        database_name = create_uuid_filename("diann-report", ".duckdb")
+        super().__init__(database_name)

-    def get_unique_references(self, field: str) -> list:
+        # Initialize database and create report table
+        self.initialize_database(max_memory, worker_threads)
+        self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])



⚠️ Potential issue

Call to undefined initialize_database and unused resource args

DiannDuckDB.__init__ calls a non-existent method. Pass max_memory/worker_threads to super().__init__ and remove the call.

- database_name = create_uuid_filename("diann-report", ".duckdb") - super().__init__(database_name) - - # Initialize database and create report table - self.initialize_database(max_memory, worker_threads) + database_name = create_uuid_filename("diann-report", ".duckdb") + super().__init__(database_name, max_memory=max_memory, worker_threads=worker_threads) self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

database_name = create_uuid_filename("diann-report", ".duckdb")

super().__init__(database_name)

def get_unique_references(self, field: str) -> list:

# Initialize database and create report table

self.initialize_database(max_memory, worker_threads)

self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])

database_name = create_uuid_filename("diann-report", ".duckdb")

super().__init__(

database_name,

max_memory=max_memory,

worker_threads=worker_threads

)

self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])

coderabbitai · 2025-09-06T07:52:40Z

quantmsio/core/quantms/psm.py

+                    # Get reference file name
+                    ref_file = None
+                    if pd.notna(row["spectra_ref"]):
+                        try:
+                            ms_run_id = row["spectra_ref"].split(":")[0]
+                            ref_file = self._ms_runs.get(ms_run_id)
+                        except (ValueError, AttributeError):
+                            pass
+


⚠️ Potential issue

MS run mapping bug: wrong key type used for _ms_runs

self._ms_runs is keyed by integer ms_run IDs, but here you use the raw "ms_run[X]" string; this always yields None. Reuse your parser.

- if pd.notna(row["spectra_ref"]): - try: - ms_run_id = row["spectra_ref"].split(":")[0] - ref_file = self._ms_runs.get(ms_run_id) - except (ValueError, AttributeError): - pass + if pd.notna(row["spectra_ref"]): + ref_file = self._get_reference_file_name(row["spectra_ref"])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Get reference file name

ref_file = None

if pd.notna(row["spectra_ref"]):

try:

ms_run_id = row["spectra_ref"].split(":")[0]

ref_file = self._ms_runs.get(ms_run_id)

except (ValueError, AttributeError):

pass

# Get reference file name

ref_file = None

if pd.notna(row["spectra_ref"]):

ref_file = self._get_reference_file_name(row["spectra_ref"])

🤖 Prompt for AI Agents

In quantmsio/core/quantms/psm.py around lines 633 to 641, the code looks up self._ms_runs using the raw string token from row["spectra_ref"] which yields None because _ms_runs is keyed by integer ms_run IDs; parse the spectra_ref to extract the numeric ms_run id exactly as the existing parser does (convert to int) and use that integer as the key when calling self._ms_runs.get(...), preserving the current try/except behavior to handle malformed values.

coderabbitai · 2025-09-06T07:52:40Z

quantmsio/core/quantms/psm.py

+                    # Update map with best score
+                    if key not in pep_map or row["score"] < pep_map[key][0]:
+                        pep_map[key] = [row["score"], ref_file, scan_number]
+


🛠️ Refactor suggestion

Guard None and normalize score to float before comparison

Avoid TypeError when score is missing; store numeric consistently.

- if key not in pep_map or row["score"] < pep_map[key][0]: - pep_map[key] = [row["score"], ref_file, scan_number] + try: + score_val = float(row["score"]) + except (TypeError, ValueError): + continue + if key not in pep_map or score_val < pep_map[key][0]: + pep_map[key] = [score_val, ref_file, scan_number]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Update map with best score

if key not in pep_map or row["score"] < pep_map[key][0]:

pep_map[key] = [row["score"], ref_file, scan_number]

# Update map with best score

try:

score_val = float(row["score"])

except (TypeError, ValueError):

continue

if key not in pep_map or score_val < pep_map[key][0]:

pep_map[key] = [score_val, ref_file, scan_number]

🤖 Prompt for AI Agents

In quantmsio/core/quantms/psm.py around lines 642 to 645, the code compares row["score"] directly which can be None or a non-numeric string; normalize by parsing score to a float (e.g. try: score = float(row.get("score")) except (TypeError, ValueError): score = float("inf")) and then use that numeric score in the comparison and when storing: if key not in pep_map or score < pep_map[key][0]: pep_map[key] = [score, ref_file, scan_number].

coderabbitai · 2025-09-06T07:52:40Z

tests/core/quantms/test_feature.py

+    temp_db_path = tempfile.mktemp(suffix=".duckdb")
+
+    # Initialize Feature
+    indexer = MzTabIndexer.create(
+        mztab_path=mztab_file,
+        msstats_path=msstats_file,
+        sdrf_path=sdrf_file,
+        database_path=temp_db_path,
+    )
+


🛠️ Refactor suggestion

Replace insecure tempfile.mktemp with TemporaryDirectory path (security, reliability).

mktemp is vulnerable (CWE-377) and can race; also MzTabIndexer expects a non-existent DB path. Use a temp directory and create the filename within it.

- # Use a temporary directory for the backend database - temp_db_path = tempfile.mktemp(suffix=".duckdb") - - # Initialize Feature - indexer = MzTabIndexer.create( - mztab_path=mztab_file, - msstats_path=msstats_file, - sdrf_path=sdrf_file, - database_path=temp_db_path, - ) + with tempfile.TemporaryDirectory() as tmpdir: + temp_db_path = Path(tmpdir) / "feature_test.duckdb" + indexer = MzTabIndexer.create( + mztab_path=mztab_file, + msstats_path=msstats_file, + sdrf_path=sdrf_file, + database_path=temp_db_path, + ) + feature = Feature(mztab_indexer=indexer) + ...

Repeat the TemporaryDirectory pattern for test_extract_psm_msg (Lines 66-74) and test_generate_feature (Lines 106-114).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

26-26: Use of insecure and deprecated function (mktemp)

(S306)

🪛 GitHub Check: CodeFactor

[warning] 26-26: tests/core/quantms/test_feature.py#L26
Use of insecure and deprecated function (mktemp). (B306)

🤖 Prompt for AI Agents

In tests/core/quantms/test_feature.py around lines 26-35 (and similarly update the patterns used in test_extract_psm_msg at lines 66-74 and test_generate_feature at lines 106-114), replace the insecure use of tempfile.mktemp with a context-managed tempfile.TemporaryDirectory and construct the database filename inside that directory (e.g., join the tempdir path with a desired filename ending in .duckdb) so the path is created safely and avoids race conditions; ensure the created path is non-existent when passed to MzTabIndexer.create and that the TemporaryDirectory context scope covers the test operations so the temp is cleaned up after the test.

coderabbitai · 2025-09-06T07:52:40Z

tests/core/quantms/test_msstats_in.py

+    if intensities_channel_counts:
+        for channel, count in intensities_channel_counts.items():
+            if channel not in channel_counts:
+                channel_counts[channel] = 0
+
+        # Use intensities-based matrix
+        for ref_file, channels in file_channel_matrix_from_intensities.items():
+            if ref_file not in file_channel_matrix:
+                file_channel_matrix[ref_file] = {}
+            for channel, count in channels.items():
+                if channel not in file_channel_matrix[ref_file]:
+                    file_channel_matrix[ref_file][channel] = 0
+                file_channel_matrix[ref_file][channel] += count


🛠️ Refactor suggestion

Fix dropped channel counts when merging intensity-derived stats

Counts from intensities_channel_counts are not added to channel_counts. Add the accumulation.

- if intensities_channel_counts: - for channel, count in intensities_channel_counts.items(): - if channel not in channel_counts: - channel_counts[channel] = 0 + if intensities_channel_counts: + for channel, count in intensities_channel_counts.items(): + channel_counts[channel] = channel_counts.get(channel, 0) + count

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if intensities_channel_counts:

for channel, count in intensities_channel_counts.items():

if channel not in channel_counts:

channel_counts[channel] = 0

# Use intensities-based matrix

for ref_file, channels in file_channel_matrix_from_intensities.items():

if ref_file not in file_channel_matrix:

file_channel_matrix[ref_file] = {}

for channel, count in channels.items():

if channel not in file_channel_matrix[ref_file]:

file_channel_matrix[ref_file][channel] = 0

file_channel_matrix[ref_file][channel] += count

if intensities_channel_counts:

for channel, count in intensities_channel_counts.items():

channel_counts[channel] = channel_counts.get(channel, 0) + count

# Use intensities-based matrix

for ref_file, channels in file_channel_matrix_from_intensities.items():

if ref_file not in file_channel_matrix:

file_channel_matrix[ref_file] = {}

for channel, count in channels.items():

if channel not in file_channel_matrix[ref_file]:

file_channel_matrix[ref_file][channel] = 0

file_channel_matrix[ref_file][channel] += count

🧰 Tools

🪛 Ruff (0.12.2)

307-307: Loop control variable count not used within loop body

(B007)

🪛 GitHub Check: CodeFactor

[notice] 272-318: tests/core/quantms/test_msstats_in.py#L272-L318
Complex Method

🤖 Prompt for AI Agents

In tests/core/quantms/test_msstats_in.py around lines 306 to 318, the loop that handles intensities_channel_counts currently ensures keys exist in channel_counts but never adds the counts themselves; update the code to add each intensities_channel_counts[channel] value into channel_counts (initializing channel_counts[channel] to 0 if missing) so the intensity-derived counts are accumulated into channel_counts before proceeding to build the file_channel_matrix.

coderabbitai · 2025-09-06T07:52:40Z

tests/core/test_large_mztab_performance.py

+    # Use a temporary directory for the backend database
+    temp_db_path = tempfile.mktemp(suffix=".duckdb")
+
+    with MzTabIndexer(
+        mztab_path=LARGE_DATASET["mztab"],
+        msstats_path=LARGE_DATASET["msstats"],
+        sdrf_path=LARGE_DATASET["sdrf"],
+        database_path=temp_db_path,
+        max_memory="16GB",  # Adjust as needed for your system
+        worker_threads=4,  # Adjust as needed for your system
+        batch_size=100000,  # Using a larger batch size for a larger file
+    ) as indexer:


🛠️ Refactor suggestion

Replace insecure mktemp with TemporaryDirectory; thread-safe and matches DB creation semantics.

mktemp creates a race and may be pre-created by attackers. Also the DB must not exist before create.

- # Use a temporary directory for the backend database - temp_db_path = tempfile.mktemp(suffix=".duckdb") - - with MzTabIndexer( + with tempfile.TemporaryDirectory() as tmpdir: + temp_db_path = Path(tmpdir) / "perf.duckdb" + with MzTabIndexer( mztab_path=LARGE_DATASET["mztab"], msstats_path=LARGE_DATASET["msstats"], sdrf_path=LARGE_DATASET["sdrf"], database_path=temp_db_path, max_memory="16GB", # Adjust as needed for your system worker_threads=4, # Adjust as needed for your system batch_size=100000, # Using a larger batch size for a larger file - ) as indexer: + ) as indexer:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 ast-grep (0.38.6)

[warning] 38-38: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp(suffix=".duckdb")
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 Ruff (0.12.2)

39-39: Use of insecure and deprecated function (mktemp)

(S306)

🪛 GitHub Check: CodeFactor

[warning] 39-39: tests/core/test_large_mztab_performance.py#L39
Use of insecure and deprecated function (mktemp). (B306)

🤖 Prompt for AI Agents

In tests/core/test_large_mztab_performance.py around lines 38–49, replace the insecure tempfile.mktemp usage with a tempfile.TemporaryDirectory context so the DB path is created inside a fresh, thread-safe temp directory and does not pre-exist; inside the TemporaryDirectory context construct a db filepath (e.g., os.path.join(tmpdir, "test.duckdb") or with your desired suffix), pass that path to MzTabIndexer, and ensure you import tempfile and os if not already imported.

ypriverol added 30 commits June 2, 2025 16:05

Minor changes in the command naming

a678e71

Merge branch 'main' of https://github.com/bigbio/quantms.io into dev

bf4e7e4

remove test logging.

38259c7

remove logging example

ed945bb

remove logging example

27a4c18

remove logging example

f816396

remove logging example

e9b7fdc

remove logging example

ebfb804

fix small error with regex

a325e34

logger fixed

dd5f83b

minor changes in logger

a7ba3a1

minor changes

e799da3

improving logger

18f47bd

minor changes, continue logging

3a13d59

minor changes

e55eb9b

update from main

ad4b891

change name parameters.

d66489c

minor fix

3d5de99

major changes in the folder structure on quantms

4cbdbbe

black applied to all files.

b3e3d40

minor changes and fixes

82eff25

minor changes to execute quantms

491ed3a

minor changes

08c61ae

minor changes

5181fa4

minor changes

9d15bc7

all commands changed

4bd42ce

black applied

d11b9a0

ibaq change

e747948

minor changes in commands

4f29c50

minor changes

d4c4e1c

ypriverol temporarily deployed to github-pages August 27, 2025 10:22 — with GitHub Pages Inactive

ypriverol temporarily deployed to github-pages August 27, 2025 10:24 — with GitHub Actions Inactive

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

Shen-YuFei and others added 7 commits August 28, 2025 17:28

update idxml.py and test_idxml.py

455db40

update idxml.py and test_idxml.py

4315b45

fix idxml.py and test_idxml.py

2ec157a

fix idxml.py and test_idxml.py

765285c

Merge branch 'dev' of https://github.com/Shen-YuFei/quantms.io into dev

7eb3c62

updata idxml.py

164c341

Merge pull request #105 from Shen-YuFei/dev

1896847

Update idxml.py and test_idxml.py to improve the modification extraction structure.

ypriverol temporarily deployed to github-pages August 31, 2025 09:01 — with GitHub Pages Inactive

ypriverol temporarily deployed to github-pages August 31, 2025 09:02 — with GitHub Actions Inactive

coderabbitai bot reviewed Aug 31, 2025

View reviewed changes

ypriverol requested a review from Copilot August 31, 2025 09:27

This comment was marked as outdated.

Sign in to view

yueqixuan and others added 2 commits August 31, 2025 21:53

fix the bugs in quantms

1a3cb64

Merge pull request #106 from yueqixuan/dev

faeb920

fix the bugs in quantms

ypriverol requested a review from Copilot August 31, 2025 15:19

ypriverol temporarily deployed to github-pages August 31, 2025 15:19 — with GitHub Pages Inactive

Copilot AI reviewed Aug 31, 2025

View reviewed changes

ypriverol temporarily deployed to github-pages August 31, 2025 15:20 — with GitHub Actions Inactive

coderabbitai bot reviewed Aug 31, 2025

View reviewed changes

yueqixuan added 2 commits September 1, 2025 16:58

Bugs fix and code optimization

76fb249

fix

105c72c

ypriverol mentioned this pull request Sep 5, 2025

Implement QuantmsIO for converting IdXML to parquet following quantms.io PSM specification OpenMS/OpenMS#8191

Merged

14 tasks

yueqixuan and others added 2 commits September 5, 2025 17:52

Fix the quantms LFQ/TMT modules

7e02000

Merge pull request #109 from yueqixuan/dev

d625a5e

Fix the quantms LFQ/TMT modules

ypriverol temporarily deployed to github-pages September 6, 2025 07:38 — with GitHub Pages Inactive

ypriverol deployed to github-pages September 6, 2025 07:39 — with GitHub Actions View deployment

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

-					{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
+					{"name": "scan_format",
+					 "type": {
+					   "type": "enum",
+					   "name": "ScanFormat",
+					   "symbols": ["scan", "index", "nativeId"]
+					 },
+					 "doc": "The format of the scan"},

		{ "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" }
		]

	TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",
	TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683-TMT.sdrf.tsv",

Major changes in the documentation and schemas created #99

Are you sure you want to change the base?

Major changes in the documentation and schemas created #99

Uh oh!

Conversation

ypriverol commented Jun 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Changes walkthrough 📝

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

ruff: noqa: S101

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

ypriverol commented Jun 24, 2025 •

edited by coderabbitai bot

Loading