Skip to content

Conversation

ypriverol
Copy link
Member

@ypriverol ypriverol commented Jun 24, 2025

PR Type

Enhancement


Description

  • Major CLI restructuring with organized command groups

  • Enhanced logging system with structured output support

  • New quantms workflow conversion command

  • Improved error handling and verbose logging


Changes walkthrough 📝

Relevant files
Enhancement
28 files
quantms.py
Add quantms workflow conversion command                                   
+318/-0 
diann.py
Add DIA-NN conversion commands                                                     
+256/-0 
psm.py
Add PSM conversion and comparison commands                             
+184/-0 
feature.py
Add feature conversion command                                                     
+167/-0 
plot.py
Add visualization plotting commands                                           
+207/-0 
quantmsioc.py
Restructure CLI with organized command groups                       
+87/-42 
de.py
Add differential expression transformation command             
+128/-0 
stats.py
Add statistical analysis commands                                               
+114/-0 
maxquant.py
Update MaxQuant conversion commands                                           
+31/-21 
psm.py
Add enhanced logging to PSM processing                                     
+17/-6   
feature.py
Add enhanced logging to feature processing                             
+19/-1   
project.py
Add project management commands                                                   
+84/-0   
ae.py
Add absolute expression transformation command                     
+82/-0   
format.py
Update schema with nullable fields                                             
+6/-0     
fragpipe.py
Update FragPipe PSM conversion method                                       
+24/-21 
gene.py
Add gene mapping transformation command                                   
+66/-0   
project.py
Update project save methods                                                           
+22/-7   
fragpipe.py
Update FragPipe conversion command                                             
+15/-7   
uniprot.py
Add UniProt mapping transformation command                             
+56/-0   
ibaq.py
Add IBAQ transformation command                                                   
+55/-0   
spectra.py
Add spectra transformation command                                             
+58/-0   
anndata.py
Update AnnData merge command                                                         
+20/-9   
attach.py
Update file attachment command                                                     
+6/-6     
mztab.py
Add logging to MzTab class                                                             
+3/-1     
logger.py
Add logger utility module                                                               
+10/-0   
__init__.py
Add convert commands package                                                         
+3/-0     
__init__.py
Add transform commands package                                                     
+3/-0     
__init__.py
Add utils commands package                                                             
+3/-0     
Bug fix
1 files
pride_utils.py
Fix regex patterns with raw strings                                           
+2/-2     
Documentation
7 files
feature.avsc
Update feature schema with scan format                                     
+18/-10 
README.md
Major documentation update with usage examples                     
+168/-1 
psm.avsc
Update PSM schema with record types                                           
+11/-7   
README_LOGGING.md
Add comprehensive logging documentation                                   
+127/-0 
differential.avsc
Add differential expression schema                                             
+51/-0   
ibaq_usecase.adoc
Update IBAQ use case documentation                                             
+7/-7     
absolute.avsc
Add absolute expression schema                                                     
+46/-0   
Dependencies
2 files
enviroment.yml
Update DuckDB version requirement                                               
+1/-1     
pyproject.toml
Update Python and PyOpenMS version requirements                   
+2/-2     
Additional files
13 files
ae_command.py +0/-60   
de_command.py +0/-67   
diann_command.py +0/-191 
feature_command.py +0/-102 
generate_gene_message_command.py +0/-49   
generate_spectra_message_command.py +0/-46   
ibaq_command.py +0/-52   
map_latest_uniport_command.py +0/-44   
plot_command.py +0/-126 
project_command.py +0/-81   
psm_command.py +0/-79   
quantms_command.py +0/-140 
statistic_command.py +0/-104 

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Summary by CodeRabbit

    • New Features

      • Reorganized CLI into grouped commands (convert/transform/visualize/stats/project) with many new conversion & transform subcommands (MaxQuant, DIA‑NN, IdXML, project workflows, mapping/transform utilities) and improved plotting/stats tools.
      • Batch‑oriented Parquet export and faster DuckDB-backed workflows for large datasets.
    • Documentation

      • Major README overhaul and updated docs/examples for new CLI and workflows.
    • Tests

      • Large expansion of unit, integration and performance tests and new CI workflows.
    • Chores

      • Minimum Python raised to 3.10; dependency adjustments and expanded .gitignore.

    Copy link
    Contributor

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 40

    Caution

    Some comments are outside the diff and can’t be posted inline due to platform limitations.

    ⚠️ Outside diff range comments (4)
    quantmsio/utils/system.py (1)

    17-35: Do not assume “enough disk” on error; handle non-existent output_dir and improve logging

    Returning True on exceptions can lead to partial writes/data loss. Also, shutil.disk_usage requires an existing path—new output_dir may not exist yet. Check the nearest existing parent, log details, and make the “assume on error” behavior explicit.

    Apply:

    -def check_disk_space(output_dir: Path, required_gb: float = 1.0) -> bool:
    +def check_disk_space(output_dir: Path, required_gb: float = 1.0, assume_enough_on_error: bool = False) -> bool:
         """
    -    Check if there's sufficient disk space for processing.
    +    Check if there's sufficient disk space for processing.
     
         Args:
    -        output_dir: Directory to check space for
    -        required_gb: Required space in GB (default: 1GB)
    +        output_dir: Directory to check space for (may not exist yet).
    +        required_gb: Required space in GiB (default: 1.0).
    +        assume_enough_on_error: If True, return True when the check fails; otherwise return False.
     
         Returns:
    -        True if sufficient space is available, False otherwise
    +        True if sufficient space is available, False otherwise.
         """
         try:
    -        total, used, free = shutil.disk_usage(output_dir)
    -        free_gb = free / (1024**3)  # Convert to GB
    -        return free_gb >= required_gb
    +        path = output_dir if output_dir.exists() else output_dir.parent
    +        total, used, free = shutil.disk_usage(path)
    +        free_gib = free / (1024 ** 3)  # Convert to GiB
    +        has_space = free_gib >= required_gb
    +        logging.getLogger(__name__).debug(
    +            "Disk space check on %s: free=%.2f GiB, required=%.2f GiB, ok=%s",
    +            path, free_gib, required_gb, has_space
    +        )
    +        return has_space
         except Exception as e:
    -        logging.warning(f"Could not check disk space: {e}")
    -        return True  # Assume sufficient space if we can't check
    +        logging.warning("Could not check disk space for %s: %s", output_dir, e)
    +        return assume_enough_on_error
    docs/pg.avsc (2)

    44-56: Avro types: prefer float over float32; struct → record.

    Replace float32 with Avro float and define a named record for array items.

    -        {"name": "intensities", "type": ["null",{
    -            "type": "array",
    -            "items": {
    -                "type": "struct",
    -                "fields": [
    -                    {"name": "sample_accession", "type": "string"},
    -                    {"name": "channel", "type": "string"},
    -                    {"name": "intensity", "type": "float32"}
    -                ]
    -            }
    -            }],
    +        {"name": "intensities", "type": ["null",{
    +            "type": "array",
    +            "items": {
    +                "type": "record",
    +                "name": "pg_intensity",
    +                "fields": [
    +                    {"name": "sample_accession", "type": "string"},
    +                    {"name": "channel", "type": "string"},
    +                    {"name": "intensity", "type": "float"}
    +                ]
    +            }
    +        }],

    57-78: Nested arrays: define records and use float.

    Bring additional_intensities into Avro-compliant form.

    -        {"name": "additional_intensities", "type": ["null",{
    -            "type": "array",
    -            "items": {
    -                "type": "struct",
    -                "fields": [
    -                    {"name": "sample_accession", "type": "string"},
    -                    {"name": "channel", "type": "string"},
    -                    {"name": "intensities", "type": {
    -                        "type": "array",
    -                        "items": {
    -                            "type": "struct",
    -                            "fields": [
    -                                {"name": "intensity_name", "type": "string"},
    -                                {"name": "intensity_value", "type": "float32"}
    -                            ]
    -                        }
    -                    }}
    -                ]
    -            }
    -            }],
    +        {"name": "additional_intensities", "type": ["null",{
    +            "type": "array",
    +            "items": {
    +                "type": "record",
    +                "name": "pg_additional_intensity",
    +                "fields": [
    +                    {"name": "sample_accession", "type": "string"},
    +                    {"name": "channel", "type": "string"},
    +                    {"name": "intensities", "type": {
    +                        "type": "array",
    +                        "items": {
    +                            "type": "record",
    +                            "name": "pg_intensity_kv",
    +                            "fields": [
    +                                {"name": "intensity_name", "type": "string"},
    +                                {"name": "intensity_value", "type": "float"}
    +                            ]
    +                        }
    +                    }}
    +                ]
    +            }
    +        }],
    quantmsio/utils/pride_utils.py (1)

    457-466: Guard against None modifications_definition to avoid runtime errors.

    When modifications exist but modifications_definition is None, current code will raise. Fallback to raw sequence in that case.

    -    if "opt_global_cv_MS:1000889_peptidoform_sequence" in es:
    -        psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"]
    -    else:
    -        psm["peptidoform"] = get_petidoform_msstats_notation(
    -            psm["sequence"], psm["modifications"], modifications_definition
    -        )
    +    if "opt_global_cv_MS:1000889_peptidoform_sequence" in es:
    +        psm["peptidoform"] = es["opt_global_cv_MS:1000889_peptidoform_sequence"]
    +    else:
    +        psm["peptidoform"] = (
    +            get_petidoform_msstats_notation(
    +                psm["sequence"], psm["modifications"], modifications_definition
    +            )
    +            if modifications_definition
    +            else psm["sequence"]
    +        )
    -    psm["proforma_peptidoform"] = get_peptidoform_proforma_version_in_mztab(
    -        psm["sequence"], psm["modifications"], modifications_definition
    -    )
    +    psm["proforma_peptidoform"] = (
    +        get_peptidoform_proforma_version_in_mztab(
    +            psm["sequence"], psm["modifications"], modifications_definition
    +        )
    +        if modifications_definition
    +        else psm["sequence"]
    +    )
    ♻️ Duplicate comments (4)
    quantmsio/core/fragpipe/fragpipe.py (1)

    371-387: Reduce default batch size and restore pre-flight checks (duplicate of prior feedback)

    1e6 rows risks memory pressure; also add input existence and ensure output dir.

    -    def write_psms_to_parquet(
    +    def write_psms_to_parquet(
             self,
    -        msms_file: Path,
    -        batch_size: int = 1000000,
    +        msms_file: Path,
    +        batch_size: int = 100000,
             output_prefix: Optional[str] = None,
         ) -> None:
    @@
    -        if not output_prefix:
    +        if not msms_file.exists():
    +            raise FileNotFoundError(f"PSM file not found: {msms_file}")
    +        self.output_directory.mkdir(parents=True, exist_ok=True)
    +        if not output_prefix:
                 output_prefix = "psm"
    quantmsio/core/quantms/pg.py (1)

    543-548: DuckDB cannot read a Python DataFrame by name in SQL; must register it.

    This creates an empty table or fails silently. Register/unregister the DataFrame before CREATE TABLE.

    -            if protein_data:
    -                protein_df = pd.DataFrame(protein_data)
    -                msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups")
    -                msstats_indexer._duckdb.execute(
    -                    "CREATE TABLE protein_groups AS SELECT * FROM protein_df"
    -                )
    +            if protein_data:
    +                protein_df = pd.DataFrame(protein_data)
    +                msstats_indexer._duckdb.execute("DROP TABLE IF EXISTS protein_groups")
    +                msstats_indexer._duckdb.register("protein_df_view", protein_df)
    +                try:
    +                    msstats_indexer._duckdb.execute(
    +                        "CREATE TABLE protein_groups AS SELECT * FROM protein_df_view"
    +                    )
    +                finally:
    +                    msstats_indexer._duckdb.unregister("protein_df_view")
    quantmsio/core/diann/diann.py (1)

    151-167: Consider aligning field names for clarity.

    Based on the past review comments, the field names in peptide_counts could be more consistent.

    Consider using more consistent field naming:

    • In peptide_counts: use unique_peptides and total_peptides
    • Keep feature_counts as is with unique_features and total_features

    This would make the distinction between peptides and features clearer.

    quantmsio/core/quantms/psm.py (1)

    58-69: Use context manager for file operations.

    The file handling needs improvement for better resource management and exception safety. The manual file opening and closing should use a context manager.

    🧹 Nitpick comments (80)
    quantmsio/utils/system.py (1)

    8-15: Add type hints and tighten the log output (units, PID)

    Minor polish: type the logger param/return, and log MiB with PID for easier tracing.

    Apply:

    -def log_memory_usage(logger, step: str):
    -    """Log memory usage for a given step."""
    -    process = psutil.Process(os.getpid())
    -    mem_info = process.memory_info()
    -    logger.debug(
    -        f"Memory usage after {step}: RSS={mem_info.rss / 1024 / 1024:.1f}MB, VMS={mem_info.vms / 1024 / 1024:.1f}MB"
    -    )
    +def log_memory_usage(logger: logging.Logger, step: str) -> None:
    +    """Log current process memory usage (RSS/VMS in MiB) for the given step."""
    +    process = psutil.Process(os.getpid())
    +    mem = process.memory_info()
    +    rss = mem.rss / (1024 ** 2)
    +    vms = mem.vms / (1024 ** 2)
    +    logger.debug(f"[pid={process.pid}] Memory usage after {step}: RSS={rss:.1f} MiB, VMS={vms:.1f} MiB")

    Optional micro-opt (outside range): cache the Process at module load to avoid re-creating it per call:

    # at module top
    _PROCESS = psutil.Process(os.getpid())
    # in function
    process = _PROCESS
    tests/core/test_project.py (4)

    20-36: Patch the correct requests symbol and simplify the mock response.
    Targets the import site to avoid future import-style changes breaking the patch; Mock is lighter than constructing a Response.

    -    @patch("requests.get")
    +    @patch("quantmsio.core.project.requests.get")
         def test_populate_from_pride_archive_successful(self, mock_get):
    -        # Mock the API response for a successful request
    -        mock_response = requests.models.Response()
    -        mock_response.status_code = 200
    -        mock_response.json = lambda: {
    -            "title": "Test Project",
    -            "projectDescription": "Test description",
    -            "sampleProcessingProtocol": "Test sample processing protocol",
    -            "dataProcessingProtocol": "Test data processing protocol",
    -            "references": [{"pubmedId": "12345678"}],
    -            "keywords": ["keyword1", "keyword2"],
    -            "projectTags": ["tag1", "tag2"],
    -            # Add other mock data as needed
    -        }
    -        mock_get.return_value = mock_response
    +        # Mock the API response for a successful request
    +        mock_get.return_value = Mock(
    +            status_code=200,
    +            json=lambda: {
    +                "title": "Test Project",
    +                "projectDescription": "Test description",
    +                "sampleProcessingProtocol": "Test sample processing protocol",
    +                "dataProcessingProtocol": "Test data processing protocol",
    +                "references": [{"pubmedId": "12345678"}],
    +                "keywords": ["keyword1", "keyword2"],
    +                "projectTags": ["tag1", "tag2"],
    +            },
    +        )

    Additional change needed outside this hunk:

    -from unittest.mock import patch
    +from unittest.mock import patch, Mock

    50-61: Avoid live network in unit tests; gate as integration and remove print.
    Prevents flakiness and noisy output; allow opt-in via env flag.

         def test_populate_from_pride_archive_api(self):
    +        if os.environ.get("PRIDE_ONLINE") != "1":
    +            self.skipTest("Integration test requires PRIDE_ONLINE=1")
             project_accession = "PXD020453"
             project_manager = ProjectHandler(project_accession)
             project_manager.populate_from_pride_archive()
     
             # Assert that the project_info has been updated
             self.assertEqual(
                 project_manager.project.project_info["project_title"],
                 "Structural insights into Cullin4-RING ubiquitin ligase remodelling by Vpr from simian immunodeficiency viruses",
             )
    -        print(project_manager.project.project_info)

    Additional change needed outside this hunk:

    +import os

    If you prefer keeping this as a pure unit test, patch requests here too (same target as above) and assert on stable fields rather than a long, mutable title.


    6-6: ddt is imported but unused.
    Either add data-driven cases or remove the decorator/import to reduce test overhead.

    -from ddt import ddt
    +# from ddt import ddt  # Uncomment when adding data-driven tests
    @@
    -@ddt
     class TestProjectHandler(TestCase):

    Alternatively, parameterize via ddt using test_dataset.

    Also applies to: 17-18


    62-69: Enhance test_save_project_info with SDRF file existence and ingestion assertions

    The populate_from_sdrf call populates project_manager.project.project_info with keys like "organisms", "organism_parts", "diseases", "cell_lines", "instruments", "enzymes" and "acquisition_properties". Strengthen the test to:

    • Verify the SDRF test file is present
    • Assert that at least one SDRF-derived property is added to project_info

    In tests/core/test_project.py around lines 66–69, update:

         def test_save_project_info(self):
             project_accession = test_dataset[0]
             sdrf_file = test_dataset[1]
    
             project_manager = ProjectHandler(project_accession)
             project_manager.populate_from_pride_archive()
    -        project_manager.populate_from_sdrf(sdrf_file)
    +        # Ensure the SDRF test file exists
    +        self.assertTrue(sdrf_file.exists(), f"Missing SDRF test file: {sdrf_file}")
    +        project_manager.populate_from_sdrf(sdrf_file)
    +        # Verify SDRF ingestion added properties to project_info
    +        info = project_manager.project.project_info
    +        self.assertIn("organisms", info)
    +        self.assertIsInstance(info["organisms"], list)
    +        self.assertGreater(len(info["organisms"]), 0)
    tests/core/test_large_mztab_performance.py (1)

    20-22: Make CI skip robust to different CI values

    Skip if CI is set to any truthy string, not just "true".

    -@pytest.mark.skipif(
    -    os.environ.get("CI") == "true", reason="Skipping performance test in CI environment"
    -)
    +@pytest.mark.skipif(
    +    bool(os.environ.get("CI")), reason="Skipping performance test in CI environment"
    +)
    quantmsio/operate/tools.py (4)

    29-33: Tighten return typing for init_save_info

    Second element can become a ParquetWriter; advertise Optional type.

    -def init_save_info(parquet_path: str) -> tuple[dict, None, str]:
    +def init_save_info(parquet_path: str) -> tuple[dict, Optional[pq.ParquetWriter], str]:

    124-134: Fix typo in docstring: “retrun” → “return”

    Minor documentation fix.

    -    retrun: None
    +    return: None

    159-175: Consider using sets in peptide_map to avoid O(n) duplicate checks

    Using a set avoids repeated membership scans; convert to list at the end if needed.

    -def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict:
    -    peptide_map: defaultdict = defaultdict(list)
    +def get_peptide_map(unique_peptides: list, fasta: str) -> defaultdict:
    +    peptide_map: defaultdict = defaultdict(set)
    @@
    -            if accession not in peptide_map[peptide]:
    -                peptide_map[peptide].append(accession)
    -    return peptide_map
    +            peptide_map[peptide].add(accession)
    +    # downstream expects lists
    +    return defaultdict(list, {k: list(v) for k, v in peptide_map.items()})

    261-268: Make column drop resilient when label is absent

    Avoid KeyError if label/reference aren’t present in the merge result.

    -        df.drop(
    +        df.drop(
                 [
                     "reference",
                     "label",
                 ],
    -            axis=1,
    -            inplace=True,
    +            axis=1,
    +            inplace=True,
    +            errors="ignore",
             )
    tests/core/operate/test_query.py (4)

    13-17: Replace prints with minimal assertions

    Ensure generators yield at least one batch; avoid noisy output.

    -        for _ in q.iter_samples():
    -            print("ok")
    +        assert next(q.iter_samples(), None) is not None

    23-27: Assert iterator yields

    Same rationale as above.

    -        for _ in q.iter_chunk():
    -            print("ok")
    +        assert next(q.iter_chunk(), None) is not None

    33-37: Assert iterator yields

    Same rationale as above.

    -        for _ in q.iter_file():
    -            print("ok")
    +        assert next(q.iter_file(), None) is not None

    43-48: Validate result of inject_position_msg

    Guard against silent no-ops.

             protein_dict = q.get_protein_dict(fasta)
    -        q.inject_position_msg(df, protein_dict)
    +        out = q.inject_position_msg(df, protein_dict)
    +        assert "pg_positions" in out.columns
    quantmsio/utils/file_utils.py (1)

    41-56: Simplify existence check using all(...)

    Matches Ruff SIM110 hint.

    -    for file_name in file_names:
    -        # Now we have Path / str which works correctly on all platforms
    -        if not (path_obj / file_name).exists():
    -            return False
    -    return True
    +    return all((path_obj / file_name).exists() for file_name in file_names)
    quantmsio/core/fragpipe/fragpipe.py (3)

    5-5: Remove unused import date

    Per static analysis hint.

    -from datetime import date

    11-11: Remove unused import pyarrow.parquet as pq

    Per static analysis hint.

    -import pyarrow.parquet as pq

    405-407: Optionally pass string path to ParquetBatchWriter for wider compatibility

    Minor portability improvement.

    -                    batch_writer = ParquetBatchWriter(
    -                        output_path,
    -                        schema=batch.schema,
    -                    )
    +                    batch_writer = ParquetBatchWriter(str(output_path), schema=batch.schema)
    quantmsio/utils/constants.py (1)

    106-117: Consider adding validation for mzTab protein columns.

    The MZTAB_PROTEIN_COLUMNS list contains a mix of referenced constants (ACCESSION, MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE) and literal strings. This is good for maintainability, but consider whether some of the literal strings (e.g., "ambiguity_members") should also be constants if they're used elsewhere.

    If any of these literal strings are referenced in other parts of the codebase, consider extracting them as constants:

    +AMBIGUITY_MEMBERS_COLUMN = "ambiguity_members"
    +DESCRIPTION_COLUMN = "description"
    +MODIFICATIONS_COLUMN = "modifications"
    +PROTEIN_COVERAGE_COLUMN = "protein_coverage"
    +
     MZTAB_PROTEIN_COLUMNS = [
         ACCESSION,
    -    "description",
    +    DESCRIPTION_COLUMN,
         MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE,
    -    "ambiguity_members",
    +    AMBIGUITY_MEMBERS_COLUMN,
    -    "modifications",
    +    MODIFICATIONS_COLUMN,
    -    "protein_coverage",
    +    PROTEIN_COVERAGE_COLUMN,
         "opt_global_Posterior_Probability_score",
         "opt_global_cv_PRIDE:0000303_decoy_hit",
         "opt_global_result_type",
     ]
    quantmsio/utils/mztab_utils.py (4)

    1-4: Add module-level docstring for better documentation.

    The module lacks a docstring describing its purpose and contents.

    Add a module-level docstring:

    +"""
    +Utilities for parsing and processing mzTab format files.
    +
    +This module provides functions for parsing mzTab metadata, headers, protein and PSM lines,
    +as well as handling modifications and peptidoform conversions from OpenMS format.
    +"""
     import re
     from typing import Optional, Dict, List

    33-43: Simplify nested loop iteration.

    The nested loop for finding the accession can be simplified.

             elif "site" in line_parts[1]:
                 index = line_parts[1].split("[")[1].split("]")[0]
                 accession = None
    -            for (
    -                key,
    -                value,
    -            ) in (
    -                _modifications.items()
    -            ):  # for name, age in dictionary.iteritems():  (for Python 2.x)
    +            for key, value in _modifications.items():
                     if value[1] == index:
                         accession = key

    280-289: Use ternary operator for cleaner code.

    The if-else block can be simplified to a ternary operator as suggested by static analysis.

     def _clean_file_path(file_path: str) -> str:
         """
         Clean a file path to extract just the filename without extension.
     
         This helper function removes the file:// protocol and extracts the filename
         without the file extension from various file path formats.
     
         Args:
             file_path: File path that may include file:// protocol and extensions
     
         Returns:
             Clean filename without protocol and extension
     
         Example:
             >>> _clean_file_path("file:///path/to/file.mzML")
             'file'
             >>> _clean_file_path("/data/sample_001.mzML")
             'sample_001'
             >>> _clean_file_path("relative/path/file.txt")
             'file'
         """
    -    if file_path.startswith("file://"):
    -        # Remove file:// protocol
    -        clean_path = file_path[7:]
    -    else:
    -        clean_path = file_path
    +    clean_path = file_path[7:] if file_path.startswith("file://") else file_path
     
         # Extract filename without extension
         filename = clean_path.split("/")[-1].split(".")[0]
         return filename

    533-658: Consider refactoring the complex modification parsing function.

    The parse_modifications_openms function is quite complex with deeply nested conditionals. Consider breaking it down into smaller, more focused helper functions for better readability and maintainability.

    Consider extracting the terminal modification handling logic into separate helper functions:

    def _handle_n_terminal_mod(peptidoform_to_process: str, mod_name_to_accession: dict, parsed_mods: dict) -> str:
        """Handle N-terminal modification parsing."""
        if peptidoform_to_process.startswith(".("):
            match = re.match(r"\.\(([^)]+)\)", peptidoform_to_process)
            if match:
                mod_name = match.group(1)
                accession = mod_name_to_accession.get(mod_name, mod_name)
                if accession not in parsed_mods:
                    parsed_mods[accession] = {"name": mod_name, "positions": []}
                parsed_mods[accession]["positions"].append(
                    {"position": "N-term.0", "scores": None}
                )
                return peptidoform_to_process[len(match.group(0)):]
        return peptidoform_to_process
    
    def _handle_c_terminal_mod(peptidoform_to_process: str, pure_sequence: str, mod_name_to_accession: dict, parsed_mods: dict) -> tuple[str, Optional[str]]:
        """Handle C-terminal modification parsing."""
        c_term_mod_name = None
        if peptidoform_to_process.endswith(")") and peptidoform_to_process.rfind(".(") > 0:
            match = re.search(r"\.\(([^)]+)\)$", peptidoform_to_process)
            if match:
                c_term_mod_name = match.group(1)
                peptidoform_to_process = peptidoform_to_process[:-len(match.group(0))]
        return peptidoform_to_process, c_term_mod_name

    This would make the main function cleaner and easier to test.

    quantmsio/core/common.py (2)

    41-41: Consider using compiled regex for better performance.

    The OPENMS_PROTEIN_QVALUE_WORDS list suggests these words will be used for pattern matching. If this is used frequently, consider pre-compiling a regex pattern.

    import re
    
    # Regular expression that ensures the protein search engine score contains required words
    OPENMS_PROTEIN_QVALUE_PATTERN = re.compile(
        r"(?=.*openms)(?=.*target-decoy)(?=.*protein)(?=.*q-value)",
        re.IGNORECASE
    )

    98-103: Add documentation for removed fields.

    The comments mention that certain fields were removed in DIA-NN version 2.0. Consider adding version compatibility information or migration guidance.

    Consider adding a version compatibility note at the module level or creating a separate compatibility matrix documentation:

    # DIA-NN Version Compatibility:
    # - Version 2.0+: Removed "File.Name", "PG.Quantity", "PG.Normalised" from main report
    # - Version 2.0+: Added "Decoy" to main report
    # - Version 2.0+: Added "N.Sequences", "N.Proteotypic.Sequences" to report.pg_matrix.tsv
    debug_test_convert_to_parquet.py (3)

    9-9: Remove unused import.

    os isn’t used.

    -import os

    44-45: Drop unnecessary f-string.

    No placeholders present.

    -        print(f"=== Full traceback: ===")
    +        print("=== Full traceback: ===")

    16-24: Make logging setup resilient and Unicode-safe.

    Avoid duplicate handlers when re-running, and set encoding for the file handler.

     logging.basicConfig(
         level=logging.DEBUG,
         format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
         handlers=[
             logging.StreamHandler(sys.stdout),
    -        logging.FileHandler("debug_test_convert_to_parquet.log"),
    +        logging.FileHandler("debug_test_convert_to_parquet.log", encoding="utf-8"),
         ],
    +    force=True,
     )
    tests/core/quantms/psms/test_mztab_functions.py (2)

    36-49: Consider adding a probability-bearing case.

    Add a case like 2[0.7]|5[0.3]-UNIMOD:4 to assert probabilities are stripped while positions remain integers.

    I can add a parametrized test covering probability annotations if you want.


    51-57: Reduce flakiness by making order-agnostic (optional).

    Current assertion depends on dict insertion order. If upstream changes order, the test could fail even with correct content.

    -    assert modification_list == ["2|5-UNIMOD:4", "1|3|7-UNIMOD:21"]
    +    assert set(modification_list) == {"2|5-UNIMOD:4", "1|3|7-UNIMOD:21"}
    tests/core/quantms/test_feature.py (2)

    16-25: Gracefully skip when example data is missing.

    Mirror the idXML pattern to avoid brittle failures in environments without the example data.

    +import pytest
    @@
    -    # Initialize Feature
    +    # Skip if data files don't exist
    +    if not (mztab_file.exists() and msstats_file.exists() and sdrf_file.exists()):
    +        pytest.skip("Example data files not found")
    +    # Initialize Feature
         feature = Feature(mztab_file, sdrf_file, msstats_file)

    Apply the same guard in test_extract_psm_msg and test_generate_feature.

    Also applies to: 43-52, 70-79


    83-89: Optional: cap iterations to keep tests fast.

    If generators can yield many batches, consider breaking after N iterations.

    -        for feature_table in feature.generate_feature():
    +        for feature_table in feature.generate_feature():
                 # Add assertions to verify the result
                 assert feature_table is not None
                 assert len(feature_table) > 0
                 assert "peptidoform" in feature_table.column_names
                 count += 1
    +            if count >= 3:
    +                break
    docs/pg.avsc (1)

    79-79: Consider boolean for decoy flag.

    is_decoy fits boolean better than int.

    -        {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the protein group is a decoy, 0 if target"},
    +        {"name": "is_decoy", "type": "boolean", "doc": "Decoy indicator: true if decoy, false if target"},
    tests/core/test_sdrf.py (3)

    15-15: Remove redundant private call.

    __init__ already loads the SDRF; calling _load_sdrf_info again is unnecessary and touches a private API.

    -    sdrf_handler._load_sdrf_info(file)

    34-43: Align allowed experiment types with implementation (optional).

    get_experiment_type_from_sdrf currently returns LFQ/TMT*/ITRAQ* only. Including "SILAC" here may be misleading.

    If SILAC support is intended, please confirm and update quantmsio/core/sdrf.py accordingly; otherwise remove it from the allowed list in tests.


    110-115: Use direct dict membership.

    Simpler and faster than checking .keys().

    -    for mod in expected_mods:
    -        assert any(
    -            mod in key for key in mods_dict.keys()
    -        ), f"Expected modification {mod} not found in mods_dict"
    +    for mod in expected_mods:
    +        assert any(mod in key for key in mods_dict), (
    +            f"Expected modification {mod} not found in mods_dict"
    +        )
    quantmsio/quantmsioc.py (2)

    103-114: Register DuckDB and project conversion commands or drop imports.

    duckdb_creator and quantms_project_convert are imported but unused (Ruff F401). If intended, register them under convert; otherwise remove the imports.

     convert.add_command(convert_idxml_file, name="idxml")
    +convert.add_command(duckdb_creator, name="duckdb")
    +convert.add_command(quantms_project_convert, name="quantms-project")

    Run flake8 after this change to confirm F401 is resolved.


    116-124: Give transform subcommands stable, human-friendly names.

    Right now Click will default to function names (long/unstable). Use explicit names matching the PR surface.

    -transform.add_command(ae_transform)
    -transform.add_command(de_transform)
    -transform.add_command(gene_transform)
    -transform.add_command(ibaq_transform)
    -transform.add_command(spectra_transform)
    -transform.add_command(uniprot_transform)
    -transform.add_command(anndata_transform)
    +transform.add_command(ae_transform, name="ae")
    +transform.add_command(de_transform, name="de")
    +transform.add_command(gene_transform, name="gene")
    +transform.add_command(ibaq_transform, name="ibaq")
    +transform.add_command(spectra_transform, name="spectra")
    +transform.add_command(uniprot_transform, name="uniprot")
    +transform.add_command(anndata_transform, name="anndata")
    tests/core/idxml/test_idxml_integration.py (2)

    2-2: Remove unused import.

    os is imported but never used.

    -import os

    82-88: Optional: Align with the new CLI entry (future-proof).

    This test invokes a Click command function directly from convert.quantms. Given the CLI now exposes convert idxml, consider testing the top-level CLI too for integration coverage. No action required if you want to keep this as a unit-style CLI test.

    tests/core/quantms/test_diann.py (1)

    16-29: LGTM on feature path; consider asserting outputs to strengthen the test.

    You exercise the full path but don’t validate shapes or schema. Adding minimal assertions (e.g., non-empty DataFrame after transform) would catch silent regressions.

    tests/core/idxml/test_idxml_psm.py (1)

    198-201: Assert a specific exception type.

    Catching bare Exception hides failures. If the loader raises a file-not-found error, make it explicit.

    -    with pytest.raises(Exception):
    +    with pytest.raises(FileNotFoundError):
             IdXmlPsm("non_existent_file.idXML")

    If the actual exception differs (e.g., OSError), update the type accordingly.

    tests/core/quantms/test_diann_duckdb_performance.py (1)

    3-8: Remove unused imports.

    os and pandas are not used in this test module.

    Apply:

    -import os
    ...
    -import pandas as pd
    tests/core/quantms/psms/test_mztab_indexer.py (1)

    129-162: Reduce log spam and turn referential checks into actionable assertions.

    Current loop logs every accession and a boolean. Aggregate missing items and assert, or at least summarize.

    Example refactor:

    -        for psm_acc_group in psm_accessions:
    -            for psm_acc in psm_acc_group.split(","):
    -                logging.info(f"PSM accession {psm_acc} protein group not found")
    -                logging.info(
    -                    any(psm_acc in prot_acc for prot_acc in all_protein_accessions)
    -                )
    +        missing_psm = []
    +        for psm_acc_group in psm_accessions:
    +            for psm_acc in psm_acc_group.split(","):
    +                if not any(psm_acc in prot_acc for prot_acc in all_protein_accessions):
    +                    missing_psm.append(psm_acc)
    +        assert not missing_psm, f"{len(missing_psm)} PSM accessions not found in proteins: {missing_psm[:10]}"
    
    -        for msstats_acc_group in msstats_accessions:
    -            for msstats_acc in msstats_acc_group.split(","):
    -                logging.info(f"MSstats accession {msstats_acc} protein group not found")
    -                logging.info(
    -                    any(msstats_acc in prot_acc for prot_acc in all_protein_accessions)
    -                )
    +        missing_ms = []
    +        for msstats_acc_group in msstats_accessions:
    +            for msstats_acc in msstats_acc_group.split(","):
    +                if not any(msstats_acc in prot_acc for prot_acc in all_protein_accessions):
    +                    missing_ms.append(msstats_acc)
    +        assert not missing_ms, f"{len(missing_ms)} MSstats accessions not found in proteins: {missing_ms[:10]}"

    Also applies to: 269-293

    tests/core/quantms/test_maxquant.py (1)

    1-2: Prune unused imports.

    os, tempfile, and MzTabProteinGroups are unused.

    Apply:

    -import os
    -import tempfile
    ...
    -from quantmsio.core.quantms.pg import MzTabProteinGroups

    Also applies to: 10-12

    tests/core/quantms/psms/test_duckdb_creation.py (5)

    1264-1280: Use contextlib.suppress for cleanup helpers.

    Simplifies error-tolerant cleanup and satisfies linters (Ruff SIM105, CodeFactor B110).

    Apply:

    +import contextlib
    ...
     def cleanup_mztab_indexer(mztab_indexer):
         """Clean up MzTabIndexer instance and its connections."""
         if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
    -        try:
    -            mztab_indexer._duckdb.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +        with contextlib.suppress(Exception):
    +            mztab_indexer._duckdb.close()
     
     def cleanup_duckdb_connection(con):
         """Clean up DuckDB connection."""
         if con:
    -        try:
    -            con.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +        with contextlib.suppress(Exception):
    +            con.close()

    559-563: Docstring/log text doesn’t match behavior (MSstats is included in Step 1).

    Either exclude MSstats in Step 1 or update wording. Given assertions expect msstats.parquet present, update text.

    Apply:

    -    """
    -    Test the two-step process with Parquet backend using TMT data:
    -    1. Create parquet store with mzTab data only
    -    2. Add MSstats data to the existing parquet store
    -    """
    +    """
    +    Test Parquet backend with TMT data:
    +    1. Create parquet store with mzTab + MSstats
    +    2. Open existing store and re-add MSstats (idempotent replace)
    +    """
    ...
    -        print("Step 1: Creating parquet store with mzTab data only...")
    +        print("Step 1: Creating parquet store with mzTab + MSstats...")

    Also applies to: 582-589


    586-596: Close indexer instances to release resources.

    Avoid leaked DuckDB connections and silence “assigned but unused” warnings by using the objects in cleanup.

    Apply:

    -        mztab_step1 = MzTabIndexer.create(
    +        mztab_step1 = MzTabIndexer.create(
                 mztab_path=mztab_path,
    -            msstats_path=msstats_path,
    +            msstats_path=msstats_path,
                 backend="parquet",
                 database_path=output_path,
             )
    ...
    -        mztab_existing = MzTabIndexer.open(
    +        mztab_existing = MzTabIndexer.open(
                 database_path=output_path,
                 backend="parquet",
             )
    ...
    -        # Clean up
    -        con.close()
    +        # Clean up
    +        cleanup_mztab_indexer(mztab_step1)
    +        cleanup_mztab_indexer(mztab_existing)
    +        con.close()

    Also applies to: 628-632, 694-696


    1013-1018: Ensure parity in parquet create/open test: close the created indexer.

    This both releases resources and makes mztab_new a used variable.

    Apply:

     mztab_new = MzTabIndexer.create(
         mztab_path=mztab_path,
         msstats_path=msstats_path,
         backend="parquet",
         database_path=output_path,
     )
    +cleanup_mztab_indexer(mztab_new)
    ...
    -        # Clean up
    -        con.close()
    +        # Clean up
    +        con.close()

    Also applies to: 1121-1123


    184-184: Remove extraneous f-strings without placeholders.

    Several print(f"...") have no {}. Drop the f to satisfy Ruff F541.

    You can auto-fix across this file:

    #!/bin/bash
    # Remove f-prefix when no {} placeholders
    python - <<'PY'
    import re, pathlib
    p = pathlib.Path("tests/core/quantms/psms/test_duckdb_creation.py")
    s = p.read_text()
    s = re.sub(r'print\(f"([^"{]*?)"\)', r'print("\1")', s)
    p.write_text(s)
    PY

    Also applies to: 214-214, 274-274, 310-310, 341-341, 401-401, 436-436, 475-475, 542-542, 580-580, 615-615, 662-662, 689-689, 719-719, 753-753, 806-806, 833-833, 863-863, 897-897, 950-950, 977-977, 1007-1007, 1042-1042, 1089-1089, 1116-1116, 1146-1146, 1181-1181, 1228-1228, 1255-1255

    tests/core/quantms/psms/test_psm.py (5)

    259-266: Avoid comparisons to True/False in tests.

    Use truthiness for clarity and to satisfy linters (E712).

    -    assert (
    -        _is_ms_run_location_line(
    -            ["MTD", "ms_run[1]-location", "file:///path/file.mzML"]
    -        )
    -        == True
    -    )
    -    assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False
    -    assert _is_ms_run_location_line(["PRT", "accession", "P12345"]) == False
    +    assert _is_ms_run_location_line(["MTD", "ms_run[1]-location", "file:///path/file.mzML"])
    +    assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"])
    +    assert not _is_ms_run_location_line(["PRT", "accession", "P12345"])

    520-525: Prefer iterating dict directly (SIM118).

    -                for protein_acc in protein_qvalue_map.keys():
    +                for protein_acc in protein_qvalue_map:

    552-556: Remove redundant f-prefix in string.

    -                    logger.info(f"\nTop 10 protein groups with lowest q-values:")
    +                    logger.info("\nTop 10 protein groups with lowest q-values:")

    558-566: Drop unused loop variable.

    -                    for i, (accession, qvalue) in enumerate(sorted_groups[:10], 1):
    +                    for accession, qvalue in sorted_groups[:10]:

    414-593: This test is doing too much; consider parametrization/splitting.

    Split by concern (stats, bounds, groups) or parametrize over backends to reduce complexity and noise.

    quantmsio/utils/pride_utils.py (2)

    424-451: Condense optionals with dict.get and reduce branching.

    -    if "opt_global_cv_MS:1002217_decoy_peptide" in es:
    -        psm["is_decoy"] = es["opt_global_cv_MS:1002217_decoy_peptide"]
    -    else:
    -        psm["is_decoy"] = None
    +    psm["is_decoy"] = es.get("opt_global_cv_MS:1002217_decoy_peptide")
    
    -    if "opt_global_Posterior_Error_Probability_score" in es:
    -        psm["posterior_error_probability"] = es[
    -            "opt_global_Posterior_Error_Probability_score"
    -        ]
    -    else:
    -        psm["posterior_error_probability"] = None
    +    psm["posterior_error_probability"] = es.get("opt_global_Posterior_Error_Probability_score")
    
    -    if "opt_global_q-value" in es:
    -        psm["global_qvalue"] = es["opt_global_q-value"]
    -    else:
    -        psm["global_qvalue"] = None
    +    psm["global_qvalue"] = es.get("opt_global_q-value")
    
    -    if "opt_global_consensus_support" in es:
    -        psm["consensus_support"] = es["opt_global_consensus_support"]
    -    else:
    -        psm["consensus_support"] = None
    +    psm["consensus_support"] = es.get("opt_global_consensus_support")

    469-481: Raise ValueError for missing ms_run mapping; include context.

    Use a more specific exception type and a concise message for callers.

    -        if ms_runs is not None:
    -            if ms_run in ms_runs:
    -                psm["ms_run"] = ms_runs[ms_run]
    -            else:
    -                raise Exception(f"The ms run {ms_run} is not in the ms runs index")
    +        if ms_runs is not None:
    +            if ms_run in ms_runs:
    +                psm["ms_run"] = ms_runs[ms_run]
    +            else:
    +                raise ValueError(f"ms_run '{ms_run}' not found in provided ms_runs index")
    quantmsio/core/openms.py (2)

    3-3: Remove unused imports.

    -from typing import List, Dict, Any, Optional, Iterator, Union
    +from typing import List, Dict, Any, Optional, Union
    -from quantmsio.utils.mztab_utils import parse_peptidoform_openms
    +# parse_peptidoform_openms not used here; remove import

    Also applies to: 11-11


    217-226: Broaden numeric check to handle non-floatables safely.

    -        except ValueError:
    +        except (TypeError, ValueError):
                 return False
    tests/core/quantms/test_diann_duckdb.py (3)

    3-7: Remove unused imports.

    -import os
     import time
     from pathlib import Path
     
    -import pandas as pd
     import pytest

    40-41: Drop redundant f-strings without placeholders.

    -        print(f"\nSmall dataset statistics:")
    +        print("\nSmall dataset statistics:")
    -        print(f"\nFull dataset statistics:")
    +        print("\nFull dataset statistics:")
    -        print(f"\nFiltered protein groups matrix statistics (5 proteins):")
    +        print("\nFiltered protein groups matrix statistics (5 proteins):")
    -        print(f"\nCaching performance:")
    +        print("\nCaching performance:")

    Also applies to: 125-126, 232-233, 278-281


    106-249: Consolidate repeated stats logic via parametrization/fixtures (optional).

    Small and full tests duplicate large chunks. Consider pytest parametrization for dataset and thresholds to trim runtime and code size.

    quantmsio/core/maxquant/maxquant.py (1)

    779-798: Simplify type checking for better maintainability.

    The explicit type comparisons violate Python best practices. Consider using isinstance() for type checking or a dictionary-based approach.

    Apply this refactor for cleaner type checking:

    -        for field, dtype in required_fields.items():
    -            if field not in df.columns:
    -                if dtype == list:
    -                    df[field] = [[] for _ in range(len(df))]
    -                elif dtype == str:
    -                    df[field] = None
    -                elif dtype == float:
    -                    df[field] = 0.0
    -                elif dtype == int:
    -                    df[field] = 0
    -            else:
    -                # Handle type conversions
    -                if dtype == float:
    -                    df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
    -                elif dtype == int:
    -                    df[field] = (
    -                        pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
    -                    )
    -                elif dtype == str:
    -                    df[field] = df[field].astype(str).fillna("")
    +        default_values = {
    +            list: lambda: [[] for _ in range(len(df))],
    +            str: lambda: None,
    +            float: lambda: 0.0,
    +            int: lambda: 0
    +        }
    +        
    +        for field, dtype in required_fields.items():
    +            if field not in df.columns:
    +                df[field] = default_values[dtype]()
    +            else:
    +                # Handle type conversions
    +                if dtype == float:
    +                    df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
    +                elif dtype == int:
    +                    df[field] = (
    +                        pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
    +                    )
    +                elif dtype == str:
    +                    df[field] = df[field].astype(str).fillna("")
    tests/core/quantms/test_msstats_in.py (4)

    1-17: Remove unused imports to clean up the module.

    The gzip and tempfile imports are not used in the code.

    -import gzip
    -import tempfile
     from pathlib import Path

    74-74: Remove extraneous f-string prefixes.

    These strings don't contain any placeholders, so the f-prefix is unnecessary.

    -            print(f"Intensity distribution analysis completed")
    +            print("Intensity distribution analysis completed")
    
    -        print(f"\nLFQ test completed successfully!")
    +        print("\nLFQ test completed successfully!")
    
    -    print(f"\nTMT test completed successfully!")
    +    print("\nTMT test completed successfully!")
    
    -                print(f"Intensity distribution analysis completed")
    +                print("Intensity distribution analysis completed")
    
    -    print(f"\nDataset Comparison:")
    +    print("\nDataset Comparison:")

    Also applies to: 117-117, 465-465, 536-536, 589-589


    272-319: Consider breaking down the complex channel processing function.

    The _process_channel_statistics function handles both direct channel counts and intensities-based channel counts with nested logic. Consider splitting into two separate functions for clarity.

     def _process_channel_statistics(
         msstats_batch, channel_counts, file_channel_matrix, channel_column="Channel"
     ):
         """Process channel statistics from batch data."""
         # Count channels from the specified channel column
         chan_counts = msstats_batch[channel_column].value_counts()
         for channel, count in chan_counts.items():
             if channel not in channel_counts:
                 channel_counts[channel] = 0
             channel_counts[channel] += count
     
    +    # Process intensities separately
    +    _process_intensities_channel_statistics(
    +        msstats_batch, channel_counts, file_channel_matrix
    +    )
    +
    +def _process_intensities_channel_statistics(
    +    msstats_batch, channel_counts, file_channel_matrix
    +):
    +    """Process channel statistics from intensities column."""
         # Count channels from the 'intensities' column
         intensities_channel_counts = {}
         file_channel_matrix_from_intensities = {}
     
         for _, row in msstats_batch.iterrows():
             ref_file = row["reference_file_name"]
             if "intensities" in row and row["intensities"] is not None:
                 for intensity_info in row["intensities"]:
                     channel = intensity_info["channel"]
     
                     # Count channels from intensities
                     if channel not in intensities_channel_counts:
                         intensities_channel_counts[channel] = 0
                     intensities_channel_counts[channel] += 1
     
                     # Count file-channel combinations from intensities
                     if ref_file not in file_channel_matrix_from_intensities:
                         file_channel_matrix_from_intensities[ref_file] = {}
                     if channel not in file_channel_matrix_from_intensities[ref_file]:
                         file_channel_matrix_from_intensities[ref_file][channel] = 0
                     file_channel_matrix_from_intensities[ref_file][channel] += 1
     
         # Update the main counting with intensities data
         if intensities_channel_counts:
    -        for channel, count in intensities_channel_counts.items():
    -            if channel not in channel_counts:
    -                channel_counts[channel] = 0
    -
             # Use intensities-based matrix
             for ref_file, channels in file_channel_matrix_from_intensities.items():
                 if ref_file not in file_channel_matrix:
                     file_channel_matrix[ref_file] = {}
                 for channel, count in channels.items():
                     if channel not in file_channel_matrix[ref_file]:
                         file_channel_matrix[ref_file][channel] = 0
                     file_channel_matrix[ref_file][channel] += count

    357-357: Simplify dictionary key check.

    Use simpler syntax for checking if a key exists in a dictionary.

    -            for channel in file_channels.keys()
    +            for channel in file_channels
    quantmsio/commands/convert/quantms.py (1)

    7-7: Remove unused import.

    The tempfile import at the module level is unused since you're importing it locally where needed.

    -import tempfile
     from typing import Optional
    quantmsio/core/idxml/idxml.py (1)

    397-397: Remove unnecessary f-string prefix.

    The string doesn't contain any placeholders.

    -        logger.info(f"=== DEBUG: convert_to_parquet started ===")
    +        logger.info("=== DEBUG: convert_to_parquet started ===")
    quantmsio/core/diann/diann.py (2)

    7-7: Remove unused imports.

    Several imports are unused in the code.

    -from typing import Union, Optional, List
    +from typing import Union
     
     ...
     
    -from quantmsio.core.duckdb import DiannDuckDB
    -from quantmsio.core.project import create_uuid_filename
     from quantmsio.core.quantms.feature import Feature
     from quantmsio.core.quantms.mztab import MzTab
     from quantmsio.core.sdrf import SDRFHandler
    -from quantmsio.operate.tools import get_ahocorasick

    Also applies to: 23-24, 28-28


    343-345: Add error handling for type conversion.

    The type conversion could fail if the data is not in the expected format.

             group = group.copy()
    -        if group["rt"].dtype != "float64":
    -            group["rt"] = group["rt"].astype("float64")
    +        try:
    +            if group["rt"].dtype != "float64":
    +                group["rt"] = pd.to_numeric(group["rt"], errors="coerce")
    +        except Exception as e:
    +            logging.warning(f"Failed to convert retention time to float: {e}")
    quantmsio/core/quantms/psm.py (2)

    242-255: Remove unused loop variable to improve code clarity.

    The index variable is not used in the loop body and should be replaced with _ to indicate it's intentionally unused.

     # Now we need to convert the temp_modifications to the final modifications dictionary where key
     # of the dictionary is accession, and indexes are moved as second element of the tuple. It could be that
     # one accession has mutliple indexes, then we need to group them and add then also site and position.
     modifications = {}
    -for index, (
    +for _, (
         name,
         accession,
         sites,
         positions,
     ) in temp_modifications.items():

    506-507: Remove extraneous f-string prefix.

    The f-string has no placeholders and doesn't need the f prefix.

    -                self.logger.warning(f"Could not convert global_qvalue to float")
    +                self.logger.warning("Could not convert global_qvalue to float")
    quantmsio/core/duckdb.py (1)

    197-197: Remove unused variable assignment.

    The result variable is assigned but never used in the performance stats method.

                 try:
    -                result = self.query_to_df(query)
    +                self.query_to_df(query)
                     end_time = time.time()
    quantmsio/core/quantms/feature.py (6)

    2-8: Remove unused imports.

    Several imports are not used in the code and should be removed for cleaner code.

     import logging
    -import tempfile
    -from pathlib import Path
    -from typing import Union
     
     import pandas as pd
     import pyarrow as pa
    -import pyarrow.parquet as pq

    14-16: Remove unused imports from the module.

    The Psm and get_protein_accession imports are not directly used at the module level.

    -from quantmsio.core.quantms.psm import Psm
     from quantmsio.core.sdrf import SDRFHandler
    -from quantmsio.operate.tools import get_protein_accession

    229-238: Simplify nested if statements.

    The nested if statements can be combined into a single condition for better readability.

                 for _, row in metadata_df.iterrows():
    -                if "fixed_mod[" in row["key"] or "var_mod[" in row["key"]:
    -                    if "site" not in row["key"] and "position" not in row["key"]:
    +                if ("fixed_mod[" in row["key"] or "var_mod[" in row["key"]) and "site" not in row["key"] and "position" not in row["key"]:
                             values = (
                                 row["value"].replace("[", "").replace("]", "").split(",")
                             )

    245-246: Fix unused loop variables.

    Multiple loop variables are not used and should be prefixed with underscore.

    -            for accession, (name, index, site, position) in modifications.items():
    +            for accession, (name, _, site, position) in modifications.items():
                     mods_map[name] = [accession, site or "X"]

    317-317: Remove unused variable assignment.

    The charge_col variable is assigned but never used.

             # Add charge column if available, otherwise use default
             if "Charge" in msstats_batch.columns:
                 grouping_cols.append("Charge")
    -            charge_col = "Charge"
             elif "PrecursorCharge" in msstats_batch.columns:
                 grouping_cols.append("PrecursorCharge")
    -            charge_col = "PrecursorCharge"
             else:
                 # Add a default charge if not available
                 msstats_batch["Charge"] = 3
                 grouping_cols.append("Charge")
    -            charge_col = "Charge"

    647-763: Consider breaking down complex conversion method.

    The convert_to_parquet_format method is quite long and handles many different column types. Consider refactoring into smaller helper methods for better maintainability.

    Would you like me to help refactor this method into smaller, more focused helper methods like _convert_float_columns, _convert_int_columns, _convert_list_columns, etc.? This would improve readability and make the code easier to test and maintain.

    @@ -15,47 +15,72 @@
    {"name": "file_type", "type": "string", "doc": "Type of the file (feature_file)"},
    {"name": "creation_date", "type": "string", "doc": "Date when the file was created"},
    {"name": "uuid", "type": "string", "doc": "Unique identifier for the file"},
    {"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Verify the scan_format field values are validated.

    The scan_format field documentation mentions possible values (scan, index, nativeId) but the schema doesn't enforce these values through an enum type.

    Consider using an Avro enum type to enforce valid values:

    -{"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
    +{"name": "scan_format", "type": {"type": "enum", "name": "ScanFormat", "symbols": ["scan", "index", "nativeId"]}, "doc": "The format of the scan"},

    🏁 Script executed:

    #!/bin/bash
    # Search for validation of scan_format values in the codebase
    rg -n "scan_format.*(?:scan|index|nativeId)" --type py
    
    # Check if there's any enum or validation for scan_format
    ast-grep --pattern 'scan_format == $_' 

    Length of output: 377


    Enforce scan_format values via an Avro enum

    I didn’t find any code that validates scan_format against the documented values (“scan”, “index”, “nativeId”), nor does the Avro schema enforce them. Without schema-level enforcement or runtime checks, invalid values could slip through and cause downstream failures.

    Suggested changes:

    • Update the Avro schema (docs/feature.avsc) to use an enum for scan_format.
    • Review all code that serializes or deserializes this field—e.g.
      • quantmsio/core/idxml/idxml.py:366
      • quantmsio/core/quantms/psm.py:326
      • quantmsio/core/quantms/feature.py:264
        to handle the enum type or validate string values accordingly.

    Diff for docs/feature.avsc:

    - {"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
    + {"name": "scan_format",
    +  "type": {
    +    "type": "enum",
    +    "name": "ScanFormat",
    +    "symbols": ["scan", "index", "nativeId"]
    +  },
    +  "doc": "The format of the scan"}
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    {"name": "scan_format", "type": "string", "doc": "The format of the scan, with possible values: scan, index, nativeId."},
    {"name": "scan_format",
    "type": {
    "type": "enum",
    "name": "ScanFormat",
    "symbols": ["scan", "index", "nativeId"]
    },
    "doc": "The format of the scan"},
    🤖 Prompt for AI Agents
    In docs/feature.avsc around line 18, replace the current string-typed
    "scan_format" field with an Avro enum type that defines the allowed symbols
    ["scan","index","nativeId"] so the schema enforces valid values; then
    update/inspect the serialization/deserialization and validation sites
    (quantmsio/core/idxml/idxml.py:366, quantmsio/core/quantms/psm.py:326,
    quantmsio/core/quantms/feature.py:264) to handle the enum (either by mapping
    enum values to/from strings or by adding explicit runtime validation that
    accepts only these three values), and run schema compatibility tests to ensure
    no breaking changes for existing data.
    

    Comment on lines +74 to +75
    "type": "struct",
    "fields": [
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Fix inconsistent field type definition.

    The best_id_score field uses "struct" type which is not standard in Avro. Avro uses "record" for complex types.

    Apply this fix to use the correct Avro type:

     {"name": "best_id_score", "type": ["null", {
         "type": "array",
         "items": {
    -        "type": "struct",
    +        "type": "record",
    +        "name": "best_id_score_entry",
             "fields": [
                 {"name": "score_name", "type": "string"},
                 {"name": "score_value", "type": "float32"}
             ]
             }
         }],
     "doc": "A named score type and value representing an identification's measure of confidence or input feature"
     },
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    "type": "struct",
    "fields": [
    {"name": "best_id_score", "type": ["null", {
    "type": "array",
    "items": {
    "type": "record",
    "name": "best_id_score_entry",
    "fields": [
    {"name": "score_name", "type": "string"},
    {"name": "score_value", "type": "float32"}
    ]
    }
    }],
    "doc": "A named score type and value representing an identification's measure of confidence or input feature"
    },
    🤖 Prompt for AI Agents
    In docs/peptide.avsc around lines 74 to 75, the complex type for the
    best_id_score field is incorrectly declared as "struct"; change that to the Avro
    standard "record" (i.e., replace "type": "struct" with "type": "record") and
    ensure the record includes a proper "name" and "fields" array so the schema
    remains valid.
    

    Comment on lines +28 to +41
    {"name": "peptide_counts", "type": ["null", {
    "type": "struct",
    "fields": [
    {"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
    {"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
    ]
    }], "doc": "Peptide sequence counts for this protein group in this specific file"},
    {"name": "feature_counts", "type": ["null", {
    "type": "struct",
    "fields": [
    {"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
    {"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
    ]
    }], "doc": "Peptide feature counts for this protein group in this specific file"},
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Avro type: use record, not struct; add names.

    Avro JSON doesn’t support struct; nested types must be record with a name.

    -        {"name": "peptide_counts", "type": ["null", {
    -            "type": "struct",
    -            "fields": [
    +        {"name": "peptide_counts", "type": ["null", {
    +            "type": "record",
    +            "name": "pg_peptide_counts",
    +            "fields": [
                     {"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
                     {"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
                 ]
             }], "doc": "Peptide sequence counts for this protein group in this specific file"},
    -        {"name": "feature_counts", "type": ["null", {
    -            "type": "struct",
    -            "fields": [
    +        {"name": "feature_counts", "type": ["null", {
    +            "type": "record",
    +            "name": "pg_feature_counts",
    +            "fields": [
                     {"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
                     {"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
                 ]
             }], "doc": "Peptide feature counts for this protein group in this specific file"},
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    {"name": "peptide_counts", "type": ["null", {
    "type": "struct",
    "fields": [
    {"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
    {"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
    ]
    }], "doc": "Peptide sequence counts for this protein group in this specific file"},
    {"name": "feature_counts", "type": ["null", {
    "type": "struct",
    "fields": [
    {"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
    {"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
    ]
    }], "doc": "Peptide feature counts for this protein group in this specific file"},
    {"name": "peptide_counts", "type": ["null", {
    "type": "record",
    "name": "pg_peptide_counts",
    "fields": [
    {"name": "unique_sequences", "type": "int", "doc": "Number of peptide sequences unique to this protein group within this file"},
    {"name": "total_sequences", "type": "int", "doc": "Total number of peptide sequences identified for this protein group in this file"}
    ]
    }], "doc": "Peptide sequence counts for this protein group in this specific file"},
    {"name": "feature_counts", "type": ["null", {
    "type": "record",
    "name": "pg_feature_counts",
    "fields": [
    {"name": "unique_features", "type": "int", "doc": "Number of unique peptide features (peptide charge combinations) specific to this protein group within this file"},
    {"name": "total_features", "type": "int", "doc": "Total number of peptide features identified for this protein group in this file"}
    ]
    }], "doc": "Peptide feature counts for this protein group in this specific file"},
    🤖 Prompt for AI Agents
    In docs/pg.avsc around lines 28-41, the Avro schema uses "struct" for nested
    types which is invalid JSON Avro; replace each nested "struct" with a "record"
    object and add a unique "name" property (e.g., PeptideCounts, FeatureCounts)
    inside each record, keeping the "fields" arrays intact and preserving the
    surrounding nullable union (["null", { ... }]); ensure the record names are
    valid Avro identifiers and update any references if needed.
    

    Comment on lines +67 to +68
    { "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" }
    ]
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    ⚠️ Potential issue

    Use valid Avro primitive types (int/float), not int32/float32.

    Avro primitives are: null, boolean, int, long, float, double, bytes, string. int32/float32 are invalid.

    Apply:

    -                                    { "name": "score_value", "type": "float32", "doc": "Value of the score for this specific position" }
    +                                    { "name": "score_value", "type": "float", "doc": "Value of the score for this specific position" }
    -            {"name": "precursor_charge", "type": "int32", "doc": "Precursor charge"},
    +            {"name": "precursor_charge", "type": "int", "doc": "Precursor charge"},
    -            {"name": "is_decoy", "type": "int32", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
    +            {"name": "is_decoy", "type": "int", "doc": "Decoy indicator, 1 if the PSM is a decoy, 0 target"},
    -            {"name": "number_peaks", "type": ["null", "int32"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"},
    +            {"name": "number_peaks", "type": ["null", "int"], "doc": "Number of peaks in the spectrum used for the peptide spectrum match"},

    Also update other occurrences (posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array, intensity_array, calculated_mz, observed_mz) to use "float".

    To fix all in this file:

    #!/bin/bash
    # Replace non-Avro primitives in-place
    sed -i -E 's/"int32"/"int"/g; s/"float32"/"float"/g' docs/psm.avsc

    Also applies to: 82-82, 84-84, 120-120

    🤖 Prompt for AI Agents
    In docs/psm.avsc around lines 67-68 (also occurrences at 82, 84 and 120), the
    schema uses invalid Avro primitive types "int32" and "float32"; update the field
    definition for score_value to use "float" and change any other fields listed
    (posterior_error_probability, predicted_rt, rt, ion_mobility, mz_array,
    intensity_array, calculated_mz, observed_mz) that use "int32"/"float32" to the
    correct Avro primitives "int"/"float" respectively; perform a project-wide
    replace of "int32" → "int" and "float32" → "float" to ensure all instances are
    fixed and validate the AVSC file afterwards.
    

    Comment on lines +51 to +53
    evidence_file = test_data[1]
    sdrf_file = test_data[2]

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Guard feature test with file existence checks.

    Apply:

     evidence_file = test_data[1]
     sdrf_file = test_data[2]
    +if not evidence_file.exists() or not sdrf_file.exists():
    +    pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    evidence_file = test_data[1]
    sdrf_file = test_data[2]
    evidence_file = test_data[1]
    sdrf_file = test_data[2]
    if not evidence_file.exists() or not sdrf_file.exists():
    pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
    🤖 Prompt for AI Agents
    In tests/core/quantms/test_maxquant.py around lines 51 to 53, the test assumes
    test_data[1] and test_data[2] (evidence_file and sdrf_file) exist; add guards to
    check os.path.exists for each file and call pytest.skip with a clear message if
    either file is missing so the test is skipped rather than failing; ensure you
    import os and pytest at the top of the file if not already present.
    

    Comment on lines +81 to +83
    evidence_file = test_data[1]
    sdrf_file = test_data[2]

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Guard feature slicing test with file existence checks.

    Apply:

     evidence_file = test_data[1]
     sdrf_file = test_data[2]
    +if not evidence_file.exists() or not sdrf_file.exists():
    +    pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    evidence_file = test_data[1]
    sdrf_file = test_data[2]
    evidence_file = test_data[1]
    sdrf_file = test_data[2]
    if not evidence_file.exists() or not sdrf_file.exists():
    pytest.skip(f"Test data not found: {evidence_file} or {sdrf_file}")
    🤖 Prompt for AI Agents
    In tests/core/quantms/test_maxquant.py around lines 81 to 83, the test uses
    evidence_file = test_data[1] and sdrf_file = test_data[2] without verifying
    those paths exist; add existence guards so the test is skipped when required
    input files are missing. Update the test to check os.path.exists (or
    Path.exists) for each file and call pytest.skip with a clear message if either
    file is absent (or use pytest.mark.skipif at module/function level with the same
    condition), so the test won't fail on CI or local runs when those fixture files
    are not present.
    

    Comment on lines +12 to 14
    "PXD007683",
    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Dataset switch to PXD007683 looks good; verify test asset path exists in repo.
    Ensures the SDRF file is present so the test doesn’t silently skip behavior.


    🏁 Script executed:

    #!/bin/bash
    # Verify the SDRF file exists where the test expects it.
    set -euo pipefail
    echo "Searching for PXD007683TMT.sdrf.tsv under tests/ ..."
    fd -a 'PXD007683TMT.sdrf.tsv' tests | tee /dev/stderr
    # Fail if not found
    test -n "$(fd -a 'PXD007683TMT.sdrf.tsv' tests)" || { echo 'Missing SDRF test file'; exit 1; }

    Length of output: 282


    🏁 Script executed:

    #!/bin/bash
    # Re-verify the presence of PXD007683TMT.sdrf.tsv using find, avoiding unavailable commands.
    set -euo pipefail
    echo "Searching for PXD007683TMT.sdrf.tsv under tests/ using find..."
    RESULT=$(find tests -type f -name 'PXD007683TMT.sdrf.tsv' || true)
    echo "$RESULT"
    # Fail if not found
    if [ -z "$RESULT" ]; then
      echo "Missing SDRF test file"
      exit 1
    fi

    Length of output: 305


    Add or update the missing SDRF test asset

    The test in tests/core/test_project.py is pointing to PXD007683TMT.sdrf.tsv under the TEST_DATA_ROOT, but that file was not found in the repo. Without it, the test will error out or silently skip important validation.

    • Location: tests/core/test_project.py (lines 12–14)
    • Expected asset: tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv

    Please ensure that the SDRF file is committed at the correct path, or adjust the test to reference the existing file location.

    🤖 Prompt for AI Agents
    In tests/core/test_project.py around lines 12 to 14, the test references
    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv" which is missing
    from the repository; either add the SDRF file at
    tests/.../quantms/dda-plex-full/PXD007683TMT.sdrf.tsv so the path resolves, or
    update the test to point to the correct existing SDRF file path (or a fixture)
    and adjust any expected contents accordingly so the test no longer depends on a
    non-existent asset.
    

    Copy link
    Contributor

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 0

    ♻️ Duplicate comments (1)
    quantmsio/core/idxml.py (1)

    627-629: Ensure parent directory exists and compress Parquet output

    Prevents write failures when dirs are missing and reduces file size. Matches prior review guidance.

    -        table = pa.Table.from_pandas(df, schema=PSM_SCHEMA)
    -        pq.write_table(table, output_path)
    -        logging.info(f"Successfully converted IdXML to parquet: {output_path}")
    +        output_path = Path(output_path)
    +        output_path.parent.mkdir(parents=True, exist_ok=True)
    +        table = pa.Table.from_pandas(df, schema=PSM_SCHEMA)
    +        pq.write_table(table, str(output_path), compression="zstd")
    +        logging.info(f"Successfully converted IdXML to parquet: {output_path}")
    🧹 Nitpick comments (7)
    quantmsio/core/idxml.py (7)

    55-57: Log full traceback when IdXML parsing fails

    Use logging.exception to include the stack trace; avoid interpolating the exception object redundantly.

    -        except Exception as e:
    -            logging.error(f"Error parsing IdXML file with pyopenms: {e}")
    -            raise
    +        except Exception:
    +            logging.exception("Error parsing IdXML file with pyopenms")
    +            raise

    170-172: Emit traceback when a peptide hit cannot be parsed

    This improves diagnosability and aligns with TRY400 guidance.

    -        except Exception as e:
    -            logging.warning(f"Error parsing peptide hit with pyopenms: {e}")
    -            return None
    +        except Exception:
    +            logging.exception("Error parsing peptide hit with pyopenms")
    +            return None

    512-520: Broadened catch is fine here; include traceback to aid debugging

    Given this wraps two alternative attachment strategies, keep the broad catch but log with traceback.

    -        except Exception as e:
    -            method_name = "OnDiscExperiment" if self._use_ondisc else "traditional"
    -            logging.warning(f"Failed to attach mzML spectra using {method_name}: {e}")
    +        except Exception:
    +            method_name = "OnDiscExperiment" if self._use_ondisc else "traditional"
    +            logging.exception(f"Failed to attach mzML spectra using {method_name}")

    529-533: OnDisc path: prefer openFile when available (true on-disk access)

    Using OnDiscMSExperiment.openFile avoids loading the full file into memory. Fall back to MzMLFile.load only if needed.

    -        ondisc_exp = oms.OnDiscMSExperiment()
    -        oms.MzMLFile().load(str(self._mzml_path), ondisc_exp)
    +        ondisc_exp = oms.OnDiscMSExperiment()
    +        # Prefer true on-disk access when available
    +        if hasattr(ondisc_exp, "openFile"):
    +            ondisc_exp.openFile(str(self._mzml_path))
    +        else:
    +            # Fallback for older pyopenms bindings
    +            oms.MzMLFile().load(str(self._mzml_path), ondisc_exp)

    562-564: Remove unnecessary int() cast

    num_peaks is already an int.

    -            psm["number_peaks"] = int(num_peaks)
    +            psm["number_peaks"] = num_peaks

    584-586: Remove unnecessary int() cast

    len(...) returns an int.

    -                psm["number_peaks"] = int(len(spectrum_mz))
    +                psm["number_peaks"] = len(spectrum_mz)

    433-470: Avoid double numpy→list normalization

    to_dataframe already normalizes list-like fields; the extra conversion in to_parquet duplicates work. Consider keeping normalization in one place (preferably to_dataframe) and removing it from to_parquet.

    -        def convert_numpy_to_list(obj):
    -            """Recursively convert numpy arrays to Python lists"""
    -            if hasattr(obj, "tolist"):
    -                return obj.tolist()
    -            elif isinstance(obj, list):
    -                return [convert_numpy_to_list(item) for item in obj]
    -            elif isinstance(obj, dict):
    -                return {key: convert_numpy_to_list(value) for key, value in obj.items()}
    -            else:
    -                return obj
    -
    -        for col in df.columns:
    -            if col in [
    -                "modifications",
    -                "protein_accessions",
    -                "additional_scores",
    -                "cv_params",
    -                "mz_array",
    -                "intensity_array",
    -            ]:
    -                df[col] = df[col].apply(convert_numpy_to_list)
    +        # Data is already normalized in to_dataframe(); avoid re-normalizing here.

    Also applies to: 605-626

    📜 Review details

    Configuration used: CodeRabbit UI

    Review profile: CHILL

    Plan: Pro

    💡 Knowledge Base configuration:

    • MCP integration is disabled by default for public repositories
    • Jira integration is disabled by default for public repositories
    • Linear integration is disabled by default for public repositories

    You can enable these sources in your CodeRabbit configuration.

    📥 Commits

    Reviewing files that changed from the base of the PR and between b0b4d76 and 1896847.

    📒 Files selected for processing (1)
    • quantmsio/core/idxml.py (1 hunks)
    🧰 Additional context used
    🧬 Code graph analysis (1)
    quantmsio/core/idxml.py (1)
    quantmsio/core/openms.py (2)
    • OpenMSHandler (14-344)
    • get_spectrum_from_scan (21-43)
    🪛 Ruff (0.12.2)
    quantmsio/core/idxml.py

    56-56: Use logging.exception instead of logging.error

    Replace with exception

    (TRY400)


    147-168: Consider moving this statement to an else block

    (TRY300)


    170-170: Do not catch blind exception: Exception

    (BLE001)


    517-517: Do not catch blind exception: Exception

    (BLE001)


    584-584: Value being cast to int is already an integer

    Remove unnecessary int call

    (RUF046)

    ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
    • GitHub Check: Codacy Static Code Analysis
    • GitHub Check: build
    🔇 Additional comments (1)
    quantmsio/core/idxml.py (1)

    557-567: Heads-up: upstream return shape may throw here; verify OpenMSHandler contract

    OpenMSHandler.get_spectrum_from_scan appears to return only two values on some error paths (see openms.py snippet), which triggers ValueError during unpacking. Your except already handles it, but the root cause should be fixed upstream to always return a consistent 3-tuple.

    @ypriverol ypriverol requested a review from Copilot August 31, 2025 09:27
    Copilot

    This comment was marked as outdated.

    @ypriverol ypriverol requested a review from Copilot August 31, 2025 15:19
    Copy link

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR implements a major restructuring of the quantms.io CLI with organized command groups and enhanced functionality. The changes introduce a hierarchical command structure with groups for conversion, transformation, visualization, statistics, and project management, while adding comprehensive logging and improving error handling.

    • Major CLI restructuring from flat commands to organized groups (convert, transform, visualize, stats, project)
    • Enhanced logging system with structured output support and comprehensive documentation
    • New quantms workflow conversion command with support for multiple proteomics tools (DIA-NN, MaxQuant, FragPipe)

    Reviewed Changes

    Copilot reviewed 106 out of 152 changed files in this pull request and generated 1 comment.

    Show a summary per file
    File Description
    quantmsioc.py Restructure main CLI with organized command groups and enhanced error handling
    diann.py, maxquant.py, fragpipe.py Add conversion commands for different proteomics tools
    psm.py, feature.py, plot.py Add PSM/feature conversion and visualization commands
    ae.py, de.py, ibaq.py, gene.py Add transformation commands for various analyses
    project.py, stats.py Add project management and statistical analysis commands
    logger.py Add centralized logging utility module
    Test files Comprehensive test coverage for new functionality and performance testing
    Schema files Update Avro schemas with nullable fields and new differential/absolute expression schemas
    Documentation Major README update with usage examples and comprehensive logging documentation

    Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

    "MSV000079033",
    TEST_DATA_ROOT / "DDA-plex/MSV000079033-Blood-Plasma-iTRAQ.sdrf.tsv",
    "PXD007683",
    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",
    Copy link
    Preview

    Copilot AI Aug 31, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The filename contains "PXD007683TMT.sdrf.tsv" but should likely be "PXD007683-TMT.sdrf.tsv" to match the pattern used elsewhere in the codebase (with hyphen between dataset ID and experiment type).

    Suggested change
    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683TMT.sdrf.tsv",
    TEST_DATA_ROOT / "quantms/dda-plex-full/PXD007683-TMT.sdrf.tsv",

    Copilot uses AI. Check for mistakes.

    Copy link
    Contributor

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 12

    Caution

    Some comments are outside the diff and can’t be posted inline due to platform limitations.

    ⚠️ Outside diff range comments (3)
    quantmsio/core/diann/diann.py (3)

    414-424: Incorrect call to Feature.generate_modifications_details (signature mismatch).

    Feature.generate_modifications_details is not a static method and expects a modifications map, not sample map + select mods. This will error.

    Minimal safe fix: map peptidoform to a canonical string and set modifications to None.

    -        report[["peptidoform", "modifications"]] = report[["peptidoform"]].apply(
    -            lambda row: Feature.generate_modifications_details(
    -                row["peptidoform"], self._sample_map, select_mods
    -            ),
    -            axis=1,
    -            result_type="expand",
    -        )
    +        _, modifications_map = self.get_masses_and_modifications_map()
    +        report["peptidoform"] = report["peptidoform"].map(modifications_map)
    +        report["modifications"] = None

    If you need full modification details, refactor to use an instance method on Feature that has access to a proper modifications_dict.


    426-439: Same LFQ mapping issue in add_additional_msg().

    Apply the same fallback to avoid KeyError.

    -                {
    -                    "sample_accession": self._sample_map[
    -                        rows["reference_file_name"] + "-" + rows["channel"]
    -                    ],
    +                {
    +                    "sample_accession": self._sample_map.get(
    +                        rows["reference_file_name"] + "-" + rows["channel"],
    +                        self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
    +                    ),
                         "channel": rows["channel"],
                         "intensity": rows["intensity"],
                     }

    449-464: Apply LFQ mapping fallback to additional_intensities as well.

    Mirror the robust lookup.

    -                {
    -                    "sample_accession": self._sample_map[
    -                        rows["reference_file_name"] + "-" + rows["channel"]
    -                    ],
    +                {
    +                    "sample_accession": self._sample_map.get(
    +                        rows["reference_file_name"] + "-" + rows["channel"],
    +                        self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
    +                    ),
                         "channel": rows["channel"],
                         "intensities": [
                             {"intensity_name": "lfq", "intensity_value": rows["lfq"]},
                         ],
                     }
    ♻️ Duplicate comments (16)
    quantmsio/core/sdrf.py (1)

    129-131: Add proper exception chaining (ruff B904).

    Preserve original context or explicitly suppress it.

    -        except FileNotFoundError:
    -            raise FileNotFoundError(
    -                "The SDRF file provided not found: " + str(sdrf_file)
    -            )
    +        except FileNotFoundError as err:
    +            raise FileNotFoundError(
    +                "The SDRF file provided not found: " + str(sdrf_file)
    +            ) from err
    quantmsio/utils/file_utils.py (3)

    17-18: Library modules shouldn’t call basicConfig.

    Don’t configure global logging in a library; attach a NullHandler instead.

    -# Configure logging
    -logging.basicConfig(level=logging.INFO)
    -logger = logging.getLogger(__name__)
    +# Configure logging
    +logger = logging.getLogger(__name__)
    +logger.addHandler(logging.NullHandler())

    30-39: Support multi-suffix and case-insensitive extensions.

    Current suffix equality fails for “.mzTab.gz” and case variants.

     def validate_extension(file_path: Union[str, Path], extension: str) -> bool:
         """Validate that the file has the correct extension."""
         if not Path(file_path).exists():
             raise FileNotFoundError(f"File not found: {file_path}")
    -    if not Path(file_path).suffix == extension:
    +    path = Path(file_path)
    +    ext = extension.lower()
    +    # Normalize extension to start with dot
    +    if not ext.startswith("."):
    +        ext = "." + ext
    +    # Handle multi-suffix like ".mzTab.gz" and case-insensitive compare
    +    path_ext = "".join(path.suffixes).lower() or path.suffix.lower()
    +    if path_ext != ext:
             raise ValueError(
                 f"File {file_path} does not have the correct extension {extension}"
             )
         return True

    390-405: Fix Parquet metadata and ensure output directory exists.

    • PyArrow expects metadata keys/values as bytes.
    • Create parent directory before ParquetWriter.
    -                    schema_metadata = self.schema.metadata or {}
    -
    -                    # Convert file_metadata to string format for parquet metadata
    -                    if self.file_metadata:
    -                        for key, value in self.file_metadata.items():
    -                            schema_metadata[f"file_metadata.{key}"] = str(value)
    -
    -                    # Update schema with metadata
    -                    schema_with_metadata = self.schema.with_metadata(schema_metadata)
    +                    # Start from existing metadata (bytes->bytes)
    +                    schema_metadata = dict(self.schema.metadata or {})
    +                    # Merge file-level metadata as bytes
    +                    if self.file_metadata:
    +                        for key, value in self.file_metadata.items():
    +                            k = f"file_metadata.{key}".encode("utf-8")
    +                            v = str(value).encode("utf-8")
    +                            schema_metadata[k] = v
    +                    schema_with_metadata = self.schema.with_metadata(schema_metadata)
    @@
    -                    self.parquet_writer = pq.ParquetWriter(
    +                    # Ensure parent directory exists
    +                    Path(self.output_path).parent.mkdir(parents=True, exist_ok=True)
    +                    self.parquet_writer = pq.ParquetWriter(
    -                        where=self.output_path,
    +                        where=str(self.output_path),
                             schema=schema_with_metadata,
                             compression=self.compression,
                         )
    quantmsio/operate/tools.py (2)

    335-346: Ensure output dir exists and pass str path to ParquetWriter.

    Avoids failures when parent dir is missing and improves PyArrow compatibility.

         pqwriter = None
         for feature in generate_ibaq_feature(sdrf_path, parquet_path):
             if not pqwriter:
    -            pqwriter = pq.ParquetWriter(output_path, feature.schema)
    +            Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    +            pqwriter = pq.ParquetWriter(str(output_path), feature.schema)

    291-299: Fix typo in public API: rename genereate_ibaq_feature → generate_ibaq_feature and update call site.

    Prevents leaking a misspelling and improves discoverability.

    -def genereate_ibaq_feature(
    +def generate_ibaq_feature(
         sdrf_path: Union[Path, str], parquet_path: Union[Path, str]
     ) -> Generator[pa.Table, None, None]:
    @@
    -    for feature in genereate_ibaq_feature(sdrf_path, parquet_path):
    +    for feature in generate_ibaq_feature(sdrf_path, parquet_path):

    Also applies to: 342-348

    quantmsio/core/quantms/pg.py (1)

    650-686: Fix DuckDB filter: ANY([...]) is unreliable; use a quoted IN list.

    This pattern has caused failures before. Build a safe IN (...) list. (Same concern was raised earlier; re-flagging.)

    -        FROM processed_msstats_with_pg 
    -        WHERE reference_file_name = ANY({file_batch})
    +        FROM processed_msstats_with_pg 
    +        WHERE reference_file_name IN ({", ".join(f"'{f.replace(\"'\",\"''\")}'" for f in file_batch)})
    quantmsio/core/diann/diann.py (2)

    152-185: Naming consistency for peptide_counts vs feature_counts.

    Prior review suggested using “unique_peptides/total_peptides” in peptide_counts and “unique_features/total_features” in feature_counts to avoid mixing “sequences” terminology.


    46-56: super().init() will raise — class no longer inherits. Initialize DiannDuckDB instead.

    This is a runtime blocker; _duckdb is never set and all DB calls will fail.

    -        super(DiaNNConvert, self).__init__(
    -            diann_report, duckdb_max_memory, duckdb_threads
    -        )
    +        # Initialize DuckDB connection
    +        self._duckdb = DiannDuckDB(
    +            diann_report,
    +            max_memory=duckdb_max_memory,
    +            worker_threads=duckdb_threads,
    +            pg_matrix_path=pg_matrix_path,
    +        )
    quantmsio/commands/convert/quantms.py (3)

    191-191: Replace insecure mktemp() with NamedTemporaryFile.

    -            temp_db_path = tempfile.mktemp(suffix=".duckdb")
    +            from tempfile import NamedTemporaryFile
    +            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
    +            temp_db_path = temp_file.name
    +            temp_file.close()

    379-379: Replace insecure mktemp() with NamedTemporaryFile.

    -            temp_db_path = tempfile.mktemp(suffix=".duckdb")
    +            from tempfile import NamedTemporaryFile
    +            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
    +            temp_db_path = temp_file.name
    +            temp_file.close()

    104-104: Replace insecure mktemp() with NamedTemporaryFile.

    Using tempfile.mktemp() is deprecated and insecure as it creates a predictable filename that attackers could potentially exploit.

    -            temp_db_path = tempfile.mktemp(suffix=".duckdb")
    +            from tempfile import NamedTemporaryFile
    +            temp_file = NamedTemporaryFile(suffix=".duckdb", delete=False)
    +            temp_db_path = temp_file.name
    +            temp_file.close()
    quantmsio/core/quantms/psm.py (2)

    526-527: Fix typo in CV parameter name.

    The CV parameter name contains a typo: "consesus_support" should be "consensus_support".

                 struct = {
    -                "cv_name": "consesus_support",
    +                "cv_name": "consensus_support",
                     "cv_value": str(consensus_support),
                 }

    664-665: Fix variable shadowing in generator.

    The loop variable df shadows the iterable it iterates over.

    -        for key, df in df.groupby(partitions):
    -            yield key, df
    +        for key, group_df in df.groupby(partitions):
    +            yield key, group_df
    quantmsio/core/quantms/feature.py (2)

    472-472: Fix closure issue in lambda function.

    The lambda function doesn't properly bind the loop variable i, which could lead to incorrect behavior if the lambda is evaluated after the loop completes.

             for i, feature in enumerate(map_features):
                 msstats.loc[:, feature] = msstats[
                     ["reference_file_name", "peptidoform", "precursor_charge"]
                 ].apply(
    -                lambda rows: merge_psm(rows, i),
    +                lambda rows, idx=i: merge_psm(rows, idx),
                     axis=1,
                 )

    541-542: Fix variable shadowing in generator.

    The loop variable df shadows the iterable it iterates over.

    -        for key, df in df.groupby(partitions):
    -            yield key, df
    +        for key, group_df in df.groupby(partitions):
    +            yield key, group_df
    🧹 Nitpick comments (36)
    tests/core/quantms/psms/test_duckdb_creation.py (7)

    9-9: Use a robust, normalized path for TEST_DATA_ROOT

    Avoid unresolved ".." segments; use resolve()+parents for clarity and portability.

    -TEST_DATA_ROOT = Path(__file__).parent / "../.." / ".." / ".." / "tests" / "examples"
    +TEST_DATA_ROOT = Path(__file__).resolve().parents[4] / "tests" / "examples"

    96-98: Simplify resource management; avoid double-closing and use context managers

    Close the indexer once and set it to None to prevent a second close in finally. Use a context manager for DuckDB instead of manual connect/close.

    -            cleanup_mztab_indexer(mztab)
    +            cleanup_mztab_indexer(mztab)
    +            mztab = None
                 assert output_path.exists(), f"DuckDB file not created: {output_path}"
    -            con = duckdb.connect(str(output_path), read_only=True)
    -            tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
    +            with duckdb.connect(str(output_path), read_only=True) as con:
    +                tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
    -            assert (
    -                "metadata" in tables and "proteins" in tables and "psms" in tables
    -            )
    -            metadata_count = con.execute(
    -                "SELECT COUNT(*) FROM metadata"
    -            ).fetchone()[0]
    -            proteins_count = con.execute(
    -                "SELECT COUNT(*) FROM proteins"
    -            ).fetchone()[0]
    -            psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0]
    -            if msstats_path:
    -                msstats_count = con.execute(
    -                    "SELECT COUNT(*) FROM msstats"
    -                ).fetchone()[0]
    +                assert (
    +                    "metadata" in tables and "proteins" in tables and "psms" in tables
    +                )
    +                metadata_count = con.execute(
    +                    "SELECT COUNT(*) FROM metadata"
    +                ).fetchone()[0]
    +                proteins_count = con.execute(
    +                    "SELECT COUNT(*) FROM proteins"
    +                ).fetchone()[0]
    +                psms_count = con.execute("SELECT COUNT(*) FROM psms").fetchone()[0]
    +                if msstats_path:
    +                    msstats_count = con.execute(
    +                        "SELECT COUNT(*) FROM msstats"
    +                    ).fetchone()[0]
    -            cleanup_mztab_indexer(mztab)
    -            cleanup_duckdb_connection(con)
    +            cleanup_mztab_indexer(mztab)

    Also applies to: 99-114, 153-156


    4-7: Replace bare try/except-pass with contextlib.suppress

    Eliminates Ruff BLE001/S110 and keeps intent clear.

    +from contextlib import suppress
     def cleanup_mztab_indexer(mztab_indexer):
         """Clean up MzTabIndexer instance and its connections."""
    -    if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
    -        try:
    -            mztab_indexer._duckdb.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +    if mztab_indexer and getattr(mztab_indexer, "_duckdb", None):
    +        with suppress(Exception):
    +            mztab_indexer._duckdb.close()
     
     def cleanup_duckdb_connection(con):
         """Clean up DuckDB connection."""
    -    if con:
    -        try:
    -            con.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +    if con:
    +        with suppress(Exception):
    +            con.close()

    Also applies to: 1251-1267


    181-181: Drop redundant f-prefixes (no placeholders) to satisfy Ruff F541

    Also helps the Black job by keeping strings tidy.

    -print(f"\n=== Testing Two-Step Creation Process ===")
    +print("\n=== Testing Two-Step Creation Process ===")
    @@
    -print(f"Initial database created with:")
    +print("Initial database created with:")
    @@
    -print(f"\nFinal database contains:")
    +print("\nFinal database contains:")
    @@
    -print(f"\n=== Testing Two-Step Creation Process (TMT) ===")
    +print("\n=== Testing Two-Step Creation Process (TMT) ===")
    @@
    -print(f"Initial database created with:")
    +print("Initial database created with:")
    @@
    -print(f"\nFinal database contains:")
    +print("\nFinal database contains:")
    @@
    -print(f"\n=== Testing Create vs Open Database with Class Methods ===")
    +print("\n=== Testing Create vs Open Database with Class Methods ===")
    @@
    -print(f"New database created with:")
    +print("New database created with:")
    @@
    -print(f"Existing database opened with:")
    +print("Existing database opened with:")
    @@
    -print(f"\nSummary:")
    +print("\nSummary:")
    @@
    -print(f"\n=== Testing Create vs Open Database with Class Methods (TMT) ===")
    +print("\n=== Testing Create vs Open Database with Class Methods (TMT) ===")
    @@
    -print(f"New database created with:")
    +print("New database created with:")
    @@
    -print(f"Existing database opened with:")
    +print("Existing database opened with:")
    @@
    -print(f"\nSummary:")
    +print("\nSummary:")

    Also applies to: 210-210, 269-269, 305-305, 335-335, 394-394, 712-712, 745-745, 797-797, 824-824, 854-854, 887-887, 939-939, 966-966


    62-62: Rename test to reflect actual behavior (DuckDB only)

    Removes “parquet” from the name to avoid confusion while parquet paths are commented.

    -def test_create_duckdb_and_parquet(test_name, mztab_path, msstats_path):
    +def test_create_duckdb_database(test_name, mztab_path, msstats_path):

    193-198: Avoid relying on private attribute _duckdb in tests

    Accessing internals couples tests to implementation details. Prefer opening a temporary read-only DuckDB connection for assertions (as in the first test) or add a small public helper on MzTabIndexer (e.g., tables(), count(table)) and use it here.

    Also applies to: 231-236, 317-323, 356-361, 725-731, 767-774, 867-872, 908-915


    1-7: Optionally silence Ruff S101 for tests

    Pytest-style asserts are fine in tests. Either configure per-file ignores in pyproject.toml or add a file header.

    • pyproject.toml:
      """
      [tool.ruff]
      per-file-ignores = { "tests/**" = ["S101"] }
      """

    • Or add at the top of this file:
      """

    ruff: noqa: S101

    """

    quantmsio/core/sdrf.py (3)

    104-112: Remove commented-out SUPOORTED_LABELING block.

    Dead/commented code adds noise; delete or move to docs if needed.

    -    # SUPOORTED_LABELING = [
    -    #     "LABEL FREE",
    -    #     "TMT10",
    -    #     "TMT11",
    -    #     "TMT16",
    -    #     "TMT6",
    -    #     "ITRAQ4",
    -    #     "ITRAQ8",
    -    # ]

    452-452: Tiny style: avoid list concatenation in drop.

    Use iterable unpacking for readability (ruff RUF005).

    -        sdrf.drop(
    -            ["comment[technical replicate]"] + factor,
    +        sdrf.drop(
    +            ["comment[technical replicate]", *factor],
                 axis=1,
                 inplace=True,
             )

    457-457: Remove commented-out return.

    Avoid stale commented code.

    -        # return sdrf[["reference_file_name", "channel", "condition"]].copy()
    quantmsio/utils/file_utils.py (2)

    341-351: Type hint nit: Optional for file_metadata.

    Use Optional[Dict[str, Any]] (ruff RUF013).

    -from typing import Dict, Iterator, List, Tuple, Union
    +from typing import Any, Dict, Iterator, List, Tuple, Union, Optional
    @@
    -    def __init__(
    +    def __init__(
             self,
             output_path: str,
             schema: pa.Schema,
             batch_size: int = 1000000,
             compression: str = "gzip",
    -        file_metadata: dict = None,
    +        file_metadata: Optional[Dict[str, Any]] = None,
         ):

    414-416: Use logging.exception to preserve tracebacks.

    Provides full stack traces for easier debugging (ruff TRY400).

    -            self.logger.error(
    +            self.logger.exception(
                     f"Error during batch writing: {e}, file path: {self.output_path}"
                 )
    @@
    -            self.logger.error(f"Error closing writer: {e}")
    +            self.logger.exception(f"Error closing writer: {e}")

    Also applies to: 431-431

    quantmsio/operate/tools.py (2)

    159-167: Remove “Before Code” banner; keep helper or move to utils.

    The heading looks like leftover. If MaxQuant relies on this (see self._automaton), keep it here or import from a shared utils module.

    -# Before Code:
     def get_ahocorasick(mods_dict: dict) -> ahocorasick.Automaton:

    169-212: Guard against select_mods=None in get_modification_details.

    Avoid TypeError when select_mods is None; also minor cleanup.

    -def get_modification_details(
    +def get_modification_details(
         seq: str,
         mods_dict: dict,
         automaton: ahocorasick.Automaton,
    -    select_mods: list | None = None,
    +    select_mods: list | None = None,
     ) -> tuple[str, list]:
    @@
    -        elif modification in select_mods:
    +        elif (select_mods is None) or (modification in select_mods):
                 modifications.append(modification)
                 modification_details.append(
    quantmsio/core/idxml_utils/idxml.py (4)

    86-96: Avoid bare except; catch Exception explicitly.

    Also keep calc_mz fallback.

    -                except:
    +                except Exception:
                         calc_mz = 0.0

    304-315: Use logger.exception and trim excessive debug noise.

    Preserve traceback; consider downgrading DEBUG banners to debug level.

    -        except Exception as e:
    -            self.logger.error(f"=== DEBUG: Error transforming PSM record: {e} ===")
    +        except Exception as e:
    +            self.logger.exception("Error transforming PSM record")
                 self.logger.error(f"=== DEBUG: Record keys: {list(record.keys())} ===")
                 self.logger.error(
                     f"=== DEBUG: Record sample: {dict(list(record.items())[:5])} ==="
                 )
                 import traceback
    -
    -            self.logger.error(
    -                f"=== DEBUG: Transformation traceback: {traceback.format_exc()} ==="
    -            )
    +            self.logger.debug(f"Transformation traceback: {traceback.format_exc()}")
                 return None

    397-401: Remove redundant f-string and keep message at info level.

    Also, minor copy edit.

    -        logger.info(f"=== DEBUG: convert_to_parquet started ===")
    +        logger.info("convert_to_parquet started")

    445-449: Use logging.exception to include stack trace.

    Easier troubleshooting for failures.

    -        except Exception as e:
    -            logger.error(f"=== DEBUG: Error during parquet conversion: {e} ===")
    -            logger.error(f"=== DEBUG: Exception type: {type(e)} ===")
    +        except Exception as e:
    +            logger.exception("Error during parquet conversion")
    +            logger.error(f"Exception type: {type(e)}")
                 import traceback
    -
    -            logger.error(f"=== DEBUG: Traceback: {traceback.format_exc()} ===")
    +            logger.debug(f"Traceback: {traceback.format_exc()}")
                 raise
    quantmsio/core/maxquant/maxquant.py (2)

    191-233: get_modification_details dependency must be present.

    With the import restored and automaton initialized, this block will work as intended. Consider adding a quick guard if self._automaton is missing.

    -            peptidoform, other_modification_details = get_modification_details(
    -                seq, self.mods_map, self._automaton, other_mods
    -            )
    +            peptidoform, other_modification_details = get_modification_details(
    +                seq, self.mods_map, getattr(self, "_automaton", get_ahocorasick(self.mods_map)), other_mods
    +            )

    761-799: Prefer isinstance/type-safe checks in required fields loop (ruff E721).

    Use identity or isinstance to avoid equality on types; also simplifies conversions.

    -        for field, dtype in required_fields.items():
    +        for field, dtype in required_fields.items():
                 if field not in df.columns:
    -                if dtype == list:
    +                if dtype is list:
                         df[field] = [[] for _ in range(len(df))]
    -                elif dtype == str:
    +                elif dtype is str:
                         df[field] = None
    -                elif dtype == float:
    +                elif dtype is float:
                         df[field] = 0.0
    -                elif dtype == int:
    +                elif dtype is int:
                         df[field] = 0
                 else:
                     # Handle type conversions
    -                if dtype == float:
    +                if dtype is float:
                         df[field] = pd.to_numeric(df[field], errors="coerce").fillna(0.0)
    -                elif dtype == int:
    +                elif dtype is int:
                         df[field] = (
                             pd.to_numeric(df[field], errors="coerce").fillna(0).astype(int)
                         )
    -                elif dtype == str:
    +                elif dtype is str:
                         df[field] = df[field].astype(str).fillna("")
    quantmsio/utils/constants.py (3)

    91-101: Consolidate MSstats “Reference” constants to avoid ambiguity.

    You define both MSSTATS_REFERENCE and MSSTATS_QUANTIFICATION_ID with the same value ("Reference"). Keeping two names for the same column is confusing and invites drift. Use one, and if you want an alias, assign it explicitly with a comment.

    - MSSTATS_REFERENCE = "Reference"
    + MSSTATS_REFERENCE = "Reference"  # primary name used across the codebase
    ...
    - MSSTATS_QUANTIFICATION_ID = "Reference"
    + # Alias kept for backward compatibility; prefer MSSTATS_REFERENCE
    + MSSTATS_QUANTIFICATION_ID = MSSTATS_REFERENCE

    109-111: Prefer an immutable container for shared prefixes.

    DECOY_PREFIXES is a global shared across modules. Make it a tuple to prevent accidental mutation.

    -DECOY_PREFIXES = ["DECOY", "REV", "RANDOM"]
    +DECOY_PREFIXES = ("DECOY", "REV", "RANDOM")

    114-125: Add a brief note on mzTab column set origin.

    The MZTAB_PROTEIN_COLUMNS set mixes fixed columns and instrument-dependent optional columns. A short comment that this is the minimal/expected set used by downstream mapping (see core/common.py MZTAB_PG_MAP) will help maintainers align changes.

    quantmsio/core/common.py (2)

    40-42: Clarify comment: this is not a regex.

    OPENMS_PROTEIN_QVALUE_WORDS is a list of required tokens, not a regular expression. Update the comment to avoid confusion.

    -# Regular expression that make sure the protein search engine score contains the following words: openms target-decoy protein and q-value
    +# Required tokens for matching OpenMS protein q-value score column names

    209-209: Ruff RUF005: prefer list unpacking for readability.

    Minor style: apply Ruff’s suggestion.

    -MZTAB_PG_USECOLS = list(MZTAB_PG_MAP.keys()) + ["description"]
    +MZTAB_PG_USECOLS = [*list(MZTAB_PG_MAP.keys()), "description"]
    quantmsio/core/quantms/pg.py (4)

    20-33: Constructor docstring and unused variable.

    • Docstring still references mztab_path; update to indexer.
    • protein_details_df is unused; drop to avoid unnecessary data load.
    -    def __init__(self, mztab_indexer):
    -        """Initialize MzTabProteinGroups with an MzTabIndexer instance.
    -
    -        Args:
    -            mztab_path: Path to mzTab file
    -        """
    +    def __init__(self, mztab_indexer):
    +        """Initialize MzTabProteinGroups with an MzTabIndexer instance."""
             self._indexer: MzTabIndexer = mztab_indexer
    -
    -        # Initialize tracking lists first
             self._temp_files = []  # Track any temporary files created
             self._file_handles = []  # Track any open file handles
    -        protein_details_df = self._indexer.get_protein_details()
             self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")

    401-404: Use logging.exception to preserve traceback for SQL batch failures.

    You’re catching broad exceptions; at minimum, include the traceback.

    -            except Exception as e:
    -                logger.warning(f"[ERROR] SQL batch failed: {e}, skipping batch")
    +            except Exception:
    +                logger.exception("[ERROR] SQL batch failed; skipping batch")

    448-452: Avoid destroying a shared indexer; delegate lifecycle to the caller.

    destroy_database() here will close the indexer even if the caller wants to reuse it. Since the CLI wraps you in a context manager already, drop this call and let the owner manage lifetime.

    -        # Clean up the temporary MzTabIndexer
    -        self._indexer.destroy_database()
    -
    -        # Cleanup any temporary files created during processing
    +        # Cleanup any temporary files created during processing
             self.cleanup()

    691-699: Parameter ‘topn’ is ignored.

    Either implement TopN properly (needs peptide-level detail) or drop the parameter and flag approximation in docs.

    quantmsio/core/diann/diann.py (2)

    440-443: Use numeric/boolean for is_decoy.

    String "0" risks type inconsistency downstream.

    -        report.loc[:, "is_decoy"] = "0"
    +        report.loc[:, "is_decoy"] = 0

    80-92: Prefer DiannDuckDB helper for query construction.

    To avoid ad-hoc IN (...) formatting and improve reuse, delegate to DiannDuckDB.get_report_from_database.

    -        report = self._duckdb.query_to_df(
    -            """
    -            select {}
    -            from report
    -            where Run IN {}
    -            """.format(
    -                sql, tuple(runs)
    -            )
    -        )
    +        report = self._duckdb.get_report_from_database(runs, columns=sql)
    quantmsio/quantmsioc.py (1)

    60-73: Logger init once at CLI entry is good. Consider honoring --verbose globally.

    If you plan a global --verbose, add a click option on cli() and set level accordingly, instead of per-subcommand.

    quantmsio/commands/convert/quantms.py (1)

    512-512: Remove extraneous f-string prefix.

    The f-string prefix is unnecessary here as there are no placeholders.

    -            logger.info(f"Successfully processed:")
    +            logger.info("Successfully processed:")
    quantmsio/core/quantms/psm.py (1)

    658-663: Improve exception handling.

    Consider using more specific exception types for better error handling and clarity.

             if not isinstance(partitions, list):
    -            raise Exception(f"{partitions} is not a list")
    +            raise TypeError(f"Expected list, got {type(partitions).__name__}")
             if len(partitions) == 0:
    -            raise Exception(f"{partitions} is empty")
    +            raise ValueError("Partitions list cannot be empty")
             for partion in partitions:
                 if partion not in cols:
    -                raise Exception(f"{partion} does not exist")
    +                raise ValueError(f"Column '{partion}' does not exist in dataframe")
    quantmsio/core/quantms/feature.py (2)

    535-540: Improve exception handling.

    Consider using more specific exception types for better error handling and clarity.

             if not isinstance(partitions, list):
    -            raise Exception(f"{partitions} is not a list")
    +            raise TypeError(f"Expected list, got {type(partitions).__name__}")
             if len(partitions) == 0:
    -            raise Exception(f"{partitions} is empty")
    +            raise ValueError("Partitions list cannot be empty")
             for partion in partitions:
                 if partion not in cols:
    -                raise Exception(f"{partion} does not exist")
    +                raise ValueError(f"Column '{partion}' does not exist in dataframe")

    201-201: Address TODO comments.

    The code contains unresolved TODO comments that need attention:

    1. Line 201: Need to determine the data source for score_name and score_value
    2. Line 248: Need to confirm whether the proteins table contains "opt_global_q-value"

    These TODOs indicate missing implementation details. Would you like me to help investigate the correct data sources or open issues to track these tasks?

    Also applies to: 248-249


    except Exception as e:
    logger.exception(f"Error in feature conversion: {str(e)}")
    raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Use proper exception chaining to preserve traceback.

    When re-raising exceptions within an except clause, use from e to preserve the original traceback context for better debugging.

    -        raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")
    +        raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.") from e

    Apply this change to all five occurrences in the file at lines 128, 208, 264, 410, and 528.

    Also applies to: 208-208, 264-264, 410-410, 528-528

    🧰 Tools
    🪛 Ruff (0.12.2)

    128-128: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

    (B904)


    128-128: Avoid specifying long messages outside the exception class

    (TRY003)


    128-128: Use explicit conversion flag

    Replace with conversion flag

    (RUF010)

    🤖 Prompt for AI Agents
    In quantmsio/commands/convert/quantms.py around lines 128, 208, 264, 410, and
    528 the code re-raises ClickException inside except blocks without exception
    chaining; update each raise to use "from e" (e.g., raise
    click.ClickException(...) from e) so the original traceback is preserved,
    applying the same change at all five specified line locations.
    

    Comment on lines +95 to 96
    "Run": "reference_file_name",
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    💡 Verification agent

    🧩 Analysis chain

    Be explicit about DIA-NN “Run” → “reference_file_name” rename across the codebase.

    You remap “Run” to “reference_file_name”. Please confirm all call sites downstream (e.g., joins, filters, and CLI code) use the canonical “reference_file_name”. Mixed usage is a common source of bugs when reading older report exports.

    Run to check remaining usage:


    🏁 Script executed:

    #!/bin/bash
    # Find usages of raw "Run" column outside mapping dicts and tests
    rg -nP -C2 '(?<!DIANN_MAP)\bRun\b' --type=py -g '!**/test/**' -g '!**/tests/**'

    Length of output: 4408


    Be explicit about renaming “Run” → “reference_file_name” in all DIA-NN call sites

    • In quantmsio/core/diann/diann.py update:
      • lines 85–87 (WHERE Run IN …) → use "reference_file_name" or DIANN_MAP["Run"]
      • lines 110–114 (SELECT "Run") → select "reference_file_name"
      • line 122 (peptide_df.to_dict()["Run"]) → index by "reference_file_name"
      • lines 520–522 (get_unique_references("Run")) → call get_unique_references("reference_file_name")
    • In quantmsio/utils/constants.py reconsider keeping RUN = "Run" and MSSTATS_RUN_NAME = "Run" alongside MSSTATS_REFERENCE_NAME = "reference_file_name" to avoid mixed-key usage.
    🤖 Prompt for AI Agents
    In quantmsio/core/common.py around lines 95–96, ensure the mapping explicitly
    renames "Run" to "reference_file_name" and propagate that change: update
    quantmsio/core/diann/diann.py so that lines 85–87 use "reference_file_name" (or
    DIANN_MAP["Run"]) instead of WHERE Run IN …, lines 110–114 SELECT
    "reference_file_name" instead of SELECT "Run", line 122 indexes
    peptide_df.to_dict() by "reference_file_name" instead of "Run", and lines
    520–522 call get_unique_references("reference_file_name") instead of
    get_unique_references("Run"); also review quantmsio/utils/constants.py and
    either keep RUN = "Run" and MSSTATS_RUN_NAME = "Run" alongside
    MSSTATS_REFERENCE_NAME = "reference_file_name" or consolidate keys to avoid
    mixed-key usage so all call sites consistently use the new reference_file_name
    key.
    

    Comment on lines +190 to +203
    report.loc[:, "intensities"] = report[
    ["reference_file_name", "pg_quantity"]
    ].apply(
    lambda rows: [
    {
    "sample_accession": self._sample_map[
    rows["reference_file_name"] + "-LFQ"
    ],
    "channel": "LFQ",
    "intensity": rows["pg_quantity"],
    }
    ],
    axis=1,
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    LFQ sample mapping likely wrong: SDRF LFQ keys don’t include “-LFQ”. Add a fallback.

    SDRFHandler.get_sample_map_run() maps LFQ as “file” (no “-LFQ” suffix). Your lookup uses “file-LFQ”.

    -            lambda rows: [
    -                {
    -                    "sample_accession": self._sample_map[
    -                        rows["reference_file_name"] + "-LFQ"
    -                    ],
    +            lambda rows: [
    +                {
    +                    "sample_accession": self._sample_map.get(
    +                        rows["reference_file_name"] + "-LFQ",
    +                        self._sample_map.get(rows["reference_file_name"], rows["reference_file_name"]),
    +                    ),
                         "channel": "LFQ",
                         "intensity": rows["pg_quantity"],
                     }
                 ],
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    report.loc[:, "intensities"] = report[
    ["reference_file_name", "pg_quantity"]
    ].apply(
    lambda rows: [
    {
    "sample_accession": self._sample_map[
    rows["reference_file_name"] + "-LFQ"
    ],
    "channel": "LFQ",
    "intensity": rows["pg_quantity"],
    }
    ],
    axis=1,
    )
    report.loc[:, "intensities"] = report[
    ["reference_file_name", "pg_quantity"]
    ].apply(
    lambda rows: [
    {
    "sample_accession": self._sample_map.get(
    rows["reference_file_name"] + "-LFQ",
    self._sample_map.get(
    rows["reference_file_name"],
    rows["reference_file_name"]
    )
    ),
    "channel": "LFQ",
    "intensity": rows["pg_quantity"],
    }
    ],
    axis=1,
    )
    🤖 Prompt for AI Agents
    In quantmsio/core/diann/diann.py around lines 190 to 203, the code looks up
    sample IDs using keys like reference_file_name + "-LFQ" but
    SDRFHandler.get_sample_map_run() stores LFQ keys without the "-LFQ" suffix;
    update the lookup to first try self._sample_map.get(reference_file_name +
    "-LFQ") and if that returns None fall back to
    self._sample_map.get(reference_file_name) (or vice versa if preferred), and
    handle missing mappings by raising a clear error or skipping the row; ensure the
    lambda uses .get to avoid KeyError and produces the same dict structure.
    

    Comment on lines +26 to +43
    def __init__(self, idxml_path: str):
    """Initialize IdXML PSM processor.

    Args:
    idxml_path: Path to idXML file
    """
    self.idxml_path = Path(idxml_path)
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    prot_ids = []
    pep_ids = []
    oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)
    self.prot_ids = prot_ids
    self.pep_ids = pep_ids
    self.openms_handler = OpenMSHandler()
    self.logger.info(
    f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"
    )

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Seed missing attributes and log summary.

    Initialize search_metadata to avoid AttributeError later.

             self.pep_ids = pep_ids
             self.openms_handler = OpenMSHandler()
    +        # Optional metadata injected by callers; default to empty dict
    +        self.search_metadata: dict = {}
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def __init__(self, idxml_path: str):
    """Initialize IdXML PSM processor.
    Args:
    idxml_path: Path to idXML file
    """
    self.idxml_path = Path(idxml_path)
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    prot_ids = []
    pep_ids = []
    oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)
    self.prot_ids = prot_ids
    self.pep_ids = pep_ids
    self.openms_handler = OpenMSHandler()
    self.logger.info(
    f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"
    )
    def __init__(self, idxml_path: str):
    """Initialize IdXML PSM processor.
    Args:
    idxml_path: Path to idXML file
    """
    self.idxml_path = Path(idxml_path)
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    prot_ids = []
    pep_ids = []
    oms.IdXMLFile().load(idxml_path, prot_ids, pep_ids)
    self.prot_ids = prot_ids
    self.pep_ids = pep_ids
    self.openms_handler = OpenMSHandler()
    # Optional metadata injected by callers; default to empty dict
    self.search_metadata: dict = {}
    self.logger.info(
    f"Loaded {len(self.prot_ids)} proteins and {len(self.pep_ids)} peptides from {idxml_path}"
    )
    🤖 Prompt for AI Agents
    In quantmsio/core/idxml_utils/idxml.py around lines 26 to 43, the IdXML class
    does not initialize the attribute search_metadata which causes an AttributeError
    later; seed this attribute in __init__ (e.g. self.search_metadata = {} or
    appropriate default) before any method may reference it, and keep the existing
    load and logging logic unchanged so the instance always has a defined
    search_metadata attribute after construction.
    

    Comment on lines +104 to +107
    # Extract protein accessions
    protein_accessions = [
    acc.decode() for acc in hit.extractProteinAccessionsSet()
    ]
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Protein accessions decoding may crash.

    extractProteinAccessionsSet often yields strings; decoding blindly can fail. Decode bytes only.

    -                protein_accessions = [
    -                    acc.decode() for acc in hit.extractProteinAccessionsSet()
    -                ]
    +                protein_accessions = [
    +                    acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc)
    +                    for acc in hit.extractProteinAccessionsSet()
    +                ]
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    # Extract protein accessions
    protein_accessions = [
    acc.decode() for acc in hit.extractProteinAccessionsSet()
    ]
    # Extract protein accessions
    protein_accessions = [
    acc.decode() if isinstance(acc, (bytes, bytearray)) else str(acc)
    for acc in hit.extractProteinAccessionsSet()
    ]
    🤖 Prompt for AI Agents
    In quantmsio/core/idxml_utils/idxml.py around lines 104 to 107, the code
    unconditionally calls .decode() on each accession from
    hit.extractProteinAccessionsSet(), which can raise if items are already strings;
    change the comprehension to decode only when an item is a bytes instance (e.g.,
    acc.decode()) and otherwise keep the item as-is (or str(acc) if you want to
    coerce non-str types), ensuring None or unexpected types are handled safely to
    avoid crashes.
    

    Comment on lines 26 to 31
    from quantmsio.operate.tools import (
    get_ahocorasick,
    get_modification_details,
    # get_ahocorasick,
    # get_modification_details,
    get_protein_accession,
    )
    from pyopenms.Constants import PROTON_MASS_U
    from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Re-import helpers required by this module.

    get_modification_details is used below; import it (and get_ahocorasick if you keep the automaton).

    -from quantmsio.operate.tools import (
    -    # get_ahocorasick,
    -    # get_modification_details,
    -    get_protein_accession,
    -)
    +from quantmsio.operate.tools import (
    +    get_ahocorasick,
    +    get_modification_details,
    +    get_protein_accession,
    +)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    from quantmsio.operate.tools import (
    get_ahocorasick,
    get_modification_details,
    # get_ahocorasick,
    # get_modification_details,
    get_protein_accession,
    )
    from pyopenms.Constants import PROTON_MASS_U
    from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS
    from quantmsio.operate.tools import (
    get_ahocorasick,
    get_modification_details,
    get_protein_accession,
    )
    from quantmsio.utils.constants import ITRAQ_CHANNEL, TMT_CHANNELS
    🤖 Prompt for AI Agents
    In quantmsio/core/maxquant/maxquant.py around lines 26 to 31, re-import the
    helper(s) currently commented out: add get_modification_details to the import
    list from quantmsio.operate.tools (and also add get_ahocorasick if the
    Aho–Corasick automaton is used elsewhere in this module); update the import
    statement to include these names so the functions used later are available, or
    remove/comment any references below if you decide not to import them.
    

    Comment on lines 63 to 69
    def extract_col_msg(self, col_df, label: str = "feature"):
    line = "\t".join(col_df.columns)
    self.mods_map = self.get_mods_map(line)
    self._automaton = get_ahocorasick(self.mods_map)
    # self._automaton = get_ahocorasick(self.mods_map)
    if label == "feature":
    intensity_normalize_names = []
    intensity_names = []
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Restore automaton initialization to avoid AttributeError.

    self._automaton is referenced later; initialize it.

             self.mods_map = self.get_mods_map(line)
    -        # self._automaton = get_ahocorasick(self.mods_map)
    +        self._automaton = get_ahocorasick(self.mods_map)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def extract_col_msg(self, col_df, label: str = "feature"):
    line = "\t".join(col_df.columns)
    self.mods_map = self.get_mods_map(line)
    self._automaton = get_ahocorasick(self.mods_map)
    # self._automaton = get_ahocorasick(self.mods_map)
    if label == "feature":
    intensity_normalize_names = []
    intensity_names = []
    def extract_col_msg(self, col_df, label: str = "feature"):
    line = "\t".join(col_df.columns)
    self.mods_map = self.get_mods_map(line)
    self._automaton = get_ahocorasick(self.mods_map)
    if label == "feature":
    intensity_normalize_names = []
    intensity_names = []
    🤖 Prompt for AI Agents
    In quantmsio/core/maxquant/maxquant.py around lines 63 to 69, restore
    initialization of self._automaton after self.mods_map is set to prevent later
    AttributeError: call and assign self._automaton = get_ahocorasick(self.mods_map)
    (or the appropriate factory function) immediately after self.mods_map =
    self.get_mods_map(line); if get_ahocorasick may return None, ensure you create a
    valid automaton or raise/log a clear error so downstream references are safe.
    

    msstats.loc[:, feature] = msstats[
    ["reference_file_name", "peptidoform", "precursor_charge"]
    ].apply(
    lambda rows: merge_psm(rows, i),
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Fix closure issue in lambda function.

    The lambda function at line 527 has the same closure issue with the loop variable i.

             for i, feature in enumerate(map_features):
                 if feature not in msstats.columns:
                     msstats.loc[:, feature] = msstats[
                         ["reference_file_name", "peptidoform", "precursor_charge"]
                     ].apply(
    -                    lambda rows: merge_psm(rows, i),
    +                    lambda rows, idx=i: merge_psm(rows, idx),
                         axis=1,
                     )
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    lambda rows: merge_psm(rows, i),
    for i, feature in enumerate(map_features):
    if feature not in msstats.columns:
    msstats.loc[:, feature] = msstats[
    ["reference_file_name", "peptidoform", "precursor_charge"]
    ].apply(
    lambda rows, idx=i: merge_psm(rows, idx),
    axis=1,
    )
    🧰 Tools
    🪛 Ruff (0.12.2)

    527-527: Function definition does not bind loop variable i

    (B023)

    🤖 Prompt for AI Agents
    In quantmsio/core/quantms/feature.py around line 527 the lambda passed (lambda
    rows: merge_psm(rows, i)) captures the loop variable i by reference causing all
    lambdas to use the final i; fix by binding the current i into the lambda's
    default arguments (e.g. lambda rows, i=i: merge_psm(rows, i)) or replace the
    lambda with functools.partial(merge_psm, i=i) so each closure uses the intended
    i value.
    

    Comment on lines 224 to 230
    if len(factor_columns) != 1:
    raise ValueError("The number of factor columns should be 1")

    # Rename the factor value column with condition as name
    sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"})
    sdrf_pd["condition"] = (
    sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)
    )
    else:
    sdrf_pd["condition"] = sdrf_pd[factor_columns]

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Bug: assigning a DataFrame to a Series column.

    When exactly one factor column exists, sdrf_pd[factor_columns] is a DataFrame; assign the single Series instead.

    -        else:
    -            sdrf_pd["condition"] = sdrf_pd[factor_columns]
    +        else:
    +            # factor_columns has exactly 1 element here
    +            sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str)
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    if len(factor_columns) != 1:
    raise ValueError("The number of factor columns should be 1")
    # Rename the factor value column with condition as name
    sdrf_pd = sdrf_pd.rename(columns={factor_columns[0]: "condition"})
    sdrf_pd["condition"] = (
    sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)
    )
    else:
    sdrf_pd["condition"] = sdrf_pd[factor_columns]
    if len(factor_columns) != 1:
    sdrf_pd["condition"] = (
    sdrf_pd[factor_columns].astype(str).agg("|".join, axis=1)
    )
    else:
    # factor_columns has exactly 1 element here
    sdrf_pd["condition"] = sdrf_pd[factor_columns[0]].astype(str)
    🤖 Prompt for AI Agents
    In quantmsio/core/sdrf.py around lines 224 to 230, when there is exactly one
    factor column the code assigns sdrf_pd[factor_columns] (a one-column DataFrame)
    to the "condition" column; change this to assign the single Series instead by
    indexing the single column name (e.g., use factor_columns[0]) and ensure you
    cast it to string to match the multi-column path before assigning to
    sdrf_pd["condition"].
    

    Copy link
    Contributor

    @coderabbitai coderabbitai bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 16

    ♻️ Duplicate comments (15)
    tests/core/quantms/test_full_conversion_suite.py (3)

    17-20: Fix workspace root to repository root

    Matches prior feedback; avoids path issues on CI.

     def get_workspace_root():
         """Get the workspace root directory"""
    -    return Path(__file__).parent.parent.parent
    +    return Path(__file__).resolve().parents[3]

    7-15: Use current interpreter instead of hardcoding 'python'

    Prevents environment mismatches; add sys import.

     import os
     import subprocess
     import tempfile
     import time
     from pathlib import Path
    +import sys

    119-131: Replace 'python' with sys.executable in all CLI invocations

    Ensures consistency across platforms and virtualenvs.

    -        cmd = [
    -            "python",
    +        cmd = [
    +            sys.executable,

    (Apply to all six command builders in these ranges.)

    Also applies to: 171-187, 227-243, 281-293, 333-349, 389-405

    quantmsio/core/quantms/feature.py (3)

    458-464: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).

    Each lambda closes over i; all rows get the last value. Bind i at definition time.

    Apply:

     for i, feature in enumerate(map_features):
    -    msstats.loc[:, feature] = msstats[
    +    index = i  # bind current index
    +    msstats.loc[:, feature] = msstats[
             ["reference_file_name", "peptidoform", "precursor_charge"]
         ].apply(
    -        lambda rows: merge_psm(rows, i),
    +        lambda rows, idx=index: merge_psm(rows, idx),
             axis=1,
         )

    512-519: Bind loop variable in lambda to avoid late-binding bug (regression of prior comment).

    Same closure issue here; bind i.

    Apply:

     for i, feature in enumerate(map_features):
         if feature not in msstats.columns:
    -        msstats.loc[:, feature] = msstats[
    +        index = i
    +        msstats.loc[:, feature] = msstats[
                 ["reference_file_name", "peptidoform", "precursor_charge"]
             ].apply(
    -            lambda rows: merge_psm(rows, i),
    +            lambda rows, idx=index: merge_psm(rows, idx),
                 axis=1,
             )

    521-533: Loop variable shadows outer df; yields confusing bugs.

    Rename inner variable.

    -        for key, df in df.groupby(partitions):
    -            yield key, df
    +        for key, group_df in df.groupby(partitions):
    +            yield key, group_df
    tests/core/quantms/psms/test_psm.py (2)

    102-121: Create output parquet safely inside TemporaryDirectory.

    Avoid mktemp for files; use a path inside a temp dir.

    -    temp_db_path = tempfile.mktemp(suffix=".duckdb")
    -    temp_output_path = tempfile.mktemp(suffix=".parquet")
    +    with tempfile.TemporaryDirectory() as tmpdir:
    +        temp_db_path = Path(tmpdir) / "psm.duckdb"
    +        temp_output_path = Path(tmpdir) / "psm.parquet"

    43-60: Replace insecure mktemp with TemporaryDirectory; remove manual unlink.

    Avoid CWE-377 and let the directory handle cleanup.

    -    # Create a temporary database path (don't create the file, let DuckDB do it)
    -    temp_db_path = tempfile.mktemp(suffix=".duckdb")
    +    with tempfile.TemporaryDirectory() as tmpdir:
    +        temp_db_path = Path(tmpdir) / "psm.duckdb"
     ...
    -    finally:
    -        # Clean up temporary database
    -        if os.path.exists(temp_db_path):
    -            os.unlink(temp_db_path)
    +    finally:
    +        pass

    Apply the same pattern to Lines 69-93, 102-129 (also for temp_output_path: create inside tmpdir), 141-159, 304-382, 394-536, and 548-585.

    quantmsio/core/quantms/pg.py (1)

    180-191: Dead method references undefined attribute _mztab_path.

    Calling _is_protein_line would raise AttributeError. Remove or guard it.

    -    def _is_protein_line(self, line_num: int) -> bool:
    -        """Check if a line is a protein data line."""
    -        try:
    -            # Use safe file opening with automatic cleanup
    -            with self._safe_file_open(self._mztab_path, "r") as file:
    -                for i, line in enumerate(file):
    -                    if i == line_num:
    -                        return line.startswith("PRT")
    -            return False
    -        except Exception:
    -            return False
    +    def _is_protein_line(self, line_num: int) -> bool:
    +        """Deprecated: no raw-file parsing in this class."""
    +        raise NotImplementedError("Not supported. Use indexer-backed access.")
    quantmsio/core/quantms/psm.py (2)

    378-384: Typo in CV param key: “consesus_support” → “consensus_support”

    Align the CV name with the intended spelling and downstream readers.

    -        cv_list = []
    -        if consensus_support and consensus_support != "null":
    -            struct = {
    -                "cv_name": "consesus_support",
    -                "cv_value": str(consensus_support),
    -            }
    -            cv_list.append(struct)
    +        cv_list = []
    +        if consensus_support and consensus_support != "null":
    +            struct = {
    +                "cv_name": "consensus_support",
    +                "cv_value": str(consensus_support),
    +            }
    +            cv_list.append(struct)

    Also applies to: 523-531


    654-666: Fix variable shadowing in generator

    Loop variable df shadows the outer df.

    -        for key, df in df.groupby(partitions):
    -            yield key, df
    +        for key, group_df in df.groupby(partitions):
    +            yield key, group_df
    quantmsio/core/duckdb.py (4)

    151-159: Fix table-existence check (variable shadowing bug)

    The method shadows tables and checks each table against itself; always True. Use a separate variable.

         def check_tables_in_db(self, tables: List[str]) -> bool:
    -        """Check if the database connection is valid."""
    -        tables = [t[0] for t in duckdb.execute("SHOW TABLES").fetchall()]
    -        for table in tables:
    -            if table not in tables:
    +        """Check if required tables exist in the database."""
    +        existing_tables = [t[0] for t in self._duckdb.execute("SHOW TABLES").fetchall()]
    +        for table in tables:
    +            if table not in existing_tables:
                     logging.info(f"Table {table} not in database")
                     return False
             return True

    507-512: Parameterize iter_file batch query

    Inline stringified list and quotes are incorrect and unsafe.

    -            query = f"""
    -            SELECT {cols}
    -            FROM report
    -            WHERE "{field}" = ANY('{refs}')
    -            """
    -            yield refs, self.query_to_df(query)
    +            placeholders = ",".join("?" for _ in refs)
    +            query = f'SELECT {cols} FROM report WHERE "{field}" IN ({placeholders})'
    +            yield refs, self._duckdb.execute(query, refs).df()

    477-483: Parameterize get_report_from_database refs filter

    Avoid interpolating lists; use IN (?,...,?).

    -        query = f"""
    -        SELECT {columns}
    -        FROM report
    -        WHERE "{RUN}" = ANY({refs})
    -        """
    -        return self.query_to_df(query)
    +        placeholders = ",".join("?" for _ in refs)
    +        query = f'SELECT {columns} FROM report WHERE "{RUN}" IN ({placeholders})'
    +        return self._duckdb.execute(query, refs).df()

    354-358: Parameterize filtering (avoid ANY(...) with inline list)

    Inline Python lists in SQL is unsafe and brittle. Use placeholders.

             query = "SELECT * FROM pg_matrix"
             if protein_groups_tuple:
    -            query += f' WHERE "{PROTEIN_GROUP}" = ANY({list(protein_groups_tuple)})'
    -
    -        return self.query_to_df(query)
    +            placeholders = ",".join("?" for _ in protein_groups_tuple)
    +            query += f' WHERE "{PROTEIN_GROUP}" IN ({placeholders})'
    +            return self._duckdb.execute(query, list(protein_groups_tuple)).df()
    +        return self._duckdb.execute(query).df()
    🧹 Nitpick comments (37)
    tests/core/quantms/test_msstats_in.py (6)

    287-301: Handle both reference file column names when building the matrix

    Hard-coding reference_file_name can break with legacy inputs using Reference_Name.

    -    for _, row in msstats_batch.iterrows():
    -        ref_file = row["reference_file_name"]
    +    for _, row in msstats_batch.iterrows():
    +        ref_col = (
    +            "Reference_Name"
    +            if "Reference_Name" in msstats_batch.columns
    +            else "reference_file_name"
    +        )
    +        ref_file = row[ref_col]

    165-179: Prefer public API over private _duckdb and use normalized column names

    Use query_to_df and the normalized lower-case columns to avoid brittle tests.

    -        raw_channels = indexer._duckdb.sql(
    -            "SELECT DISTINCT Channel FROM msstats ORDER BY Channel"
    -        ).df()
    +        raw_channels = indexer.query_to_df(
    +            "SELECT DISTINCT channel FROM msstats ORDER BY channel"
    +        )
    @@
    -        sample_raw = indexer._duckdb.sql(
    -            """
    -            SELECT Channel, Reference, ProteinName, PeptideSequence, Intensity 
    -            FROM msstats 
    -            WHERE Reference LIKE '%a05058%' 
    -            AND ProteinName LIKE '%P55011%' 
    -            LIMIT 15
    -        """
    -        ).df()
    +        sample_raw = indexer.query_to_df(
    +            """
    +            SELECT channel, reference_file_name, pg_accessions, peptidoform, intensity
    +            FROM msstats
    +            WHERE reference_file_name LIKE '%a05058%'
    +              AND pg_accessions LIKE '%P55011%'
    +            LIMIT 15
    +            """
    +        )

    349-349: Replace ambiguous multiplication sign

    Avoid the Unicode “×” to prevent encoding/test log issues.

    -    print("\nFeatures by File × Channel Matrix:")
    +    print("\nFeatures by File x Channel Matrix:")

    117-117: Remove extraneous f-strings without placeholders

    Minor cleanliness; keeps linters quiet (Ruff F541).

    -        print(f"\nLFQ test completed successfully!")
    +        print("\nLFQ test completed successfully!")
    @@
    -    print(f"\nTMT test completed successfully!")
    +    print("\nTMT test completed successfully!")
    @@
    -                print(f"Intensity distribution analysis completed")
    +                print("Intensity distribution analysis completed")
    @@
    -    print(f"\nDataset Comparison:")
    +    print("\nDataset Comparison:")

    Also applies to: 465-465, 536-536, 589-589


    9-10: Drop unused imports

    gzip and tempfile are not used in this module.

    -import gzip
    -import tempfile
    +import tempfile  # if you still need it elsewhere, keep; otherwise remove

    If tempfile is also unused, remove it as well.


    181-182: Avoid catching bare Exception in debug helper

    Narrow the exception or re-raise after logging to keep failures visible in CI.

    -    except Exception as e:
    +    except (RuntimeError, ValueError) as e:
    quantmsio/utils/constants.py (3)

    94-95: Alias peptide name to the sequence constant to prevent drift

    Avoid duplicating the same string literal in two places.

    -MSSTATS_PEPTIDE_NAME = "PeptideSequence"
    +MSSTATS_PEPTIDE_NAME = MSSTATS_PEPTIDE_SEQUENCE

    111-114: Make MSstats columns order deterministic

    Using set() makes column order non-deterministic, which can lead to unstable schemas across chunks.

    -MSSTATS_USECOLS = list(
    -    set(MSSTATS_NAME_MAP.values())
    -    | {"sample_accession", "reference_file_name", "channel"}
    -)
    +MSSTATS_USECOLS = [
    +    "pg_accessions",
    +    "peptidoform",
    +    "charge",
    +    "intensity",
    +    "sample_accession",
    +    "reference_file_name",
    +    "channel",
    +]

    121-133: Use declared constants within MZTAB_PROTEIN_COLUMNS for consistency

    Keeps identifiers centralized and reduces typos.

     MZTAB_PROTEIN_COLUMNS = [
         ACCESSION,
         "description",
         MZTAB_PROTEIN_BEST_SEARCH_ENGINE_SCORE,
    -    "ambiguity_members",
    +    AMBIGUITY_MEMBERS,
         "modifications",
         "protein_coverage",
         "opt_global_nr_found_peptides",
         "opt_global_Posterior_Probability_score",
         "opt_global_cv_PRIDE:0000303_decoy_hit",
    -    "opt_global_result_type",
    +    OPT_GLOBAL_RESULT_TYPE,
     ]
    tests/core/quantms/test_full_conversion_suite.py (3)

    1-1: Remove shebang or mark file executable

    The shebang isn’t needed for a pytest module and triggers Ruff EXE001.

    -#!/usr/bin/env python3

    91-101: Avoid loading full Parquet into memory for simple stats

    Use metadata/schema to get rows/cols and column names.

    -    try:
    -        df = pq.read_table(file_path).to_pandas()
    -        size_mb = file_path.stat().st_size / (1024 * 1024)
    -
    -        return {
    -            "rows": len(df),
    -            "columns": len(df.columns),
    -            "size_mb": size_mb,
    -            "column_names": list(df.columns)[:10],  # First 10 columns
    -        }
    +    try:
    +        pf = pq.ParquetFile(file_path)
    +        size_mb = file_path.stat().st_size / (1024 * 1024)
    +        names = list(pf.schema_arrow.names)
    +        return {
    +            "rows": pf.metadata.num_rows,
    +            "columns": len(names),
    +            "size_mb": size_mb,
    +            "column_names": names[:10],
    +        }

    81-84: Narrow broad exception or re-raise after logging

    Catching Exception hides real failures during CI; consider narrowing to OSError/CalledProcessError or re-raising.

    quantmsio/core/quantms/feature.py (4)

    488-496: Column name “mp_accessions” looks unintended; likely “accession” or “pg_accessions”.

    “mp_accessions” is not used elsewhere and may not exist in FEATURE_SCHEMA; risk of silent drop/mismatch. Please confirm the intended field and rename accordingly.

    Example:

    -    "mp_accessions",
    +    "accession",  # or "pg_accessions" if you intend PG group accessions

    If you intend PGs, consider deriving from existing pg_accessions instead of the PSM accession string.


    660-664: anchor_protein type inconsistency with schema.

    add_additional_msg converts anchor_protein to list, but convert_to_parquet_format later casts anchor_protein to str. Confirm FEATURE_SCHEMA: if anchor_protein is string, keep the list in a new field (anchor_proteins) and leave anchor_protein as string; if it’s a list, remove it from string_columns and add to list_columns.

    Option A (keep string + add list):

    -    msstats["anchor_protein"] = msstats["anchor_protein"].apply(
    -        get_protein_accession
    -    )
    +    msstats["anchor_proteins"] = msstats["anchor_protein"].apply(get_protein_accession)
     string_columns = [
    -    "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name",
    +    "sequence","peptidoform","reference_file_name","anchor_protein","scan","scan_reference_file_name",
     ]
    -list_columns = ["pg_accessions", "gg_accessions", "gg_names"]
    +list_columns = ["pg_accessions", "gg_accessions", "gg_names", "anchor_proteins"]

    Option B (schema expects list): remove anchor_protein from string_columns and add it to list_columns.

    Also applies to: 739-747, 751-763


    114-116: Overbroad exception handlers hide real errors.

    Catching bare Exception and immediately returning empty structures makes debugging hard. Narrow exceptions and re-raise unexpected errors, or at least log with stack_info/exec_info.

    Example:

    -        except Exception as e:
    -            self.logger.warning(f"Could not extract PSM messages: {e}")
    -            return {}, {}
    +        except (KeyError, AttributeError, ValueError) as e:
    +            self.logger.warning("Could not extract PSM messages", exc_info=e)
    +            return {}, {}

    Repeat pattern for other helpers.

    Also applies to: 233-235, 257-259, 270-272, 327-331


    134-223: generate_modifications_details: simplify score assignment and guard indexing.

    • scores selection uses “mod_name in select_mods” which is always True (select_mods = keys).
    • Accessing sequence[aa_pos-1] needs a length guard.
    -                        aa = sequence[aa_pos - 1]
    +                        if 0 < aa_pos <= len(sequence):
    +                            aa = sequence[aa_pos - 1]
    +                        else:
    +                            continue
    ...
    -                            "scores": (
    -                                [
    -                                    {"score_name": "localization_probability","score_value": None}
    -                                ] if mod_name in select_mods else None
    -                            ),
    +                            "scores": None,  # until a real source is available

    Also drop the unused select_mods variable.

    tests/core/quantms/test_feature.py (1)

    36-40: Prefer context-managing the indexer.

    MzTabIndexer supports enter/exit. Using “with MzTabIndexer.create(...) as indexer” guarantees cleanup even on failure.

    -    indexer = MzTabIndexer.create(
    +    with MzTabIndexer.create(
             mztab_path=mztab_file,
             msstats_path=msstats_file,
             sdrf_path=sdrf_file,
    -        database_path=temp_db_path,
    -    )
    +        database_path=temp_db_path,
    +    ) as indexer:
    tests/core/quantms/psms/test_psm.py (1)

    246-254: Use Pythonic truth checks in tests.

    E712 flags “== True/False”. Prefer direct truthiness.

    -    ) == True
    +    )
    ...
    -    assert _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"]) == False
    +    assert not _is_ms_run_location_line(["MTD", "ms_run[1]-format", "mzML"])
    tests/core/quantms/psms/test_duckdb_creation.py (4)

    137-139: Remove unnecessary f-strings without placeholders.

    These prints don’t interpolate variables and trigger F541 in Ruff. Drop the f prefix.

    Apply across the listed lines, e.g.:

    - print(f"\n=== Testing Two-Step Creation Process ===")
    + print("\n=== Testing Two-Step Creation Process ===")

    Also applies to: 166-177, 226-234, 263-268, 293-304, 350-361, 385-392, 419-431, 472-480, 499-503, 530-537, 564-569, 617-625, 644-648


    149-154: Avoid accessing private attribute _duckdb in tests.

    Use a read-only DuckDB connection for verification to keep tests decoupled from internals.

    Example for one block:

    - tables = [t[0] for t in mztab_step1._duckdb.execute("SHOW TABLES").fetchall()]
    + with duckdb.connect(str(output_path), read_only=True) as con_verify:
    +     tables = [t[0] for t in con_verify.execute("SHOW TABLES").fetchall()]

    Also applies to: 276-281, 311-317, 399-404, 442-449, 586-594


    653-660: Replace bare except/pass in cleanup helpers.

    Use contextlib.suppress or log at debug to avoid silencing unexpected errors and S110/BLE001 warnings.

    Apply:

    +import contextlib
    @@
    -    if mztab_indexer and hasattr(mztab_indexer, "_duckdb") and mztab_indexer._duckdb:
    -        try:
    -            mztab_indexer._duckdb.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +    if mztab_indexer and getattr(mztab_indexer, "_duckdb", None):
    +        with contextlib.suppress(Exception):
    +            mztab_indexer._duckdb.close()
    @@
    -    if con:
    -        try:
    -            con.close()
    -        except Exception:
    -            pass  # Ignore errors during cleanup
    +    if con:
    +        with contextlib.suppress(Exception):
    +            con.close()

    Also applies to: 662-669


    46-50: Test name/docs mention Parquet but only DuckDB is exercised.

    Either add Parquet path coverage or rename to avoid confusion.

    quantmsio/core/quantms/pg.py (3)

    45-55: Use a consistent logger; prefer the instance logger.

    cleanup() uses a different logger name than the class logger, fragmenting log control.

    -        logger = logging.getLogger("quantmsio.core.mztab")
    +        logger = self.logger

    Also replace other logging.getLogger("quantmsio.core.mztab")/quantmsio.core.quantms.pg occurrences in this method with self.logger.

    Also applies to: 57-66


    266-271: Prefer logging.exception in except blocks.

    Leverage stack traces directly; current error messages duplicate str(e).

    -                logging.getLogger("quantmsio.core.mztab").error(
    -                    f"Error opening file {file_path}: {e}"
    -                )
    +                logging.getLogger("quantmsio.core.mztab").exception(
    +                    "Error opening file %s", file_path
    +                )
    @@
    -            logger.error(f"Error loading protein groups table: {e}")
    +            logger.exception("Error loading protein groups table")
    @@
    -            logger.error(f"Error creating msstats protein join: {e}")
    +            logger.exception("Error creating msstats protein join")

    Also applies to: 283-289, 485-487, 555-557


    425-491: Fix logging and remove unused protein_groups var.

    You build protein_data but log len(protein_groups), which is always 0.

    -        protein_groups = {}
    -        protein_data = list()
    +        protein_data: list[dict] = []
    @@
    -        logger.info(
    -            f"Loaded {len(protein_groups)} protein groups from {total_rows} total rows"
    -        )
    +        logger.info(
    +            "Loaded %d protein groups from %d total rows",
    +            len(protein_data),
    +            total_rows,
    +        )
    quantmsio/commands/convert/quantms.py (5)

    129-132: Preserve traceback and avoid redundant str(e) in logs.

    Use logger.exception("...") and chain exceptions with from e.

    -    except Exception as e:
    -        logger.exception(f"Error in feature conversion: {str(e)}")
    -        raise click.ClickException(f"Error: {str(e)}\nCheck the logs for more details.")
    +    except Exception as e:
    +        logger.exception("Error in feature conversion")
    +        raise click.ClickException(f"Error: {e}\nCheck the logs for more details.") from e

    Repeat similarly for PSM, idXML-PSM, PG, and create-duckdb handlers.

    Also applies to: 211-214, 267-270, 416-419, 534-537


    324-335: Annotate optional parameters explicitly with Optional[...].

    PEP 484 discourages implicit Optional typing; update annotations.

    -def convert_quantms_pg_cmd(
    -    mztab_path: Path = None,
    -    database_path: Path = None,
    -    msstats_file: Path = None,
    -    sdrf_file: Path = None,
    -    output_folder: Path = None,
    +def convert_quantms_pg_cmd(
    +    mztab_path: Optional[Path] = None,
    +    database_path: Optional[Path] = None,
    +    msstats_file: Optional[Path] = None,
    +    sdrf_file: Optional[Path] = None,
    +    output_folder: Optional[Path] = None,

    Ensure from typing import Optional is present (it already is).


    520-520: Drop extraneous f-prefix.

    Minor logging polish: remove f when no placeholders are used.

    -            logger.info(f"Successfully processed:")
    +            logger.info("Successfully processed:")

    88-95: Minor: avoid redundant Path() wrapping.

    database_path is already a Path; use it directly for .exists() and str() conversions.

    Example:

    -        if database_path and Path(database_path).exists():
    +        if database_path and database_path.exists():

    Also applies to: 181-194, 369-385, 501-509


    402-415: Use public conversion helper to write Parquet.

    If possible, expose a writer method on MzTabProteinGroups similar to Feature.write_feature_to_file to avoid calling a private _convert_to_parquet_format.

    quantmsio/core/quantms/psm.py (6)

    386-392: Avoid KeyError; reuse computed spectra_ref and handle None

    Use the local spectra_ref and guard for None to prevent KeyError.

    -            transformed_record["scan"] = generate_scan_number(record["spectra_ref"])
    +            transformed_record["scan"] = (
    +                generate_scan_number(spectra_ref) if spectra_ref else None
    +            )

    430-436: Remove duplicate assignments

    mz_array and intensity_array are set twice.

                 transformed_record["number_peaks"] = None
                 transformed_record["ion_mobility"] = None
    -            transformed_record["mz_array"] = None
    -            transformed_record["intensity_array"] = None

    439-441: Log full traceback on transform failures

    Use logger.exception for better diagnostics.

    -        except Exception as e:
    -            self.logger.warning(f"Error transforming PSM record: {e}")
    +        except Exception:
    +            self.logger.exception("Error transforming PSM record")
                 return None

    478-491: Use .get() to avoid KeyError on missing score columns

    PSM rows may not carry all search_engine_score[i] fields.

    -        for score_index, score_name in self._score_names["psms"].items():
    -            value = record[f"search_engine_score[{score_index}]"]
    +        for score_index, score_name in self._score_names["psms"].items():
    +            value = record.get(f"search_engine_score[{score_index}]")
                 if value is not None and value != "null":

    507-509: Minor: remove unnecessary f-string

    No placeholders present.

    -                self.logger.warning(f"Could not convert global_qvalue to float")
    +                self.logger.warning("Could not convert global_qvalue to float")

    580-582: Prefer logger.exception when catching unexpected errors

    Preserves stack trace for debugging.

    -        except Exception as e:
    -            self.logger.error(f"Could not extract peptide columns: {e}")
    +        except Exception:
    +            self.logger.exception("Could not extract peptide columns")
                 self._pep_columns = []
    quantmsio/core/duckdb.py (1)

    223-225: Use logger.exception to capture stack trace on benchmark failures

    -            except Exception as e:
    -                self.logger.error(f"Query failed on iteration {i+1}: {e}")
    +            except Exception:
    +                self.logger.exception(f"Query failed on iteration {i+1}")
                     return {"error": str(e)}

    Comment on lines +105 to +116
    elif mztab_path:
    # No database_path provided, create a temporary one
    temp_db_path = tempfile.mktemp(suffix=".duckdb")
    logger.info(
    f"Creating temporary MzTabIndexer at {temp_db_path} from {mztab_path}"
    )
    indexer = MzTabIndexer.create(
    mztab_path=str(mztab_path),
    msstats_path=msstats_file,
    sdrf_path=sdrf_file,
    database_path=temp_db_path,
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Replace insecure tempfile.mktemp() with NamedTemporaryFile(delete=False).

    mktemp is vulnerable to race conditions (CWE-377/B306). Use a secure temp file.

    -            temp_db_path = tempfile.mktemp(suffix=".duckdb")
    +            from tempfile import NamedTemporaryFile
    +            with NamedTemporaryFile(suffix=".duckdb", delete=False) as tf:
    +                temp_db_path = tf.name

    Apply in all three places.

    Also applies to: 195-204, 386-396

    🧰 Tools
    🪛 ast-grep (0.38.6)

    [warning] 106-106: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
    Context: tempfile.mktemp(suffix=".duckdb")
    Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
    https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
    https://owasp.org/Top10/A01_2021-Broken_Access_Control

    (avoid-mktemp-python)

    🪛 Ruff (0.12.2)

    107-107: Use of insecure and deprecated function (mktemp)

    (S306)

    🪛 GitHub Check: CodeFactor

    [warning] 110-110: quantmsio/commands/convert/quantms.py#L110
    Use of insecure and deprecated function (mktemp). (B306)

    🤖 Prompt for AI Agents
    In quantmsio/commands/convert/quantms.py around lines 105 to 116 (and also apply
    the same change at 195-204 and 386-396), replace the insecure tempfile.mktemp()
    calls with a secure tempfile.NamedTemporaryFile(delete=False, suffix=".duckdb")
    usage: create a NamedTemporaryFile with delete=False, capture its .name for
    database_path, close the file descriptor immediately (so it can be opened by
    other code) and pass the .name to MzTabIndexer.create; ensure the temporary file
    is cleaned up later if needed.
    

    Comment on lines 24 to +51
    def __init__(
    self,
    report_path: Union[Path, str],
    duckdb_max_memory: str = "16GB",
    duckdb_threads: int = 4,
    database_path: Optional[Union[Path, str]] = None,
    max_memory: str = "16GB",
    worker_threads: int = 8,
    ):
    self.parquet_db: duckdb.DuckDBPyConnection | None = None
    self._report_path = report_path
    self._duckdb_name = create_uuid_filename("report-duckdb", ".db")
    self._duckdb = self.create_duckdb_from_diann_report(
    duckdb_max_memory, duckdb_threads
    """Initialize DuckDB connection.

    Args:
    database_path: Optional path to DuckDB database file
    max_memory: Maximum memory to use
    worker_threads: Number of worker threads
    """
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    self.logger.info(f"duckdb uses {worker_threads} threads.")
    self.logger.info(f"duckdb uses {max_memory} of memory.")

    start_time = time.time()
    self._duckdb = duckdb.connect(
    database=str(database_path) if database_path else ":memory:",
    config={
    "memory_limit": max_memory,
    "threads": worker_threads,
    },
    )
    self.logger.info(
    f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Persist database path and apply resource settings via constructor

    Store database_path for cleanup and pass max_memory/worker_threads through __init__. Otherwise, cleanup_duckdb can’t locate the file and subclass workarounds appear (see DiannDuckDB).

         def __init__(
             self,
             database_path: Optional[Union[Path, str]] = None,
             max_memory: str = "16GB",
             worker_threads: int = 8,
         ):
             """Initialize DuckDB connection.
    @@
    -        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    +        self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
             self.logger.info(f"duckdb uses {worker_threads} threads.")
             self.logger.info(f"duckdb uses {max_memory} of memory.")
     
             start_time = time.time()
    -        self._duckdb = duckdb.connect(
    -            database=str(database_path) if database_path else ":memory:",
    +        # Track database path for cleanup
    +        self._database_path = Path(database_path) if database_path else None
    +        self._duckdb = duckdb.connect(
    +            database=str(self._database_path) if self._database_path else ":memory:",
                 config={
                     "memory_limit": max_memory,
                     "threads": worker_threads,
                 },
             )
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def __init__(
    self,
    report_path: Union[Path, str],
    duckdb_max_memory: str = "16GB",
    duckdb_threads: int = 4,
    database_path: Optional[Union[Path, str]] = None,
    max_memory: str = "16GB",
    worker_threads: int = 8,
    ):
    self.parquet_db: duckdb.DuckDBPyConnection | None = None
    self._report_path = report_path
    self._duckdb_name = create_uuid_filename("report-duckdb", ".db")
    self._duckdb = self.create_duckdb_from_diann_report(
    duckdb_max_memory, duckdb_threads
    """Initialize DuckDB connection.
    Args:
    database_path: Optional path to DuckDB database file
    max_memory: Maximum memory to use
    worker_threads: Number of worker threads
    """
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    self.logger.info(f"duckdb uses {worker_threads} threads.")
    self.logger.info(f"duckdb uses {max_memory} of memory.")
    start_time = time.time()
    self._duckdb = duckdb.connect(
    database=str(database_path) if database_path else ":memory:",
    config={
    "memory_limit": max_memory,
    "threads": worker_threads,
    },
    )
    self.logger.info(
    f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"
    )
    def __init__(
    self,
    database_path: Optional[Union[Path, str]] = None,
    max_memory: str = "16GB",
    worker_threads: int = 8,
    ):
    """Initialize DuckDB connection.
    Args:
    database_path: Optional path to DuckDB database file
    max_memory: Maximum memory to use
    worker_threads: Number of worker threads
    """
    self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
    self.logger.info(f"duckdb uses {worker_threads} threads.")
    self.logger.info(f"duckdb uses {max_memory} of memory.")
    start_time = time.time()
    # Track database path for cleanup
    self._database_path = Path(database_path) if database_path else None
    self._duckdb = duckdb.connect(
    database=str(self._database_path) if self._database_path else ":memory:",
    config={
    "memory_limit": max_memory,
    "threads": worker_threads,
    },
    )
    self.logger.info(
    f"Time to initialize duckdb {(time.time() - start_time):.2f} seconds"
    )
    🤖 Prompt for AI Agents
    In quantmsio/core/duckdb.py around lines 24 to 51, the constructor currently
    does not persist the provided database_path or the resource settings on the
    instance, which prevents cleanup_duckdb and subclasses from locating the DB file
    or reading configured resources; fix by storing the incoming parameters on self
    (e.g., self.database_path = Path(database_path) if provided else None,
    self.max_memory = max_memory, self.worker_threads = worker_threads) and continue
    to use those stored values when connecting (and later when cleaning up) so
    cleanup_duckdb and subclasses can access the actual path and resource settings.
    

    Comment on lines +59 to +78
    def cleanup_duckdb(self):
    """Check if DuckDB connection is closed, then delete the database file."""
    # Close connection if it is still open
    if self._duckdb:
    try:
    self._duckdb.close()
    self.logger.info("[Check] DuckDB connection closed.")
    except Exception as e:
    self.logger.info(f"Failed to close DuckDB connection: {e}")
    finally:
    self._duckdb = None

    db_file = Path(self._database_path)
    # Delete the database file using pathlib
    if db_file.exists():
    try:
    db_file.unlink()
    self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
    except Exception as e:
    self.logger.info(f"Failed to delete database file: {e}")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    cleanup_duckdb breaks when _database_path is unset

    When using an in-memory DB, _database_path is None; current code raises. Guard for None before deletion.

    -        db_file = Path(self._database_path)
    -        # Delete the database file using pathlib
    -        if db_file.exists():
    +        if not getattr(self, "_database_path", None):
    +            return
    +        db_file = Path(self._database_path)
    +        # Delete the database file using pathlib
    +        if db_file.exists():
                 try:
                     db_file.unlink()
                     self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
                 except Exception as e:
                     self.logger.info(f"Failed to delete database file: {e}")
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def cleanup_duckdb(self):
    """Check if DuckDB connection is closed, then delete the database file."""
    # Close connection if it is still open
    if self._duckdb:
    try:
    self._duckdb.close()
    self.logger.info("[Check] DuckDB connection closed.")
    except Exception as e:
    self.logger.info(f"Failed to close DuckDB connection: {e}")
    finally:
    self._duckdb = None
    db_file = Path(self._database_path)
    # Delete the database file using pathlib
    if db_file.exists():
    try:
    db_file.unlink()
    self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
    except Exception as e:
    self.logger.info(f"Failed to delete database file: {e}")
    def cleanup_duckdb(self):
    """Check if DuckDB connection is closed, then delete the database file."""
    # Close connection if it is still open
    if self._duckdb:
    try:
    self._duckdb.close()
    self.logger.info("[Check] DuckDB connection closed.")
    except Exception as e:
    self.logger.info(f"Failed to close DuckDB connection: {e}")
    finally:
    self._duckdb = None
    # If no on-disk database path is set (e.g. in-memory DB), skip file deletion
    if not getattr(self, "_database_path", None):
    return
    db_file = Path(self._database_path)
    # Delete the database file using pathlib
    if db_file.exists():
    try:
    db_file.unlink()
    self.logger.info(f"[CleanUp] Database file deleted: {db_file}")
    except Exception as e:
    self.logger.info(f"Failed to delete database file: {e}")
    🧰 Tools
    🪛 Ruff (0.12.2)

    66-66: Do not catch blind exception: Exception

    (BLE001)


    77-77: Do not catch blind exception: Exception

    (BLE001)

    🤖 Prompt for AI Agents
    In quantmsio/core/duckdb.py around lines 59 to 78, cleanup_duckdb assumes
    self._database_path is set and constructs a Path which raises when it's None;
    guard before attempting deletion by checking if self._database_path is truthy
    (not None/empty) and only then create Path(self._database_path) and call
    exists()/unlink(); if it's None or represents an in-memory DB skip file deletion
    and log that no file was present; preserve existing exception handling and set
    self._duckdb = None in the finally block.
    

    Comment on lines +84 to 105
    def secure_query_to_df(
    self, query: str, params: Optional[Dict[str, Any]] = None
    ) -> pd.DataFrame:
    """Execute parameterized query safely and return result as DataFrame.

    Args:
    query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')
    params: Dictionary of parameters to safely substitute

    if max_memory is not None:
    database.execute("SET max_memory='{}'".format(max_memory))
    if worker_threads is not None:
    database.execute("SET worker_threads='{}'".format(worker_threads))
    Returns:
    DataFrame with query results

    msg = database.execute(
    "SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')"
    ).df()
    logging.info("duckdb uses {} threads.".format(str(msg["value"][0])))
    logging.info("duckdb uses {} of memory.".format(str(msg["value"][1])))
    Example:
    >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})
    """
    if params:
    # DuckDB uses positional parameters with ?
    param_values = list(params.values())
    return self._duckdb.execute(query, param_values).df()
    else:
    return self._duckdb.execute(query).df()

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Parameter order bug and API mismatch in secure_query_to_df

    DuckDB uses positional ? placeholders. Passing an unordered dict can bind parameters in an unintended order. Accept a sequence and pass through.

    -    def secure_query_to_df(
    -        self, query: str, params: Optional[Dict[str, Any]] = None
    -    ) -> pd.DataFrame:
    +    def secure_query_to_df(
    +        self, query: str, params: Optional[List[Any]] = None
    +    ) -> pd.DataFrame:
    @@
    -        Example:
    -            >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})
    +        Example:
    +            >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"])
    @@
    -        if params:
    -            # DuckDB uses positional parameters with ?
    -            param_values = list(params.values())
    -            return self._duckdb.execute(query, param_values).df()
    +        if params:
    +            # DuckDB uses positional parameters with ?
    +            return self._duckdb.execute(query, params).df()
             else:
                 return self._duckdb.execute(query).df()
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def secure_query_to_df(
    self, query: str, params: Optional[Dict[str, Any]] = None
    ) -> pd.DataFrame:
    """Execute parameterized query safely and return result as DataFrame.
    Args:
    query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')
    params: Dictionary of parameters to safely substitute
    if max_memory is not None:
    database.execute("SET max_memory='{}'".format(max_memory))
    if worker_threads is not None:
    database.execute("SET worker_threads='{}'".format(worker_threads))
    Returns:
    DataFrame with query results
    msg = database.execute(
    "SELECT * FROM duckdb_settings() where name in ('worker_threads', 'max_memory')"
    ).df()
    logging.info("duckdb uses {} threads.".format(str(msg["value"][0])))
    logging.info("duckdb uses {} of memory.".format(str(msg["value"][1])))
    Example:
    >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", {"accession": "P12345"})
    """
    if params:
    # DuckDB uses positional parameters with ?
    param_values = list(params.values())
    return self._duckdb.execute(query, param_values).df()
    else:
    return self._duckdb.execute(query).df()
    def secure_query_to_df(
    self, query: str, params: Optional[List[Any]] = None
    ) -> pd.DataFrame:
    """Execute parameterized query safely and return result as DataFrame.
    Args:
    query: SQL query with placeholders (e.g., 'SELECT * FROM table WHERE id = ?')
    params: List of parameters to safely substitute
    Returns:
    DataFrame with query results
    Example:
    >>> secure_query_to_df("SELECT * FROM proteins WHERE accession = ?", ["P12345"])
    """
    if params:
    # DuckDB uses positional parameters with ?
    return self._duckdb.execute(query, params).df()
    else:
    return self._duckdb.execute(query).df()
    🤖 Prompt for AI Agents
    In quantmsio/core/duckdb.py around lines 84 to 105, secure_query_to_df currently
    accepts params: Optional[Dict[str, Any]] which can reorder positional ?
    bindings; change the API to accept a sequence (Optional[Sequence[Any]] or
    List[Any]) instead, update the docstring/example to show a positional list/tuple
    (e.g., ["P12345"]), and when params is provided pass it (or list(params))
    directly to self._duckdb.execute(query, params) so binding order matches the
    query; keep the no-params branch unchanged.
    

    Comment on lines +305 to 311
    database_name = create_uuid_filename("diann-report", ".duckdb")
    super().__init__(database_name)

    def get_unique_references(self, field: str) -> list:
    # Initialize database and create report table
    self.initialize_database(max_memory, worker_threads)
    self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    Call to undefined initialize_database and unused resource args

    DiannDuckDB.__init__ calls a non-existent method. Pass max_memory/worker_threads to super().__init__ and remove the call.

    -        database_name = create_uuid_filename("diann-report", ".duckdb")
    -        super().__init__(database_name)
    -
    -        # Initialize database and create report table
    -        self.initialize_database(max_memory, worker_threads)
    +        database_name = create_uuid_filename("diann-report", ".duckdb")
    +        super().__init__(database_name, max_memory=max_memory, worker_threads=worker_threads)
     
             self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    database_name = create_uuid_filename("diann-report", ".duckdb")
    super().__init__(database_name)
    def get_unique_references(self, field: str) -> list:
    # Initialize database and create report table
    self.initialize_database(max_memory, worker_threads)
    self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])
    database_name = create_uuid_filename("diann-report", ".duckdb")
    super().__init__(
    database_name,
    max_memory=max_memory,
    worker_threads=worker_threads
    )
    self.create_table_from_file("report", self._report_path, [PROTEIN_GROUP, RUN])

    Comment on lines +633 to +641
    # Get reference file name
    ref_file = None
    if pd.notna(row["spectra_ref"]):
    try:
    ms_run_id = row["spectra_ref"].split(":")[0]
    ref_file = self._ms_runs.get(ms_run_id)
    except (ValueError, AttributeError):
    pass

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue

    MS run mapping bug: wrong key type used for _ms_runs

    self._ms_runs is keyed by integer ms_run IDs, but here you use the raw "ms_run[X]" string; this always yields None. Reuse your parser.

    -                    if pd.notna(row["spectra_ref"]):
    -                        try:
    -                            ms_run_id = row["spectra_ref"].split(":")[0]
    -                            ref_file = self._ms_runs.get(ms_run_id)
    -                        except (ValueError, AttributeError):
    -                            pass
    +                    if pd.notna(row["spectra_ref"]):
    +                        ref_file = self._get_reference_file_name(row["spectra_ref"])
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    # Get reference file name
    ref_file = None
    if pd.notna(row["spectra_ref"]):
    try:
    ms_run_id = row["spectra_ref"].split(":")[0]
    ref_file = self._ms_runs.get(ms_run_id)
    except (ValueError, AttributeError):
    pass
    # Get reference file name
    ref_file = None
    if pd.notna(row["spectra_ref"]):
    ref_file = self._get_reference_file_name(row["spectra_ref"])
    🤖 Prompt for AI Agents
    In quantmsio/core/quantms/psm.py around lines 633 to 641, the code looks up
    self._ms_runs using the raw string token from row["spectra_ref"] which yields
    None because _ms_runs is keyed by integer ms_run IDs; parse the spectra_ref to
    extract the numeric ms_run id exactly as the existing parser does (convert to
    int) and use that integer as the key when calling self._ms_runs.get(...),
    preserving the current try/except behavior to handle malformed values.
    

    Comment on lines +642 to +645
    # Update map with best score
    if key not in pep_map or row["score"] < pep_map[key][0]:
    pep_map[key] = [row["score"], ref_file, scan_number]

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Guard None and normalize score to float before comparison

    Avoid TypeError when score is missing; store numeric consistently.

    -                    if key not in pep_map or row["score"] < pep_map[key][0]:
    -                        pep_map[key] = [row["score"], ref_file, scan_number]
    +                    try:
    +                        score_val = float(row["score"])
    +                    except (TypeError, ValueError):
    +                        continue
    +                    if key not in pep_map or score_val < pep_map[key][0]:
    +                        pep_map[key] = [score_val, ref_file, scan_number]
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    # Update map with best score
    if key not in pep_map or row["score"] < pep_map[key][0]:
    pep_map[key] = [row["score"], ref_file, scan_number]
    # Update map with best score
    try:
    score_val = float(row["score"])
    except (TypeError, ValueError):
    continue
    if key not in pep_map or score_val < pep_map[key][0]:
    pep_map[key] = [score_val, ref_file, scan_number]
    🤖 Prompt for AI Agents
    In quantmsio/core/quantms/psm.py around lines 642 to 645, the code compares
    row["score"] directly which can be None or a non-numeric string; normalize by
    parsing score to a float (e.g. try: score = float(row.get("score")) except
    (TypeError, ValueError): score = float("inf")) and then use that numeric score
    in the comparison and when storing: if key not in pep_map or score <
    pep_map[key][0]: pep_map[key] = [score, ref_file, scan_number].
    

    Comment on lines +26 to +35
    temp_db_path = tempfile.mktemp(suffix=".duckdb")

    # Initialize Feature
    indexer = MzTabIndexer.create(
    mztab_path=mztab_file,
    msstats_path=msstats_file,
    sdrf_path=sdrf_file,
    database_path=temp_db_path,
    )

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Replace insecure tempfile.mktemp with TemporaryDirectory path (security, reliability).

    mktemp is vulnerable (CWE-377) and can race; also MzTabIndexer expects a non-existent DB path. Use a temp directory and create the filename within it.

    -    # Use a temporary directory for the backend database
    -    temp_db_path = tempfile.mktemp(suffix=".duckdb")
    -
    -    # Initialize Feature
    -    indexer = MzTabIndexer.create(
    -        mztab_path=mztab_file,
    -        msstats_path=msstats_file,
    -        sdrf_path=sdrf_file,
    -        database_path=temp_db_path,
    -    )
    +    with tempfile.TemporaryDirectory() as tmpdir:
    +        temp_db_path = Path(tmpdir) / "feature_test.duckdb"
    +        indexer = MzTabIndexer.create(
    +            mztab_path=mztab_file,
    +            msstats_path=msstats_file,
    +            sdrf_path=sdrf_file,
    +            database_path=temp_db_path,
    +        )
    +        feature = Feature(mztab_indexer=indexer)
    +        ...

    Repeat the TemporaryDirectory pattern for test_extract_psm_msg (Lines 66-74) and test_generate_feature (Lines 106-114).

    Committable suggestion skipped: line range outside the PR's diff.

    🧰 Tools
    🪛 Ruff (0.12.2)

    26-26: Use of insecure and deprecated function (mktemp)

    (S306)

    🪛 GitHub Check: CodeFactor

    [warning] 26-26: tests/core/quantms/test_feature.py#L26
    Use of insecure and deprecated function (mktemp). (B306)

    🤖 Prompt for AI Agents
    In tests/core/quantms/test_feature.py around lines 26-35 (and similarly update
    the patterns used in test_extract_psm_msg at lines 66-74 and
    test_generate_feature at lines 106-114), replace the insecure use of
    tempfile.mktemp with a context-managed tempfile.TemporaryDirectory and construct
    the database filename inside that directory (e.g., join the tempdir path with a
    desired filename ending in .duckdb) so the path is created safely and avoids
    race conditions; ensure the created path is non-existent when passed to
    MzTabIndexer.create and that the TemporaryDirectory context scope covers the
    test operations so the temp is cleaned up after the test.
    

    Comment on lines +306 to +318
    if intensities_channel_counts:
    for channel, count in intensities_channel_counts.items():
    if channel not in channel_counts:
    channel_counts[channel] = 0

    # Use intensities-based matrix
    for ref_file, channels in file_channel_matrix_from_intensities.items():
    if ref_file not in file_channel_matrix:
    file_channel_matrix[ref_file] = {}
    for channel, count in channels.items():
    if channel not in file_channel_matrix[ref_file]:
    file_channel_matrix[ref_file][channel] = 0
    file_channel_matrix[ref_file][channel] += count
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Fix dropped channel counts when merging intensity-derived stats

    Counts from intensities_channel_counts are not added to channel_counts. Add the accumulation.

    -    if intensities_channel_counts:
    -        for channel, count in intensities_channel_counts.items():
    -            if channel not in channel_counts:
    -                channel_counts[channel] = 0
    +    if intensities_channel_counts:
    +        for channel, count in intensities_channel_counts.items():
    +            channel_counts[channel] = channel_counts.get(channel, 0) + count
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    if intensities_channel_counts:
    for channel, count in intensities_channel_counts.items():
    if channel not in channel_counts:
    channel_counts[channel] = 0
    # Use intensities-based matrix
    for ref_file, channels in file_channel_matrix_from_intensities.items():
    if ref_file not in file_channel_matrix:
    file_channel_matrix[ref_file] = {}
    for channel, count in channels.items():
    if channel not in file_channel_matrix[ref_file]:
    file_channel_matrix[ref_file][channel] = 0
    file_channel_matrix[ref_file][channel] += count
    if intensities_channel_counts:
    for channel, count in intensities_channel_counts.items():
    channel_counts[channel] = channel_counts.get(channel, 0) + count
    # Use intensities-based matrix
    for ref_file, channels in file_channel_matrix_from_intensities.items():
    if ref_file not in file_channel_matrix:
    file_channel_matrix[ref_file] = {}
    for channel, count in channels.items():
    if channel not in file_channel_matrix[ref_file]:
    file_channel_matrix[ref_file][channel] = 0
    file_channel_matrix[ref_file][channel] += count
    🧰 Tools
    🪛 Ruff (0.12.2)

    307-307: Loop control variable count not used within loop body

    (B007)

    🪛 GitHub Check: CodeFactor

    [notice] 272-318: tests/core/quantms/test_msstats_in.py#L272-L318
    Complex Method

    🤖 Prompt for AI Agents
    In tests/core/quantms/test_msstats_in.py around lines 306 to 318, the loop that
    handles intensities_channel_counts currently ensures keys exist in
    channel_counts but never adds the counts themselves; update the code to add each
    intensities_channel_counts[channel] value into channel_counts (initializing
    channel_counts[channel] to 0 if missing) so the intensity-derived counts are
    accumulated into channel_counts before proceeding to build the
    file_channel_matrix.
    

    Comment on lines +38 to +49
    # Use a temporary directory for the backend database
    temp_db_path = tempfile.mktemp(suffix=".duckdb")

    with MzTabIndexer(
    mztab_path=LARGE_DATASET["mztab"],
    msstats_path=LARGE_DATASET["msstats"],
    sdrf_path=LARGE_DATASET["sdrf"],
    database_path=temp_db_path,
    max_memory="16GB", # Adjust as needed for your system
    worker_threads=4, # Adjust as needed for your system
    batch_size=100000, # Using a larger batch size for a larger file
    ) as indexer:
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🛠️ Refactor suggestion

    Replace insecure mktemp with TemporaryDirectory; thread-safe and matches DB creation semantics.

    mktemp creates a race and may be pre-created by attackers. Also the DB must not exist before create.

    -    # Use a temporary directory for the backend database
    -    temp_db_path = tempfile.mktemp(suffix=".duckdb")
    -
    -    with MzTabIndexer(
    +    with tempfile.TemporaryDirectory() as tmpdir:
    +        temp_db_path = Path(tmpdir) / "perf.duckdb"
    +        with MzTabIndexer(
                 mztab_path=LARGE_DATASET["mztab"],
                 msstats_path=LARGE_DATASET["msstats"],
                 sdrf_path=LARGE_DATASET["sdrf"],
                 database_path=temp_db_path,
                 max_memory="16GB",  # Adjust as needed for your system
                 worker_threads=4,  # Adjust as needed for your system
                 batch_size=100000,  # Using a larger batch size for a larger file
    -    ) as indexer:
    +        ) as indexer:

    Committable suggestion skipped: line range outside the PR's diff.

    🧰 Tools
    🪛 ast-grep (0.38.6)

    [warning] 38-38: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
    Context: tempfile.mktemp(suffix=".duckdb")
    Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
    https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
    https://owasp.org/Top10/A01_2021-Broken_Access_Control

    (avoid-mktemp-python)

    🪛 Ruff (0.12.2)

    39-39: Use of insecure and deprecated function (mktemp)

    (S306)

    🪛 GitHub Check: CodeFactor

    [warning] 39-39: tests/core/test_large_mztab_performance.py#L39
    Use of insecure and deprecated function (mktemp). (B306)

    🤖 Prompt for AI Agents
    In tests/core/test_large_mztab_performance.py around lines 38–49, replace the
    insecure tempfile.mktemp usage with a tempfile.TemporaryDirectory context so the
    DB path is created inside a fresh, thread-safe temp directory and does not
    pre-exist; inside the TemporaryDirectory context construct a db filepath (e.g.,
    os.path.join(tmpdir, "test.duckdb") or with your desired suffix), pass that path
    to MzTabIndexer, and ensure you import tempfile and os if not already imported.
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants