Skip to content

Conversation

@rdemko2332
Copy link
Contributor

No description provided.

@rdemko2332 rdemko2332 requested a review from jbrestel November 25, 2025 14:24
@jbrestel
Copy link
Member

I asked claude-code to review this. Here is the response:

Code Review: PR #100 - Longest Transcript

Overview

This PR introduces functionality to filter FASTA files for the longest transcript per gene, adds a plugin to update ortholog group member counts, and includes several bug fixes:

  1. New script filterForLongestTranscript to process FASTA files
  2. New plugin AddNumberOfMembers.pm to update ortholog group statistics
  3. Fixes to InsertPhylogeneticProfile.pm to use source_id instead of parsing protein IDs
  4. Merge conflict resolution in orthomclEcPrediction for table name changes

Critical Issues

  1. Syntax Error in AddNumberOfMembers.pm ❌

Location: Load/plugin/perl/AddNumberOfMembers.pm:80

og.core_peripheral = 'peripheral'

This line starts with a SQL fragment but has no context. The code appears to be truncated or has a copy-paste error. The getUnfinishedOrthologGroups() method is called but never
defined, and the SQL query starting at line 80 is incomplete.

Impact: This plugin will fail to compile and run.

  1. Duplicate Variable Declarations in AddNumberOfMembers.pm ❌

Location: Load/plugin/perl/AddNumberOfMembers.pm:146-148

my $sql = "UPDATE apidb.OrthologGroup SET number_of_members = -1";
my $sql = "UPDATE apidb.OrthologGroup SET number_of_core_members = -1";
my $sql = "UPDATE apidb.OrthologGroup SET number_of_peripheral_members = -1";

Three my $sql declarations overwrite each other. Only the last one will execute during undo.

Recommendation: Execute all three statements:
my $sql1 = "UPDATE apidb.OrthologGroup SET number_of_members = -1";
my $sql2 = "UPDATE apidb.OrthologGroup SET number_of_core_members = -1";
my $sql3 = "UPDATE apidb.OrthologGroup SET number_of_peripheral_members = -1";
$dbh->prepareAndExecute($sql1);
$dbh->prepareAndExecute($sql2);
$dbh->prepareAndExecute($sql3);


Code Quality Issues

  1. Missing Error Handling in filterForLongestTranscript

Location: Load/bin/filterForLongestTranscript:33

The file move operation could fail partway through, leaving the directory in an inconsistent state:
move($src, $dest) or die "Failed to overwrite $dest: $!\n";

Recommendation: Consider creating backups before overwriting or implementing atomic operations.

  1. Hardcoded Regex Pattern

Location: Load/bin/filterForLongestTranscript:72

my ($gene_id) = $header =~ /gene=([^\s]+)/;

The script assumes all FASTA headers contain gene= which is acknowledged in the comment but fragile. If the format changes, all files will be silently skipped with only warnings.

Recommendation: Make the regex pattern configurable via command-line argument with gene= as default.

  1. Performance Concern: Memory Usage

Location: Load/bin/filterForLongestTranscript:42-56

The entire FASTA file is loaded into memory via the %genes hash. For very large files with many genes, this could consume significant memory.

Recommendation: Document memory requirements or consider streaming approaches for very large datasets.


Improvements in InsertPhylogeneticProfile.pm ✅

  1. Excellent Fix: Proper Protein ID Resolution

Location: Load/plugin/perl/InsertPhylogeneticProfile.pm:94-100

The change from parsing protein IDs with regex to using database lookups is a significant improvement:

OLD: my ($taxonCode, $sourceId) = split(/|/, $member);

NEW: Preload mapping from database

my $taxonCode = $proteinIdToOrthomclAbbrev{$member};

Benefits:

  • More robust and maintainable
  • Eliminates dependency on protein ID format
  • Single source of truth from the database
  1. Whitespace/Indentation Cleanup ✅

The indentation fixes improve readability and follow Perl conventions.


Security Considerations

  1. Path Traversal Risk (Low)

Location: Load/bin/filterForLongestTranscript:22-24

While using File::Spec->catfile() is good practice, there's no validation that files in the directory are actually owned by the expected user or haven't been symlinked.

Risk Level: Low (assuming trusted environment)
Recommendation: Add check for symlinks if processing untrusted directories.


Testing Gaps

  1. No Test Coverage

The PR adds:

  • 88 lines of new Perl script
  • 156 lines of new plugin code
  • Significant logic changes to existing plugin

Recommendation: Add test cases for:

  • FASTA files with no gene= tags
  • FASTA files with multiple transcripts per gene
  • Empty FASTA files
  • Plugin behavior with zero core/peripheral members
  • Plugin restart capability

Style & Convention

  1. Inconsistent Tab/Space Usage

Location: Load/plugin/perl/InsertPhylogeneticProfile.pm:82

while (my ($orthomclAbbrev) = $sth->fetchrow_array()) {
$allTaxa{$orthomclAbbrev} = 1; # tabs
}

The file mixes tabs and spaces. Should follow project conventions consistently.

  1. Plugin Documentation Quality ✅

The new plugin follows VEuPathDB conventions with proper documentation hash structure, though the purpose and notes are empty placeholders.

Recommendation: Fill in documentation:
my $purpose = <<PURPOSE;
Updates number_of_members, number_of_core_members, and number_of_peripheral_members
columns in ApiDB::OrthologGroup based on counts from OrthologGroupAaSequence.
PURPOSE


Summary

Must Fix Before Merge ❌

  1. AddNumberOfMembers.pm is broken - incomplete SQL query starting at line 80
  2. Duplicate variable declarations in undoPreprocess()
  3. Missing method getUnfinishedOrthologGroups()

Should Fix

  1. Improve error handling in file operations
  2. Make regex pattern configurable
  3. Complete plugin documentation
  4. Add test coverage

Good Changes ✅

  • The InsertPhylogeneticProfile.pm fix is solid and well-architected
  • filterForLongestTranscript logic is sound (once gene= pattern is present)
  • Proper use of File::Temp and File::Spec modules

Recommendation: Request changes - the AddNumberOfMembers.pm plugin has critical errors that will prevent it from running. The other changes look good once these are fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants