Longest transcript #100

rdemko2332 · 2025-11-25T14:22:59Z

No description provided.

jbrestel · 2025-11-26T14:26:34Z

I asked claude-code to review this. Here is the response:

Code Review: PR #100 - Longest Transcript

Overview

This PR introduces functionality to filter FASTA files for the longest transcript per gene, adds a plugin to update ortholog group member counts, and includes several bug fixes:

New script filterForLongestTranscript to process FASTA files
New plugin AddNumberOfMembers.pm to update ortholog group statistics
Fixes to InsertPhylogeneticProfile.pm to use source_id instead of parsing protein IDs
Merge conflict resolution in orthomclEcPrediction for table name changes

Critical Issues

Syntax Error in AddNumberOfMembers.pm ❌

Location: Load/plugin/perl/AddNumberOfMembers.pm:80

og.core_peripheral = 'peripheral'

This line starts with a SQL fragment but has no context. The code appears to be truncated or has a copy-paste error. The getUnfinishedOrthologGroups() method is called but never
defined, and the SQL query starting at line 80 is incomplete.

Impact: This plugin will fail to compile and run.

Duplicate Variable Declarations in AddNumberOfMembers.pm ❌

Location: Load/plugin/perl/AddNumberOfMembers.pm:146-148

my $sql = "UPDATE apidb.OrthologGroup SET number_of_members = -1";
my $sql = "UPDATE apidb.OrthologGroup SET number_of_core_members = -1";
my $sql = "UPDATE apidb.OrthologGroup SET number_of_peripheral_members = -1";

Three my $sql declarations overwrite each other. Only the last one will execute during undo.

Recommendation: Execute all three statements:
my $sql1 = "UPDATE apidb.OrthologGroup SET number_of_members = -1";
my $sql2 = "UPDATE apidb.OrthologGroup SET number_of_core_members = -1";
my $sql3 = "UPDATE apidb.OrthologGroup SET number_of_peripheral_members = -1";
$dbh->prepareAndExecute($sql1);
$dbh->prepareAndExecute($sql2);
$dbh->prepareAndExecute($sql3);

Code Quality Issues

Missing Error Handling in filterForLongestTranscript

Location: Load/bin/filterForLongestTranscript:33

The file move operation could fail partway through, leaving the directory in an inconsistent state:
move($src, $dest) or die "Failed to overwrite $dest: $!\n";

Recommendation: Consider creating backups before overwriting or implementing atomic operations.

Hardcoded Regex Pattern

Location: Load/bin/filterForLongestTranscript:72

my ($gene_id) = $header =~ /gene=([^\s]+)/;

The script assumes all FASTA headers contain gene= which is acknowledged in the comment but fragile. If the format changes, all files will be silently skipped with only warnings.

Recommendation: Make the regex pattern configurable via command-line argument with gene= as default.

Performance Concern: Memory Usage

Location: Load/bin/filterForLongestTranscript:42-56

The entire FASTA file is loaded into memory via the %genes hash. For very large files with many genes, this could consume significant memory.

Recommendation: Document memory requirements or consider streaming approaches for very large datasets.

Improvements in InsertPhylogeneticProfile.pm ✅

Excellent Fix: Proper Protein ID Resolution

Location: Load/plugin/perl/InsertPhylogeneticProfile.pm:94-100

The change from parsing protein IDs with regex to using database lookups is a significant improvement:

OLD: my ($taxonCode, $sourceId) = split(/|/, $member);

NEW: Preload mapping from database

my $taxonCode = $proteinIdToOrthomclAbbrev{$member};

Benefits:

More robust and maintainable
Eliminates dependency on protein ID format
Single source of truth from the database

Whitespace/Indentation Cleanup ✅

The indentation fixes improve readability and follow Perl conventions.

Security Considerations

Path Traversal Risk (Low)

Location: Load/bin/filterForLongestTranscript:22-24

While using File::Spec->catfile() is good practice, there's no validation that files in the directory are actually owned by the expected user or haven't been symlinked.

Risk Level: Low (assuming trusted environment)
Recommendation: Add check for symlinks if processing untrusted directories.

Testing Gaps

No Test Coverage

The PR adds:

88 lines of new Perl script
156 lines of new plugin code
Significant logic changes to existing plugin

Recommendation: Add test cases for:

FASTA files with no gene= tags
FASTA files with multiple transcripts per gene
Empty FASTA files
Plugin behavior with zero core/peripheral members
Plugin restart capability

Style & Convention

Inconsistent Tab/Space Usage

Location: Load/plugin/perl/InsertPhylogeneticProfile.pm:82

while (my ($orthomclAbbrev) = $sth->fetchrow_array()) {
$allTaxa{$orthomclAbbrev} = 1; # tabs
}

The file mixes tabs and spaces. Should follow project conventions consistently.

Plugin Documentation Quality ✅

The new plugin follows VEuPathDB conventions with proper documentation hash structure, though the purpose and notes are empty placeholders.

Recommendation: Fill in documentation:
my $purpose = <<PURPOSE;
Updates number_of_members, number_of_core_members, and number_of_peripheral_members
columns in ApiDB::OrthologGroup based on counts from OrthologGroupAaSequence.
PURPOSE

Summary

Must Fix Before Merge ❌

AddNumberOfMembers.pm is broken - incomplete SQL query starting at line 80
Duplicate variable declarations in undoPreprocess()
Missing method getUnfinishedOrthologGroups()

Should Fix

Improve error handling in file operations
Make regex pattern configurable
Complete plugin documentation
Add test coverage

Good Changes ✅

The InsertPhylogeneticProfile.pm fix is solid and well-architected
filterForLongestTranscript logic is sound (once gene= pattern is present)
Proper use of File::Temp and File::Spec modules

Recommendation: Request changes - the AddNumberOfMembers.pm plugin has critical errors that will prevent it from running. The other changes look good once these are fixed.

rdemko2332 and others added 4 commits November 11, 2025 12:58

Adding script for filteringForLongestTranscript

8de3f1a

Fixing Phylogenetic profile issue

0495ca7

Adding addnummembers plugin

789c9da

change ProteinSequenceGroup_p to ProteinSequenceGroup, without -p

caea51c

rdemko2332 requested a review from jbrestel November 25, 2025 14:24

Removing unneeded perl script

a0f6f23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Longest transcript #100

Longest transcript #100

Uh oh!

rdemko2332 commented Nov 25, 2025

Uh oh!

jbrestel commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Longest transcript #100

Are you sure you want to change the base?

Longest transcript #100

Uh oh!

Conversation

rdemko2332 commented Nov 25, 2025

Uh oh!

jbrestel commented Nov 26, 2025

OLD: my ($taxonCode, $sourceId) = split(/|/, $member);

NEW: Preload mapping from database

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants