Port database loading scripts to AllianceMine #1120

dondi · 2024-10-02T20:30:17Z

Now that InterMine is no more, we need to port our loading scripts to AllianceMine

ntran18 · 2024-10-09T17:46:53Z

Add documents to the Wiki. I wanted to take a break from it last week, so I didn't investigate PPI. I will do it this week.

ntran18 · 2024-10-10T18:49:19Z

I got response from SGD support about the API or service they have. I will test look at it and see if it works.

Update: Good news! There is a web service for AllianceMIne.

ntran18 · 2024-10-15T21:22:00Z

I updated the Wiki with all analyses for GRN and PPI and the proposed solutions for each.

dondi · 2024-10-16T20:18:56Z

@ntran18’s PR to query the gene source rather than network in order to find ERT1 has been merged but appears to still not find ERT1 on the installed beta deployment. @ntran18 will investigate the issue

Further @ntran18 will meet with @kdahlquist to clarify the updated queries on AllianceMine and other content in the wiki writeup

ntran18 · 2024-10-16T20:48:33Z

The ERT1 issue should be fixed now. The issue was that when we updated the database, we forgot to update the database's GRN gene table.

ntran18 · 2025-01-15T16:40:49Z

@kdahlquist https://www.alliancegenome.org/bluegenes/alliancemine/templates. Currently AllianceMine has some templates for Yeastmine. Could you please check over them?

kdahlquist · 2025-01-16T22:11:05Z

I looked at the Organism > All Genes query results from the link given in the above comment. This looks like what we need to retrieve all the genes in yeast. There are a couple of observations:

The default query as written does not retrieve the systematic names, only the standard names. In order to get the systematic names, you need to click the "Edit Query" button and select systematic name in the model browser panel on the far left. I don't know whether the revised query can then be converted into a web service URL like the default query can.
The All Genes query retrieves 7261 records. We only need the Gene > Featuretype "ORF" records (6700 records). You could probably play with the query to only retrieve those instead of filtering them afterwards.

The Literature > Regulation template is what we need to use to get the network information.

We should capture this entire table.
Regulator > Standard Name or Systematic name is the regulator
Gene > Standard Name or Systematic name is the target
I think for this one, you can use the default query

One more thing to note. I looked at the downloaded results files and it looks like when a gene doesn't have a standard name, they leave that field blank. In that case, we need to copy the systematic name and put it in the standard name field.

Let me know if you have more questions.

kdahlquist · 2025-01-16T22:19:16Z

Literature > Regulation is for the GRNs.

For PPI, use Literature > Interaction and just use the "physical" interaction type.

ntran18 · 2025-01-17T23:55:42Z

I will take a look at these. Thank you!

ntran18 · 2025-01-19T08:06:12Z

@kdahlquist for GRN network table, does the regulators and target genes also have "ORF" feature type?

kdahlquist · 2025-01-19T22:26:28Z

Yes, they should. I don't see where the GRN network table actually includes an ORF field, but visual inspection of the table I can download from the web interface shows that they are all ORFs.

One thing to note is that there is a "strain" field. We should only capture "S288c" data. We could explore in the future whether to make an option to filter by strain, but for now, we just want to capture the "S288c" data.

ntran18 · 2025-01-22T02:42:02Z

ORF

I modified the query to filter only the ORF feature type, which now returns 28,364 interactions compared to the original query, which returned 29,510 interactions. Could you please confirm which query I should be using? @kdahlquist

Adding Feature Type and filtering with ORF feature type and "s288c" strain

Original query with "s288c" strain

Database Update Issue

I encountered an issue while attempting to update the protein standard name in the database. When trying to change the standard name in the protein table, an error occurs stating that the standard name is referenced in the physical_interactions table, which prevents the update.

After discussing the issue with @dondi, we discovered that the schema for the protein and physical_interactions tables is incorrect. The physical_interactions table uses protein.standard_name as a foreign key, but it should be using gene_systematic_name.

Possible Solutions:
1. Fixing the Schema: Adjust the foreign key relationship in the physical_interactions table to reference the gene_systematic_name from the protein table, which requires updating the entire physical_interactions table.
2. Temporary Workaround: Disable the foreign key constraints while performing the update, and re-enable them afterward.
Question: How often does the protein standard name need to be changed? Understanding this will help determine whether to implement a permanent schema fix or use a temporary solution.

Missing `time_stamp` in Tables

I noticed that there is no time_stamp field in the gene or protein tables. Without this field, if a gene exists in an older timestamp, it will still be possible to query it in any time_stamp.

Question: Is it acceptable for users to query genes at any time_stamp, or should we enforce the use of specific timestamps when querying genes?

kdahlquist · 2025-01-22T16:27:31Z

For the first question:

"I modified the query to filter only the ORF feature type, which now returns 28,364 interactions compared to the original query, which returned 29,510 interactions. Could you please confirm which query I should be using? @kdahlquist"
Answer: yes, filter to only the ORF feature type. That is a good catch.

kdahlquist · 2025-01-22T16:32:02Z

For the second question:

"Question: How often does the protein standard name need to be changed? Understanding this will help determine whether to implement a permanent schema fix or use a temporary solution."
We need to permanently fix this, because that is the correct way to do things. The number of times that this will happen will be variable over time as proteins gain standard names, but discoveries of new genes/proteins that have systematic names, but not standard names could still happen. It's best if our schema is an accurate representation of the relationships going forward, so in the future we don't have any confusion when new developers come along.

kdahlquist · 2025-01-22T16:35:28Z

For the third question:

Question: Is it acceptable for users to query genes at any time_stamp, or should we enforce the use of specific timestamps when querying genes?
I'm not sure I 100% understand this question. However, we are keeping distinct versions of the database based on the date we refresh the data from (now) AllianceMine. Users should not be able to query a mixture of the data from different dates. If they select "2024" data, then it should only query 2024 data. If they select "2023" or "2025" data, they should only be able to query data from those downloads.

kdahlquist · 2025-01-22T16:41:08Z

Question: are week keeping the "Annotation Type" field in our database for GRNs and/or PPI? If we are not, can we keep it? I can envision a feature request where users can select whether they want to look at just manually curated, just high-throughput, or both types of interactions on GRNs and PPIs that are loaded from the database.

kdahlquist · 2025-01-22T18:47:11Z

I looked at the schemas and have a question. I see on the PPI database schema, there is this in the protein_protein_interactions.physical_interactions table:

CONSTRAINT unique_physical_interaction UNIQUE (protein1, protein2, interaction_detection_methods_identifier, experiment_name, time_stamp, source)

What does this mean? I think it means to exclude duplicate protein-protein interactions. However, by including both manually curated and high-throughput, there might actually be duplicates if an interaction was annotated in both ways.

Also, would this same constraint be needed in the GRN database?

ntran18 · 2025-01-22T19:04:29Z

My current implementation already filter out the duplicate interactions. So it doesn't care what annoted type is currently. Do you want to save duplicate interactions or is there a priority for selecting the preferred annotation type?

kdahlquist · 2025-01-22T19:58:33Z

We would need to save them. The user will choose from "manually curated only", "high throughput only" or "both manually curated and "high throughput". We wouldn't want to lose an interaction from either category.

ntran18 · 2025-01-25T05:48:45Z

I edited the new schema for the database, but I have a few questions:

For the physical_interactions and network tables, should I use a UNIQUE constraint or a PRIMARY KEY to ensure that each row is unique? If I use a PRIMARY KEY, the columns must be non-null, whereas with a UNIQUE constraint, the columns can have null values.
Should the time_stamp foreign key in the network and physical_interactions tables reference the gene table or the source table?

kdahlquist · 2025-01-27T16:56:38Z

I'm not sure about question 1.
Are the time_stamp's going to be different in the two different tables?

dondi · 2025-01-29T17:06:36Z

@ntran18 and @dondi discussed over the week and had the following conclusions:

Although the “true” primary key of the source table is a composite of the source identifier + timestamp, practically speaking we won’t be loading sources at the exact same moment, so we will retain just timestamp as the reference value that the other tables will use for database integrity. This deviation will be noted in the documentation
For the interactions tables, the entire row must be considered in order to be unique and so we will apply that constraint consistently to all of them

dondi · 2025-01-29T17:08:19Z

After @ceciliazaragoza finishes work on #1150, she can join up with @ntran18 for the database work since this will be a heavier lift

dondi added the priority 0 label Oct 2, 2024

dondi assigned ntran18 Oct 2, 2024

dondi added priority 1 and removed priority 0 labels Oct 30, 2024

This was referenced Jan 22, 2025

Attach timestamps to genes #1164

Open

Add annotation_type to our database #1165

Open

kdahlquist added priority 0 database and removed priority 1 labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port database loading scripts to AllianceMine #1120

Port database loading scripts to AllianceMine #1120

dondi commented Oct 2, 2024

ntran18 commented Oct 9, 2024

ntran18 commented Oct 10, 2024 •

edited

Loading

ntran18 commented Oct 15, 2024

dondi commented Oct 16, 2024

ntran18 commented Oct 16, 2024

ntran18 commented Jan 15, 2025

kdahlquist commented Jan 16, 2025

kdahlquist commented Jan 16, 2025

ntran18 commented Jan 17, 2025 •

edited

Loading

ntran18 commented Jan 19, 2025

kdahlquist commented Jan 19, 2025

ntran18 commented Jan 22, 2025 •

edited

Loading

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

ntran18 commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

ntran18 commented Jan 25, 2025

kdahlquist commented Jan 27, 2025

dondi commented Jan 29, 2025

dondi commented Jan 29, 2025

Port database loading scripts to AllianceMine #1120

Port database loading scripts to AllianceMine #1120

Comments

dondi commented Oct 2, 2024

ntran18 commented Oct 9, 2024

ntran18 commented Oct 10, 2024 • edited Loading

ntran18 commented Oct 15, 2024

dondi commented Oct 16, 2024

ntran18 commented Oct 16, 2024

ntran18 commented Jan 15, 2025

kdahlquist commented Jan 16, 2025

kdahlquist commented Jan 16, 2025

ntran18 commented Jan 17, 2025 • edited Loading

ntran18 commented Jan 19, 2025

kdahlquist commented Jan 19, 2025

ntran18 commented Jan 22, 2025 • edited Loading

ORF

Database Update Issue

Missing time_stamp in Tables

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

ntran18 commented Jan 22, 2025

kdahlquist commented Jan 22, 2025

ntran18 commented Jan 25, 2025

kdahlquist commented Jan 27, 2025

dondi commented Jan 29, 2025

dondi commented Jan 29, 2025

ntran18 commented Oct 10, 2024 •

edited

Loading

ntran18 commented Jan 17, 2025 •

edited

Loading

ntran18 commented Jan 22, 2025 •

edited

Loading

Missing `time_stamp` in Tables