Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port database loading scripts to AllianceMine #1120

Open
dondi opened this issue Oct 2, 2024 · 23 comments
Open

Port database loading scripts to AllianceMine #1120

dondi opened this issue Oct 2, 2024 · 23 comments

Comments

@dondi
Copy link
Owner

dondi commented Oct 2, 2024

Now that InterMine is no more, we need to port our loading scripts to AllianceMine

@ntran18
Copy link
Collaborator

ntran18 commented Oct 9, 2024

Add documents to the Wiki. I wanted to take a break from it last week, so I didn't investigate PPI. I will do it this week.

@ntran18
Copy link
Collaborator

ntran18 commented Oct 10, 2024

I got response from SGD support about the API or service they have. I will test look at it and see if it works.

image

Update: Good news! There is a web service for AllianceMIne.

@ntran18
Copy link
Collaborator

ntran18 commented Oct 15, 2024

I updated the Wiki with all analyses for GRN and PPI and the proposed solutions for each.

@dondi
Copy link
Owner Author

dondi commented Oct 16, 2024

@ntran18’s PR to query the gene source rather than network in order to find ERT1 has been merged but appears to still not find ERT1 on the installed beta deployment. @ntran18 will investigate the issue

Further @ntran18 will meet with @kdahlquist to clarify the updated queries on AllianceMine and other content in the wiki writeup

@ntran18
Copy link
Collaborator

ntran18 commented Oct 16, 2024

The ERT1 issue should be fixed now. The issue was that when we updated the database, we forgot to update the database's GRN gene table.

@ntran18
Copy link
Collaborator

ntran18 commented Jan 15, 2025

@kdahlquist https://www.alliancegenome.org/bluegenes/alliancemine/templates. Currently AllianceMine has some templates for Yeastmine. Could you please check over them?

@kdahlquist
Copy link
Collaborator

I looked at the Organism > All Genes query results from the link given in the above comment. This looks like what we need to retrieve all the genes in yeast. There are a couple of observations:

  • The default query as written does not retrieve the systematic names, only the standard names. In order to get the systematic names, you need to click the "Edit Query" button and select systematic name in the model browser panel on the far left. I don't know whether the revised query can then be converted into a web service URL like the default query can.
  • The All Genes query retrieves 7261 records. We only need the Gene > Featuretype "ORF" records (6700 records). You could probably play with the query to only retrieve those instead of filtering them afterwards.

The Literature > Regulation template is what we need to use to get the network information.

  • We should capture this entire table.
  • Regulator > Standard Name or Systematic name is the regulator
  • Gene > Standard Name or Systematic name is the target
  • I think for this one, you can use the default query

One more thing to note. I looked at the downloaded results files and it looks like when a gene doesn't have a standard name, they leave that field blank. In that case, we need to copy the systematic name and put it in the standard name field.

Let me know if you have more questions.

@kdahlquist
Copy link
Collaborator

Literature > Regulation is for the GRNs.

For PPI, use Literature > Interaction and just use the "physical" interaction type.

@ntran18
Copy link
Collaborator

ntran18 commented Jan 17, 2025

I will take a look at these. Thank you!

@ntran18
Copy link
Collaborator

ntran18 commented Jan 19, 2025

@kdahlquist for GRN network table, does the regulators and target genes also have "ORF" feature type?

@kdahlquist
Copy link
Collaborator

Yes, they should. I don't see where the GRN network table actually includes an ORF field, but visual inspection of the table I can download from the web interface shows that they are all ORFs.

One thing to note is that there is a "strain" field. We should only capture "S288c" data. We could explore in the future whether to make an option to filter by strain, but for now, we just want to capture the "S288c" data.

@ntran18
Copy link
Collaborator

ntran18 commented Jan 22, 2025

ORF

I modified the query to filter only the ORF feature type, which now returns 28,364 interactions compared to the original query, which returned 29,510 interactions. Could you please confirm which query I should be using? @kdahlquist

Image
Adding Feature Type and filtering with ORF feature type and "s288c" strain

Image
Original query with "s288c" strain

Database Update Issue

I encountered an issue while attempting to update the protein standard name in the database. When trying to change the standard name in the protein table, an error occurs stating that the standard name is referenced in the physical_interactions table, which prevents the update.

After discussing the issue with @dondi, we discovered that the schema for the protein and physical_interactions tables is incorrect. The physical_interactions table uses protein.standard_name as a foreign key, but it should be using gene_systematic_name.

Image
  • Possible Solutions:

    1. Fixing the Schema: Adjust the foreign key relationship in the physical_interactions table to reference the gene_systematic_name from the protein table, which requires updating the entire physical_interactions table.
    2. Temporary Workaround: Disable the foreign key constraints while performing the update, and re-enable them afterward.
  • Question: How often does the protein standard name need to be changed? Understanding this will help determine whether to implement a permanent schema fix or use a temporary solution.

Missing time_stamp in Tables

I noticed that there is no time_stamp field in the gene or protein tables. Without this field, if a gene exists in an older timestamp, it will still be possible to query it in any time_stamp.

  • Question: Is it acceptable for users to query genes at any time_stamp, or should we enforce the use of specific timestamps when querying genes?

@kdahlquist
Copy link
Collaborator

For the first question:

  • "I modified the query to filter only the ORF feature type, which now returns 28,364 interactions compared to the original query, which returned 29,510 interactions. Could you please confirm which query I should be using? @kdahlquist"
  • Answer: yes, filter to only the ORF feature type. That is a good catch.

@kdahlquist
Copy link
Collaborator

For the second question:

  • "Question: How often does the protein standard name need to be changed? Understanding this will help determine whether to implement a permanent schema fix or use a temporary solution."
  • We need to permanently fix this, because that is the correct way to do things. The number of times that this will happen will be variable over time as proteins gain standard names, but discoveries of new genes/proteins that have systematic names, but not standard names could still happen. It's best if our schema is an accurate representation of the relationships going forward, so in the future we don't have any confusion when new developers come along.

@kdahlquist
Copy link
Collaborator

For the third question:

  • Question: Is it acceptable for users to query genes at any time_stamp, or should we enforce the use of specific timestamps when querying genes?
  • I'm not sure I 100% understand this question. However, we are keeping distinct versions of the database based on the date we refresh the data from (now) AllianceMine. Users should not be able to query a mixture of the data from different dates. If they select "2024" data, then it should only query 2024 data. If they select "2023" or "2025" data, they should only be able to query data from those downloads.

@kdahlquist
Copy link
Collaborator

Question: are week keeping the "Annotation Type" field in our database for GRNs and/or PPI? If we are not, can we keep it? I can envision a feature request where users can select whether they want to look at just manually curated, just high-throughput, or both types of interactions on GRNs and PPIs that are loaded from the database.

@kdahlquist
Copy link
Collaborator

I looked at the schemas and have a question. I see on the PPI database schema, there is this in the protein_protein_interactions.physical_interactions table:

CONSTRAINT unique_physical_interaction UNIQUE (protein1, protein2, interaction_detection_methods_identifier, experiment_name, time_stamp, source)

What does this mean? I think it means to exclude duplicate protein-protein interactions. However, by including both manually curated and high-throughput, there might actually be duplicates if an interaction was annotated in both ways.

Also, would this same constraint be needed in the GRN database?

@ntran18
Copy link
Collaborator

ntran18 commented Jan 22, 2025

My current implementation already filter out the duplicate interactions. So it doesn't care what annoted type is currently. Do you want to save duplicate interactions or is there a priority for selecting the preferred annotation type?

@kdahlquist
Copy link
Collaborator

We would need to save them. The user will choose from "manually curated only", "high throughput only" or "both manually curated and "high throughput". We wouldn't want to lose an interaction from either category.

@ntran18
Copy link
Collaborator

ntran18 commented Jan 25, 2025

Image

Image

I edited the new schema for the database, but I have a few questions:

  1. For the physical_interactions and network tables, should I use a UNIQUE constraint or a PRIMARY KEY to ensure that each row is unique? If I use a PRIMARY KEY, the columns must be non-null, whereas with a UNIQUE constraint, the columns can have null values.

  2. Should the time_stamp foreign key in the network and physical_interactions tables reference the gene table or the source table?

@kdahlquist
Copy link
Collaborator

  1. I'm not sure about question 1.
  2. Are the time_stamp's going to be different in the two different tables?

@dondi
Copy link
Owner Author

dondi commented Jan 29, 2025

@ntran18 and @dondi discussed over the week and had the following conclusions:

  • Although the “true” primary key of the source table is a composite of the source identifier + timestamp, practically speaking we won’t be loading sources at the exact same moment, so we will retain just timestamp as the reference value that the other tables will use for database integrity. This deviation will be noted in the documentation
  • For the interactions tables, the entire row must be considered in order to be unique and so we will apply that constraint consistently to all of them

@dondi
Copy link
Owner Author

dondi commented Jan 29, 2025

After @ceciliazaragoza finishes work on #1150, she can join up with @ntran18 for the database work since this will be a heavier lift

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants