Skip to content

Conversation

francisco-ebi
Copy link
Contributor

Branch containing all migrated tasks related to the impc_web_api_mapper.py original task.
Added shared utils model

…b_apiimpc_gene_diseases_mapperpy-to-use-airflow

399 migrate impc etljobsloadimpc web apiimpc gene diseases mapperpy to use airflow
…b_apiimpc_batch_query_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_batch_query_mapper to use Airflow
…b_apiimpc_gene_search_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_gene_search_mapper to use Airflow
…b_apiimpc_idg_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_idg_mapper.py to use Airflow
@francisco-ebi francisco-ebi changed the title IMPC web api migration IMPC web api tasks migration Sep 18, 2025
…b_apiimpc_external_links_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_external_links_mapper to use Airflow
…b_apiimpc_gene_histopathology_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_external_links_mapper to use Airflow
…b_apiimpc_gene_images_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_gene_images_mapper to use Airflow
…b_apiimpc_phenotype_pleiotropy_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_phenotype_pleiotropy_mapper.py to use Airflow
…b_apiimpc_embryo_landing_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_embryo_landing_mapper.py to use Airflow
"associationCurated", col("associationCurated").astype(BooleanType())
)

max_disease_df.coalesce(100).write.option("ignoreNullFields", "false").json(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from repartition(500) to coalesce(100) for local development

impc_images_df.repartition(500).write.option("ignoreNullFields", "false").json(
output_path
)
impc_images_df.coalesce(100).write.option("ignoreNullFields", "false").json(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from repartition(500) to coalesce(100) for local development

stats_results_df = stats_results_df.withColumn(
"femaleMutantCount", col("femaleMutantCount").astype(IntegerType())
)
stats_results_df.distinct().coalesce(5).write.option(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from repartition(1000) to coalesce(5) for local development

…b_apiimpc_gene_summary_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_gene_summary_mapper.py to use Airflow
"embryoExpressionObservationsAverage"
),
)
gene_avg_df.coalesce(10).write.option("ignoreNullFields", "false").json(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from repartition(1) to coalesce(10) for local development

gene_avg_df.coalesce(10).write.option("ignoreNullFields", "false").json(
output_path + "_avgs"
)
gene_df.coalesce(10).write.option("ignoreNullFields", "false").json(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from repartition(100) to coalesce(10) for local development

…b_apiimpc_phenotype_search_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_phenotype_search_mapper.py to use Airflow
…b_apiimpc_phenotype_summary_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_phenotype_summary_mapper.py to use Airflow
…b_apiimpc_images_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_images_mapper.py to use Airflow
…b_apiimpc_histopathology_datasets_mapperpy-to-use-airflow

Migrate impc_etl/jobs/load/impc_web_api/impc_histopathology_datasets_mapper.py to use Airflow
Copy link
Contributor

@ficolo ficolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the transformations needed only on development, use:

environment = Variable.get("environment", "development")

And then:

if environment == 'development':
    ...

@ficolo ficolo merged commit 42790d9 into dev Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants