Skip to content

Shortage data scraping fails due to Cloudflare protection #398

@eddie-cosma

Description

@eddie-cosma

Problem Statement

A few weeks ago, the ASHP website seems to have implemented Cloudflare Ray protection in such a way that causes scraping attempts to be blocked. When changing log level to DEBUG in Airflow, I see the following in the ASHP dag log:

[2025-07-05, 03:32:17 UTC] {dag.py:48} INFO - Checking ASHP website for updates
[2025-07-05, 03:32:17 UTC] {connectionpool.py:1007} DEBUG - Starting new HTTPS connection (1): www.ashp.org:443
[2025-07-05, 03:32:17 UTC] {connectionpool.py:465} DEBUG - [https://www.ashp.org:443](https://www.ashp.org/) "GET /drug-shortages/current-shortages/drug-shortages-list?page=CurrentShortages HTTP/1.1" 403 None
[2025-07-05, 03:32:17 UTC] {dag.py:52} ERROR - ASHP website unreachable
[2025-07-05, 03:32:17 UTC] {cli_action_loggers.py:83} DEBUG - Calling callbacks: []
[2025-07-05, 03:32:17 UTC] {local_task_job.py:208} INFO - Task exited with return code 1

Running curl "https://www.ashp.org/drug-shortages/current-shortages/drug-shortages-list?page=CurrentShortages" on any of my machines results in a 403 response with the following statement contained in the response text:

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

Criteria for Success

Find a way to reliably import ASHP drug shortage data.

Additional Information

I need to research further. A quick search indicates that some people have had success using cloudscraper, but this would result in an additional dependency. I'm wondering if a better solution would be to build an API separate from sagerx that could handle scraping and act as an intermediary for this data, then use sagerx to simply poll the API. Open to suggestions/ideas.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions