-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding automatic size checker #1385
Comments
@JLO64 firstly, thanks for all the great work on this - yay for open source! So a few comments from me, and I'll also let @garritfra opine too.. Broken sitesWe definitely need to do something with these. We currently have the site checker which spits out a list of any status that isn't 200, then we manually go through the list. We haven't run it for a while, which is probably why you're seeing lots of broken sites. Having something automated to clean up would be better for obvious reasons, but I think we need to consider some kind of safety net so that things aren't accidentally deleted that are valid. Maybe a CRON that runs and if a site fails over, say 7 days, it gets deleted. I'm just spit-balling ideas here, as this is all way out of my skillset. If that can't be done easily, we will continue with the manual script and cleanup. Cloudflare auth and
|
No objections! An idea regarding the automatic size check: Renovate, the dependabot competitor, opens an issue in each repository it's active in and lists all dependencies that are out of date. Maybe this approach works here as well? We could have an issue that regularly gets updated by an action, listing all domains that could be removed, alongside the time it has been down for or above 512 kB. To avoid having to maintain a backend, the body of the issue should be in machine-readable form. If we want, we can even add checkmarks next to the domains to schedule removal, or remove any domains that were unhealthy for n days (or n iterations of the check). These are just my unfiltered thoughts, I'm totally open for remarks or alternative solutions. |
Sorry for not responding sooner! I've had a hectic weekend and haven't been able to touch VS Code or GitHub, but I'll definely be able to submit a PR based on the above comments soon. I wish I could contribute more on the GitHub Actions side of things, but I've never used it before now. I'm gonna try giving myself a crash course on how that all works this week, but for now I'll absain from commenting on that stuff. That said, switching to GitHub Secrets seems like a good idea! Regarding broken links and websites larger than 512KB, maybe a |
Sorry for the lack of updates, I got really sick on Monday and am still recovering. Quick update on the script. I tried running it overnight but it errored out after ~250 entries. I'm not quite certain why that happened since when I reran it from the entry it failed at it ran properly. Additionally, I think there's an issue with some form of rate limiting going on. This is a patern throughought the table where every 60-ish entries for roughly 10 entries it failes to scan them. At most the script querries the API 2 times per 20 seconds which is well within the limits of the overall API and that particular endpoint. Additionally, there are a bunch of sites that are failing seemingly at random despite being accessable via a browser and Cloudflare. The frequency of this is low at roughly 1 per 40 entries. Thankfully, I can just have the script retest these entries by having it sort Long term I don't think this will be an issue if the script is run periodically via GitHub Actions for just a couple of sites at a time, but this is going to be a problem if we ever have to rerun the entire list again. For now I'll PR the script as it is. I haven't changed the API token stuff, documentation, or comments. I'll do that in later PRs. |
This looks fantastic! |
So as mentioned at the end of #1366 the script to automatically check website size don't work anymore due to its reliance on GMetrix. As such I've been working on updating it with Cloudflare APIs. It's working pretty well now (I replaced all the GMetrix stuff with Cloudflare APIs, improved error handling) but before I PR my changes I'd like to communicate and get clarification on certain things.
Broken URLs
A decent number of URLs on the list are currently broken and can't be accessed by either Cloudflare or via a web browser. With the error handling I currently have in place, I have the sizes in their entries set to "1000" so that they don't show up on the website. However, this will add a bunch of empty entries to the site list which would cause a bunch of extra/unnecessary processing when this this script is run in the future.
I could have these entries deleted automatically, but I don't like the idea of removing other's people's entries like that. Maybe it would be better for them to just be manually removed instead?
Cloudflare Authorization
Just as a heads up, the Cloudflare API isn't open and you need to register for it. Registering for an account is easy and straightfoward. However, in order to access your
accountId
needed for this script you need to tie a domain to that account. I did so with mine and while it prompts you to change your domain DNS settings you can just skip past that.Thankfully, I haven't yet run into any rate limits for the API. At one point I ran just over 100 consecutive URLs through it and had no issues.
.env file?
Speaking of authorization stuff, I currently have the API token and account ID stored the same way the GTMetrix credentials are (in a seperate file called
myauth.py
). I think I understand why this method was used but I'd argue that storing them in a.env
file is better since it allows this script to be better run via GitHub actions or something similar.That said, I'm willing to stick to the current method for my PR.
Markdown file output
I'm currently planning on submitting my PR with two files changed,
site_size_rechecker.py
andsites.yml
. A good chunk of the entries are outdated and given how long it takes to run the whole list (it's a minimum of 20 seconds per entry) I'm gonna leave the script running on my server overnight before I PR.Is it alright if I modify the script to output the markdown table to a file? Since I'm gonna be redoing all the entries, it's gonna be a truly massive list that I doubt will fit in a PR so I'd rather link to a file.
Documentation
docs_site_size_rechecker.md
is a really handy file that documents how to usesite_size_rechecker.py
. That said, it requires a signifigant rewrite due to the changes I've made. Would you mind if I PR my changes first then update this file?The text was updated successfully, but these errors were encountered: