Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting to disable expensive endpoints for anonymous users #33966

Closed
Victorious3 opened this issue Mar 21, 2025 · 9 comments · Fixed by #34024
Closed

Setting to disable expensive endpoints for anonymous users #33966

Victorious3 opened this issue Mar 21, 2025 · 9 comments · Fixed by #34024
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@Victorious3
Copy link

Victorious3 commented Mar 21, 2025

Feature Description

Since AI scrapers are terrorizing the web and flooding innocent gitea instances, it would make sense to have an option to only allow expensive endpoints (like /src/commit or /blame) for logged in users.

What I have observed is that crawlers like Claudebot and Bytespider don't respect my robots.txt and decide to crawl every single file from every single commit. For big repositories this can become a massive performance hit since gitea has to run git to be able to serve the requests, which has a lot of overhead. I even enabled a redis cache but since they hit new files all the time it didn't help much.

As a workaround I have configured my reverse proxy nginx to redirect these endpoints to an Anubis instance (https://anubis.techaro.lol/) which seems to kill most of the scrapers or at least wastes their time for long enough to make their DDOS (because that's what it is, really!) less annoying.

However, since this is a solution that works on proxying with nginx, every user sees the Anubis thing before being able to look at commits, even if they are logged in. Therefore it would be preferrable to just have an option to disallow these endpoints. If someone external wants to look at the commits they can just check out the repository and look at the history there.

Screenshots

No response

@Victorious3 Victorious3 added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Mar 21, 2025
@qwertfisch
Copy link

qwertfisch commented Mar 24, 2025

This feature would be very welcomed. I have the same problems as there are AI scrapers which do not even set an appropriate useragent string, that terrorize my Gitea instance for a small open source project. The vast majority of accesses go to /pulls? and /issues?.

Edit: my proposal to block queries with GET parameters seems pointless as even navigating through the issues needs a page parameter. So it would indeed help disabling some endpoints or (better) giving permission settings for anonymous users, with possible values read/write, read and none.

@lunny
Copy link
Member

lunny commented Mar 25, 2025

Or perhaps anonymous users will have limited access or reduced traffic.

@wxiaoguang
Copy link
Contributor

What do you think about this?

-> Add a config option to block "expensive" pages #34024

@delvh
Copy link
Member

delvh commented Mar 26, 2025

@wxiaoguang isn't that another solution for #33951 ?
That PR sounds much less intrusive to me.
Instead of outright crippling gitea - the routes you describe are the routes I myself use the most often - we could also use that approach to gracefully degrade only when there is a too large request queue…
That would fix the same problem and probably much more user-friendly.

@wxiaoguang
Copy link
Contributor

These 2 are different.

  • Block "expensive" pages (including issues/PRs): no need to spend CPU/Memory resources for the AI crawlers.
  • Git content QoS: it only focuses "git content", it still consumes CPU/Memory to render expensive pages.

These 2 PRs could also co-exist and won't conflict.

And since #33951 has no progress in recent days, I think at least we can implement this issue's proposal to help the users under "AI crawler attack" as a quick solution.

wxiaoguang added a commit to wxiaoguang/gitea that referenced this issue Mar 30, 2025
…o-gitea#34024)

Fix go-gitea#33966

```
;; User must sign in to view anything.
;; It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources,
;; for example: block anonymous AI crawlers from accessing repo code pages.
;; The "expensive" mode is experimental and subject to change.
;REQUIRE_SIGNIN_VIEW = false
```
# Conflicts:
#	routers/api/v1/api.go
#	tests/integration/api_org_test.go
@wxiaoguang
Copy link
Contributor

1.23 nightly is ready (it is a stable release and will be 1.23.7 soon)

It has a new config option:

[service]
;; User must sign in to view anything.
;; It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources,
;; for example: block anonymous AI crawlers from accessing repo code pages.
;; The "expensive" mode is experimental and subject to change.
;REQUIRE_SIGNIN_VIEW = false

Welcome to try and provide feedbacks.

@qwertfisch
Copy link

Works wonderful, thank you. It already reduced the load and the number of accesses both.

@wxiaoguang Is it possible to configure what endpoints are considered as “expensive”? I would like to give anonymous users access to the source code, because this is already what one can see on the main repository page, but only the root directory. Also if users are able to checkout the complete repo anonymously, it makes not much sense to restrict the /src paths.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 31, 2025

/src is quite special and indeed it is one of the most expensive endpoints, it is the endpoint which is the direct victim of AI crawlers ........ so exposing it to anonymous users would just bring the problem back.


Since the REQUIRE_SIGNIN_VIEW = expensive is experimental and it does work, maybe you could try to fine tune the logic and build your own binary to find the balance. And actually #33951 could do better to have QoS for some endpoints.

@qwertfisch
Copy link

Normally I would say this is an expensive endpoint. It’s just that in my case the bots hammered on the /pull endpoint with 98% of all requests, whereas /src was hardly frequented. I will try to have a look in the code and build a custom binary. At least the raid stopped (down from 2m requests per day to 60k) using this measure. And interested real people can still checkout or register, so it’s not too restricted or inconvenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants