Description
The .tpf
function currently stores all Atoms in memory. This leads to OOM issues (out of memory crash) for large collections.
Some thoughts on this:
Rust OOM tools
Currently, we don't get any useful errors in the log. We can't do a stack trace, theres no unwind. This makes debugging OOM issues hard.
This also may have something to do with linux overcommitting memory.
- The RFC for
try_reserve
may help prevent panics / OS killing Atomic-Server. oom=panic
might help give prettier error messages. But it's not implemented in stable rust yet.
Index all the TPF queries
Let's go over the types of TPF queries we use, and how we can index these:
- All the queries with a known
subject
are not relevant - By far the most queries have a known
property
andvalue
- The queries with a known
property
probably need aproperty-value-subject
index. We don't have that as of now. That would also help us create really performant queries for new, unindexed query filters. - The queries with only a known
value
are indexed by thereference_index
.
How I found the issue
read more...
- Go to atomicdata.dev/collections
- Scroll down
- See
loading...
The problem is that the websocket requests have no response.
Sometimes (but not always) the WebSocket connection seems to fail:
The connection to wss://atomicdata.dev/ws was interrupted while the page was loading. [websockets.js:23:19](https://atomicdata.dev/lib/dist/src/websockets.js)
websocket error:
error { target: WebSocket, isTrusted: true, srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, returnValue: true, defaultPrevented: false, composed: false, … }
[bugsnag.js:2579:15](https://atomicdata.dev/node_modules/.pnpm/@bugsnag+browser@7.16.5/node_modules/@bugsnag/browser/dist/bugsnag.js)
On the server, I see this every time:
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: Visit https://atomicdata.dev
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: 2022-10-29T10:50:49.596753Z INFO actix_server::builder: Starting 1 workers
Oct 29 10:50:49 vultr.guest atomic-server[2965299]: 2022-10-29T10:50:49.596978Z INFO actix_server::server: Actix runtime found; starting in Actix runtime
Oct 29 10:51:13 vultr.guest systemd[1]: atomic.service: Main process exited, code=killed, status=9/KILL
Oct 29 10:51:13 vultr.guest systemd[1]: atomic.service: Failed with result 'signal'.
Oct 29 10:51:14 vultr.guest systemd[1]: atomic.service: Scheduled restart job, restart counter is at 27.
Oct 29 10:51:14 vultr.guest systemd[1]: Stopped Atomic-Server.
Oct 29 10:51:14 vultr.guest systemd[1]: Started Atomic-Server.
What killed our process?
dmesg -T| grep -E -i -B100 'killed process'
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/atomic.service,task=atomic-server,pid=2965353,uid=0
[Sat Oct 29 10:51:59 2022] Out of memory: Killed process 2965353 (atomic-server) total-vm:891908kB, anon-rss:278920kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:776kB oom_score_adj:0
An out of memory issue...
Since we can correctly see most of the Collections, but not all, I think it's one of the collections that is actually causing this.
After checking them one by one, the culprit seems to be /commits
. Makes sense, it is by far the largest collection!
I think the problem has to do with .tpf
not being iterable.
Activity
[-]Collections stuck on loading [/-][+]Out of memory when dealing with large collections[/+]#529 WIP propvalsub index
#529 WIP propvalsub index
Add property index #529
#529 add property index, speed up queries
Remove TPF #529
Add migration for #529
Query
to use iterators #532#529 add property index, speed up queries
Remove TPF #529
Add migration for #529