-
Notifications
You must be signed in to change notification settings - Fork 1
Zenodeo Status and Strategy
The query making portion in conjunction with the backend query performing parts of Zenodeo are the most complex. First, a quick background on how the querying happens:
‘data’ query, the one that returns records from the resource (let’s just consider treatments for simplicity), is always performed. Also, the ‘count’ query is always performed. The ‘data’ query returns a page of 30 records, and allows pagination back and forth. The ‘count’ query returns the total number of records matching that query.
If the treatmentId
is provided then only one record is returned in the ‘data’ query, the value of the ‘count’ query is 1, and in addition, a bunch of related queries are performed for records related to that specific treatment. But, no facets or stats are returned.
If no treatmentId
is provided then zero or more records are returned, no related records are returned, but facets and stats are returned if explicitly requested.
The query itself is constructed from the query string aka search portion of the URL, and the same constraint (the WHERE
clause in SQL) is used for all the queries - data, count, related records, facets, stats.
The difficulty comes from the fact that the queries are unpredictable, hence making them all performant becomes a challenge. Some queries are very fast, and some can be slow.
Since Zenodeo is a general purpose API, (and it will always remain a general purpose API), it has to have routes that serve everyone. Since not everyone is going to be interested in facets and stats, they are optional, triggered by a BOOLEAN
flag in the query string, for example, facets=true&stats=true
, or any combination thereof. By default, these flags are absent or off.
I have been working hard to write a program that constructs all the SQL queries for any requested search (sometimes these queries can be almost a dozen) and tuning the db so the SQL queries return a result in under a second each. Ideally a SQL query will return results in tens of milliseconds. But many queries take 100s of ms. Up until a few days ago, some queries were taking tens of seconds, which was unacceptable. Then some help from the db wizards at SQLite helped me change the db so all the queries became quite fast.
Now, keep in mind, SQL queries are fast if they use an index. But an index is used only if it matches what is being asked for. Additionally, indexes work great on columns, but not on parts of columns. This is why fuzzy queries (LIKE
operator in SQL) is slower than the equality operator (=
). Full text search (FTS) queries also perform quasi-fuzzy searches but they are fast because of a lot of TF/IDF indexes that are used. And, (I haven’t tested it), even FTS queries may be slower if performed with parts of words, unless those words are stems.
SQL queries are also slowed down by pagination. When I request only a certain number of records, the SQL engine finds all the matching queries and then throws away everything before and after the requested page. As one can imagine, this gets more and more wasteful the further down the tree one drills down. There is a way to make pagination faster by using an internally sortable ID, but I haven’t figured out an easy way to implement it yet. I am working on it.
And, finally, sorting the result set also slows down. By default, the results are sorted by the primary key (PK). By imposing a different sort order and direction, an extra step has to be performed. Add this as a prerequisite before pagination is performed and you get a double whammy
So, if you request fuzzy searches, please keep them to a minimum, only for those columns where they absolutely make sense, not on every column.
Going forward, I am in the process of rewriting my query making engine once again to make it even more flexible and covering all sorts of edge-cases, and I continue to work on making the SQL queries themselves much faster. Of course, when more queries are performed and cached, more queries will return the answer instantly. So, the system will get better the more it is used.
Also, one big enhancement that is planned is including the Zenodo metadata in Zenodeo. RIght now the only access I have to Zenodo is over REST. But when I suck in the entire Zenodo metadata into Zenodeo, all kinds of rich queries will be possible against “Zenodo” (really against Zenodeo) much like for treatments. Even better facet and stats searches will be possible.