-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kiwix-serve indicates that the served item is marked "is_front" #1026
Comments
Oh and we would probably also need the item title. Is it possible as well? |
It's not, this is a creator only info. |
Close it right away to make it look dramatic 😂 |
You may want to look at https://libzim.readthedocs.io/en/latest/ |
Actually the front (not really but similar, see libzim doc) articles can be found in the Listings |
Usually logs don't save HTTP response headers, you plan to tweak the reverse proxy logging feature? As workaround can we just get logs saved only if it ends with ".html" or no extensions at all? |
Caddy logs do contain request and response headers by default (when using the proper structured format): https://caddyserver.com/docs/logging No tweaking needed, this is the default / recommended configuration. We will mostly not store these logs, only process them on the fly. Mostly because for simplicity and resilience reasons, we will in fact store Caddy logs files, but we will keep only 2 rotating files of 1Mo each, and delete the "not current" file after 48h if not already rotated again (numbers could still change, this has not been heavily discussed so far, but this is the "spirit"). |
I guess, this feature request can be implemented... but how will youndo for other kinds (not based on kiwix-serve) of content? I wonder if we could not have something different and more generic to handle this properly. |
I think you misunderstood something : everything is proxies so we have CT for all requests to all apps |
But you can not set this specific header for all the content (Edupi for example). Therefore how will be made the front/resource distinction then? |
I thought the discussion was to use CT instead of is_front. I must have missed some comments |
I don't know what you mean with "CT" |
There is basically two approaches from my understanding to detect "Package page" views for offspot/metrics, based on reverse proxy logs:
My question was first to check there was not something already feasible / implemented without code changes. And then to investigate if it is meaningful to make a change (or if we "live with what we have", at least for now). I already have an answer to the first part of my question, which is great, thank you. |
Versatile doesn't prevent us from having a better support for ZIM where we have control. Regarding urls in scrapers, you are well aware that we do ways to make it human readable. Sure we could also embed the entry title in headers but sending that both in body and in headers is not appealing |
@rgaudin @benoit74 Thx, seems very clear to me now. your proposla of header seems OK to me, I don't really have a better idea. Considering:
What would be a very concrete proposal of header name/value(s)? |
Being a dinosaur, I'd use the
There is no chance for ours to ever become standard so I think we can use
|
While it is technically true, it may not be the best way to get the information. If we go this way, it would be better to move with supporting generic headers (which could be used by zimit2). Depending on how we implement it, we would still have to do a second entry lookup, but it would at least be generic and not only for us. It also may be merged with the generic metadata (partly explained, but never implemented in openzim/libzim#325) features.
Another other idea (relevant or not) : |
I initially thought that was what @benoit74 wanted to do. We could export a list of front articles for every ZIM in a sorted list or another fast-access format that metrics could query. |
Header namingI second idea of keeping the Regarding naming the header(s), we might also consider that:
Then I would propose to add only one header
Who does whatI don't mind if we decide that it is preferable to not implement this in libkiwix and only export the list of front articles. In any case, the computation will be done somewhere. More global insightAs mentioned in offspot/metrics#33 (comment), the more we dive into this issue, the more doubts I have about the real user need for this. |
@mgautierfr Can we not just efficiently returns the article title in a HTTP response header from the dirent (as the dirent is anyway read if you return the content). At this stage this is I believe the way forward: just always return the article title as http header, cheap and straight forward. If no title in the header, then metrics can consider that this is a resource and not a front article. |
I like the idea. And it is pretty straight forward. But there is a catch here (which may be a problem or not. If you don't care, I don't care too): The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do:
|
This is an optimisation hack for the title index search. We should be able from the dirent directly to know that. If not possible we should allow this. |
I don't get why we should mind about the situation where title is same as URL. If something (scraper, whatever) decided to use the URL as a title, then this is the title. And I don't mind if we blindly return this title in an HTTP response header. The user of this information will hence be able to apply its own logic if he feels like a title identical to the URL is acceptable or not. Typically we could imagine to use this information as a heuristic in offspot/metrics to detect which requests are most probably a "front matter" and which ones aren't (even if I'm still not convinced that we won't have scraper which will set a title equal to file name for instance on some assets ... but this is something we could have control on). |
This is a space optimization hack. If title is same than url, we only store the url in the dirent. So, at reading time, if the dirent contains only a url, we don't know if the dirent was created with a title same as url or without a title (empty title counts as without a title). |
I have open a ticket at libzim to get this feature: |
I apologize in advance for my limited libzim / libkiwix expertise, you might have to rephrase everything below.
In python-scraperlib, when adding items to the ZIM we can mark them as "is_front" so that they are used for suggestions / searches while other items are ignored (and if not passed, this property is also computed dynamically based on the content type).
For the offspot/metrics project, we might need to detect if the HTTP web response of kiwix-serve is for an "asset" (
is_front
is False) or for a "page" (is_front
is True). Because we would like for instance to count the number of pages visited per period.This would typically be possible if the
is_front
property is stored in the ZIM (not sure it is the case) and returned by libkiwix / kiwix-serve as a response header.Is this or would this be possible?
The text was updated successfully, but these errors were encountered: