Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kiwix-serve indicates that the served item is marked "is_front" #1026

Open
benoit74 opened this issue Nov 24, 2023 · 25 comments
Open

kiwix-serve indicates that the served item is marked "is_front" #1026

benoit74 opened this issue Nov 24, 2023 · 25 comments
Assignees
Labels
Milestone

Comments

@benoit74
Copy link

I apologize in advance for my limited libzim / libkiwix expertise, you might have to rephrase everything below.

In python-scraperlib, when adding items to the ZIM we can mark them as "is_front" so that they are used for suggestions / searches while other items are ignored (and if not passed, this property is also computed dynamically based on the content type).

For the offspot/metrics project, we might need to detect if the HTTP web response of kiwix-serve is for an "asset" (is_front is False) or for a "page" (is_front is True). Because we would like for instance to count the number of pages visited per period.

This would typically be possible if the is_front property is stored in the ZIM (not sure it is the case) and returned by libkiwix / kiwix-serve as a response header.

Is this or would this be possible?

@benoit74
Copy link
Author

Oh and we would probably also need the item title. Is it possible as well?

@rgaudin
Copy link
Member

rgaudin commented Nov 24, 2023

It's not, this is a creator only info.

@rgaudin rgaudin closed this as completed Nov 24, 2023
@rgaudin
Copy link
Member

rgaudin commented Nov 24, 2023

Close it right away to make it look dramatic 😂

@rgaudin
Copy link
Member

rgaudin commented Nov 24, 2023

You may want to look at https://libzim.readthedocs.io/en/latest/

@rgaudin
Copy link
Member

rgaudin commented Nov 24, 2023

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

@rgaudin rgaudin reopened this Nov 24, 2023
@kelson42
Copy link
Collaborator

Usually logs don't save HTTP response headers, you plan to tweak the reverse proxy logging feature?

As workaround can we just get logs saved only if it ends with ".html" or no extensions at all?

@benoit74
Copy link
Author

Caddy logs do contain request and response headers by default (when using the proper structured format): https://caddyserver.com/docs/logging

No tweaking needed, this is the default / recommended configuration.

We will mostly not store these logs, only process them on the fly. Mostly because for simplicity and resilience reasons, we will in fact store Caddy logs files, but we will keep only 2 rotating files of 1Mo each, and delete the "not current" file after 48h if not already rotated again (numbers could still change, this has not been heavily discussed so far, but this is the "spirit").

@kelson42
Copy link
Collaborator

kelson42 commented Nov 26, 2023

I guess, this feature request can be implemented... but how will youndo for other kinds (not based on kiwix-serve) of content? I wonder if we could not have something different and more generic to handle this properly.

@rgaudin
Copy link
Member

rgaudin commented Nov 26, 2023

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

@kelson42
Copy link
Collaborator

I think you misunderstood something : everything is proxies so we have CT for all requests to all apps

But you can not set this specific header for all the content (Edupi for example). Therefore how will be made the front/resource distinction then?

@rgaudin
Copy link
Member

rgaudin commented Nov 26, 2023

I thought the discussion was to use CT instead of is_front. I must have missed some comments

@kelson42
Copy link
Collaborator

I thought the discussion was to use CT instead of is_front. I must have missed some comments

I don't know what you mean with "CT"

@benoit74
Copy link
Author

There is basically two approaches from my understanding to detect "Package page" views for offspot/metrics, based on reverse proxy logs:

  • rely on content-type to decide that an HTTP request/response is for a "Package page", i.e. something that we want to track in metrics in terms of number of views (for now)
    • this is what we have planned / implemented so far
    • pro:
      • versatile, could be implemented for all apps
    • cons:
      • we only have the URL, so not really nice or even not useful at all
        • e.g. for EduPi the URL is only a technical ID of the document retrieved, starting at 1 for first document uploaded, ..
        • ZIMs URLs are not always very explicit, ...
      • it is hard to decide what is a "Package page" only based on content-type (e.g. should we include PDFs ? ePubs ? Videos ?
      • for ZIMs, we have somehow already decided what is a "Package page" at ZIM creation time with the is_front property and we do not benefit from it
  • find another alternative
    • this is the topic of this ticket
    • pros:
      • reuse something which is already decided in scrapers (is_front), and benefit from their logic (will always be more specific / fine-tuned than just a content-type)
      • probably possible to also access the real "Title" of the "Package Page", instead of just a technical URL
        • even non sensitive to scraper changes in terms of URL structure in the ZIM
      • no business logic in metrics to decide what is a "Package Page"
      • business logic is tied to the ZIM, so probably easier to evolve if needed
    • cons:
      • only work for ZIMs
      • obviously a change is needed in libkiwix / kiwix-serve ^^

My question was first to check there was not something already feasible / implemented without code changes. And then to investigate if it is meaningful to make a change (or if we "live with what we have", at least for now).

I already have an answer to the first part of my question, which is great, thank you.

@rgaudin
Copy link
Member

rgaudin commented Nov 26, 2023

Versatile doesn't prevent us from having a better support for ZIM where we have control.

Regarding urls in scrapers, you are well aware that we do ways to make it human readable.

Sure we could also embed the entry title in headers but sending that both in body and in headers is not appealing

@kelson42
Copy link
Collaborator

@rgaudin @benoit74 Thx, seems very clear to me now. your proposla of header seems OK to me, I don't really have a better idea.

Considering:

What would be a very concrete proposal of header name/value(s)?

@rgaudin
Copy link
Member

rgaudin commented Nov 27, 2023

Being a dinosaur, I'd use the X- prefix. MDN docs says:

this convention was deprecated in June 2012 because of the inconveniences it caused when nonstandard fields became standard

There is no chance for ours to ever become standard so I think we can use X- prefix or not. Whatever you prefer

X-ZIM-Title: xxx
X-ZIM-FrontArticle: true/false
openZIM-Title: xxx
openZIM-FrontArticle: true/false

@mgautierfr
Copy link
Member

Actually the front (not really but similar, see libzim doc) articles can be found in the Listings

While it is technically true, it may not be the best way to get the information.
Having the article in a list (and get the information from there) would mean that we do a search for every resources in this list to know if it is front or not. It mostly double the work (and time) to locate a resources (not including decompression). If this information is used only for (our) metrics/stats, I'm not sure it worth it.

If we go this way, it would be better to move with supporting generic headers (which could be used by zimit2). Depending on how we implement it, we would still have to do a second entry lookup, but it would at least be generic and not only for us.

It also may be merged with the generic metadata (partly explained, but never implemented in openzim/libzim#325) features.

find another alternative

  • this is the topic of this ticket

Another other idea (relevant or not) :
Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

@rgaudin
Copy link
Member

rgaudin commented Nov 27, 2023

Make metrics ask the "zim file" if the url is a front or not. When metric detects (by heuristics) that a url may be front article, it opens the zim file itself and searches for information in it.

I initially thought that was what @benoit74 wanted to do. We could export a list of front articles for every ZIM in a sorted list or another fast-access format that metrics could query.

@benoit74
Copy link
Author

Header naming

I second idea of keeping the X- prefix, these headers will never make it to an international standard

Regarding naming the header(s), we might also consider that:

  • the need comes from offspot/metrics
  • we would like to be able to track page views of other applications than kiwix-serve

Then I would propose to add only one header X-Offspot-Page-Viewed-Name which:

  • will contain a user-friendly label of the viewed page name
  • will only be set when a page has to be tracked (i.e. when it is a front matter for ZIMs, not for JS/CSS, and usually not for PDF/ePub/...)

Who does what

I don't mind if we decide that it is preferable to not implement this in libkiwix and only export the list of front articles. In any case, the computation will be done somewhere.

More global insight

As mentioned in offspot/metrics#33 (comment), the more we dive into this issue, the more doubts I have about the real user need for this.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 17, 2023

@mgautierfr Can we not just efficiently returns the article title in a HTTP response header from the dirent (as the dirent is anyway read if you return the content).

At this stage this is I believe the way forward: just always return the article title as http header, cheap and straight forward. If no title in the header, then metrics can consider that this is a resource and not a front article.

@mgautierfr
Copy link
Member

I like the idea. And it is pretty straight forward. But there is a catch here (which may be a problem or not. If you don't care, I don't care too):

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do:

  • Always set a title header (and set it to url for entry without title)
  • Set a title only if it is different that url (and so lost track of entry with title == url)

@kelson42 kelson42 added this to the 13.1.0 milestone Dec 19, 2023
@kelson42 kelson42 self-assigned this Dec 19, 2023
@kelson42
Copy link
Collaborator

The way we store title in zim file, we cannot know if entry has not title (image, video) or if title is same as url. So we have two way to do

This is an optimisation hack for the title index search. We should be able from the dirent directly to know that. If not possible we should allow this.

@benoit74
Copy link
Author

I don't get why we should mind about the situation where title is same as URL. If something (scraper, whatever) decided to use the URL as a title, then this is the title. And I don't mind if we blindly return this title in an HTTP response header.

The user of this information will hence be able to apply its own logic if he feels like a title identical to the URL is acceptable or not.

Typically we could imagine to use this information as a heuristic in offspot/metrics to detect which requests are most probably a "front matter" and which ones aren't (even if I'm still not convinced that we won't have scraper which will set a title equal to file name for instance on some assets ... but this is something we could have control on).

@mgautierfr
Copy link
Member

This is a space optimization hack. If title is same than url, we only store the url in the dirent. So, at reading time, if the dirent contains only a url, we don't know if the dirent was created with a title same as url or without a title (empty title counts as without a title).

@kelson42
Copy link
Collaborator

I have open a ticket at libzim to get this feature:
openzim/libzim#885

@kelson42 kelson42 modified the milestones: 13.2.0, 13.3.0 May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants