Alter handling of huge text files

Issue #3090 asks whether matches might be excerpted in results from the search API to avoid a performance-killing situation such as returning a line that is a gigabyte in length. There is the open #2732 to convert `SearchEngine` to use the modern Lucene unified highlighter. With that PR's new `HitFormatter`, it would be fairly straight-forward to refactor to use the same excerpting as applied by `LineHighlight` for UI search.

Huge text files present additional problems, however, for OpenGrok.

The Lucene `uhighlight` API makes it ultimately impossible to avoid loading full, indexed source content into memory. While in some places in the API, Lucene permits content to be represented as `CharSequence`, which would allow (with a bit of work) to lazily load source content into memory; the final formatting via Lucene `PassageFormatter` is done with a method, `format(Passage[] passages, String content)`, where a `String` is demanded.

As well keep in mind that Lucene postings have an offset datatype of `int`, so content past an offset of 2,147,483,647 cannot be indexed for OpenGrok to present context, since OpenGrok chooses to be able to store postings-with-offsets so that later context presentation is not re-analyzing files. (Currently OpenGrok does not limit the number of characters read, which results in issues like #2560. The latest JFlex 1.8.x has revised its `yychar` as a `long`, but Lucene would still have an `int` limit for offsets.)
 
For huge text files then I can think of a few possible choices:

* Allow setting an upper limit of characters to be read from files so that "full, _indexed_ source content"  is capped, and continue to use `PassageFormatter`. This means however that some content from very large files would be missing from the index. (Currently _all_ content from >2GB files is missing from the index.)

or

* Index the content fully, but do not store postings with offsets, and do not enable any showing of context. OpenGrok would merely be able to report yes or no whether a huge text file was matched by a particular query.

or

* Break up very large documents into virtual, partial documents (fitting within `int` and likely fitting within say `short` to make the pieces very manageable), and fully index the pieces, and allow presenting context for each piece separately.

I generally think the second option might be satisfactory. Is there truly much utility to excerpting from a 1GB JSON file? What does "context" mean within such a file? I don't expect realizing that option would be too difficult. I suppose it could be done by reclassifying huge `Genre.PLAIN` files as `Genre.DATA`; but still using the plain-text analyzer and, where applicable, a language-specific symbol tokenizer; and also avoiding XREF generation (by virtue of being `Genre.DATA`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alter handling of huge text files #3097

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alter handling of huge text files #3097

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions