Skip to content

"Product" formats/natures (formats that are both X and Y) #103

@julik

Description

@julik

We have formats that parse ambiguously. For example, a Keynote document is a JPEG "at the head" and a ZIP with a specific structure "at the tail". A CR2 is a TIFF until considered otherwise. A TIFF is somewhat CR2-ish until considered otherwise. An Office document is a ZIP initially...

The number of these is only ever going to increase (see the library grounding principles). Currently we are at the stage where we litter the code with workarounds like "if this is also a CR2, bail out", "if this is also a ZIP, it is a Keynote file so bail out..." and so forth. What if, instead of doing this, we were to do the following:

  • Apply all the low level parsers, always
  • Apply some "folder" or "matcher" strategy to the flat list of results. For example, if something is matched as a JPEG and a ZIP and has a specific file structure we can assume it is Keynote. We then take the two results and smash them together into one which states the Keynote file type unambiguously. If we see the Office ZIP filenames in the file we convert the result into a Word file result
  • We return the "folder" list to the caller.

So the procedure would look somewhat like this:

initial_results = parsers.map {|p| p.call(io) } #=> [JPEG, ZIP]
results_with_complex_types = fold_complex_filetypes(initial_results) # => [Keynote]

This does clash with the idea of parsing "at most as many parsers as was requested" but we would get much more intuitive operation in return, and we could remove quite a few hacks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions