Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JtR next generation #4260

Open
magnumripper opened this issue May 6, 2020 · 7 comments
Open

JtR next generation #4260

magnumripper opened this issue May 6, 2020 · 7 comments

Comments

@magnumripper
Copy link
Member

magnumripper commented May 6, 2020

While I'm not sure we'll ever actually manage to give birth to the "next generation" re-write of JtR, I thought we should at least have an issue here for brainstorming. Here's some of my thoughts quoted (and edited) from some other place nearby:

We'd do a total re-write but heavily re-use existing code (after careful considerations on a case-by-case basis). We'd not add a single line of code until (re-)defining lots of things, like code style (tho' actually the code style could be not to have one [except we'd mostly keep the current one for core, as in non-plugs]), plug-in interfaces (we should have mode plugs as well), source tree structure and so on.

Some must-haves (IMHO):

  • All formats (except, say, -stdout) are plug-ins. We should strive to make the -stdout option as close to just any format as possible, from a code point of view.
  • All modes (except, say, -stdin) are plug-ins.
  • Ditch OpenSSL in favor of own code that can be more or less optimised (first prio is merely to have it working at all, performance isn't a priority until near a release).
  • Candidate strings should be UTF-32 and/or UTF-8-32 all the way from generation (or reading from a file) to set_key(). Perhaps also (in lots of places but not everywhere) use a "pascal string" struct for strings in many parts of core.
  • The use of threads (OpenMP or not) should be carefully planned so modes can use it where more beneficial than in the format.
  • The core should obviously be written with node/fork/MPI in mind as well - including the option to split a job by salts as opposed to by candidates.
  • New format interface(s) should be very carefully considered. This includes OpenCL/FPGA/other interfaces (eg. all main OpenCL kernels should have the same interface).
  • On-device mask mode should be much easier to implement. Unified interfaces, shared CPU code, shared GPU code.
  • An OpenCL format plugin (for example) should ideally not contain a single line of OpenCL code. Maybe we should also ditch the idea of having separate source files for CPU/OpenCL/ZTEX and so on - instead use a single one, calling abstraction layers in core for the non-CPU stuff.

Optional features:

  • I came up with this idea recently but am not sure it with fly: We could try making some compatibility layer for using legacy plugin formats as-is. That way we con't need to upgrade all 321 plugins at once (we might even opt to never fix some of the most obscure).
  • Re-write dynamic mode from scratch (using only the "dynamic compiler" mode, nothing else) and add OpenCL support for it.

I think the concept of "format tags" and aliases should be a core thing.

@solardiz
Copy link
Member

Comment on improving multi-thread scalability for fast hashes within the current formats interface or with minor changes to it: #5435 (comment)

@AlekseyCherepanov
Copy link
Member

I would add idea about pulling third party code into john: it would be nice to keep it structured to allow updates from upstream.

I guess git submodule is quite popular to use third party repo without changes. But it makes me uneasy when I see it (submodule binds to commit hash, but I am not sure if the hash is checked by default and I am too lazy to investigate it; and I am not happy just downloading software from Internet randomly, john is the only exception, for everything else I have Debian).

An alternative that seems nicer: git subtree, it includes repo as subtree together with history and ability to update (and maybe even make changes and rebase them over updates).

@AlekseyCherepanov
Copy link
Member

For pascal strings, there is sds library. It is interesting because it holds both '\0' and length. It is used in redis. So trade-offs should be documented somewhere.

@AlekseyCherepanov
Copy link
Member

From attempts to hack on code quickly during contests, I would say that format interface is too detailed for many simple formats. So there is a lot of boilerplate code in simple cases. Same applies to openmp: support for it inside a format adds code that complicates reading. New interfaces should reduce code in formats and take care of such servicing as parallelization.

Some formats really need the flexibility of current interface. So I guess there might be some pluggable adapters for different interfaces.

I have a few considerations about multi-threading. Formats use global variables and are not compatible with threading as is. But there are the following variants (and probably some more that I did not think about):

  • thread local variables instead of global
  • extracting format into loadable module and loading it multiple times to get multiple instances of global variables
  • abandoning global variables in favor of true OO-style with self passed to every method
  • keep formats as is and use fork together with shared memory for db_main

Some formats use global arrays for its data, it gives a small speed up. But I think it works only with code inlined into crypt_all(). Many complex formats use shared functions for hashing. Even raw formats like raw-sha1 and raw-md5 use shared functions. Also in many formats, global arrays were replaced with pointers (and I guess it is not as fast as true arrays). So abandoning global variables should not be a problem in most cases. OTOH it would be a pity to loose possibility of such optimization for some special cases.

@AlekseyCherepanov
Copy link
Member

Another idea: some objects for parsers to pack specific code like tag+hex with parameters like actual tag and length of hex part. So a format could say: raw hash as 64 hex lowercase only with required tag "$...$", postprocess binary with this callback. And there would be a function to parse hex hashes using such parameters. Then other formats would declare use of custom parser. So we would be able to implement more common parsers and switch formats onto them gradually.

Postprocessing like reversing rounds is the thing that makes parsers inseparable from hashing code.

Another interesting problem is that some groups of formats like cpu+opencl have different valid()s in its formats because supported flavours differ between implementations (e.g. argon-opencl can crack argon2id while cpu format cannot).

I have a prototype of a flexible dynamic parser where each parser is described by string. String is parsed into a tree and there is interpreter for this tree. I don't know the speed though. Strings would allow parsers to be defined in john.conf like dynamic formats. But for static use, next step would be to develop macros with similar syntax to expand into tree using struct literals at compile time. Thinking about macros, it is possible to expand into code directly. But all these approaches make tens of tag+hex parsers take separate space in john's binary. Parsers as data might be cool because some analysis over them could be performed, e.g. range of full expected length could be computed in some cases and for instance raw-sha512 would not be tried for short hashes like nt based just on length.

(Trees are needed to handle variants of forms. Other approach is bytecode-like array with jumps. Jumps are a bit harder to populate from macros.)

@AlekseyCherepanov
Copy link
Member

Hm, valid() takes self. There are binary_size and signature[]. So it is already possible to make shared valid() for all tag+hex formats.

@AlekseyCherepanov
Copy link
Member

Another idea: length-based queues or even full data-driven branching (codenamed "continuous feeding"). Writing down #5534 reminded me of that. Some aspects of the idea were discussed on john-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants