Implement more sophisticated string encoding detection. #4896

Rot127 · 2025-02-12T19:51:12Z

Is your feature request related to a problem? Please describe.

String detection should be reliable transparent. If the decoding is not known, it should give the use a numbre possible strings at a given offset. The user can then choose which makes the most sense.
This is not done yet. It simply chooses the encoding most similar to ASCII, which is a not very reliable metric.

Describe the solution you'd like

The following is already implemented here: https://github.com/rizinorg/rizin/tree/string-detection

Problems of the previous heuristics

UTF-16 and UTF-32 checks only tested for bytes with the upper bytes 0. This effectively excludes all code points > 0xff.
But those are the ones UTF-16/32 is used in the first place. So the heuristics is off.
The testing happened mostly sequentially and returned early. If UTF-32 matched before IBM037, it just assumed it is UTF-32, not even performing the IBM037 test. Leaving the string invisible.
Due to this it is also hard to extend the testing. Because with every new check of a new encoding, the previous checks must somehow fail before. Forbidding the possibility of a byte sequence being valid in two encodings. (And judging from the code, @imbillow ran into exactly this problem).
There was no easy way to test for multiple encodings at the same offset, except running the scan again after switching encodings.

New encoding detection

The new solution is somewhat less performant (I think, but I didn't profile it yet and I don't think it is much worse).
But, I think it is so much more worth it, because:

We can actually control how much of a performance hit we want.
- Check only X bytes of the buffer? Not a big deal to implement.
- Only check for a subset of encodings? Just some if conditions.
At the same time we get way more flexibility how the detection works (via the biases).
- If we have a binary with big endian, we can give big endian strings a strong bias.
- We want to limit the string detection to strings of a certain length, just change the bias.
- We need the detected strings to align with even addresses, again just change the bias.
- In general, we can give the user a selection of choises. If they see strings of encoding X make the most sense, she can assume this is the string encoding in the file, set it and no longer get any performance penalty.
- Most generally, we reduce the number of false negatives to 0 (within the range of the detection parameters and biases).

I think this trade is very good. Because we should always be able to provide all statically knowable information to the user.
What they do with it is their choice, but not ours.

The algorithm

It is rather straight forward. The detection function gets a buffer it iterates twice over the buffer (O(2n)) and
decodes at each index a code point for each possible encoding and checks it's printability (O(|Encodings| * log(I)): I is number of invalid code point ranges.).
If the decoding gives a printable code point, it gets a score of the number of bytes it uses (4 if UTF-32, 2-4 for UTF-16, 1 for IBM037 etc.).
If there is no valid decoding of the bytes, the score is 0.0.
This is done for every offset.
we iterate backwards over the buffer and accumulate the scores, except they have a score of 0.0. This means that strings of printable characters have a high score at the beginning of the string (namely the number of consecutive printable characters) and a low score at the end.

Complexity should be around: O(2n * |Encodings| * log(I)), which collapses to O(2n * log(I)) because the number of encodings is constant. So we get way more flexibility, precision and user friendliness for a factor of log(I) (and I hope I have not overlooked something :D).

Step by step we do:

Example
Step I (calculate code points at each offset):

Index  | 0    | 1    | 2    | 3    | 4    | 5    |
Bytes  | 0xdf | 0xff | 0x41 | 0x41 | 0xdf | 0xff |
UTF-8  | 2    | 1    | 1    | 1    | 2    | 1    |
UTF-16be | 2    | 2    | 1    | 1    | 2    | 1    |
UTF-32 | 0    | 0    | 0    | 0    | 0    | 0    |
IBM037 | 0    | 0    | 1    | 1    | 0    | 0    |
...

Step II (accumulate backwards):

Summing up always takes the current cell score at i plus the score of the i + score cell.
So for UTF-8 we do:

score-5 = 1	
score-4 = 2	
score-3 = 1	+ score-4
score-2 = 1	+ score-3
score-1 = 1	+ score-3
score-0 = 2	+ score-2

The resulting table is:

Index    | 0    | 1    | 2    | 3    | 4    | 5    |
Bytes    | 0xdf | 0xff | 0x41 | 0x41 | 0xdf | 0xff |
UTF-8    | 6    | 4    | 4    | 3    | 2    | 1    |
UTF-16be | 6    | 5    | 4    | 3    | 2    | 1    |
UTF-32   | 0    | 0    | 0    | 0    | 0    | 0    |
IBM037   | 0    | 0    | 2    | 1    | 0    | 0    |
...

Now we can query this table at different offsets and get the string encoding with the highest score.
At offset 0 both UTF-16 and UTF-8 have the same score. But before we query this table, we apply a bias to each score.
Let's say we know the binary is from Windows (and we suspect the strings are encoded in UTF-16).
If we query at offset 0, to get the most likely encoding, we have two possibilities (UTF-8 and UTF-16be).
But because we assume UTF-16 is more likely, we multiply the UTF-16be score with a bias (6 * 1.2) and now the UTF-16 is the most likely one.

We can apply these biases for everything.
E.g. String is too short? => If initial score is < 4.0 (uses less then 4 bytes) we add a multiplier of 0.0 (ignoring it).
But also for alignment for addresses and whatever comes to mind (e.g. code point should be < 0x7f for ASCII only).

In practice we discover a lot more possible string encodings.

The most clearest cases are in test_str_search.c. It did not detect a lot of the valid encodings.
Because the detection stopped too early. And in the tests there is the implicit assumption made, what encoding is expected (namely code points of the Latin alphabet, ASCII range).
E.g. 0xff bytes assumed not being a valid part of UTF-16 code points, but they are.
Mostly characters with ff in the code point were not discovered in those test cases ( 0xff = ÿ).

Another example:
The byte sequence 0x49,0x00,0x20,0x00,0x61,0x00,0x6d,0x00 is valid for UTF-16LE and BE.
Once I am and once am. And so forth. But only one of them was expected in the tests.

Also check out the cmd_ps test file for an example listing all possible strings at an offset.

Places for improvements (before merge?)

Get rid of the unrolled loops.
Limited default search to only UTF-X. Set strong bias for endianess.
Make ps list alternatives with guess-list.

Later improvements

Different search bias profiles.
Set sensitive selection for encodings searches based on binary information.
Implement statistical analysis on the code points to detect language.
Let users choose what is "printable" and what not. E.g. '\n' is part of the Unicode control plain and not printable. But for our use case (search) it is. Maybe only undefined Unicode points?

Describe alternatives you've considered

A decent guess about the encoding should already happen based in the known binary information (Windows = UTF-16LE etc).

Additional context

https://github.com/rizinorg/rizin/tree/string-detection

The text was updated successfully, but these errors were encountered:

Rot127 added enhancement New feature or request RzBin RzUtil labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more sophisticated string encoding detection. #4896

Implement more sophisticated string encoding detection. #4896

Rot127 commented Feb 12, 2025

Implement more sophisticated string encoding detection. #4896

Implement more sophisticated string encoding detection. #4896

Comments

Rot127 commented Feb 12, 2025