Skip to content

Commit c596280

Browse files
committed
blog: add patterns and ghidra decompiler guide
Signed-off-by: Shreeyash Pandey <[email protected]>
1 parent afd6469 commit c596280

19 files changed

+1117
-3
lines changed
70.5 KB
Loading

docs/_images/zoomed-out-vdb.png

42.2 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
Ghidra Decompiler - CLI guide
2+
#############################
3+
4+
`Ghidra <https://ghidra-sre.org/>`_ has a decompiler that unlike the rest of the
5+
program (written in java) is written in C++. This caught my attention so I
6+
started to hack on it. Unfortunately, there isn't much written on the decompiler
7+
if one wants to use it standalone, in the terminal without the ghidra GUI. This
8+
article tries to fill that void.
9+
10+
Building The Decompiler
11+
***********************
12+
13+
Fetch and unzip the ghidra package from `their github release page
14+
<https://github.com/NationalSecurityAgency/ghidra/releases>`_
15+
16+
.. code::
17+
18+
$ unzip ghidra_11.1.2_PUBLIC_20240709.zip
19+
20+
`cd` into the decompiler directory and build it
21+
22+
.. code::
23+
24+
$ cd ghidra_11.1.2_PUBLIC/Ghidra/Features/Decompiler/src/decompile/cpp
25+
$ make decomp_opt -j $(nproc --all)
26+
27+
You should end up with a executable called `decomp_opt`.
28+
29+
Running the Decompiler
30+
**********************
31+
32+
While inside the directory, export the SLEIGHHOME env variable so our decompiler
33+
can find it, then run the executable.
34+
35+
.. code::
36+
37+
$ export SLEIGHHOME=/home/shreeyash/ghidra_11.1.2_PUBLIC
38+
$ ./decomp_opt
39+
[decomp]>
40+
41+
The compiler is running now waiting for commands.
42+
43+
.. note::
44+
45+
Remember to always export the environment variable before running decomp_opt.
46+
You could consider tossing the two commands into a script, making life easier
47+
for you.
48+
49+
Decompile and view an ELF executable
50+
************************************
51+
52+
Let's start with a trivial c++ program with some control flow, compile it into an
53+
executable (ELF) and decompile it.
54+
55+
Here's the program, save and compile it:
56+
57+
.. code::
58+
59+
$ cat a.cpp
60+
#include <iostream>
61+
#define THRESHOLD 20
62+
int foo() {
63+
return 10;
64+
}
65+
int main() {
66+
int b = foo();
67+
std::cout << "The threshold is " << THRESHOLD << '\n';
68+
std::cout << "You returned " << b << '\n';
69+
if (b < THRESHOLD) {
70+
std::cout << "get in\n";
71+
} else {
72+
std::cout << "get out!\n";
73+
}
74+
}
75+
$ g++ -no-pie a.cpp -o a
76+
$ ./a
77+
The threshold is 20
78+
You returned 10
79+
get in
80+
81+
The executable is ready, what's left now is decompilation.
82+
83+
Let's start the decompiler, and load our file:
84+
85+
.. code::
86+
87+
$ ./decomp_opt
88+
[decomp]> load file a
89+
a successfully loaded: Intel/AMD 64-bit x86
90+
91+
92+
We've loaded our executable in the decompiler. c++ is an abstract language with
93+
constructs that do not make any sense to a CPU. These include, but are not
94+
limited to: functions, structs, loops etc. In order to implement these, the
95+
compiler has to translate abstractions into concrete implementation which
96+
manifests itself in the form of control flow instructions like branch, compare,
97+
and jump. If we peep into an executable, we'll notice what we called functions
98+
are now 'addresses' i.e. a number that represents a location in memory.
99+
Functions are run by jumping (i.e. setting the program counter) to an address.
100+
Essentially, if we wish to decompile a function we had in source, we'll have to
101+
find the corresponding address at which it resides. `a.cpp` has two functions:
102+
`main` and `foo`. To find the address where a functions resides in the
103+
executable, we could use `objdump`.
104+
105+
.. code::
106+
107+
$ objdump -C -D a
108+
...
109+
00000000004011c5 <main>:
110+
4011c5: f3 0f 1e fa endbr64
111+
4011c9: 55 push %rbp
112+
4011ca: 48 89 e5 mov %rsp,%rbp
113+
4011cd: 48 83 ec 10 sub $0x10,%rsp
114+
4011d1: e8 e0 ff ff ff call 4011b6 <_Z5todayv>
115+
4011d6: 89 45 fc mov %eax,-0x4(%rbp)
116+
4011d9: 48 8d 05 24 0e 00 00 lea 0xe24(%rip),%rax # 402004 <_IO_stdin_used+0x4>
117+
4011e0: 48 89 c6 mov %rax,%rsi
118+
4011e3: 48 8d 05 96 2e 00 00 lea 0x2e96(%rip),%rax # 404080 <_ZSt4cout@GLIBCXX_3.4>
119+
4011ea: 48 89 c7 mov %rax,%rdi
120+
4011ed: e8 9e fe ff ff call 401090 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt>
121+
4011f2: 48 89 c2 mov %rax,%rdx
122+
4011f5: 8b 45 fc mov -0x4(%rbp),%eax
123+
...
124+
125+
Searching for 'main' reveals its label which resides at address `0x4011c5`.
126+
127+
.. code::
128+
129+
[decomp]> load addr 0x4011c5 main
130+
Function main: 0x004011c5
131+
132+
`load addr` takes an address and an optional 'label'. Label is essentially a
133+
name that we assign to that address. In this case, it was 'main'—could've been
134+
anything for what its worth.
135+
136+
.. code::
137+
138+
[decomp]> decompile
139+
Decompiling main
140+
Decompilation complete
141+
[decomp]> print C
142+
143+
xunknown8 main(void)
144+
145+
{
146+
int4 iVar1;
147+
xunknown8 xVar2;
148+
149+
iVar1 = func_0x004011b6();
150+
xVar2 = func_0x00401090(0x404080,0x402004);
151+
xVar2 = func_0x004010c0(xVar2,0x14);
152+
func_0x004010a0(xVar2,10);
153+
xVar2 = func_0x00401090(0x404080,0x402016);
154+
xVar2 = func_0x004010c0(xVar2,iVar1);
155+
func_0x004010a0(xVar2,10);
156+
if (iVar1 < 0x14) {
157+
func_0x00401090(0x404080,0x402024);
158+
}
159+
else {
160+
func_0x00401090(0x404080,0x40202c);
161+
}
162+
return 0;
163+
}
164+
[decomp]>
165+
166+
Just like that, we've decompiled our program. Notice how the names are garbled.
167+
This is because names (of variables and functions) are really neccessary to
168+
execute a program.
169+
170+
Let's analyze the decompiled output. The latter part of all function names are
171+
their address. This means, we can look them up in the `objdump`. Moreover,
172+
if the set of commands that got us `main` s decompilation we to be repeated
173+
for all the functions present in in the output, the resulting decompilation
174+
of main would replace all address with the labels we assign to them. Looking
175+
up in `objdump`, we find `func_0x004011b6` to be foo:
176+
177+
.. code::
178+
179+
...
180+
00000000004011b6 <foo()>:
181+
4011b6: f3 0f 1e fa endbr64
182+
4011ba: 55 push %rbp
183+
4011bb: 48 89 e5 mov %rsp,%rbp
184+
4011be: b8 0a 00 00 00 mov $0xa,%eax
185+
...
186+
187+
`func_0x00401090` is not present in the executable, however, the calls to this
188+
function are shown in the objdump thusly:
189+
190+
.. code::
191+
192+
4011ed: e8 9e fe ff ff call 401090 <std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)@plt>
193+
194+
Its quite obvious from the hint that `func_0x00401090` is the operator `<<`
195+
overloaded to accept a `std::basic_ostream` object and a `const char *`. The
196+
`@plt` at the end indicates that this function can be found in the `.plt`
197+
section of the executable. `.plt` which stands for Procedure Linkage Table
198+
is a redirection table of external functions that can be found in shared
199+
objects. So, `func_0x00401090` is `operator<<` found in `libstdc++.so` that
200+
the program is linked to. It takes two arguments: both addresses to
201+
objects. A search reveals that the first argumnet is the object `std::cout`
202+
of which the definition resides in an external library (`libstdc++.so`) and
203+
the other argument is a char literal that can be found in the `.rodata`
204+
section of the executable.
205+
206+
.. code::
207+
208+
$ objdup -s -j .rodata a
209+
Contents of section .rodata:
210+
402000 01000200 54686520 74687265 73686f6c ....The threshol
211+
402010 64206973 2000596f 75207265 7475726e d is .You return
212+
402020 65642000 67657420 696e0a00 67657420 ed .get in..get
213+
402030 6f757421 0a00 out!..
214+
215+
Indeed, the string `"The threshold is "` is present at address `0x0402004`.
216+
217+
Likewise, all following functions till `func_0x004010a0` are overloads of
218+
`operator<<` that handle different types of data. What remains is the control
219+
flow. It checks if `iVar1` which is `b` in the original source is less than
220+
`0x14` (`THRESHOLD`) and calls the familiar `func_0x00401090` i.e.
221+
(`operator<<`).
222+
223+
Conclusion
224+
**********
225+
226+
Our work was made much easier by the fact that the executable was not
227+
'stripped'. Stripping is a process that gets rid of all the symbols that are
228+
not absolutely neccessary for execution (greatly reduces executable size). In
229+
the real world, especially if we are dealing with propreitary software,
230+
executables might be stripped. Unstripped executables allows us to tread
231+
faster by simply searching for symbols like we did to find main. Stripped
232+
executables require us to trace, find and deduce what we need. In a later
233+
article, I may demo decompilation of stripped executables.

docs/_sources/blog/index.rst.txt

+3
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,7 @@ Blog
44
.. toctree::
55
:titlesonly:
66

7+
8+
When Reverse Engineering, Your Pattern Seeking Brain Is Your Friend <pattern_seeking_brain>
9+
Ghidra Decompiler - Standalone CLI Guide <ghidra_decompiler_cli_guide>
710
How to remove a vertex from a boost graph? <boost_graphs_remove_vertex>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
When Reverse Engineering, Your Pattern Seeking Brain Is Your Friend
2+
###################################################################
3+
4+
At work, I've been working on reverse engineering a propreitary file format that
5+
is used to represent a synthesized `netlist
6+
<https://en.wikipedia.org/wiki/Netlist>`_ for FPGAs by our vendor's EDA tools.
7+
8+
It's a binary file, and here's a sample of the hexdump:
9+
10+
.. code::
11+
12+
00000000 12 6c 1f 03 0b 00 00 a0 00 00 00 00 40 e4 7a 1f |[email protected].|
13+
00000010 03 08 b6 84 a3 66 00 00 00 00 01 00 20 20 00 02 |.....f...... ..|
14+
00000020 40 31 00 ab 00 0e 43 45 4e 4e 41 48 45 68 65 46 |@1....CENNAHEheF|
15+
00000030 62 66 4b 49 03 8f 8d a3 03 8f 8d a3 02 00 08 aa |bfKI............|
16+
00000040 ba b9 9e 40 b6 9b 99 00 00 00 00 00 00 00 00 01 |...@............|
17+
00000050 0a 49 4d 41 4f 40 4e 4e 49 48 82 08 aa ba b9 9e |.IMAO@NNIH......|
18+
00000060 40 b6 9b 99 00 40 31 00 ab 00 00 b6 84 a3 66 00 |@[email protected].|
19+
00000070 00 00 00 ff 00 00 00 00 00 00 00 00 03 8f 8d a3 |................|
20+
00000080 00 00 00 00 00 00 00 00 00 00 40 31 00 ab 00 00 |..........@1....|
21+
00000090 b6 84 a3 66 00 00 00 00 ff 06 00 14 00 05 00 01 |...f............|
22+
23+
Without any information on the file, this stands as a wall full of random bytes.
24+
Although complex, there's a lot that can be deduced by looking for patterns.
25+
File formats are often divided in sections. The bytes may look random, but in
26+
reality, they ought to be very structured. The first step in dealing with this
27+
is to **extract the structure**.
28+
29+
One trick I use is to zoom out on the hexdump. This isolates zeros and all other bytes.
30+
31+
Here's an image of a hexdump of the same file, zoomed out:
32+
33+
.. image:: /_static/zoomed-out-vdb.png
34+
:align: center
35+
36+
Do you notice any patterns?
37+
38+
There are alternating strips of dark and light patterns. The light patterns are
39+
just zeros and darker ones appear to be 'data'. Here's a highlighted image with
40+
the patterns. White rectangles represent dark parts and greens represent the
41+
zeros.
42+
43+
.. image:: /_static/zoomed-out-vdb-highlighted.png
44+
:align: center
45+
46+
Since this repeats it's likely encoding the same 'type' of information. The pattern
47+
starts with dark section followed by white and ends with a white section. They
48+
always come in a pair. So we can deduce that to represent one of this type of
49+
data, we need a dark part followed by the light part.
50+
51+
Now, what could this be?
52+
53+
This is a question that falls into the 'content' part. What we did above was the
54+
'structure' part. As it turns out, getting meaning out of this is much more
55+
tedious.
56+
57+
A `Fuzzer <https://en.wikipedia.org/wiki/Fuzzing>`_ is the right tool for this
58+
job. As fuzzers tend to be very special purpose, i wrote one for myself.
59+
Extracting the details with the fuzzer vindicates our suspicion. The dark and
60+
light parts are indeed part of the structure. The dark part is sort of a
61+
preamble to the light part. The light part is a port reference list for all
62+
the black-box module present in the netlist.
63+
64+
That this pattern represents black-box modules can be deduced by counting the
65+
number of times this pattern is present, and what other thing is present as
66+
many times in the original source file from which this was generated. Inspecting
67+
the source, which is just a verilog file confirms that this are indeed the
68+
black-box modules.
69+
70+
Conclusion
71+
##########
72+
73+
In conclusion, file formats or any other type of data that are supposed to be
74+
regular can be brute-forced by our pattern-seeking brains to reveal their
75+
structures.
76+
77+
PS: I'll write a full description of the fuzzer and document other details as
78+
this project progresses.

docs/_sources/links.rst.txt

+2
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ Programming/Hacking
3737
- `Continuations for Curmudgeons
3838
<https://intertwingly.net/blog/2005/04/13/Continuations-for-Curmudgeons>`__
3939
- `Video Lan (VLC) Hacker's Guide <https://wiki.videolan.org/Hacker_Guide/Audio_Filters/>`__
40+
- `Fast IO on Unixes by the way of 'yes' command
41+
<https://www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/>`__
4042

4143
Books
4244
-----
70.5 KB
Loading

docs/_static/zoomed-out-vdb.png

42.2 KB
Loading

0 commit comments

Comments
 (0)