Intro to Reverse Engineering

Henceforce RE == Reverse Engineering, because I simply refuse to write it long form over and over again.

Why does this site exist?

There is a literal wealth of resources online which can teach reverse engineering, so why does this site deserve to exist? I'm fairly confident that this site provides a novel method of teaching the subject, because it approaches problem backwards. I first provide readers with a small C application which demonstrates a topic (loops, if statements, function calls) and then I reverse engineer the application to demonstrate and reinforce to readers how the ASM ties back to the original C code.

Teaching ASM and reverse engineering in this way means that readers are only ever confronted with small and relevant ASM samples, which takes away a lot of the overwhelming feelings that RE beginners typically feel. This site will teach you ASM fundamentals in small spoonfuls, which easily and obviously map back to the original provided C code.

What's reverse engineering?

When application source code is compiled into an executable program, the source code is not included in the executable. Reverse engineering is the methodical process of trying to understand the application's functionality by analyzing the executable's ASM instructions.

What's ASM, and what's x64?

This course will serve as a foundational introduction to writing and reading 64 bit ( AKA x64) Intel Assembly (AKA ASM) code, which will in turn give you the tools to go about reverse engineering 64 bit applications (and x86 / 32 bit applications by extension 😊 ).

Learning to reverse engineer applications opens up a wealth of possibilities for you. You’ll gain the ability to –

With literally just a few basic nuggets of information, you'll be able to understand the inner workings of so many applications.

The final point which I'd like to raise about assembly and RE is that, once you know the fundamentals, you can modify an executable (binary) in any way that you please. This is actually exactly how "micropatching" is done, when a developer loses the source code for one of their binaries but wishes to make a modification for security reasons etc. then they will simply modify the target binary's assembly.

What does 64 bit mean and why does it matter?

A 64 bit CPU is capable of addressing 64 bit wide values in RAM, and because of this is capable of working with extremely large numbers.. for example, a 64 bit CPU can count from either -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807 (signed integer, from negative to positive) or from 0 to 18,446,744,073,709,551,615 (unsigned integer, from 0 to max value). These values are important, as they map to 263−1 and 264−1 respectively.

Essentially, 64 bit means that the CPU is capable of addressing a value which is 8 bytes wide. This huge increase in the addressable range is the reason why 64 bit CPUs can address 16 exabytes of RAM, whereas 32 bit CPUs can only address 4 gigabytes of RAM maximum (The maximum unsigned 32 bit integer is 4,294,967,295, or 232−1)

Memory addresses in 64 bit computing are 8 bytes wide whereas in 32 bit assembly they are 4 bytes wide. For example an address in 64 bit computing might look like 0x4142434445464748, whereas in 32 bit computing they look like 0x41424344. They are double the width of 32 bit addresses. This is an important distinction, and a dead giveaway when looking at a memory address.

What's assembly?

Basically assembly, or more formally Intel's ASM, is a programming language which is comprised of a large series of mnemonics which are assembled into opcodes, which are the raw instructions which are parsed by your computer's processor (CPU) in order to execute an application.

I realise that this is a confusing, scary statement but it's actually very simple. We frequently discuss how computers talk in "ones and zeroes", which is true at the very, very lowest level. One layer above that though, your CPU is capable of reading a sequence of opcodes (operation codes) from RAM which give the CPU instructions. These opcodes are actually hexadecimal instructions, rather than ones and zeroes.

For example, telling the computer to move a value from a register to another register looks like the following in assembly -

mov eax, 0x0

That's the mnemonic for this particular opcode -

b8 00 00 00 00

Then the assembly instructions (the mnemonics) are assembled using an assembler into the opcodes listed above.

A disassembler converts those opcodes back into human readable mnemonics again.

I realise that this is a little scary for a budding RE professional, but we're going to see some simple examples soon which explain things even further.

A key takeaway here should be that you don't need to understand opcodes.. at all.. and you don't need to be able to read swathes of ones and zeroes to be a good revese engineer, that's just nonsense perpetuated by the movies. If you can read mnemonics like mov eax, 0x0 then you can reverse engineer, easy as that.

How is ASM created?

This is the key question.

You typically write your programs in a high level language such as C, C++, Go, Rust etc.

Programs written using these languages are compiled into an executable binary file, such as firefox, google-chrome, apt-get, cat.

This compilation process takes high-level-language and essentially reduces it down to ASM / opcodes (low-level-language) which the CPU can execute.

When you compile a C program with GCC, the following actions happen internally -

This link explains the above 4 stages in much more detail, I highly recommend taking a look.

This is the typical process for generating ASM. It's also possible to manually write, assemble and link your ownprogram using the ASM language. We're going to get into this in a couple of lesson's time.

A practical example

Let's generate some assembly. We're not going to understand what it all means yet, but it will highlight how assembly is made.

Write a basic C program on any 64 bit Linux [virtual] machine and call it 0xff-hello-world.c, like the following -

#include <stdio.h>

int main(int argc, char** argv){
    printf("learnreverseengineering.com is the best!");
}

Nothing fancy, it just prints a factual statement out to the terminal.

Next up, run the following command (replace 0xff-hello-world.c with whatever you named your file, obviously) -

gcc -save-temps -masm=intel 0xff-hello-world.c -o 0xff-hello-world

This command should generate four new files -

The file without any file extension is the final assembled binary. You can execute this to see learnreverseengineering.com is the best!

The .i file is the pre-processed file. It's essentially the same as the .c file but with added info to help the compiler. enter cat 0xff-hello-world.i to see how this looks

The .o file is the object file. Every 'unit' (file) which GCC compiles gets a .o file, which are eventually linked together to form the final executable. This is essentially an unlinked binary

The .s file is the compiledset of ASM mnemonics. The contents of this file are -

	.file	"0xff-hello-world.c"
	.intel_syntax noprefix
	.text
	.section	.rodata
.LC0:
	.string	"learnreverseengineering.com is the best!"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	mov	rbp, rsp
	.cfi_def_cfa_register 6
	sub	rsp, 16
	mov	DWORD PTR -4[rbp], edi
	mov	QWORD PTR -16[rbp], rsi
	lea	rdi, .LC0[rip]
	mov	eax, 0
	call	printf@PLT
	mov	eax, 0
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
	.section	.note.GNU-stack,"",@progbits

The above is x64 ASM language, in all of its terrifying and verbose glory. The vast majority of this code is honestly irrelevant, and we'll cover why shortly. Only this little section is relevant to our interests -

lea	rdi, .LC0[rip]
mov	eax, 0
call	printf@PLT
mov	eax, 0
leave

These 5 lines tell the CPU to Load the Effective Address of our "learnreverseengineering.com is the best" string into a register (essentially a variable!), move the number 0 into another register (another variable..!) and call the printf function.

After calling printf, the code leaves the current function, with a return value of 0.

When it's broken down and focused down like that it's much more simple, right?

Just to prove that this isn't magic, and I'm not lying, we're going to disassemble (reverse) the .executable file we just created to prove that the ASM above is what the application is comprised of.

On a Linux box, run objdump -M intel-mnemonic -d 0xff-hello-world and scroll down until you see the main function.

0000000000001139 <main>:
        1139:       55                      push   rbp
        113a:       48 89 e5                mov    rbp,rsp
        113d:       48 83 ec 10             sub    rsp,0x10
        1141:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
        1144:       48 89 75 f0             mov    QWORD PTR [rbp-0x10],rsi
        1148:       48 8d 05 b9 0e 00 00    lea    rax,[rip+0xeb9]        # 2008 <_IO_stdin_used+0x8>
        114f:       48 89 c7                mov    rdi,rax
        1152:       b8 00 00 00 00          mov    eax,0x0
        1157:       e8 d4 fe ff ff          call   1030 <printf@plt>
        115c:       b8 00 00 00 00          mov    eax,0x0
        1161:       c9                      leave  
        1162:       c3                      ret    
    

Notice that the assembly matches what we printed above? This stuff isn't magic.

Also notice those weird characters to the left of our mnemonics? Those are the opcodes which the CPU executes. We can prove that by running xxd on the file and scrolling down to offset 0x0000000000001139 (where the main function starts, displayed at the top of the objdump output above)

xxd 0xff-hello-world

        00001130: f30f 1efa e977 ffff ff55 4889 e548 83ec  .....w...UH..H..
        00001140: 1089 7dfc 4889 75f0 488d 05b9 0e00 0048  ..}.H.u.H......H
        00001150: 89c7 b800 0000 00e8 d4fe ffff b800 0000  ................
        00001160: 00c9 c366 2e0f 1f84 0000 0000 000f 1f00  ...f............
        00001170: 4157 4c8d 3d6f 2c00 0041 5649 89d6 4155  AWL.=o,..AVI..AU
        00001180: 4989 f541 5441 89fc 5548 8d2d 602c 0000  I..ATA..UH.-`,..
        00001190: 534c 29fd 4883 ec08 e863 feff ff48 c1fd  SL).H....c...H..
        

See, as mentioned above, where the bold hex starts above (at 00001130+9) we can see the opcodes listed above! 55, 48 89 e5, 48 83 ec 10, etc.

Again, this stuff isn't magic by any means.

Closing thoughts

If you've stuck with it this far then you're probably going to have a good time with the rest of the material. If you've struggled to keep up then that's OK too - we're going to ramp up slowly and I'm always available to give pointers (contact details in footer)

This stuff is difficult in the beginning. There's no getting around it. Some very smart people thought up the beautiful, elegant and efficient assembly language many years ago, and there's definitely a learning curve involved when trying to learn to read AND write it.

Don't panic if you didn't understand this introduction, the next 5 or so lessons will start at the very, very basics of the assembly language and slowly ramp up to reverse engineering more complicated (and exciting) programs. It's OK to not understand everything yet.

Knowledge of this topic is important in my biased opinion, and will make you a more valuable InfoSec professional going forward. Keep at it, if you're confused then you can email me at oliver.brooks@ learnreverseengineering.com or DM me on the infosec.exchange Mastodon at @computerproblemhaver. You've got this.

Thanks for reading.