Show HN: Assembly to C code Decompiler

emcq · on May 27, 2016

Just this week I was exploring C decompilers and stumbled upon the open source Snowman[0], which worked well for my purposes and can run in a self contained mode with a dependency on Qt5.

[0] https://github.com/yegord/snowman

ultramancool · on May 27, 2016

Interesting, Snowman is also integrated into x64dbg.

openasocket · on May 27, 2016

How do decompilers work in general? I'm imagining the normal compiler pipeline in reverse: convert the machine code into some intermediate representation, add some 'de-optimization' passes to make the control flow more clear, then a back end which converts that into a C AST, which is then printed out into valid C code.

T-R · on May 27, 2016

That's more or less it. Some things you can do:

- Dataflow analysis, to get an idea of the types/range of values you're working with

- Pattern matching to try to identify higher level constructs from math to loops to jump tables.

- Compiler-specific pattern matching to identify things like function entry points

- Signature pattern matching common functions (like C standard library calls), along the lines of what's used in High Level Emulation.

cosinetau · on May 27, 2016

My uni runs a systems class that does the SIC/XE architecture linker/loader in forward and reverse depending on the time of year.

FWIK from classmates who made the decompiler, your post is the jist of it.

zzzcpan · on May 27, 2016

I don't think this can get you to a meaningful code, but more of assembly-looking C. It would require some sort of machine learning to guess code fragments with proper variable names, properly nested loops, etc.

openasocket · on May 27, 2016

variable names are almost certainly a lost cause, but IDA pro is pretty good at recovering control flow, which could be turned into C constructs (didn't Dijkstra prove something about expressing any control flow using structured programming constructs?). You'd probably need a bunch of heuristics to tell the difference between a while loop and a for loop. More ambitious would be trying to un-inline functions, which would require liberal use of the de-optimizer and recognizing common sub-structures.

bluetomcat · on May 27, 2016

And with any non-trivial amount of optimization, it would lead to horridly looking HLL code. Compilation is a very "lossy" process when optimizations are turned on.

bluetomcat · on May 27, 2016

And if the symbols have been stripped, good luck with generating meaningful identifiers.

umanwizard · on May 27, 2016

Anyone know how this compares to the industry-standard Hex-Rays Decompiler? (Sold as an add-on to IDA Pro)

CraigJPerry · on May 27, 2016

Can you actually buy it though? I have it in the back of my mind that they restrict sales of ida pro and the decompiler as an anti piracy measure.

slrz · on May 27, 2016

What do they think the people who want this are going to do if they can't buy it legally? I can't imagine it didn't cross their mind that those folks are likely to just...pirate it. Awesome "anti-piracy" measure.

ultramancool · on May 27, 2016

There's some level of restriction, but copies still leak fairly often. Usually you're a few versions behind running pirated copies, rarely more than that.

Many people are however used to working with its output, not due to having purchased copies, but due to having pirated copies.

khedoros · on May 27, 2016

They certainly don't have a process where you can just "add to cart" on their website and hand them a couple thousand dollars. I'm not sure what extra checks they do through their sales team; maybe they're just trying to make sure that large businesses aren't buying up single-dev "Named" licenses and installing it across their organization, or maybe they're trying to verify that buyers are known security researchers, or something.

cynix · on May 28, 2016

Umm, I bought my license by doing exactly that — add to cart and fork over a few thousand dollars. I don't remember them doing any verification other than requiring a business email address.

ryanlol · on May 27, 2016

Pirated IDA+hex-rays is the industry standard.

umanwizard · on May 27, 2016

My employer has a copy. I'm confident they acquired it legally.

zxv · on May 27, 2016

Closed source, windows only, diss-assmebles 6502 only, and expires next month. No thanks.

zandorg · on May 27, 2016

I don't like to reply to my own posts, but here goes.

It's closed source just at the moment. HexRays is closed too.

It can run on Linux or MacOSX as CLisp runs on those platforms. I just haven't started work on porting to new platforms.

It decompiles 6502 as a proof of concept. It can decompile other CPUs, but not fully.

The expiration is temporary. I intend to eliminate this.

Thanks for your comments, it's great to get feedback.

Just another thing: Check out the 'Samples' page to see what it can do with different CPUs.

dang · on May 27, 2016

> I don't like to reply to my own posts

Please do! That kind of discussion a big part of HN.

(Since this is your own work, we added "Show HN" to the title.)

khedoros · on May 27, 2016

I have a few technical questions, as well as a few about your intentions with the software.

How does it handle interrupt calls to the OS? It's not an issue for Windows (because it's all done through library calls, right?) But DOS int21 and Linux int80, for example?

With the x86 work, is the logic all built around protected mode? I've been using IDA to examine/document the assembly of a DOS game, so I'd be interested in the behavior if it's fed real mode code. Further (and tying in to my previous question), the game uses Borland overlays through int3f (it seeks in the binary itself and loads new sections of code into memory, while running, before jumping into the newly-retrieved code). Would that kind of thing be possible to handle automatically? IDA seems to be hard-coded to look for the offset+length tables that are used, and finds the function entry points that way.

More on the business side, you've got a way to request a quote, and the impression I get is that your aim is to run a decompilation business. Where does that leave the software itself? As a proprietary technology that lets you differentiate your business? Or is it your plan to sell the software, release binaries, release code, or some combination? My perspective is that of a hobbyist with a curiosity for reverse engineering and a (strictly non-commercial) project to apply it to, and I'm trying to figure out where this software fits into my world.

zandorg · on May 27, 2016

About interrupt calls... If it finds such a call, all it has to do is look at the call's input registers (assuming they are in registers) and output the call with the 'logical' contents of the registers.

The x86 code is basically 32-bit Windows. I haven't got a 16-bit compiler. However, it should work for 16-bit. The key thing is that you can write specialised modules for each CPU, and the rest (loops, variables) are standardised.

On the business side... Basically, my 6502 decompiler works the best so I thought I could sell that. And as for x86, there are issues to do with structs & arrays, that I haven't had time to figure out.

As for the decompiler software, I plan to finish the x86 and ARM decompilers and sell them for a reasonable price (say, $150 for all CPUs, not just one). As noted, arrays & structs are a problem.

As for 'release code', anyone can write a CPU module, but I need to document it first.

Thanks for your comments.

mike986 · on May 27, 2016

can yours be used to reverse engineering SNES games? Feed it the rom and have it split out c code?

zandorg · on May 27, 2016

Absolutely, although some 6502 code is more friendly than others. SNES code can be very optimised compared to C64 code.

If you make sure the binary starts at address 0, and put in the right entry point (which will be somewhere in the SNES ROM), then it should decompile it fine.

Your first step will be to use the left-hand portion of the window to disassemble the NES ROM, and then once assembly files have been created, you decompile them on the right of the window. (this is all in the documentation of course).

khedoros · on May 27, 2016

How close is the SNES 65C816(-ish) code to NES 6502(-ish) code supported in the demo version of the program? I understand that they're from the same family, but I'd expect some incompatible extensions in the more-advanced chip, at least.

zandorg · on May 27, 2016

I haven't done any work on 65C816. Just 6502 so far.

It should be possible to make a 'CPU module' for the decompiler which takes into account the features of the 65C816 (such as 16-bit registers).

ratboy666 · on May 28, 2016

How do you deal with self-modifying code? My old FORTRAN compiler generated that 8080 target

milcron · on May 27, 2016

Oh, this is written in Common Lisp? Very interesting!

Frondo · on May 27, 2016

I'd be curious to see what C it produced if you fed it hand-written assembly.

tptacek · on May 27, 2016

Decompilers don't promise to recreate the original C code; they just promise to transform assembly instructions into some kind of C program.

colejohnson66 · on May 28, 2016

Decompilers don't "deoptimize" programs. Compilation is a very lossy process (reused variables, moving variables to other registers, etc.). The best you'll get is some C code that, when compiled, would do the same thing as the original assembly (but probably slower). Interesting idea, nonetheless.

khedoros · on May 27, 2016

The game that I mentioned elsewhere in this thread seems to be a mix of C and assembly (as one would expect from an early-90s game), so I'd be curious about the same thing.

kw71 · on May 28, 2016

Many instructions of machine code can be written as one C statement. There are some cases of things the cpu can do that are not standardized in C, like "arithmetic shift vs. logical shift." Maybe these instructions would be inlined.

khedoros · on May 28, 2016

> Many instructions of machine code can be written as one C statement.

I'm quite aware =) Nested loops, some pointer calculations (dealing with real mode pointers), conditionals with more than one condition, and multi-step math statements seem particularly verbose, compared to their higher-level equivalents. If a piece of code gets kind of complicated, I usually hand-decompile it, and it's usually much shorter, even in fairly-naive C code.