Just this week I was exploring C decompilers and stumbled upon the open source Snowman[0], which worked well for my purposes and can run in a self contained mode with a dependency on Qt5.
How do decompilers work in general? I'm imagining the normal compiler pipeline in reverse: convert the machine code into some intermediate representation, add some 'de-optimization' passes to make the control flow more clear, then a back end which converts that into a C AST, which is then printed out into valid C code.
I don't think this can get you to a meaningful code, but more of assembly-looking C. It would require some sort of machine learning to guess code fragments with proper variable names, properly nested loops, etc.
variable names are almost certainly a lost cause, but IDA pro is pretty good at recovering control flow, which could be turned into C constructs (didn't Dijkstra prove something about expressing any control flow using structured programming constructs?). You'd probably need a bunch of heuristics to tell the difference between a while loop and a for loop. More ambitious would be trying to un-inline functions, which would require liberal use of the de-optimizer and recognizing common sub-structures.
And with any non-trivial amount of optimization, it would lead to horridly looking HLL code. Compilation is a very "lossy" process when optimizations are turned on.
What do they think the people who want this are going to do if they can't buy it legally? I can't imagine it didn't cross their mind that those folks are likely to just...pirate it. Awesome "anti-piracy" measure.
There's some level of restriction, but copies still leak fairly often. Usually you're a few versions behind running pirated copies, rarely more than that.
Many people are however used to working with its output, not due to having purchased copies, but due to having pirated copies.
They certainly don't have a process where you can just "add to cart" on their website and hand them a couple thousand dollars. I'm not sure what extra checks they do through their sales team; maybe they're just trying to make sure that large businesses aren't buying up single-dev "Named" licenses and installing it across their organization, or maybe they're trying to verify that buyers are known security researchers, or something.
Umm, I bought my license by doing exactly that — add to cart and fork over a few thousand dollars. I don't remember them doing any verification other than requiring a business email address.
I have a few technical questions, as well as a few about your intentions with the software.
How does it handle interrupt calls to the OS? It's not an issue for Windows (because it's all done through library calls, right?) But DOS int21 and Linux int80, for example?
With the x86 work, is the logic all built around protected mode? I've been using IDA to examine/document the assembly of a DOS game, so I'd be interested in the behavior if it's fed real mode code. Further (and tying in to my previous question), the game uses Borland overlays through int3f (it seeks in the binary itself and loads new sections of code into memory, while running, before jumping into the newly-retrieved code). Would that kind of thing be possible to handle automatically? IDA seems to be hard-coded to look for the offset+length tables that are used, and finds the function entry points that way.
More on the business side, you've got a way to request a quote, and the impression I get is that your aim is to run a decompilation business. Where does that leave the software itself? As a proprietary technology that lets you differentiate your business? Or is it your plan to sell the software, release binaries, release code, or some combination? My perspective is that of a hobbyist with a curiosity for reverse engineering and a (strictly non-commercial) project to apply it to, and I'm trying to figure out where this software fits into my world.
About interrupt calls... If it finds such a call, all it has to do is look at the call's input registers (assuming they are in registers) and output the call with the 'logical' contents of the registers.
The x86 code is basically 32-bit Windows. I haven't got a 16-bit compiler. However, it should work for 16-bit. The key thing is that you can write specialised modules for each CPU, and the rest (loops, variables) are standardised.
On the business side... Basically, my 6502 decompiler works the best so I thought I could sell that. And as for x86, there are issues to do with structs & arrays, that I haven't had time to figure out.
As for the decompiler software, I plan to finish the x86 and ARM decompilers and sell them for a reasonable price (say, $150 for all CPUs, not just one). As noted, arrays & structs are a problem.
As for 'release code', anyone can write a CPU module, but I need to document it first.
Absolutely, although some 6502 code is more friendly than others. SNES code can be very optimised compared to C64 code.
If you make sure the binary starts at address 0, and put in the right entry point (which will be somewhere in the SNES ROM), then it should decompile it fine.
Your first step will be to use the left-hand portion of the window to disassemble the NES ROM, and then once assembly files have been created, you decompile them on the right of the window. (this is all in the documentation of course).
How close is the SNES 65C816(-ish) code to NES 6502(-ish) code supported in the demo version of the program? I understand that they're from the same family, but I'd expect some incompatible extensions in the more-advanced chip, at least.
Decompilers don't "deoptimize" programs. Compilation is a very lossy process (reused variables, moving variables to other registers, etc.). The best you'll get is some C code that, when compiled, would do the same thing as the original assembly (but probably slower). Interesting idea, nonetheless.
The game that I mentioned elsewhere in this thread seems to be a mix of C and assembly (as one would expect from an early-90s game), so I'd be curious about the same thing.
Many instructions of machine code can be written as one C statement. There are some cases of things the cpu can do that are not standardized in C, like "arithmetic shift vs. logical shift." Maybe these instructions would be inlined.
> Many instructions of machine code can be written as one C statement.
I'm quite aware =) Nested loops, some pointer calculations (dealing with real mode pointers), conditionals with more than one condition, and multi-step math statements seem particularly verbose, compared to their higher-level equivalents. If a piece of code gets kind of complicated, I usually hand-decompile it, and it's usually much shorter, even in fairly-naive C code.
[0] https://github.com/yegord/snowman