A "smart" disassembler

By zPasi

Champion (499)

zPasi's picture

05-10-2019, 12:43

You know, when disassembling a Z80 binary, you'll often get nonsense like this:

LBFF9:  CP      00h
        RET     NZ
        JP      NZ,LC000
LC000:  LD      DE,0C136h

When in fact that should be:

        DB      0FEh
        DW      0C000h
        DW      0C21Fh
        DW      0C000h
LC000:  LD      DE,0C136h

In the first version, the first 7 bytes of a Basic bload file are interpreted as instructions. Just by chance, these "instructions" don't overlap the real enty point, so that LD DE,0C136h is decoded right. Of course we know that the 7 first bytes of a bload binary are always data, so they are easy to skip, but often there are data bytes later on that make the disassembly a mess.

I was wondering, why disassemblers don't track how the program is executed, so they would know which addresses are code? You'd just specify the entry point.

Of course I wasn't the first thinking about this. There even already is such a smart assembler! YAZD

Of course tricks exist that confuse even YAZD, like JP (HL) and self-modifying code, but when you read the first version, you can add more entry points and get even better results.

The program is made with C# so it's a Windows EXE, but also runs on Mono.

E.g. mono yazd.exe --addr:0xbff9 --entry:0xc000 mybind.bin mybin.lst --xref

Login or register to post comments

By Sandy Brand

Master (253)

Sandy Brand's picture

05-10-2019, 14:34

This is an interesting question you are posing. Smile

From a pure theoretical standpoint, it can be proven that writing something like that would be impossible as it is related to the Halting Problem.

In practice, of course, we can make the assumption in that the programs we try to disassemble are never so 'ill behaved' to run into these theoretical 'infinities'. However, your average program can still have a lot of branching and you would need to simulate through all possible combinations in order to see which part of the assembly is code that is executed, which part is just data, and which part is maybe superfluous data that is never accessed.

This can become difficult because branching can be dependent on many things that are 'outside of the code'. Think about things such as hardware behavior, timings, lots of data being read and/or written to other storage media, etc. A program that would read something as simple as a text string from user input can have an enormous amount of possible different results depending on what is computed. The theoretical term for this is a Combinatorial Explosion.

A lot of this is just theoretical blah blah, of course, but you will run into these problems quite quickly.
From a practical stand-point the problem is that a lot of the assumptions, limits and short-cuts that a programmer had in his/her head when writing the code, are not explicitly stored in the assembly code and thus a disassembler will have a very hard time reconstructing the same 'mind map' and be smart enough to reduce the amount of work that needs to be done to decompile the binary data.

A simple example of this can be a jump table such as this:

; A = Jump offset.
            LD   (LABEL + 1),A
            JP    FUNCTION2
            JP    FUNCTION3
; Followed by some completely unrelated code...
            LD    B,123   ; Etc...

The original author of the code somehow knew that the only possible values for register A for this part of the program were 0, 3 and 6 and that this code is thus safe to run. The disassembler though doesn't have this knowledge and would either need to extract this from the code that ran before it, or somehow do a full simulation and see what possible outcomes make sense. This will be very hard, because the jump table itself is just embedded somewhere in the large blob of binary data and there is no way the disassembler can tell that this was a jump table with just 3 entries.

Of course, you can help the disassembler by making a lot of assumptions and trying to steer the disassembly with clever heuristics and recognition of common programming patterns. But to get this to work at any decent level is no small feat :)

By sd_snatcher

Prophet (3498)

sd_snatcher's picture

05-10-2019, 14:47

Awareness of memory banking would also be very welcome on a modern disassembler.

By akumajo

Resident (43)

akumajo's picture

06-10-2019, 08:55

A few years ago I wanted to reconstruct the source code of a game program I had assembled on paper in the 80's, I had only the binary on a floppy disk.
I had then created a dis-compiler, a program whose specifications are as follows :

The file to be processed is an MSX program running under DOS (.COM extension), or executable from BASIC (.BIN extension), or from a cartridge (.ROM extension).

recognize BIN header...................OK
recognize ROM header...................OK
recognize COM file ...................OK
simply dissasembling...................OK
determine relative jump................OK
set labels.............................OK
recognize ROM jump functions...........OK
recognize RAM system data ...........OK
distinguish opcode / data..............OK
follow code processing.................90%
produce source code for asMSX..........OK

All in all this has worked well, but to go further, I think it would be necessary to start from an emulator and play all the possible code segments to practically reconstruct the initial code.
For most programs this works because there is no self-generated code very often.