# Assembler Optimizer

Страница 34/42
27 | 28 | 29 | 30 | 31 | 32 | 33 | | 35 | 36 | 37 | 38 | 39

Earlier in this thread there has been a lot of discussion on compression algorithms. I'm not on top of all this, but I was intrigued by the concept, so I tried to bake up some homemade packing of vgm-data (I first extracted the vgm-data I wanted and this data is the basis for compression). For what it's worth, I'll just add the info here, maybe it can be of interrest.

I achieve around 70% reduction in size (and up 90% reduction when (unfairly)comparing to vgm-files). A number I like! But...

Algorithm: I think it is like lz77, as it looks back at the encoded buffer already built-up and sees if it can match a similar sequence of numbers. Up to 127 of them. Beforehand, the stream is prepared as 6-bit only as the 7th bit is used to mark a (jump-)block. I had to make a drawing for myself, which illustrates the concept:

Explanation to the image:

• Each box is one byte
• Ln = Length n = 0-127
• Ann = Address n = 0-32767
• Nn = Nested offset n = 0-127
• ' means that 7th bit is set to indicate something, 'L: Start of jump-block. 'A: This jump block has one extra byte (offset-info)
• Any character in the packed buffer without a number after it (white boxes and the green 'X') are plain values, and the first occurence of the value in the buffer.
• In the drawing there is a minimum of 4 matched values in sequence to qualify for a "jump". I use 5 in the final tool.

I made the deflate/inflate-tool in python. My idea was to keep the compressed datablock in memory and just make a getNextValue-function (as music is linear anyways). Naïve as I am, I thought this could work. The strength (and the problem) with the algorithm is that it is recursive. In a "vgm-redux"-file of 20kB (that is only pairs of (register/wait + data)-bytes), I have found out that there is a lot of recursion-calls. Unfortunately I now realized that the getNextValue calls itself like this:

```C:\source\repos\noname_test\project>python packtool.py -inflate clip.mus clip.unpacked
Number of additional calls to getNextValue: 0 call(s), happens 1117 time(s) in clip.
Number of additional calls to getNextValue: 1 call(s), happens 388 time(s) in clip.
Number of additional calls to getNextValue: 2 call(s), happens 726 time(s) in clip.
Number of additional calls to getNextValue: 3 call(s), happens 1001 time(s) in clip.
Number of additional calls to getNextValue: 4 call(s), happens 1195 time(s) in clip.
Number of additional calls to getNextValue: 5 call(s), happens 1340 time(s) in clip.
Number of additional calls to getNextValue: 6 call(s), happens 1406 time(s) in clip.
Number of additional calls to getNextValue: 7 call(s), happens 1402 time(s) in clip.
Number of additional calls to getNextValue: 8 call(s), happens 1387 time(s) in clip.
Number of additional calls to getNextValue: 9 call(s), happens 1428 time(s) in clip.
Number of additional calls to getNextValue: 10 call(s), happens 1328 time(s) in clip.
Number of additional calls to getNextValue: 11 call(s), happens 1191 time(s) in clip.
Number of additional calls to getNextValue: 12 call(s), happens 1082 time(s) in clip.
Number of additional calls to getNextValue: 13 call(s), happens 970 time(s) in clip.
Number of additional calls to getNextValue: 14 call(s), happens 892 time(s) in clip.
Number of additional calls to getNextValue: 15 call(s), happens 789 time(s) in clip.
Number of additional calls to getNextValue: 16 call(s), happens 709 time(s) in clip.
Number of additional calls to getNextValue: 17 call(s), happens 645 time(s) in clip.
Number of additional calls to getNextValue: 18 call(s), happens 521 time(s) in clip.
Number of additional calls to getNextValue: 19 call(s), happens 462 time(s) in clip.
Number of additional calls to getNextValue: 20 call(s), happens 434 time(s) in clip.
Number of additional calls to getNextValue: 21 call(s), happens 390 time(s) in clip.
Number of additional calls to getNextValue: 22 call(s), happens 347 time(s) in clip.
Number of additional calls to getNextValue: 23 call(s), happens 300 time(s) in clip.
Number of additional calls to getNextValue: 24 call(s), happens 275 time(s) in clip.
Number of additional calls to getNextValue: 25 call(s), happens 250 time(s) in clip.
Number of additional calls to getNextValue: 26 call(s), happens 228 time(s) in clip.
Number of additional calls to getNextValue: 27 call(s), happens 191 time(s) in clip.
Number of additional calls to getNextValue: 28 call(s), happens 180 time(s) in clip.
Number of additional calls to getNextValue: 29 call(s), happens 156 time(s) in clip.
Number of additional calls to getNextValue: 30 call(s), happens 139 time(s) in clip.
Number of additional calls to getNextValue: 31 call(s), happens 115 time(s) in clip.
Number of additional calls to getNextValue: 32 call(s), happens 97 time(s) in clip.
Number of additional calls to getNextValue: 33 call(s), happens 85 time(s) in clip.
Number of additional calls to getNextValue: 34 call(s), happens 74 time(s) in clip.
Number of additional calls to getNextValue: 35 call(s), happens 62 time(s) in clip.
Number of additional calls to getNextValue: 36 call(s), happens 41 time(s) in clip.
Number of additional calls to getNextValue: 37 call(s), happens 37 time(s) in clip.
Number of additional calls to getNextValue: 38 call(s), happens 34 time(s) in clip.
Number of additional calls to getNextValue: 39 call(s), happens 33 time(s) in clip.
Number of additional calls to getNextValue: 40 call(s), happens 24 time(s) in clip.
Number of additional calls to getNextValue: 41 call(s), happens 13 time(s) in clip.
Number of additional calls to getNextValue: 42 call(s), happens 5 time(s) in clip.
Number of additional calls to getNextValue: 43 call(s), happens 7 time(s) in clip.
Number of additional calls to getNextValue: 44 call(s), happens 9 time(s) in clip.
Number of additional calls to getNextValue: 45 call(s), happens 11 time(s) in clip.
Number of additional calls to getNextValue: 46 call(s), happens 11 time(s) in clip.
Number of additional calls to getNextValue: 47 call(s), happens 7 time(s) in clip.
Number of additional calls to getNextValue: 48 call(s), happens 7 time(s) in clip.
Number of additional calls to getNextValue: 49 call(s), happens 5 time(s) in clip.
Number of additional calls to getNextValue: 50 call(s), happens 5 time(s) in clip.
Number of additional calls to getNextValue: 51 call(s), happens 4 time(s) in clip.
Number of additional calls to getNextValue: 52 call(s), happens 5 time(s) in clip.
Number of additional calls to getNextValue: 53 call(s), happens 4 time(s) in clip.
Number of additional calls to getNextValue: 54 call(s), happens 3 time(s) in clip.
Number of additional calls to getNextValue: 60 call(s), happens 1 time(s) in clip.
Number of additional calls to getNextValue: 62 call(s), happens 1 time(s) in clip.
```

So, if I am lucky enough to get one single getNextValue to cost 200 on average (which may be way off), we are seeing a total of 12400 cycles. And as numbers are stored as 6-bits, any value bigger than 63 is stored as two values. So, a possible 24800 cycles. Ha ha ha, not suitable for a game like mine :-)

So, I might not even start implementing this in asm - although it would have been kind of cool.

That said. I did some tests with the same concept, but without recursion. This ended up 30% size reduction. Not sure if I'll follow that path.

Bengalack: not sure why you posted it here in MDL topic, if I missed the relation or not, but in case of Z80 compression, you may want to take a look at two tools released just couple of weeks ago:
and
https://github.com/einar-saukas/ZX0 by Einar Saukas

Both tools are polishing LZ77 ad absurdum and produce quite similar results (winner depends on particular data, but if you don't care about +- couple of bytes, ZX0 may become probably more popular due to author's fame, if you care about every last byte gained from compression, you should get familiar with both tools I guess).

To be somewhat on topic with this:
santiontanon: maybe you can try those depackers against MDL to see if it does pass through them (although I think lzxpack02 conditional blocks may be not supported yet?)... anyway, if MDL reports some optimization as possible, I would suggest to first ensure it really is possible, those depackers were reviewed by several people and trimmed down before release a lot. :)
(the point is not as much to optimize them, but to make MDL survive projects which include the depackers source - maybe I'm misunderstanding MDL tool, but I think this has some value for users?)

Thanks for sharing the results Bengalack! The connection to MDL comes from an earlier discussion (the thread is becoming quite long haha) about using compression techniques to identify repeated sequences of instructions in an assembler program that could be separated away as functions, to optimize a program for size. We would still have to add the "call/ret" costs when applying it to code, but still definitively related!

@Ped7g: and yes! nice suggestion, I started transitioning from pletter/aplib to ZX0 for some personal projects last month when I saw the announcements, and ZX0 indeed achieves better compression than both in my data! I will take a look at lzxpack02 (I wasn't aware of it!). I did run MDL just for fun on the zx0 decompressor, and it did not find any optimization (it's probably been very optimized already haha)! But I'll try it in lzxpack02 and see!

Working fine here!. In fact I've added the optimized route in the AGD toolchain and it optimized the sources successfully. The game runs at first try just adding the comment, "push hl ; mdl:no-opt". Good work!
I'm now analyzing the output of the biggest test, with double extra outputs and multiple pages syntax. It also optimizes the output!; but the game doesn't works...
Seems that there're multiple optimizations, reorganizations and optimizations undone so I need to check which one doing bad.

Bengalack: Yep that’s like LZ77, though it does not have that nesting concept.

My interest in the CFG (context-free grammar) dictionary compression was because the repeated parts can be processed like subroutines, you can model them with “call” and “return” commands and a stack. Because this does not need to keep a decoding buffer in RAM it works straight from ROM. Overlapping (like the 'L4 A2 A2 in your picture) is not possible, but repeats can nest. Since all it needs to do is to push or pop a pointer when a call or return command is encountered, the processing speed is pretty quick.

But maybe it’s better to continue this topic in your FM register log thread .

@jltursan: oh, interesting about the second biggest test. Is that the "test02" you shared earlier?

santiontanon wrote:

We would still have to add the "call/ret" costs when applying it to code, but still definitively related!

Yes that logic has to be added. My approach was initially about finding the patterns, and reducing the size (and being able to use the data in its compact form in memory (or ROM), just as I understand MDL would need to as well). My post won't solve things, I just shared in case it could spark some ideas or provide some reference.

Grauw wrote:

Overlapping (like the 'L4 A2 A2 in your picture) is not possible, but repeats can nest.

Not sure how much is gained from the overlapping or "flow-into", but it's an integral part of the algorithm, and I'm sure it is important. And I agree, I don't see how this can be utilised with code-"data" that should be lightweight with only call+ret (if I understand correctly).

Grauw wrote:

Since all it needs to do is to push or pop a pointer when a call or return command is encountered, the processing speed is pretty quick.

Given that I understand this correctly (I assume that a repated code block would be replaced by `call block_name` by MDL (+ a `ret` is added at the end of this `block_name`): But if the amount of repeats are excessive it might be a certain cost? call+ret is 29 "msx-cycles", so 10 jumps are almost 290. But I guess that's the tradeoff one needs to look into?

Grauw wrote:

But maybe it’s better to continue this topic in your FM register log thread .

Not my thread, though But yes, any concerns going further with my current algorithm I'll take somewhere else.

Bengalack wrote:

Given that I understand this correctly (I assume that a repated code block would be replaced by `call block_name` by MDL (+ a `ret` is added at the end of this `block_name`): But if the amount of repeats are excessive it might be a certain cost? call+ret is 29 "msx-cycles", so 10 jumps are almost 290. But I guess that's the tradeoff one needs to look into?

Yeah, the size does not come for free . But in the constrained environment of a 16K or 32K ROM, you may be ok with that. Especially if you exclude hot loops from the optimisation.

Quote:

@jltursan: oh, interesting about the second biggest test. Is that the "test02" you shared earlier?

Not really, think of it as "test03". As it's somewhat close to the last one, I've tried and mostly worked. I'll narrow the reasons why the game doesn't works, then I'll be back to you.
And no, I lied, there's not the biggest one, I've also tried it with a monster game (64KB used RAM) and it failed with a "java.lang.IndexOutOfBoundsException"; but better to go step by step

@bangalack/@grauw: I don't mind it at all to use this thread, up to you haha. Reading all your progress on these compression algorithms gives me lots of ideas for what to try next

@jltursan: great! thanks a lot! Looking forward to the next challenge haha. Maybe let's look at "test03" first, but I'd be curious about the stack trace of that "IndexOutOfBoundsException", I wonder where could that be coming from

Страница 34/42
27 | 28 | 29 | 30 | 31 | 32 | 33 | | 35 | 36 | 37 | 38 | 39