Multiply A * 48

Pagina 1/3
| 2 | 3

Door jltursan

Prophet (2532)

afbeelding van jltursan

22-03-2021, 18:59

That's it, if I want to multiply A * 48 I need a 16 bit register as output; so this code is probably the obvious approach:

ld b,0
ld c,a
ld h,b
ld l,c
add hl,hl
add hl,bc
add hl,hl
add hl,hl
add hl,hl
add hl,hl
ld b,h
ld c,l               ; BC = A * 48

...but seems a bit heavy, 105Tstates Sad

Has anyone found a faster solution?

Aangemeld of registreer om reacties te plaatsen

Van Grauw

Ascended (10004)

afbeelding van Grauw

22-03-2021, 19:26

Can’t think of anything faster than that other than using a lookup table.

Multiply48:
    ld l,a
    ld h,Multiply48_lut >> 8
    ld c,(hl)
    inc h
    ld b,(hl)
    ret

    ALIGN 100H
Multiply48_lut:
    REPT 100H, ?i
    db ?i & 0FFH
    ENDM
    REPT 100H, ?i
    db ?i >> 8
    ENDM

34 cycles at the expense of 512 bytes of memory.

Van jltursan

Prophet (2532)

afbeelding van jltursan

22-03-2021, 19:43

Indeed, it has a huge speedup at a big RAM cost,; but really nice alternative, thanks Grauw!. Btw, these directives are Glass-compatible, isn't it?

I'm thinking on a pretty obvious optimization if I return the result in HL, saving the copy to BC (10 Tstates); but it can impact on how it's the result used afterwards...

Van Grauw

Ascended (10004)

afbeelding van Grauw

22-03-2021, 19:58

jltursan wrote:

Indeed, it has a huge speedup at a big RAM cost,; but really nice alternative, thanks Grauw!. Btw, these directives are Glass-compatible, isn't it?

Yeah they are. Key point is to first have 256 entries with the LSBs of the result value, and then another 256 with the MSBs, so that you can calculate the index faster and use inc h.

As for the non-LUT multiplication, changing the input and output registers is indeed the obvious optimisation, and if the input range is limited to 0-85 then 8-bit additions could be used for the x3.

Van NYYRIKKI

Enlighted (5836)

afbeelding van NYYRIKKI

22-03-2021, 20:48

Maybe 11(+2) cycles slower version with 272-byte table?

Multiply48:
    ld l,a
    ld h,Multiply48_lut >> 8
    ld b,(hl)
    inc h
    and 15
    ld l,a
    ld c,(hl)
    ret

Van Manuel

Ascended (18070)

afbeelding van Manuel

22-03-2021, 20:55

I have no experience at all with this, but do these things make sense (just thinking a bit about it):
- 48 = 2 x 2 x 2 x 2 x 3 = left shift of 4, plus some adds?
- or: 48 = 32 x 1.5: left shift of 5 on A + one right shift of A?
I'm sure I'm stating the obvious here, but I'm interested to know why it's not a good solution Smile

Van NYYRIKKI

Enlighted (5836)

afbeelding van NYYRIKKI

22-03-2021, 21:13

Manuel wrote:

I'm sure I'm stating the obvious here, but I'm interested to know why it's not a good solution Smile

The original question was how to do it faster than *3*16 and there is absolutely nothing wrong with doing it like that... We just try to think, how to do it faster.

Van Manuel

Ascended (18070)

afbeelding van Manuel

22-03-2021, 21:19

oh, repeated add hl, hl is faster than shifting?

Van RvS

Resident (53)

afbeelding van RvS

22-03-2021, 21:27

Provided the value is in (hl)

xor a
rld
ld l,(hl)
ld h,a
ld d,h
ld e,l
add hl,hl
add hl,de

In 72 cycles. Not as fast, but no memory usage.

Van jltursan

Prophet (2532)

afbeelding van jltursan

22-03-2021, 21:35

When operating with 16 bits register, yes, it's faster. As there're no 16-bits shifting instructions in the Z80, you'll need to use two 8-bit shifts through carry to left-shift HL; so in the end, it's slower by every single shift.

Maybe Grauw has guessed how is this routine being used, but yes, "A" values can be constrained between 0-31 (you've enough clues now to guess a specific use for the routine).

So, sacrificing the general purpose of the routine and changing a bit how the registers are used, I've ended with the following:

ld l,a			
add a,a			
add a,l			
ld h,0			
ld l,a			
add hl,hl		
add hl,hl		
add hl,hl		
add hl,hl		; HL = A * 48 (76T)

Not bad, a bit faster and only A and HL modified.

Now, best of all, working with only 32 possible A values, LUT based routines can reduce significantly the used RAM Smile

Van NYYRIKKI

Enlighted (5836)

afbeelding van NYYRIKKI

22-03-2021, 21:38

Manuel wrote:

oh, repeated add hl, hl is faster than shifting?

Yes, ADD HL,HL is 11(+1) cycles while SLA L & RL H is 16(+2) cycles.

Pagina 1/3
| 2 | 3