# Fast addition on a list of values

Hello everyone,

I need to add a constant 8-bit value to a big list of 16-bit values stored in RAM, update them with the result and then check if that value is within a given range :

```loop:	(value) = (value)+constant
if value < range_start then
if value > range_stop then
...
goto loop```

But I need to make it fast, as fast as possible in fact.

This is the best solution I found so far.
If 'hl' contains the address of the list of 16-bit values, and 'e' contains the 8-bit constant value :

```.loop	ld	a,e			;  5 | e = 8-bit constant
exx				;  5 (std)
ld	c,a			;  5
ld	(hl),c			;  8
inc	hl			;  7
sub	c			;  5
ld	b,a			;  5
ld	(hl),b			;  8

;-------------------------------------------------------------------------------
; test range
;-------------------------------------------------------------------------------
ex	de,hl				;  5
ld	hl,(range_stop)		; 17 | range_stop is a negative value
jp	ns,.off_range			; 11

ld	bc,range_stop-range_start	; 11
jp	s,.off_range			; 11
(...)```

The use of 'adc hl,bc' has the advantage of giving me the use of the 's' flag, which 'add hl,bc' does not. I loose the a little bit on the precision of the result, because I do not know the value of the carry, but that's OK.

I really need to make this subroutine go as fast as possible.

Would you have some ideas ?

```CONSTANT:      EQU   123

RANGE_START:   DW    1234
RANGE_END:     DW    5678

; (Initialization can be done faster most likely, keeping it readable for now).
LD   A,(RANGE_START + 0)
LD   (DATA2 + 1),A
LD   A,(RANGE_START + 1)
LD   B,A          ; B = High byte of (RANGE_START).

LD   A,(RANGE_END + 0)
LD   (DATA1 + 1),A
LD   A,(RANGE_END + 1)
LD   C,A          ; C = High byte of (RANGE_END).

LD   DE,CONSTANT  ; E = CONSTANT (8 bits).
; D = 0 (Note: if CONSTANT were 16 bit, the algorithm would still work).

LOOP:
LD   A,E          ; 5
LD   (HL),A       ; 8
INC  HL           ; 7
EX   AF,AF'       ; 5   Keep the low byte for later when needed.

; Add to high byte and include carry.
LD   A,D          ; 5   Set A to 0 but we don't loose the carry.
LD   (HL),A       ; 8
INC  HL           ; 7

; Check with start of range (high byte).
CP   B            ; 5
RET  C            ; 6 or 12  If carry then value is definitely smaller than (RANGE_START).
JR   Z,JUMP1      ; 8 or 13  If equal to high byte then check low byte as well.
;          Otherwise value is definitely larger than (RANGE_START).

JUMP2:
; Check with end of range (high byte).
CP   C            ; 5
JP   C,LOOP       ; 11       If carry then value is definitely smaller than (RANGE_END).
RET  NZ           ; 6 or 12  If not equal to high byte then value is definitely larger than (RANGE_END).

; Check with end of range (low byte).
EX   AF,AF'       ; 5        A = Low byte.

DATA1:     CP   0            ; 8        0 becomes low byte of (RANGE_END).

JP   C,LOOP       ; 11       If carry then value is definitely smaller than (RANGE_END).
JP   Z,LOOP       ; 11       If zero then value is equal to (RANGE_END).
RET

JUMP1:     ; Check with start of range (low byte).
EX   AF,AF'       ; 5        A = Low byte.
DATA2:     CP   0            ; 8        0 becomes low byte of (RANGE_START).
RET  C            ; 6 or 12  If carry then value is definitely smaller than (RANGE_START).
;          Otherwise it is equal or larger.
EX   AF,AF'       ; 5
JP   JUMP2        ; 11
```

WARNING: not tested code!

The main idea behind it is that you don't need to do a full 16 bit comparison on the range values, but first inspect the high-byte to see if the value can actually be outside of the ranges or not. Only when the high byte matches would you then need to check the low byte, otherwise you can just skip it and move on.

This is not only a bit faster, it also means you need less Z80 registers for storing 16 bit computation results and you can thus use them for something else and/or don't need to switch to shadow registers and back.

Rough performance comparison:

Old solution:

Adding value = 64 T-states (but I think your code snippet is incomplete and you need an additional INC HL and EXX somewhere, which would bring it to 76 T-states).
Range check: 72 T-states

New solution:

Range check: 41 T-states (best-case), 75 T-states (worst-case).

As you can see my solution performs worse in worst-case scenarios, however, those should be much rarer (depending on the data of course).

I would say, give it a try and see if it works better for you or not (Ps.: This example uses some self-modifying code tricks, which will not work if your code is stored in ROM of course).

With stack-abuse and seting up the range variables...

```; param sp: address of the list of 16-bit values,
; param bc: the 16-bit constant value
; param bc': -(range_start)
; param de': -(range_stop - range_start)
loop:
; (value) = (value) + constant
pop	hl
push	hl
; if value < range_start then
exx
pop	hl
jp	m, .off_range ; (value - range_start < 0)
; if value > range_stop then
jp	p, .off_range ; (value - range_start - -(range_stop - range_start) >= 0 ==> value - range_start + range_start - range_stop >= 0 ==> value-range_stop >= 0)
exx
; ...
; goto loop
jp	loop
```

(Notes: untested; stack manipulation not included)

Very nice solutions Thanks to both of you, now is the time to do some testing !

When I need to quickly test if a value is in a range, I use a technique that we could call "low resolution equality".

It only works if the value of the range is a power-of-2 and is constant. But that is the kind of constraint you need to be able to get really fast process.

I'll explain it in pseudo-code, but it is simple to implement in assembler.

If we take `r` as the range, `x` for the value to be tested and `tab[i]` for the value of the array, the technique is just to test the equality: `x / r == (tab[i] - (r / 2)) / r`.
If we take a range of 16, e.g., then we will test: `x >> 4 == (tab[i] - 8) >> 4`.
And if we pre-compute the table not to be `tab[i]`, but `(tab[i] - 8) >> 4` and `x` to be `x >> 4`, then the test is reduced to `x == tab[i]`.

I think it's hard to do it faster. Note also that by offsetting the values and the ranges by 8000h, they become unsigned, and you can then test with the carry flag instead of the sign flag. Adding 8000h just needs flipping bit 15. The gain is small in this case, though, just 2*(17-12)-8 = 2 cycles.

Here's an example using your original code. Prepare range_start and range_end by flipping bit 15, then:

```.loop	ld	a,e			;  5 | e = 8-bit constant
exx				;  5 (std)
ld	c,a			;  5
ld	(hl),c			;  8
inc	hl			;  7
sub	c			;  5
ld	(hl),a			;  8
xor	128			;  8 | flip bit 15 of BC
ld	b,a			;  5

;-------------------------------------------------------------------------------
; test range
;-------------------------------------------------------------------------------
ex	de,hl				;  5
ld	hl,(range_stop)		; 17 | range_stop is a negative value