03-23-2011, 12:26 AM
(where /should/ I be putting this sort of thing?)
Plug this in for mul16.asm
I think it saves on average about 110 T states per multiply, according to my tests. If I counted correctly, it's 10 bytes longer.
Why it's faster:
SLA C is a long slow opcode, compared to just doing the RLA. It's faster to loop twice and roll the A register round the two halves than it is to roll the 16 bit pair.
Also in this case, JR is a better choice than the original JP instruction. Not only is it a byte shorter, but it's faster on average. Probably.
16 JP NC instructions = 160 T states.
JR is 7 if condition fails, 12 if it passes. We can assume that for bits, half will be 1 and half will be 0. So that's an average of (8*12)+(8*7)=156 T states. It's worth saving the byte; which compensates for a double loop being a few extra bytes.
Could also probably shave a little time by using dec b && jp nc _mul16loop since that will jump most times. Probably not worth the bytes. Having two short loops actually speeds up the DJNZ a little too
Plug this in for mul16.asm
Code:
__MUL16: ; Mutiplies HL with the last value stored into de stack
; Works for both signed and unsigned
PROC
LOCAL __MUL16LOOP1
LOCAL __MUL16NOADD1
LOCAL __MUL16LOOP2
LOCAL __MUL16NOADD2
ex de, hl
pop hl ; Return address
ex (sp), hl ; CALLEE caller convention
;;__MUL16_FAST: ; __FASTCALL ENTRY: HL = 1st operand, DE = 2nd Operand
;; ld c, h
;; ld a, l ; C,A => 1st Operand
;;
;; ld hl, 0 ; Accumulator
;; ld b, 16
;;
;;__MUL16LOOP:
;; sra c ; C,A >> 1 (Arithmetic)
;; rra
;;
;; jr nc, __MUL16NOADD
;; add hl, de
;;
;;__MUL16NOADD:
;; sla e
;; rl d
;;
;; djnz __MUL16LOOP
__MUL16_FAST:
ld b, 8
ld a, d
ld c, e
ex de, hl
ld hl, 0
__MUL16LOOP1:
add hl, hl ; hl << 1
;sla c
rla ; a,c << 1
jr nc, __MUL16NOADD1
add hl, de
__MUL16NOADD1:
djnz __MUL16LOOP1
ld a,c
ld b,8
__MUL16LOOP2:
add hl, hl ; hl << 1
rla ; a,c << 1
jr nc, __MUL16NOADD2
add hl, de
__MUL16NOADD2:
djnz __MUL16LOOP2
ret ; Result in hl (16 lower bits)
ENDP
I think it saves on average about 110 T states per multiply, according to my tests. If I counted correctly, it's 10 bytes longer.
Why it's faster:
SLA C is a long slow opcode, compared to just doing the RLA. It's faster to loop twice and roll the A register round the two halves than it is to roll the 16 bit pair.
Also in this case, JR is a better choice than the original JP instruction. Not only is it a byte shorter, but it's faster on average. Probably.
16 JP NC instructions = 160 T states.
JR is 7 if condition fails, 12 if it passes. We can assume that for bits, half will be 1 and half will be 0. So that's an average of (8*12)+(8*7)=156 T states. It's worth saving the byte; which compensates for a double loop being a few extra bytes.
Could also probably shave a little time by using dec b && jp nc _mul16loop since that will jump most times. Probably not worth the bytes. Having two short loops actually speeds up the DJNZ a little too
