## Try this for faster multiply?

Posts: 771

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Try this for faster multiply?

(where /should/ I be putting this sort of thing?)

Plug this in for mul16.asm

Code:
`__MUL16:   ; Mutiplies HL with the last value stored into de stack         ; Works for both signed and unsigned      PROC      LOCAL __MUL16LOOP1                LOCAL __MUL16NOADD1      LOCAL __MUL16LOOP2                LOCAL __MUL16NOADD2            ex de, hl      pop hl      ; Return address      ex (sp), hl ; CALLEE caller convention;;__MUL16_FAST:   ; __FASTCALL ENTRY: HL = 1st operand, DE = 2nd Operand;;      ld c, h;;      ld a, l    ; C,A => 1st Operand;;;;      ld hl, 0 ; Accumulator;;      ld b, 16;;;;__MUL16LOOP:;;      sra c   ; C,A >> 1  (Arithmetic);;      rra;;;;      jr nc, __MUL16NOADD;;      add hl, de;;;;__MUL16NOADD:;;      sla e;;      rl d;;         ;;      djnz __MUL16LOOP__MUL16_FAST:        ld b, 8        ld a, d        ld c, e        ex de, hl        ld hl, 0__MUL16LOOP1:        add hl, hl  ; hl << 1        ;sla c        rla         ; a,c << 1        jr nc, __MUL16NOADD1        add hl, de__MUL16NOADD1:        djnz __MUL16LOOP1        ld a,c        ld b,8__MUL16LOOP2:        add hl, hl  ; hl << 1        rla         ; a,c << 1        jr nc, __MUL16NOADD2        add hl, de__MUL16NOADD2:        djnz __MUL16LOOP2      ret   ; Result in hl (16 lower bits)      ENDP`

I think it saves on average about 110 T states per multiply, according to my tests. If I counted correctly, it's 10 bytes longer.

Why it's faster:

SLA C is a long slow opcode, compared to just doing the RLA. It's faster to loop twice and roll the A register round the two halves than it is to roll the 16 bit pair.

Also in this case, JR is a better choice than the original JP instruction. Not only is it a byte shorter, but it's faster on average. Probably.

16 JP NC instructions = 160 T states.
JR is 7 if condition fails, 12 if it passes. We can assume that for bits, half will be 1 and half will be 0. So that's an average of (8*12)+(8*7)=156 T states. It's worth saving the byte; which compensates for a double loop being a few extra bytes.

Could also probably shave a little time by using dec b && jp nc _mul16loop since that will jump most times. Probably not worth the bytes. Having two short loops actually speeds up the DJNZ a little too

Posts: 771

Joined: Mon Apr 27, 2009 7:26 pm

Location: Slough, Berkshire, UK

### Re: Try this for faster multiply?

I'm wondering if similar optimizations could be made with other 16 bit operations?

### Who is online

Users browsing this forum: No registered users and 0 guests