Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Faster multiply?
#1
Okay, I confess: I'm not an assembler user. This is why I need a compiler.

That said, I love to nosey into things I don't understand. I often learn from that.

I was looking at an old piece of assembler code I found, when trying (fairly unsuccessfully) to learn Z80 code:

8 Bit number multiply

Rotates Num 1 round to examine it. Rotates Num2 up. Adds HL and DE if the bit is set.
Result is Num1 * Num2 in HL.

LD HL,0
LD DE, (NUM2)
LD A, (NUM1)
LOOP RR A (Divide A by 2 - copying the 1's column bit into the carry flag.)
JR NC, JP1 (Jump over the add if we have to)
ADD HL,DE
JP1 RET Z (Leave when we finish - A has gone to zero)
SLA E }
RL D } Multiply DE*2
JR LOOP


naturally, when I found this again, I couldn't resist comparing it to the 8 bit multiply assembler in the library. I'm not sure which is faster (Boriel's code is a little unclear in the fastcall case which registers are set with parameters; I think it's A and HL tho. I have a suspicion this code is faster, if only because it doesn't loop for a whole 8 bits if A is small - it quits as soon as A hits zero, which might optimize a few loops out.

Anyway, just for your perusal. Like I said, I'm a little beyond my depth on this one; but I love to optimize where I can!

I'm glad LCD mentioned the HISOFT compiler - because I think it does an awesome job. The limitation with it is that it's ON the spectrum, so you need room for basic + compiler + compiled code [though it can be clever and delete the basic as it goes to make more room]. I'm hoping Boriel's compiler will eventually Exceed the speed of Hisoft's - because it can be much cleverer outside the Spectrum than one locked into it!
Reply
#2
Ok, this is an interesting issue. I tried a supposed faster 8-bit multiplication taken from internet, but it was failing, so I implemented another by myself.

Let me repost your code here, for better clarification:
Code:
LD HL,0
    LD DE, (NUM2)
    LD A, (NUM1)
LOOP:
    RR A ; (Divide A by 2 - copying the 1's column bit into the carry flag.)
    JR NC, JP1; (Jump over the add if we have to)
    ADD HL,DE
JP1:
    RET Z ; (Leave when we finish - A has gone to zero)
    SLA E
    RL D   ; Multiply DE*2
    JR LOOP

I will later study this case.
Reply
#3
I have tested the code, and found it to be slightly faster (40% faster) than the 8 bit multiply that's the default.

It's also, unless I miscounted, slightly smaller, which is a bonus.

I recommend using this code, if you find it to work for you.
Reply
#4
You know, when I posted to this thread originally, a year ago, I could program in basic, and had a smattering of an idea about assembler. I had never really learned z80 assembly language.

Oddly, I've spent more of the last year coding in machine code than I have in basic. *chuckle*

Boriel - your compiler has made me learn something I wanted to learn as a kid. I'm now passable (not really excellent), but passable at machine code. Because I focused on it, and because ZX BASIC made it SO EASY to play with bits of M/C and program basic around it.

:-) But you know, thanks again. It's very cool to realise I reached the point of coding routines and even wizard magic like Interrupt Mode 2 and copying memory using the stack. And just now I sat for half an hour and coded a putchar routine. Okay, it doesn't /quite/ work. Something stops it returning, but I know I'll debug it. And best of all - it did indeed put the character on the screen right where I wanted. Just got to fix a minor bug. In other words, if I want to, I can code in machine code :-)

Wow.

It only took 27 years....
Reply
#5
Okay, today I've managed to get a little time for 8 bit mutiplication. This is the current (new) code:
Code:
ld b, 8
    ld l, a
    xor a

__MUL8LOOP:
    add a, a ; a *= 2
    sla l
    jp nc, __MUL8B
    add a, h

__MUL8B:
    djnz __MUL8LOOP
    ret

And this is yours, replacing JR with JP (2 T-states faster, each). I've also commented it a lot, because some instructions has been removed (unneeded).
Code:
LD E, H  ; H is the 2nd factor
    LD HL,0
    LD D, L  ; DE => H

    ;LD A, (NUM1) ;; Not needed, already done
LOOP:
    ;; RR A ; (Divide A by 2 - copying the 1's column bit into the carry flag.)
    RRA ; 1 byte, 4 T-states; RR A => 2 bytes, 8T-States and are equivalent!

    ; NOTE: JR is 3 T-States faster than JP when the condition is not met
    ; In this case, it's most likely numbers will be little ones (containing more 0s than 1s), so JP
    JP NC, JP1; (Jump over the add if we have to) ;
    ADD HL,DE ; 11 T-states

JP1:
    RET Z ; (Leave when we finish - A has gone to zero) ; 5 T-States if condition not met, 10 if met
    SLA E  ; 8 T-states
    ; RL D   ;; Multiply DE*2 ; Needless in 8 bit, only E is needed!
    JP LOOP ; 2 T-States faster
Note, your routine returns result in L not in A. Perhaps this routine could be rearranged?

Here's the benchmark I've used to time MUL8:
Code:
DIM a as Ubyte = 8
DIM t as Uinteger AT 23672 ' REM t = Frames
DIM q as UByte
DIM tmp as UInteger

POKE t, 0 : ' Sets the clock to 0 in a single instruction

FOR tmp = 0 to 65534
    q = a * 165
NEXT

Print CAST(Fixed, t) / 50

END ' End the program OK instead of an STOP error (STOP is an "error")
PRINT q ' Avoid -O3 variable removal

I haven't timed your routine. To do so, edit mul8.asm in library-asm replacing mul8 code with yours.
Also remember you have the old mul8, not the new one. With the new one, this benchmark gives 8.11 segs.
Reply
#6
Oh, I so didn't write it :-)

I came up with it from a Machine Code tutorial I was studying back then. And of course, I compared it to the library version. Nowadays, I actually am a little better equipped to understand it.

Has it really been that long? Wow.
Reply
#7
boriel Wrote:Note, your routine returns result in L not in A. Perhaps this routine could be rearranged?

Probably. But how can you expect to be able to hold the result of 8 bit * 8 Bit in an 8 bit register? The result should be in something like HL surely?
Reply
#8
britlion Wrote:
boriel Wrote:Note, your routine returns result in L not in A. Perhaps this routine could be rearranged?

Probably. But how can you expect to be able to hold the result of 8 bit * 8 Bit in an 8 bit register? The result should be in something like HL surely?
That's right. In fact, the old mul8 routine did a * h => DE, returning E. When doing 8bit x 8bit, the compiler returns the result in 8 bit (all are bytes). So if you multiply 2 bytes, you will probably overflow if they're greater than 12 each (12 x 12 = 144). There's a wide range of combinations which do not overflow, though.

The same applies to any other mutiplication (float, fixed, etc...). Most programmers are so used to types that they unconsciously choose the right type when they expect a multiplication (e.g. choose uinteger if they're going to multiply 15 x 15 instead of byte).

When the factors have not the same type, the smallest one is expanded, so 8 x 257 will CAST 8 to Uinteger first.
Reply
#9
Just making a note to myself more than anything, but Boriel may find it usefule - there's a Multiply HL*DE -> HL routine in rom at 30A9 if I'm reading this right. Could be handy.

So call 30A9 should be all you need. Corrupts the A register, though.

Not sure if it's fast, but it sure is short!
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)