Compiler Speed Trials britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-28-2010, 04:58 AM Boriel, I owe you an apology - there is an issue with your hisoft basic test. I absolutely should have made it clearer - and I would have if I thought you were going to replicate the tests! If you recall, I noted that Hisoft ran far more slowly than I expected. The code listed above that you used does indeed take 4.8 seconds. But it's using floating point math to calculate k/2 on line 120. As specified in the Hisoft Basic Manual, and as I noted, I had to tweak it to read: Code:`120 LET V=INT(k/2)*3+4-5` in order to use Integer math, there. I should have made that MUCH clearer. It's also a little misleading to say "using the same code" ZX Basic doesn't run that program in two seconds unless you use DIM to specify Integers. That's fair of course, since the REM does the same thing for Hisoft Basic. But it is worth noting the code does have to be very slightly different. An END IF if nothing else! Try comparing this version (arranged so Hisoft Basic uses Integer math): Code:```7 REM :INT +a,k,v,i,m() 8 REM : OPEN# 9 CLS 10 POKE 23672,0: POKE 23673,0 90 POKE 23672,0 100 LET a=0: LET k=5: LET v=0 110 LET a=a+1 120 LET v=INT (k/2) *3+4-5 130 GO SUB 1000 140 DIM m(5) 150 FOR i=1 TO 5 : REM change to 0 to 4 for ZX BASIC 160 LET m(i)=a 170 NEXT i 200 IF a<1000 THEN GO TO 110 : REM NEEDS AN END IF FOR ZX BASIC 210 PRINT (PEEK 23672+256*PEEK 23673)/50 220 PRINT m(1),k,i,var 999 STOP 1000 RETURN``` And you'll find the Hisoft basic version does indeed return a 0.5 second time. What version did you use to get it to work on ZX BASIC precisely? I would guess something like (arranged so ZX BASIC uses Integer math): Code:```7 REM :INT +a,k,v,i,m() 8 REM : OPEN# DIM a,k,v,i as uInteger 9 CLS 10 POKE 23672,0: POKE 23673,0 90 POKE 23672,0 100 LET a=0: LET k=5: LET v=0 110 LET a=a+1 120 LET v=INT (k/2)*3+4-5 130 GO SUB 1000 140 DIM m(5) as uInteger 150 FOR i=0 TO 4 160 LET m(i)=a 170 NEXT i 200 IF a<1000 THEN GO TO 110 : END IF 210 PRINT (PEEK 23672+256*PEEK 23673)/CAST (FIXED, 50) 220 PRINT m(1),k,i,var 999 STOP 1000 RETURN``` Under -O3, I get this to return in 2.06 seconds. These two are as near as I can get to source code designed to do the same thing, making the same optimizations in each compiler - that is using integer values where possible. I think it's fair to compare them both in fully integer mode. I'm sorry to say that the numbers used in the table at the start of this thread are ones I can stand by, certainly for Hisoft and ZX BASIC, because I tested them personally; and have edited the post to show the best times I achieved in all cases. Here's a snapshot to make it easy to look at: http://sites.google.com/site/britlion/Bm7Hisoft.z80 All set to do *x to clear, *c to compile and *r to run. boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-28-2010, 07:21 AM britlion Wrote:Boriel, I owe you an apology - there is an issue with your hisoft basic test. I absolutely should have made it clearer - and I would have if I thought you were going to replicate the tests! If you recall, I noted that Hisoft ran far more slowly than I expected. The code listed above that you used does indeed take 4.8 seconds. But it's using floating point math to calculate k/2 on line 120. As specified in the Hisoft Basic Manual, and as I noted, I had to tweak it to read: Code:`120 LET V=INT(k/2)*3+4-5` in order to use Integer math, there. I should have made that MUCH clearer. It's also a little misleading to say "using the same code" ZX Basic doesn't run that program in two seconds unless you use DIM to specify Integers. That's fair of course, since the REM does the same thing for Hisoft Basic. But it is worth noting the code does have to be very slightly different. An END IF if nothing else!That's why I was asking which code you were using exactly! The code I use for both tests was this: Code:```6 DIM a,k,v,i as UInteger: DIM m(5) as UInteger: REM comment this line in HiSoft Basic 7 REM :INT +a,k,v,i,m() 8 REM : OPEN# 9 CLS 10 POKE 23672,0: POKE 23673,0 90 POKE 23672,0 100 LET a=0: LET k=5: LET v=0 110 LET a=a+1 120 LET v=k/2*3+4-5 : REM Int(K/2) rounds to ILong 130 GO SUB 1000 140 DIM m(5) 150 FOR i=1 TO 5 : REM change to 0 to 4 for ZX BASIC 160 LET m(i)=a 170 NEXT i 200 IF a<1000 THEN GO TO 110 : REM NEEDS AN END IF FOR ZX BASIC 210 PRINT (PEEK 23672+256*PEEK 23673)/50 999 STOP 1000 RETURN 1010 PRINT m(1),k,i,var : REM Never executed```Britlion Wrote:And you'll find the Hisoft basic version does indeed return a 0.5 second time.Then the compiler might be even doing a "Unused var removal" optimization. I need to get the generated ASM CODE, to see the Hisoft Routines, but It's my bet the program is being optimized that way or the m(i) access is optimized. BTW I've optimized the array access a little (just 8 T-states per dimension). I've created a benchmark directory in the compiler source and put this one, so benchmarks we create will be there. britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-28-2010, 03:40 PM The .z80 snapshot I listed above has the following asm from 64994 - the code is 347 bytes long: Code:```    CALL L_FF31         ; FDE2 CD 31 FF     LD HL, L_FF3D         ; FDE5 21 3D FF     LD (\$5C6C), HL         ; FDE8 22 6C 5C     CALL \$0D6B         ; FDEB CD 6B 0D     CALL L_FF31         ; FDEE CD 31 FF     LD HL, \$5C78         ; FDF1 21 78 5C     LD (HL), \$00         ; FDF4 36 00     LD HL, \$5C79         ; FDF6 21 79 5C     LD (HL), \$00         ; FDF9 36 00     LD HL, \$5C78         ; FDFB 21 78 5C     LD (HL), \$00         ; FDFE 36 00     LD HL, \$0000         ; FE00 21 00 00     LD (L_FF3D), HL         ; FE03 22 3D FF     LD HL, \$0005         ; FE06 21 05 00     LD (\$FF3F), HL         ; FE09 22 3F FF     LD HL, \$0000         ; FE0C 21 00 00     LD (\$FF41), HL         ; FE0F 22 41 FF L_FE12: LD HL, (L_FF3D)         ; FE12 2A 3D FF     INC HL             ; FE15 23     LD (L_FF3D), HL         ; FE16 22 3D FF     LD HL, (\$FF3F)         ; FE19 2A 3F FF     SRL H             ; FE1C CB 3C     RR L             ; FE1E CB 1D     LD D, H             ; FE20 54     LD E, L             ; FE21 5D     ADD HL, HL         ; FE22 29     ADD HL, DE         ; FE23 19     LD DE, \$0004         ; FE24 11 04 00     ADD HL, DE         ; FE27 19     LD DE, \$0005         ; FE28 11 05 00     AND A             ; FE2B A7     SBC HL, DE         ; FE2C ED 52     LD (\$FF41), HL         ; FE2E 22 41 FF     CALL L_FED7         ; FE31 CD D7 FE     LD HL, \$FF4E         ; FE34 21 4E FF     LD BC, \$0009         ; FE37 01 09 00     LD A, \$00         ; FE3A 3E 00     CALL L_FF36         ; FE3C CD 36 FF     LD HL, \$0001         ; FE3F 21 01 00     LD (\$FF43), HL         ; FE42 22 43 FF     PUSH HL             ; FE45 E5     LD HL, \$0005         ; FE46 21 05 00     LD (\$FF4A), HL         ; FE49 22 4A FF     POP HL             ; FE4C E1     JP L_FE69         ; FE4D C3 69 FE L_FE50: LD HL, (\$FF43)         ; FE50 2A 43 FF     LD DE, \$FF4E         ; FE53 11 4E FF     DEC HL             ; FE56 2B     ADD HL, HL         ; FE57 29     ADD HL, DE         ; FE58 19     PUSH HL             ; FE59 E5     LD HL, (L_FF3D)         ; FE5A 2A 3D FF     EX DE, HL         ; FE5D EB     POP HL             ; FE5E E1     LD (HL), E         ; FE5F 73     INC HL             ; FE60 23     LD (HL), D         ; FE61 72     LD HL, (\$FF43)         ; FE62 2A 43 FF     INC HL             ; FE65 23     LD (\$FF43), HL         ; FE66 22 43 FF L_FE69: LD DE, (\$FF4A)         ; FE69 ED 5B 4A FF     EX DE, HL         ; FE6D EB     AND A             ; FE6E A7     SBC HL, DE         ; FE6F ED 52     JP NC, L_FE50         ; FE71 D2 50 FE     LD HL, (L_FF3D)         ; FE74 2A 3D FF     LD DE, \$03E8         ; FE77 11 E8 03     CALL L_FEDD         ; FE7A CD DD FE     LD A, H             ; FE7D 7C     OR L             ; FE7E B5     JP NZ, L_FE12         ; FE7F C2 12 FE     LD HL, \$5C78         ; FE82 21 78 5C     LD L, (HL)         ; FE85 6E     LD H, \$00         ; FE86 26 00     PUSH HL             ; FE88 E5     LD HL, \$0100         ; FE89 21 00 01     PUSH HL             ; FE8C E5     LD HL, \$5C79         ; FE8D 21 79 5C     LD L, (HL)         ; FE90 6E     LD H, \$00         ; FE91 26 00     POP DE             ; FE93 D1     CALL L_FEF6         ; FE94 CD F6 FE     POP DE             ; FE97 D1     ADD HL, DE         ; FE98 19     CALL L_FF1E         ; FE99 CD 1E FF     LD HL, \$0032         ; FE9C 21 32 00     CALL L_FF1E         ; FE9F CD 1E FF     CALL L_FEEB         ; FEA2 CD EB FE     CALL \$2DE3         ; FEA5 CD E3 2D     LD A, \$0D         ; FEA8 3E 0D     RST \$10             ; FEAA D7     LD HL, (\$FF4E)         ; FEAB 2A 4E FF     CALL L_FEE5         ; FEAE CD E5 FE     LD A, \$06         ; FEB1 3E 06     RST \$10             ; FEB3 D7     LD HL, (\$FF3F)         ; FEB4 2A 3F FF     CALL L_FEE5         ; FEB7 CD E5 FE     LD A, \$06         ; FEBA 3E 06     RST \$10             ; FEBC D7     LD HL, (\$FF43)         ; FEBD 2A 43 FF     CALL L_FEE5         ; FEC0 CD E5 FE     LD A, \$06         ; FEC3 3E 06     RST \$10             ; FEC5 D7     LD HL, \$FF45         ; FEC6 21 45 FF     CALL \$33B4         ; FEC9 CD B4 33     CALL \$2DE3         ; FECC CD E3 2D     LD A, \$0D         ; FECF 3E 0D     RST \$10             ; FED1 D7     LD HL, \$2758         ; FED2 21 58 27     EXX             ; FED5 D9     RET             ; FED6 C9 L_FED7: RET             ; FED7 C9     LD HL, \$2758         ; FED8 21 58 27     EXX             ; FEDB D9     RET             ; FEDC C9 L_FEDD: XOR A             ; FEDD AF     SBC HL, DE         ; FEDE ED 52     LD H, A             ; FEE0 67     LD L, A             ; FEE1 6F     RET NC             ; FEE2 D0     INC L             ; FEE3 2C     RET             ; FEE4 C9 L_FEE5: CALL L_FF1E         ; FEE5 CD 1E FF     JP \$2DE3         ; FEE8 C3 E3 2D L_FEEB: CALL L_FF26         ; FEEB CD 26 FF     CALL \$31AF         ; FEEE CD AF 31     LD (\$5C65), DE         ; FEF1 ED 53 65 5C     RET             ; FEF5 C9 L_FEF6: LD A, \$20         ; FEF6 3E 20     CP E             ; FEF8 BB     JR C, L_FF08         ; FEF9 38 0D     LD A, D             ; FEFB 7A     AND A             ; FEFC A7     JR NZ, L_FF08         ; FEFD 20 09     LD B, E             ; FEFF 43     EX DE, HL         ; FF00 EB     LD L, H             ; FF01 6C     CP B             ; FF02 B8     RET Z             ; FF03 C8 L_FF04: ADD HL, DE         ; FF04 19     DJNZ L_FF04         ; FF05 10 FD     RET             ; FF07 C9 L_FF08: LD C, L             ; FF08 4D     LD B, H             ; FF09 44     LD HL, \$0000         ; FF0A 21 00 00     LD A, \$0F         ; FF0D 3E 0F L_FF0F: SLA E             ; FF0F CB 23     RL D             ; FF11 CB 12     JR NC, L_FF16         ; FF13 30 01     ADD HL, BC         ; FF15 09 L_FF16: ADD HL, HL         ; FF16 29     DEC A             ; FF17 3D     JR NZ, L_FF0F         ; FF18 20 F5     OR D             ; FF1A B2     RET P             ; FF1B F0     ADD HL, BC         ; FF1C 09     RET             ; FF1D C9 L_FF1E: XOR A             ; FF1E AF     LD E, A             ; FF1F 5F     LD D, L             ; FF20 55     LD C, H             ; FF21 4C     LD B, A             ; FF22 47     JP \$2AB6         ; FF23 C3 B6 2A L_FF26: LD HL, (\$5C65)         ; FF26 2A 65 5C     LD BC, \$FFFB         ; FF29 01 FB FF     ADD HL, BC         ; FF2C 09     LD D, H             ; FF2D 54     LD E, L             ; FF2E 5D     ADD HL, BC         ; FF2F 09     RET             ; FF30 C9 L_FF31: LD A, \$02         ; FF31 3E 02     JP \$1601         ; FF33 C3 01 16 L_FF36: LD D, H             ; FF36 54     LD E, L             ; FF37 5D     LD (HL), A         ; FF38 77     INC DE             ; FF39 13     LDIR             ; FF3A ED B0     RET             ; FF3C C9 L_FF3D: RLA             ; FF3D 1``` boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 12:53 AM Okay, I've reverse engineered the hisoft CM you put above (see attached file). The routine is fair, and it's doing even an array initialization on *each pass* :!: :o The bad news first: it's effectively working on a vector. Doing this might break the ZX BASIC compiler, since it's supposed to be multi-architectural. The good news: Most of the time is gone in the 16bit multiplication used for array accesses. Since most of the arrays and element sizes are near 0, this routine uses it's own array-multiplication (HISoft Basic is doing the same for some multiplications). This will reduce the execution time down to 1.11 segs. Even better, using multidimensional arrays does only add a little overhead (about 1.39 for 2 dims). The FOR...NEXT does the comparison at the reverse. This is 10 T-states faster, so I change the FOR...NEXT scheme that way to. Now this REALLY needs intensive testing (array and FOR...NEXT loops). I'm uploading a new 1.2.6 beta-r1571, if someone is interested. Attached Files   hisoft.asm (Size: 5.17 KB / Downloads: 575) britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 07:00 AM Cool! I'll look into trying to break that. Shaving most of a second off a 2 second benchmark is still impressive. Curiously, I can see the odd optimization in Hisoft routines too: Code:```LD DE, \$0005          AND A                  SBC HL, DE         ; HL = k / 2 * 3 + 4 - 5``` Is probably better done as: Code:```LD DE, -5 ADD HL,DE``` Which doesn't need the AND A so shaves one byte and 4 T states off every fixed 16 bit subtract. britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 07:05 AM Oh yes - this: Code:```LD HL, \$2758      ; ??    EXX              RET``` I'm pretty sure this is setting HL' to the required value for BASIC before exiting. The Basic interpreter (and in particular the Interrupt service routine) assume that IY is pointed at the system variables and HL' is set to a specific value. I assume this one. Otherwise the spectrum can crash hard when dropping to BASIC from machine code. boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 09:32 AM britlion Wrote:Oh yes - this: Code:```LD HL, \$2758      ; ??    EXX              RET``` I'm pretty sure this is setting HL' to the required value for BASIC before exiting. The Basic interpreter (and in particular the Interrupt service routine) assume that IY is pointed at the system variables and HL' is set to a specific value. I assume this one. Otherwise the spectrum can crash hard when dropping to BASIC from machine code.Yes, I guess it's just that. If you look to ZX BASIC generated code, it stores IX (not needed), IY and HL', and recovers them on exit (unless returning with an error, RST #8). It's a pity IY is used for TIME interrupt, because IY could be very handy for managing data structures like Objects and structs. I think It would be a good idea to use IM2 for people wanting time-frames counting, without using IY. boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 09:54 AM britlion Wrote:Cool! I'll look into trying to break that. Shaving most of a second off a 2 second benchmark is still impressive. Curiously, I can see the odd optimization in Hisoft routines too: Code:```LD DE, \$0005          AND A                  SBC HL, DE         ; HL = k / 2 * 3 + 4 - 5``` Is probably better done as: Code:```LD DE, -5 ADD HL,DE``` Which doesn't need the AND A so shaves one byte and 4 T states off every fixed 16 bit subtract.It is. Even funnier: ZX BASIC doesn't do that either for 16 bits :mrgreen: So I've just put it. :wink: britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 03:06 PM Ha! I've been looking at Z80 optimization pages lately. Under math tricks at http://wikiti.brandonw.net/index.php?ti ... timization there are some good ones, including that one. I like the alternatives to using NEG - which is awkward, being an extended instruction: Code:```;Instead of     neg     add a,N   ;you want to calculate N-A ;Do it this way:     cpl     add a,N+1    ;neg is practically equivalent to cpl \ inc a ; -> save 1 byte and 4 T-states``` And this one, when you learn it, is solid gold: Looping with 16 bit counter There are two ways to make loops with a 16bit counter : * the naive one, which results in smaller code but increased loop overhead (24 * n T-states) and destroys a Code:```ld bc, ... loop:   ; loop body here   dec bc   ld  a, b   or  c   jp  nz,loop``` * the slightly trickier one, which takes a couple more bytes but has a much lower overhead (12 * n + 14 * (n / 16) T-states) (This is harder to understand why it works, but is MUCH faster - almost as fast as djnz with an 8 bit counter (and oddly written, such that it uses B and D as loop counters; personally I'd use B and C, I think) Code:```dec  de   ld  b, e   inc  b   inc  d loop2:   ; loop body here      djnz loop2   dec  d   jp  nz,loop2``` How it works is because if b=0 DJNZ will loop 256 times. So we start with B set to the number of loops past a multiple of 256, and then loop D lots of 256 times. It saves a lot of time over a reasonably large loop - for example, it's about half the overhead, just for something as small as a 1000 loop. There are quite a few other similar tricks listed. You might find some are handy. boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 03:44 PM :o The last one is really nice :!: I could use it for string management, maybe... I've improved the DRAW routine (slightly, but only a nano-bit faster). britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 03:50 PM Incidentally, Boriel - is the compiler correctly removing all the un-needed run time routines? I noted that the Hisoft version of BM7 was around 350 bytes, and the ZX BASIC version was almost 1,800 bytes, with -O3 enabled. I'm not sure if that's a problem or not, given that most of them would be needed for reasonably large programs; but if we're considering an option of a 45 byte overhead for a faster SQR function, for example, then not having the runtime packages we don't need becomes a little more important. boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 03:57 PM britlion Wrote:Incidentally, Boriel - is the compiler correctly removing all the un-needed run time routines? I noted that the Hisoft version of BM7 was around 350 bytes, and the ZX BASIC version was almost 1,800 bytes, with -O3 enabled. I'm not sure if that's a problem or not, given that most of them would be needed for reasonably large programs; but if we're considering an option of a 45 byte overhead for a faster SQR function, for example, then not having the runtime packages we don't need becomes a little more important.It's OK. Because of the PRINT. Print routine is about 580bytes (ITALIC, BOLD, ATRIBUTES, etc...). But a --use-rom-print is on the way if you needn't ITALIC/BOLD nor speed (I wonder if the PRINT routine could be optimized more, BTW) britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 04:02 PM I think Hisoft optimized down to whether ink or paper changes were made, and didn't include code for them if not used. I suppose you could arrange for the print routine to not include the BOLD and ITALIC (and INVERSE and OVER) routines, if it was made fairly modular? britlion Posting Freak Posts: 805 Threads: 135 Joined: Apr 2009 Reputation: 5 05-29-2010, 04:06 PM britlion Wrote:I think Hisoft optimized down to whether ink or paper changes were made, and didn't include code for them if not used. I suppose you could arrange for the print routine to not include the BOLD and ITALIC (and INVERSE and OVER) routines, if it was made fairly modular? I just looked at the print.asm - and it looks like it IS modular. All those #includes at the start - could the compiler make them optional, if not needed? boriel Administrator Posts: 1,715 Threads: 54 Joined: Aug 2019 Reputation: 16 05-29-2010, 04:10 PM britlion Wrote:britlion Wrote:I think Hisoft optimized down to whether ink or paper changes were made, and didn't include code for them if not used. I suppose you could arrange for the print routine to not include the BOLD and ITALIC (and INVERSE and OVER) routines, if it was made fairly modular? I just looked at the print.asm - and it looks like it IS modular. All those #includes at the start - could the compiler make them optional, if not needed?Because people asked me to support attribute control codes (e.g. CHR\$(22, 1, 10) = AT 1, 10; etc...) Since PRINT a\$ could also mean PRINT INVERSE 1; ITALIC 1; "xxx" in a single string, all must be included. So only ROM print would fix that. « Next Oldest | Next Newest »