ehehe, the sdcc compiler has inverted the loop, this is why :-)
With Hithech C you have to do it manually but once you do that you get 13 seconds on plain z80 (837 ticks)
;-)
This is the code:
unsigned int i; void testCode(void) { register unsigned int j,s; s = 0; for (i=0;i<10000;i++) for(j=100;j>0;j--) s++; printf("%d\n",s); }
correspondin to this:
8 0000' _testCode: 9 0000' FD E5 push iy 10 ;main.c: 8: register unsigned int j,s; 11 ; _s allocated to iy 12 0002' FD 21 0000 ld iy,0 13 ;main.c: 10: for (i=0;i<10000;i++) 14 0006' 21 0000 ld hl,0 15 0009' 18 0E jp L1 16 17 000B' l5: 18 ;main.c: 11: for(j=100;j>0;j--) 19 ; _j allocated to bc 20 000B' 01 0064 ld bc,064h 21 000E' l9: 22 ;main.c: 12: s++; 23 000E' FD 23 inc iy 24 0010' 0B dec bc 25 0011' 78 ld a,b 26 0012' B1 or c 27 0013' 20 F9 jp nz,l9 28 0015' 2A 0000' ld hl,(_i) 29 0018' 23 inc hl 30 0019' L1: 31 0019' 22 0000' ld (_i),hl 32 001C' 01 2710 ld bc,02710h 33 001F' B7 or a 34 0020' ED 42 sbc hl,bc 35 0022' 38 E7 jp c,l5 36 ;main.c: 13: printf("%d\n",s); 37 0024' FD E5 push iy 38 0026' 21 0000' ld hl,u19 39 0029' E5 push hl 40 002A' CD 0000* call _printf 41 002D' C1 pop bc 42 002E' C1 pop bc 43 ;main.c: 14: } 44 002F' FD E1 pop iy 45 0031' C9 ret
Maybe the original test only tests one very narrow thing: can the compiler figure out that it's faster to count to 100 than to 10000, on an 8 bit processor, and that in this particular example it's possible to swap the order of the for loops without changing the output. Based on this test alone, I wouldn't make too many conclusions.
I mean, if the compiler was really smart, it would leave out the for loops entirely and just print the result. The loop counter variables aren't needed for anything. And the test also assumes that the programmer is clueless.
Actually, the fact SDCC is able to revert the loop to gain speed relaying on the fact the counter variable is not used is a very good feature. It makes the asm closer to the one a human would do.
Afterall, SDCC is still in development, Hitech C stopped its development in 2001.
Another possible optimization for SDCC could be to declass the counter variable j to unsigned char: again, once it is not used in the output, you can represent it internally in a more efficient way.
@Marq
Are you involved in the SDCC development ?
[edit]
@ yzi, I agree, the smartest thing is to compute the result offline and print it ;-)
Anyway, I did a quick test on gcc for x86 cmpiling the sample code.
It seems to do the true loop, so I wouldn't ever dare to ask for this level of optimization for a z80 corosscompiler.
4 void testCode(void) { 0x00401334 push %ebp 0x00401335 mov %esp,%ebp 0x00401337 push %edi 0x00401338 push %esi 0x00401339 push %ebx 0x0040133A sub $0x1c,%esp 5 register unsigned int i,j,s; 6 s = 0; 0x0040133D mov $0x0,%ebx 7 for (i=0;i<10000;i++) 0x00401342 mov $0x0,%edi 0x00401347 jmp 0x401357 [testCode+35] 0x00401356 inc %edi 0x00401357 cmp $0x270f,%edi 0x0040135D jbe 0x401349 [testCode+21] 8 for(j=100;j>0;j--) 0x00401349 mov $0x64,%esi 0x0040134E jmp 0x401352 [testCode+30] 0x00401351 dec %esi 0x00401352 test %esi,%esi 0x00401354 jne 0x401350 [testCode+28] 9 s++; 0x00401350 inc %ebx 10 printf("%d\n",s); 0x0040135F mov %ebx,0x4(%esp) 0x00401363 movl $0x403024,(%esp) 0x0040136A call 0x401bb0 [printf] 11 } 0x0040136F add $0x1c,%esp 0x00401372 pop %ebx 0x00401373 pop %esi 0x00401374 pop %edi 0x00401375 pop %ebp 0x00401376 ret
Another smart trick to improve performance (without appealing to assembly):
void testCode() { unsigned int i,j,s; s = 0; for (i = 10000; i; i--) for(j = 100; j; j-- ) s++; printf("%d\n",s); }
868 vdp interrupts = 14,96 sec(s).
Or even better:
void testCode() { unsigned int i,s; /*unsigned*/ char j; s = 0; for (i = 10000; i; i--) for(j = 100; j; j-- ) s++; printf("%d\n",s); }
762 vdp interrupts = 12,7 sec(s).
(I used char instead of its unsigned counterpart because not supported by MSX-C )
Quoting myself from a previous post:
"Well, I didn't mean this experiment as any conclusive test – just a fun one-afternoon test run of different compilers."
There are plenty of better test cases available all around the web, although many of them might rely on functionality (long ints, multiplications and so on) that aren't native to Z80. Tweaking the code to get better performance is kind of beside the point. For the sake of it, we could implement a bit more realistic open task or a set of tasks which would include at least:
- Nested loops
- Conditional jumps
- Array indexing
- Basic math
- Function calls
Can't make it awfully complex if it's to work with BASIC, too, so need to leave out structs etc.
By the way, SDCC doesn't seem to benefit at all from that char counter trick. Same 13 s with that.
If you publish a test suite I can run it on the Hitech cross compiler
PS
The unsigned char trick works very well for Hitech cross compiler. This code:
void testCode(void) { register unsigned char j; register unsigned int i,s; s = 0; for (i=10000;i;i--) for(j=100;j;j--) s++; printf("%d\n",s); }
runs in 11 seconds (696 ticks) and corresponds to this asm:
7 0000' _testCode: 8 ;main.c: 8: register unsigned char j; 9 ; _s allocated to de 10 0000' 11 0000 ld de,0 11 ;main.c: 11: for (i=10000;i;i--) 12 ; _i allocated to bc 13 0003' 01 2710 ld bc,02710h 14 0006' 18 0B jp l8 15 0008' l5: 16 ;main.c: 12: for(j=100;j;j--) 17 ; _j allocated to l 18 0008' 2E 64 ld l,064h 19 000A' 18 02 jp l12 20 000C' l9: 21 ;main.c: 13: s++; 22 000C' 13 inc de 23 000D' 2D dec l 24 000E' l12: 25 000E' 7D ld a,l 26 000F' B7 or a 27 0010' 20 FA jp nz,l9 28 0012' 0B dec bc 29 0013' l8: 30 0013' 78 ld a,b 31 0014' B1 or c 32 0015' 20 F1 jp nz,l5 33 ;main.c: 14: printf("%d\n",s); 34 0017' D5 push de 35 0018' 21 0000' ld hl,u19 36 001B' E5 push hl 37 001C' CD 0000* call _printf 38 001F' C1 pop bc 39 0020' C1 pop bc 40 ;main.c: 15: } 41 0021' C9 ret
@ARTRAG: does the Hitech C support the volatile keyword?
yes, it forces the compiler to not reuse the value of variable in registers but to retrive it from ram
PS
a large collection of C compilers for z80
http://www.z80.eu/c-compiler.html