"local map", is this a RAM nametable. along with it could come a faster scroll border fill.
ha! That's a neat trick indeed! The one tricky thing would be to make sure objects going through the edge of the screen do not reappear on the other side though
"local map", is this a RAM nametable. along with it could come a faster scroll border fill.
Yes, that's another way of looking at it, another motivation for me to do it. Though one slightly tricky thing is, 17 tiles can be visible at once in the x direction when scrolled. I think I'll have to cheat a little there, to keep the local map at a convenient 16x16 size.
ha! That's a neat trick indeed! The one tricky thing would be to make sure objects going through the edge of the screen do not reappear on the other side though
Yeah, in principle, carry checks in the right places... but the reality will probably be more stubborn .
4: Drawing new map tiles (44.42%, 7.4 ms)
Yesterday I optimised the tile drawing code, it now takes 31% CPU time per frame (5.2 ms), which also maxes out the VDP copy speed for sixteen 4x16 tiles, so this is about as fast as it’ll get :).
It’s mostly due to optimising the tile lookup code, previously I would calculate the tile data address for every piece copied, now I calculate the position once and only increment the address between draws. (I’m not using the “local map” I talked about earlier yet.)
I also simplified the code to always draw a full 4 pixel wide column / row (16 copies) rather than breaking that up in 4 parts (4 copies). This was originally done to minimise the copying time when scrolling 1 pixel per frame, but since nowadays I scroll 2 pixels per frame anyway, with the copies spread out across two frames, there’s no reason to optimise for 1 pixel scrolling anymore.
The main problem I think is really having the CPU idle while copying. You could calculate the next tile address while copying the current for free, but in any case we have the CPU stopped while waiting for the command to finish. So in that CPU usage, from time perspective is high CPU usage, but from real perspective the CPU is doing nothing, no usage, limited by the VDP command time to finish.
Maybe there could be something to do in the gap of time of each 4x16px copy.
Also, in the linked post, you comment to start at VBLANNK, don't know if you are doing the same thing in this case, but is not needed, you could start copying just after the logic finished even in rendering time as it is supposed the copies are off-screen.
To maximize the VRAM slots, can also use async drawing (that is the system we are going to use):
- It requires some working flags.
- Enqueue the operations. Operation (something like a gfxop structure) is equal to source (x, y, width, height), destination (x,y). If want to also put there sprite operations, add a type field to the structure and the required fields for those operations.
- At VBLANK, immediately start dispatching the queue. Set the working flag and call the dispatch function that dispatches while elements in the queue AND the working flag is set.
- At line 0 (line interrupt), stop. Reset the working flag so the dispatch loop ends even if there are more items.
- When the frame logic ends in the main logic loop, continue dispatching. Set the working flag and call the function.
- So, at VBLANK, if the working flag is already set, return, do not call the function twice.
This is as summary, then it must be implemented. By this way we always maximize the VRAM access, as we use it while drawing borders (max speed). Can even try to disable/enable sprites at border, but I don't know if it has some effect as I think in the border is equal to VDP disabled.
In other words, do not make CPU computing work at borders, focus graphic working in that gap. You can dispatch the sound and other timming tasks at line 0 instead the VBLANK and have the same result.
I do have something to do in the gap of time between each 4x16 copy: prepare the next copy . Previously this preparation was taking too long, so the VDP was idling between copies. Now the CPU is done in time to keep the VDP busy.
But I don’t think there’s much time left to do anything else. Switching contexts to another task alone would probably already take more time than it takes the VDP to complete such a small copy. The copy itself takes about 0.3 ms (1074 cycles), and that’s including set-up which takes 443 cycles already.
Interesting thing about focusing copies into the blanking period. Leaves some access slots wasted though, maybe you should also try to write sprite data to VRAM inbetween the copies ;).
I myself am not starting these copies during blanking or anything fancy like that. Luckily these column copies happen in the masked 8x212 area, so they are not visible, so I can do them whenever. For tile animations I also don’t think I have to worry much about tearing.
Unfortunately the V9958 does not have a CE interrupt like the V9990 does (my no.1 wish they had)... I did consider making “copy slots” on line interrupts and processing queued copies there. It leaves the VDP a little idle, but the CPU can continue unhindered, and the idle time can be minimised by careful tuning of the split distances. However, queueing copy commands also takes extra time, plus line interrupt overhead... I may still try to do it for the larger (16x16) copies for the tile animations, but I’m not sure if it’s going to be worthwhile.
But I don’t think there’s much time left to do anything else. Switching contexts to another task alone would probably already take more time than it takes the VDP to complete such a small copy. The copy itself takes about 0.3 ms (1074 cycles), and that’s including set-up which takes 443 cycles already.
So since the VDP command execution itself only takes 640 cycles at a minimum, probably something like 700 orso in practice considering waiting for access slots with sprites enabled, I guess in theory I could speed this up to take only 20% (3.3 ms) orso by skipping the CE check. But then of course you get CPU speed dependent timing, so that’s a no-go.
I did consider making “copy slots” on line interrupts and processing queued copies there. It leaves the VDP a little idle, but the CPU can continue unhindered, and the idle time can be minimised by careful tuning of the split distances. However, queueing copy commands also takes extra time, plus line interrupt overhead... I may still try to do it for the larger (16x16) copies for the tile animations, but I’m not sure if it’s going to be worthwhile.
Addendum: if I specify the maximum execution time of the copy in advance, I could set line interrupts exactly when they complete. Let’s say for a 16x16 HMMM it takes 2048 cycles + 10% margin for access slots, I could set line interrupts every 10 lines orso (228 cycles per line).
It will be troublesome though when other screensplits are active, especially when it changes the vertical offset, which will be the case for me. I think it’s going to be much smarter to avoid that kind of too-clever optimisations entirely and keep the code flow simple and work within the limitations, so I can focus my efforts on game features. I’m not making a tech demo, I’m making a game. Repeating this to myself often ;).
Smalls copies are indeed really bad, but what to do. It takes longer the own CPU access than the copy itself. Maybe the CPU access could be minimized, as all the copies in the same row or column (depending horizontal/vertical) share many attributes, they could be set only for 1st command and only change the varying ones.
I.e. let's copy the column at one side of the screen (multiple 4x16 copies).
- On the first one, set all the parameters.
- Use indirect access to set registers, as they are consecutive, to use the auto-increment feature.
- From 2nd onwards, set only the required ones, that are SX, SY, no need even to set DY (see below). Also no need to set R#45 again. Less CPU access to the master bus (that is slow). Less accesses per block copy.
Look at the 4.6 section of the V9938 manual. to see what registers changes and how. What I have not clear is the
# Count (NY*)
in the table?? I think it only changes if reach the screen border, and changes containing the number of lines copied. So no problem as is the last copy.
Doing that at screen border (VBLANK) time could increase the speed a lot.
Maybe the CPU access could be minimized, as all the copies in the same row or column (depending horizontal/vertical) share many attributes, they could be set only for 1st command and only change the varying ones.
Yeah I thought about that, but I’d have to re-set SX, SY, DY, DYH, NY and CMD for the column copies and SX, SY, DX, DY, NY and CMD for the row copies. E.g. with 6x ld a,(hl); inc hl; out (99H),a; ld a,reg; out (99H),a
that takes 282 cycles. Alternatively, 12 OUTIs to 99H for 216 cycles.
Additionally I won’t need to set up indirect access and with the former probably can avoid the RAM buffer as well. So maybe I’ll try this, it doesn’t seem like too much trouble and if it shaves off a few % of the frame time...
Look at the 4.6 section of the V9938 manual. to see what registers changes and how. What I have not clear is the # Count (NY*)
in the table?? I think it only changes if reach the screen border, and changes containing the number of lines copied. So no problem as is the last copy.
That’s also not entirely clear to me, but I think NY behaves just like SY and DY, and changes for every line processed. It’s probably just describing an edge case when you try to copy across the maximum y coordinate (Y > 1024?).
I think some test would be good. Because maybe there is no need to re-set DY and NY.
* Coordinate at the command end (SY*, DY*)
I don't know if that means the same line it ends, or the next one, so direct copies could be done without updating DY.
Also some test for the NY.
Make copies in a row not re-setting one each time of DY and NY and see results.
On the best scenario, you should only to re-set SX, SY for a column, as DX, DY, NX and NY would be already set. And then the CMD. For row SX, SY, DX, DY and NY because the tile would reach the screen border modifying NY, but we want a full 16px copy for the next one for the scroll.
The good thing about indirect register access is that you already prepare the system for working on hi-res modes. And when working on 256px width modes you simply put 0 on the high part of the operation values. So if later you change your mind and use SC7 there is no need to change the code.
What is DYH?