In an attempt to save a few cycles I've jumped the bandwagon and looked into this. Seemingly there isn't much gain when the screen is on, but I wanted to do something in the screen blank / immediately at the interrupt start. From what I can see, norakomi's question was not about vblank, but when screen is on, so this is slightly on the side but still very much related.
It is mentioned both on the wiki and the map-pages that there is no speed limit, in this case. Result:
It works in openmsx17.
It doesn't work on physical machines.
Tested on a1-wsx (T9769C chipset) and svi-738-msx2.
My test: Updating sprite attribute table from a list of bytes with yx-values, like this: y, x, y, x, y, x ...
Code: Unrolled macro like this:
.macro writeSpriteAttrTableRamToVramMasksOne outi outi in a, ( VDPIO ) in a, ( VDPIO ) .endm
==>
.rept 8 writeSpriteAttrTableRamToVramMasksOne .endm
Unless I misunderstood something or did something wrong, openmsx does not emulate this 100% and there is actually a speed limit here.
Did you emulate the same machine as the real machine you tested on? Can you share something executable that can be easily tested?
Did you emulate the same machine as the real machine you tested on?
Yes. During this development, I have tested on these machines in openmsx:
* Sanyo_phc-70FD2 (+2 wait-states)
* Panasonic_FS-A1WSX (+1 wait-states) - this one tested physically on
* Philips_NMS_8250
All the above works in openmsx. I don't have a openmsx machine-"profile" for my modded SVI-738 MSX2.
Can you share something executable that can be easily tested?
I know, I know, to fix this, we must have an isolated test-case! I don't have that, it may be a bit of work to set up, which will take some time. My current results comes from Lilly's Saga, which does comes with a lot of code that is not ready for sharing at this point in time. Given that I'm not mistaking here, and this (edge-)case really is an error, how interesting is it for the openmsx-team to fix it? It would help me to know your priorities wrt how I should prioritise providing an isolated testcase.
It is always interesting to fix things that do not match real hardware behaviour. But I can't promise that it will be fixed within a given time. It may take a lot of time to find out what is wrong exactly and it is very well possible that no one has that time for a (long) while.
All I can say is that a test case makes the probability that it will be investigated a LOT higher.
A copy of the SAT takes 3.2% cpu. And then with the two IN it is 16% faster. You spend huge effort in trying to save 16% of 3.2% cpu.
So, this thread practicaly is saying the typical "the MSX VDP is slow", for something that took 0.5% cpu... But the time is not taken by the VDP but by some unknown code.
It's not my game, so, I'm not sure. But 0.5% cpu might mean having one or two extra enemies/items in the game without slowdowns, severely impacting level design and gameplay. So, I think all these optimization efforts are always worth it
But something got broken. With this "optimization" of copying only 2 of 4 bytes, pattern animation is lost.
When you need 0.5% cpu, why not search in the other 97% (!), why insist to search in those tiny 3% and then break something.
haha, yeah, you have a point. I'm sure Bengalack has searched everywhere else too. But I'm putting words in his mouth, so, I'll let him comment
Effectively...
And again the usual mantra: "VDP is slow".
the vdp is not so slow, in most operations, the very high overhead is the VRAM address ptr setup which have a lot of ceremony to be executed, expecially on msx2 machines.
for some contiguos block operations (vram operations tends to be block operations) the speed, even in active area is not that bad.
a typical otir takes the same amount of time of an LDIR and due to autoincrement feature of VDP could save some z80 registers and increments that could be used for extra things.
There is still a speed limit, just it’s lower than the fastest the Z80 can do (10 cycles).
But this looks to me more specific than just a wrongly documented speed limit during vertical blank. It seems like it may be caused by the interplay between mixed OUTs and INs to the VRAM transfer port. I think there is not much information known about how these relate to each other timing-wise, how they’re handled by the VDP precisely.
Can you still reproduce the issue if you replace the INs by OUTs?
Also, are you certain the transfer always starts and ends during vertical blanking? It’s quite sensitive to timing, if it’s occasionally pushed out of the vertical blank period by a long music player frame on interrupt handler, or interrupts that are kept disabled for too long on the main loop, that would cause issues.