It seems like it may be caused by the interplay between mixed OUTs and INs to the VRAM transfer port. I think there is not much information known about how these relate to each other timing-wise, how they’re handled by the VDP precisely.
Right. Actually I asked for this earlier, but no one answered
Can you still reproduce the issue if you replace the INs by OUTs?
Let me test and come back on this. Anyways, my plan was to, at some point in time, set up a simple, isolated test case, so we can see this clearer.
Also, are you certain the transfer always starts and ends during vertical blanking? It’s quite sensitive to timing, if it’s occasionally pushed out of the vertical blank period by a long music player frame on interrupt handler, or interrupts that are kept disabled for too long on the main loop, that would cause issues.
Yes. I am certain.
Can you still reproduce the issue if you replace the INs by OUTs?
Let me test and come back on this
Ok, just verified that OUT behaves exactly like IN, in my case.
The VDP has a byte buffer. That byte is still waiting to get written to vram. And then when you do a quick IN. There is no extra logic or state bits that say "oh, I gonna memorize this extra request for later". The plan of the things that are waiting is getting destroyed.
So even when you dont care that the IN is reading trash. Still something has been destroyed.
The question is whether an IN puts the VDP into read mode. Maybe it does not go into read mode but just do an address register increment.
Then you can do it this way
outi outi nop in in outi outi nop in in
Then maybe the IN gets the value from the byte buffer, the recently written OUT.
A copy of the SAT takes 3.2% cpu. And then with the two IN it is 16% faster. You spend huge effort in trying to save 16% of 3.2% cpu.
So, this thread practicaly is saying the typical "the MSX VDP is slow", for something that took 0.5% cpu... But the time is not taken by the VDP but by some unknown code.
I see your point wrt to the numbers... But I guess, people are different?
Three things:
1. I just question why this doesn't work, when documentation says it should
2. Some of us are obsessed with slashing cycles (can't help it )
3. If I can help getting the emulator better, I will try
If my post could be read as "VDP is slow", I'm sorry, it was not the intention.
The question is whether an IN puts the VDP into read mode. Maybe it does not go into read mode but just do an address register increment.
Then you can do it this way
outi outi nop in in outi outi nop in in
Then maybe the IN gets the value from the byte buffer, the recently written OUT.
I tested on a physical a1-wsx. does not work, like the others.
I can add: if I add a NOP (or any 5-cycle op) before the second in as well, then things work. And of course, this is the crux the whole thing. Supposedly, these two extra ops are not supposed to be required.
I see your point wrt to the numbers... But I guess, people are different?
Three things:
1. I just question why this doesn't work, when documentation says it should
2. Some of us are obsessed with slashing cycles (can't help it )
3. If I can help getting the emulator better, I will try
If my post could be read as "VDP is slow", I'm sorry, it was not the intention.
Some things came across wrong. I say the MSX can do it. While the 2 byte RAM SAT is saying that there is no time to copy the 3rd byte, the P byte for pattern animation.
I've made a simple program that shows the behaviour. I used C (sdcc) and fusion-c to get the setup done with a few lines only. I also targeted a hook directly on 0x38 to remove any doubts to where the cycles go. The normal build-script in fusion-c makes a dos/com-file, ie. ram is already present in page 0 when you run it.
I put .c/.com-files here: https://drive.google.com/drive/folders/1ByJARr_sqUUB4XKAImWE...
What it does:
* It sets screen 4, page 0, 16x16 pix sprites, white background. 32 sprite patterns are prepared. They are different, but only pattern 0 is used in sprite attribute table, and it looks like a "full block". All 32 sprites are initially placed in position (0,0) and made black.
* The interrupt routine takes the first 8 sprites and attempts to put them at x=75, but starting on top of screen and then every 16 pixel downwards. The interrupt routine can easily swap between IN and OUT to advance the vram-address-pointer, but none gives better result than the other.
Result: It looks different on physical machine than in emulator, and hence I question the no/minimum delay when in vblank.
Here is the c-file with inline asm:
#include "fusion-c/header/msx_fusion.h" #include "fusion-c/header/vdp_sprites.h" #define COLOR_BLACK 0x01 #define COLOR_WHITE 0x0F #define SCR4_PAGE0_SPRITE_ATTR_TABLE_ADDRESS 0x1E00 #define SCR4_PAGE0_SPRITE_COLOR_TABLE_ADDRESS (SCR4_PAGE0_SPRITE_ATTR_TABLE_ADDRESS-512) #define SCR4_PAGE0_SPRITE_PATTERN_TABLE_ADDRESS 0x3800 #define SPRITES_NUM 32 #define SPRITE_PATTERN_BYTES 32 #define SPRITE_COLOR_BYTES 16 #define SPRITE_ATTR_ENTRY_LEN 4 #define VDPIO 0x98 #define VDPPORT1 0x99 const unsigned char yx_array[] = { 0, 75, 16, 75, 32, 75, 48, 75, 64, 75, 80, 75, 96, 75, 112, 75 }; // --------------------------------- void putFirst8SpritesAtPos() { __asm .macro macroSetVdpWrite ; writeaddress in AHL rlc h rla rlc h rla srl h srl h out ( VDPPORT1 ), a ; // set bits 15-17 ld a,#14 | #0x80 ; // sets write bit out ( VDPPORT1 ), a ld a, l ; // set bits 0-7 out ( VDPPORT1 ), a ld a, h ; // set bits 8-14 or #64 ; // + write access out ( VDPPORT1 ), a .endm .macro macroWriteSATEntry outi outi out ( VDPIO ), a out ( VDPIO ), a ; in a, ( VDPIO ) ; in a, ( VDPIO ) .endm xor a ld hl, #SCR4_PAGE0_SPRITE_ATTR_TABLE_ADDRESS macroSetVdpWrite xor a ld c, #VDPIO ld hl, #_yx_array macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry macroWriteSATEntry __endasm; } // --------------------------------- void myInterrupt() __naked { __asm push af push bc push hl xor a ; // read stats register 0 to make sure further processing happens out ( VDPPORT1 ), a ; // status register number ld a, #0x8F ; // VDP register R#15 out ( VDPPORT1 ), a ; // out VDP register number in a, ( VDPPORT1 ) ; // read VDP S#0 call _putFirst8SpritesAtPos ; // our "main routine" pop hl pop bc pop af ei ret __endasm; } // -------------------------------------- // MSX2 only. Assumes default palette. // Interrupt is hi-jacked, so you need to // reset after this to re-gain control // -------------------------------------- void main(void) { SetColors( COLOR_BLACK, COLOR_WHITE, COLOR_WHITE ); // white background Screen( 4 ); SetDisplayPage( 0 ); // prolly default, but anyway SetActivePage( 0 ); // prolly default, but anyway Sprite16(); unsigned char i; for( i=0;i<SPRITES_NUM;i++ ) // Define 32 sprite patterns uniquely. Index/pattern #0 is a full, square block (16x16 pix) FillVram( SCR4_PAGE0_SPRITE_PATTERN_TABLE_ADDRESS+i*SPRITE_PATTERN_BYTES, 0xFF-i, SPRITE_PATTERN_BYTES ); for( i=0;i<SPRITES_NUM;i++ ) // Set Color for all 16 lines in all 32 sprites to Black FillVram( SCR4_PAGE0_SPRITE_COLOR_TABLE_ADDRESS+i*SPRITE_COLOR_BYTES, COLOR_BLACK, SPRITE_COLOR_BYTES ); for( i=0;i<SPRITES_NUM;i++ ) // Put all 32 sprites at pos (0,0), using pattern 0. "unused"-attribute is also set to 0 FillVram( SCR4_PAGE0_SPRITE_ATTR_TABLE_ADDRESS+i*SPRITE_ATTR_ENTRY_LEN, 0, SPRITE_ATTR_ENTRY_LEN ); __asm di ld a, #0xC3 ; // "jp" opcode ld hl, #_myInterrupt ld ( 0x0038 ), a ; // Short-circuit the system: Hi-jack the interrupt all-together ld ( 0x0039 ), hl ; // to remove any doubts that we run our code in vblank ei __endasm; while( 1 ) // Loop forever Halt(); }
Edited twice: "ld a, #0x0F" was changed to "ld a, #0x8F ; // VDP register R#15" as I fooled around with both values, and made a mistake initially.
Here are some images of the results:
openmsx17 - and the way the code was intended to behave:
Real a1-wsx, "with two out-commands", (one other run gave a different output):
Real a1-wsx, "with two in-commands":
Real a1-wsx, "with two in-commands", another run:
Real svi-738, MSX2, "with two out-commands" (blinking):
Real svi-738, MSX2, "with two in-commands" (blinking):
Great work, can you create an openMSX ticket for this on GitHub as well?
If you change the out (VDPIO),a
to out (c),a
, giving it two extra cycles, does the problem disappear? Just to give a ballpark idea of how big the timing error is.
As a confirmation, in the openMSX debugger I hacked in a change where the border colour is set to red during the I/O, and to blue otherwise, and indeed it happens neatly after the blanking starts. (Could be a nice addition to the test.)