I'm pretty sure the explanation isn't much more complicated than: that was the number of transistors available.
Re: interrupts, if we're talking about a hypothetical loop that omits fetch and decode then it already must include refresh, so making it also check for the need to respond to the two (or three, depending on how you count reset) types of interrupt isn't adding a great deal of further complexity. And it's still valid to enter the interrupt with the pushed PC being the start of the LDxR exactly as with the current implementation.
From that final observation, you can even imagine intermediate implementations that might have worked well: do two iterations of read and write, then exit to the regular fetch-decode-execute path. In which case you can probably not worry about performing refresh inside your special loop.
But then again, if you're just throwing extra transistors around, why not expand the ALU to 8-bit, thereby halving the cost of all 16-bit arithmetic and leaving LDxR as they are but hopefully reducing them closer to their hypothetical maximum speed of 13 cycles/iteration (14 MSX).