Voice synthesis on ISR

صفحة 36/36
29 | 30 | 31 | 32 | 33 | 34 | 35 |

بواسطة Grauw

Ascended (10581)

صورة Grauw

14-12-2021, 12:33

ARTRAG wrote:

Probably someone has a better idea looking at the real data of what now is to be represented using SCC features.

https://github.com/artrag/voicenc_scc/blob/master/unvoiced%2...

The unvoiced tracts are those where the lower black line (voiced probability) goes below 50%.
There are two kind of examples in this picture:
The one in the at about 2,75 sec, where the spectrum is concentrated at 5KHz
The two at about 2,4 sec and at 3,9 sec where the spectrum is low pass and with low energy

Proposals ?

It seems at least the 2-8 kHz noise will not be very well represented by a tone with 60 Hz fundamental…

While if I look at the low noise’s waveform at 2.4 and 3.2 sec it does seem fairly periodic, even roughly matching the period of the waves that come before it. Perhaps the period is less than 2x the window of the FFT so it’s not captured well? Or the power does not exceed a threshold. It could be the pitch detection algorithm is tailored to match speech which has many harmonics, while this is more sinusoid. Sinusoids are captured pretty well by autocorrelation-based algorithms, maybe fall back on them when PEFAC doesn’t give a good match?

Perhaps one approach would be to do some experiments first by manually selecting frequencies for those segments and seeing whether one sounds particularly good, before moving to the question of how to generate it.

بواسطة ARTRAG

Enlighted (6845)

صورة ARTRAG

14-12-2021, 18:27

The low noise waveform at 2.4 and 3.2 sec are correctly encoded as they were voiced. I think that as you said the pitch detection algorithm is working in a critical area due to lower energy of the input signal.
If you look at the signals
https://github.com/artrag/voicenc_scc/blob/master/unvoiced_r...
it seems that he strategy of choosing the maximum of the spectrum instead of the pitch works

What is badly encoded is the frame around 2.75 sec...

The unvoiced signal is a passband .... maybe if I try translate it to low frequency before sampling ...

بواسطة Grauw

Ascended (10581)

صورة Grauw

14-12-2021, 18:55

As I understood the paper, PEFAC has a stage which filters white noise (see section 2.1) which is probably part of why it doesn’t perform so well at 2.75s. The paper boasts about its ability to pick up speech frequencies under negative SNR conditions in the very first sentence, but in our case we generally feed it relatively clean signals and to us the noise is also interesting.

I don’t know if this can or needs to be tweaked, or can be taken into account in another way, but it may be good to know about at least.

بواسطة ARTRAG

Enlighted (6845)

صورة ARTRAG

14-12-2021, 20:12

Yes, the frame is definitely unvoiced, so it is correct that PEFAC doesn't return a pitch. The voicebox implementation of PEFAC returns the probability that the frame is voiced/unvoiced, so now I change encoding strategy according to that parameter.
Now unvoiced frames are encoded as periodic, some as voiced frame, but the pitch is assumed to be the maximum of the spectrum. This seems to work fine for waveforms at 2.4 and 3.2 sec but definitely bad for the frame at at 2.75sec.

Maybe I should discriminate unvoiced frames in a bit further: regular periodic waves with low energy could be encoded as they are now, while pass-band noise at high frequency should be encoded with a different strategy.

بواسطة ARTRAG

Enlighted (6845)

صورة ARTRAG

23-01-2022, 11:32

Having seen what kind of magic GhostwriterP has done in Realfun 3
https://www.youtube.com/watch?v=9HObgSBSByg
and here
https://www.msx.org/forum/msx-talk/development/realfun-3?pag...
I was thinking at ways to rise the speech quality for singing voices... here a couple of almost random ideas:
1) double the ISR rate using a line interrupt
This would halve the frame length, it would pass from 20ms/17ms to 10ms/8ms. The effect should be positive on the accuracy of the pitch tracking. Moreover wave forms should be closer each other and the MSE matching should give better results. The implementation burden considering the current architecture of the music players could be quite heavy ... @GhostwriterP would it be feasible ?
2) use two SCC channels with a time offset
This implies a new encoding algorithm. The two channels should use the same pitch we have now, but the two waves should be played with a delay equal to half of the pitch period. The sum of the two channels is equivalent to a wave of 64 bins that can be optimised to match with the signal period in time. It would give a double resolution in frequency at cost of a player with a variable delay in it.
3) use two SCC channels with a frequency offset
The two channels should use different base frequencies, the first one could use the pitch we have now, the second an higher harmonic, it could be 16 times the pitch in theory (but a lower value could maybe work better). The waves should be optimised using a low pass and an high pass filter with cutoff bandwidth at 16 (the same used above in any case) times the pitch. The sum of the two channels should give a better reconstruction of the higher part of the spectrum, even if I have difficulties to imagine how to optimise the wave phases between frames.

All the above solutions need twice the CPU bandwidth and twice the rom/ram storage and can be combined (at N times the cost in terms of resources...)

بواسطة GhostwriterP

Hero (663)

صورة GhostwriterP

23-01-2022, 12:32

Doubling the ISR rate is technically feasible (for MSX 2 and higher).

For Realfun 3 it will require quite some work to make it happen though, several issues and solutions come to mind, but you never know what else you might run into once you start Wink.

Double size means half singing time... I think the 128 kb sample buffer is not going to be enough then (only just 64 s as it is now at 60Hz). But I suppose one could compose the song without singing (or use lower quality first) and use the "signals" to manually trigger the (high quality) samples in your own program.

بواسطة Grauw

Ascended (10581)

صورة Grauw

23-01-2022, 22:22

It would need to be tested but I’m not sure the options above would give a real significant improvement to quality. I think each will only have a modest effect. Also, to me the beauty of the current method is that it is very cheap on the player performance and costs just 1 channel.

صفحة 36/36
29 | 30 | 31 | 32 | 33 | 34 | 35 |