# Realfun 3

Page 6/10
1 | 2 | 3 | 4 | 5 | | 7 | 8 | 9 | 10
ARTRAG wrote:

The voice encoder is tailored for human voice. The pitch tracking is specific for speech and implements a quite complex algorithm from the voice-box library.

So, choir voices or singing (oooh, aaah, iiih, eeeh etc) are not speech voicey enough to be picked up by the algorithm? or will not always pay out?

Send me the wav files you have tested and I'll try to analyze the problem
The algorithm works for a single voice
No choirs or echo should work

Ok, choir samples and instruments will not work with the voice encoder... so we need another tool for those.

As far as I understand the encoder is based on taking the dominant frequency (or frequencies if multiple channels are used) at each small time interval and put that in the data. When played that resembles the input signal. That works probably best if the dominant frequency changes a lot (as in speech), as then these changes help the brain to recognise that as well.

No, that was the previous version
This one estimates the pitch in each segment
Then it tries to approximate the period of the vocal segment using 32 samples (the SCC wave)

It means that the max frequency represented in each segment is 32 x pitch

OK, so you still try to recreate the sample in SCC RAM that best matches the piece of audio in the segment, if repeated. With that you can determine the sample to use and the frequency to play it at.
How long is a segment? 1/intfreq seconds?

If I understand correctly each segment represents 1/intfreq second. For such segment we have a pitch (or actually period 2 bytes) and a SCC wave form (32 bytes), so 34 bytes total per intfreq.

Also interesting to note it that because the voices have pitch data, changing the pitch on the fly is a little more tricky as the period scale is not linear. At the moment I have a routine in place that recalculates this shifted pitch depending the fundamental of each segment by means of two multiplications (once 10bit x 12bit and once 12bit x 16bit and some pre- and aft rolling). On Z80 this takes about 1300 cycles (rough estimate), on R800 we have muluw, so it goes quite a bit faster.

I do not know if there is a smarter (or more importantly faster way) to do so, but voice pitching does come with an additional cost.

For voices or samples with a fixed fundamental, like instruments, one could just write the period directly, which is off course a lot faster.

Manuel wrote:

OK, so you still try to recreate the sample in SCC RAM that best matches the piece of audio in the segment, if repeated. With that you can determine the sample to use and the frequency to play it at.
How long is a segment? 1/intfreq seconds?

The signal is segmented in 1/intfreq segments
For each segment, the voice encoder computes a pitch and a probability that the current segment is a voiced segment.
The probability is computed using heuristics based on features of the human voice.
Pitch is computed also for unvoiced segments but the values are not meaningful.

For voiced tracts, the segment is filtered and resampled at 32*pitch.
The result is averaged on the pitch period and sampled to get 32 values for the SCC.

For unvoiced tracts, the pitch is neglected, the encoder locates the frequency where the signal has its max energy and the segment is filtered and resampled at 32 times the frequency of the energy peak.
The result is averaged and resampled to get 32 values for the SCC.
This strategy is used to represent high pass noisy segments, but maybe there is something better to do.

About the problems with instruments, my guess is that the pitch is estimated correctly.
What could be incorrect is the voiced/unvoiced probability, thus the encoder chooses the wrong branch for representing the signal.
If this is the problem, the workaround would be only matter of setting to 0 the threshold to discriminate between voiced and unvoiced segments.

GhostwriterP wrote:

If I understand correctly each segment represents 1/intfreq second. For such segment we have a pitch (or actually period 2 bytes) and a SCC wave form (32 bytes), so 34 bytes total per intfreq.

Also interesting to note it that because the voices have pitch data, changing the pitch on the fly is a little more tricky as the period scale is not linear. At the moment I have a routine in place that recalculates this shifted pitch depending the fundamental of each segment by means of two multiplications (once 10bit x 12bit and once 12bit x 16bit and some pre- and aft rolling). On Z80 this takes about 1300 cycles (rough estimate), on R800 we have muluw, so it goes quite a bit faster.

I do not know if there is a smarter (or more importantly faster way) to do so, but voice pitching does come with an additional cost.

For voices or samples with a fixed fundamental, like instruments, one could just write the period directly, which is off course a lot faster.

For pitching, the encoder tries to put in the header the note (as period) closer to the average pitch of the first half second of the sample. You can use that base note as reference: with a bit of approximation, the pitching can be obtained by using a table of notes and the differences between the frequency of the base note and the pitch of each segment.

To me this sound like that the encoder is compromising on the quality to make it slightly better pitch adjustable? Or does that approach also improve quality overall?

Quote:

approximation ... using a table of notes and the differences between the frequency of the base note and the pitch of each segment.

well this only works if you stay reasonable close to the base frequency, the further away the bigger the error. But, indeed for speech that may not need to be in tune, and where only a limited range of adjusting is useable anyway, this may not necessarily be a problem.

Page 6/10
1 | 2 | 3 | 4 | 5 | | 7 | 8 | 9 | 10