first, we have to understand that, despite looking like UTAU’s CVVC, deepvocal’s CVVX is quite different, and to start configuring it we have to know how to differentiate voiced consonant from unvoiced consonant, voiced consonants use our vocal cords to emit them, examples of voiced consonants are m, n, b, j, r, v. unvoiced consonants do not use the voice to reproduce them, only the language, teeth and air “s, sh, ts, k, p”, the configuration for voiced consonants and unvoiced consonants is slightly different, so there is a need to know differentiate them.
consonant information, vocalized or non-vocalized, must be organized in the deepvocal Toolbox dictionary, in its respective tabs.
Deepvocal requires 3 types of settings for each phoneme,
-CV, CV and V_X
An example:
-sa, sa, a_s
“-sa” is a phoneme that represents the beginning of phrases that start with -sa, it must be configured preserving the beginning of the consonant, so that the beginning of the phrase that contains “sa” is natural and smooth,
so that the vowel sounds clean, it is very important that between points 3 and 4, the vowel is stable, without any type of deformation or interference, it must not fluctuate, so when recording it must be very careful with the vowel stability.
the next phoneme is “sa” (without the hyphen), that phoneme represents transition phoneme, the phoneme that will transition between sentences with other phonemes, that phoneme has an unvoiced consonant, many people configure it wrong, and put the dot 1 in the middle, but it must be placed at the beginning.
phonemes unvoiced consonants whose consonants are small, like “ka”, “ta”, the silence between the transition must be preserved in the CV and not in the V_X.
OBS: it is important to be careful when recording to not extend the unvoiced consonants too much, otherwise the transition will be strange and unnatural.
point 1 should be placed at the beginning because the VX a_s should not contain a “s” consonant just a small beginning of it, as stated in the DVTB manual
in phonemes whose consonant is not voiced, the consonant is part of the CV and not the VX.
it happens differently in voiced consonant phonemes, the consonant is already part of the VX and point 2 is in the middle of it.
the above demonstration refers to a VX phoneme “a_m” taken from the sample “ma_ma”,
“CV” phonemes without transition hyphen that are voiced consonant must be configured continuing from the part of the consonant that was preserved in V_X
the vowel phonemes required by deepvocal are -V, V_X, V_-
-V are beginning phonemes for vowels,
the example used will be “-a”
the phoneme “a” without a hyphen, plays a role similar to the CV of common phonemes, it will serve to transition between other vowels, this sample must be taken from a V_V transition
in this sample above, it contains only “a” from beginning to end.
the V_X vowel transition phonemes serve to connect with the vowels without a hyphen, creating a CVVX connection of vowels, the following example is the connection “a_i”
only a small fragment of the transition is used, to make the connection, for example from “a” to “i”, it looks like this:
aaaaa a_i iiiiii,
the fragment VX a_i acts as a bridge that connects the two loose vowels “a” and “i”.
the v_x transition of vowels must be very compact and must contain only the transition, they cannot contain large parts of vowels, so that the transition is natural, even if it is fast.
V_- phonemes are important to provide a natural finalization of the vowels, they contain information on the finalization of vowels, as shown below, point 1 is almost the end of the vowel, and point 2 is where it ends, where there is no trace of the vowel sound.
an example of the ending of “a”, with the phoneme a_-
this is a lot of information, but it is the basics for those who want to set up voice banks for deepvocal.