How does CVVX work in DeepVocal? Plus any new info about the release for DV2?

Like the title says, I downloaded the Toolkit and the editor to see how dv vbs work, but because you can actually encode dv libraries I can’t take a look on them personally. there are not alot of sources to make a deepvocal voicebank in english, and most ported deepvocal voicebanks sound choppy and really unatural (example. Ritsu Namine DV vs Rabia)

Plus, the new DV2 engine still seems to be out of reach and lacks any proper context. Why is Boxstar absent? I havent seen them on twitter nor youtube, it’s a bit discouraging and sad, considering I am excited for the release of it.

Consonant Vowel - Consonant/Vowel. “X” represents either Consonant or Vowel
A note is either a CV or a V or a ending C. The VX is a automatic blending function for CV notes to blend into a V or a ending C. Unfortunately, Consonants cannot blend with another Consonant using the VX function. So standalone Consonants may be choppy depending on the usage. Complete explanation of DeepVocal Toolbox functions can be viewed here.

Regarding English voicebanks in DeepVocal, Bunny DV’s English reclist is available for download for you to make English possible in DeepVocal. Alternatively, you may use Bukimi or WADE for your English voicebank, though their reclist isn’t exactly ‘public’.

DV2’s release include a better engine that reduces engine noise and makes voicebanks more bright. There has been reports that is may cause particular voicebanks to have popping noise in the DeepVocal fandom Discord server. Additionally, DV2 has been released for download 4 hours ago as of the making of this post. You can download it at the DeepVocal download page.

first, we have to understand that, despite looking like UTAU’s CVVC, deepvocal’s CVVX is quite different, and to start configuring it we have to know how to differentiate voiced consonant from unvoiced consonant, voiced consonants use our vocal cords to emit them, examples of voiced consonants are m, n, b, j, r, v. unvoiced consonants do not use the voice to reproduce them, only the language, teeth and air “s, sh, ts, k, p”, the configuration for voiced consonants and unvoiced consonants is slightly different, so there is a need to know differentiate them.

consonant information, vocalized or non-vocalized, must be organized in the deepvocal Toolbox dictionary, in its respective tabs.

Deepvocal requires 3 types of settings for each phoneme,
-CV, CV and V_X
An example:
-sa, sa, a_s
“-sa” is a phoneme that represents the beginning of phrases that start with -sa, it must be configured preserving the beginning of the consonant, so that the beginning of the phrase that contains “sa” is natural and smooth,

so that the vowel sounds clean, it is very important that between points 3 and 4, the vowel is stable, without any type of deformation or interference, it must not fluctuate, so when recording it must be very careful with the vowel stability.

the next phoneme is “sa” (without the hyphen), that phoneme represents transition phoneme, the phoneme that will transition between sentences with other phonemes, that phoneme has an unvoiced consonant, many people configure it wrong, and put the dot 1 in the middle, but it must be placed at the beginning.

phonemes unvoiced consonants whose consonants are small, like “ka”, “ta”, the silence between the transition must be preserved in the CV and not in the V_X.

OBS: it is important to be careful when recording to not extend the unvoiced consonants too much, otherwise the transition will be strange and unnatural.

point 1 should be placed at the beginning because the VX a_s should not contain a “s” consonant just a small beginning of it, as stated in the DVTB manual


in phonemes whose consonant is not voiced, the consonant is part of the CV and not the VX.

it happens differently in voiced consonant phonemes, the consonant is already part of the VX and point 2 is in the middle of it.


the above demonstration refers to a VX phoneme “a_m” taken from the sample “ma_ma”,

“CV” phonemes without transition hyphen that are voiced consonant must be configured continuing from the part of the consonant that was preserved in V_X

the vowel phonemes required by deepvocal are -V, V_X, V_-

-V are beginning phonemes for vowels,
the example used will be “-a”


the phoneme “a” without a hyphen, plays a role similar to the CV of common phonemes, it will serve to transition between other vowels, this sample must be taken from a V_V transition

in this sample above, it contains only “a” from beginning to end.

the V_X vowel transition phonemes serve to connect with the vowels without a hyphen, creating a CVVX connection of vowels, the following example is the connection “a_i”


only a small fragment of the transition is used, to make the connection, for example from “a” to “i”, it looks like this:
aaaaa a_i iiiiii,

the fragment VX a_i acts as a bridge that connects the two loose vowels “a” and “i”.
the v_x transition of vowels must be very compact and must contain only the transition, they cannot contain large parts of vowels, so that the transition is natural, even if it is fast.

V_- phonemes are important to provide a natural finalization of the vowels, they contain information on the finalization of vowels, as shown below, point 1 is almost the end of the vowel, and point 2 is where it ends, where there is no trace of the vowel sound.
an example of the ending of “a”, with the phoneme a_-

this is a lot of information, but it is the basics for those who want to set up voice banks for deepvocal.

1 Like