Microsoft’s According, VALL-E is primarily a “Neural Codec Language Model” and is based on EnCodec, introduced by Meta in October 2022. VALL-E audio codecs extract codes from text and acoustic signals, as opposed to converting them to speech by manipulating the waveform, usually by other text-to-speech. It understands the tone and intonation of a person’s voice and uses EnCodec to extract the required data components (called ‘tokens’) and then the training data.
In this way, this system understands the voice of that person as well as the tone of his speaking and then can speak any text written exactly like that person’s voice and his style of speaking.
Microsoft trained the speech synthesis functionalities of VALL-E using Meta’s LibriLight audio library. It contains over 60,000 hours of English language speech from over 7,000 speakers, derived primarily from LibriVox public domain audiobooks. For VALL-E to produce a good result, the sound present in the three-second sample must be similar to the sound present in its learning algorithm.
Microsoft does not make the VALL-E code available to others in order to prevent VALL-E from being misused or misused by someone else. It appears that the researchers are aware of the potential social harm this technology could cause.