Processing

Please wait...

Settings

Settings

Goto Application

1. EP3776530 - SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Office
European Patent Office
Application Number 19728819
Application Date 17.05.2019
Publication Number 3776530
Publication Date 17.02.2021
Publication Kind A1
IPC
G10L 13/033
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
033Voice editing, e.g. manipulating the voice of the synthesiser
G10L 13/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
04Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 25/30
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/-G10L21/129
27characterised by the analysis technique
30using neural networks
CPC
G10L 13/033
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
033Voice editing, e.g. manipulating the voice of the synthesiser
G10L 13/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
04Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 25/30
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
27characterised by the analysis technique
30using neural networks
G06N 3/08
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
08Learning methods
G10L 25/18
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
03characterised by the type of extracted parameters
18the extracted parameters being spectral information of each sub-band
G10L 17/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
17Speaker identification or verification
04Training, enrolment or model building
Applicants GOOGLE LLC
Inventors JIA YE
CHEN ZHIFENG
WU YONGHUI
SHEN JONATHAN
PANG RUOMING
WEISS RON J
MORENO IGNACIO LOPEZ
REN FEI
ZHANG YU
WANG QUAN
NGUYEN PATRICK AN PHU
Designated States
Priority Data 201862672835 17.05.2018 US
Title
(DE) SYNTHESE VON SPRACHE AUS TEXT IN EINER STIMME EINES ZIELSPRECHERS UNTER VERWENDUNG NEURONALER NETZE
(EN) SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS
(FR) SYNTHÈSE DE LA PAROLE D'UN TEXTE EN UNE VOIX D'UN LOCUTEUR CIBLE À L'AIDE DE RÉSEAUX NEURONAUX
Abstract
(EN) Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
(FR) La présente invention concerne des procédés, des systèmes et un appareil, incluant des programmes d'ordinateur codés sur un support d'enregistrement informatique, pour une synthèse de la parole. Les procédés, les systèmes et l'appareil comprennent des actions consistant à obtenir une représentation audio de paroles d'un locuteur cible, obtenir un texte d'entrée pour lequel des paroles doivent être synthétisées en une voix du locuteur cible, générer un vecteur de locuteur en fournissant la représentation audio à un moteur de codage de locuteur qui est entraîné pour distinguer des locuteurs les uns des autres, générer une représentation audio du texte d'entrée parlé dans la voix du locuteur cible par la fourniture du texte d'entrée et du vecteur de locuteur à un moteur de génération de spectrogrammes qui est entraîné à utiliser des voix de locuteurs de référence pour générer des représentations audio, et fournir la représentation audio du texte d'entrée parlé dans la voix du locuteur cible aux fins de sortie.