Processing

Please wait...

Settings

Settings

Goto Application

1. WO2022072936 - TEXT-TO-SPEECH USING DURATION PREDICTION

Publication Number WO/2022/072936
Publication Date 07.04.2022
International Application No. PCT/US2021/053417
International Filing Date 04.10.2021
IPC
G10L 13/10 2013.1
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
10Prosody rules derived from text; Stress or intonation
G10L 25/30 2013.1
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/-G10L21/129
27characterised by the analysis technique
30using neural networks
CPC
G10L 13/027
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
G10L 13/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
04Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 13/10
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
10Prosody rules derived from text; Stress or intonation
G10L 2013/105
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
10Prosody rules derived from text; Stress or intonation
105Duration
G10L 25/30
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
27characterised by the analysis technique
30using neural networks
Applicants
  • GOOGLE LLC [US]/[US]
Inventors
  • ZHANG, Yu
  • ELIAS, Isaac
  • CHUN, Byungha
  • JIA, Ye
  • WU, Yonghui
  • CHRZANOWSKI, Mike
  • SHEN, Jonathan
Agents
  • PORTNOV, Michael
Priority Data
63/087,16202.10.2020US
Publication Language English (en)
Filing Language English (EN)
Designated States
Title
(EN) TEXT-TO-SPEECH USING DURATION PREDICTION
(FR) SYNTHÈSE TEXTE-PAROLE À L'AIDE D'UNE PRÉDICTION DE DURÉE
Abstract
(EN) Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.
(FR) L'invention concerne des procédés, des systèmes et un appareil, comprenant des programmes informatiques codés sur des supports de stockage informatiques, permettant la synthèse de données audio à partir de données de texte à l'aide d'une prédiction de durée. L'un des procédés comprend le traitement d'une séquence de texte d'entrée qui comprend un élément de texte respectif à chacune de multiples étapes temporelles d'entrée à l'aide d'un premier réseau neuronal afin de générer une séquence d'entrée modifiée comprenant, pour chaque étape temporelle d'entrée, une représentation de l'élément de texte correspondant dans la séquence de texte d'entrée; le traitement de la séquence d'entrée modifiée à l'aide d'un second réseau neuronal afin de générer, pour chaque étape temporelle d'entrée, une durée prédite de l'élément de texte correspondant dans la séquence audio de sortie; le sur-échantillonnage de la séquence d'entrée modifiée en fonction des durées prédites afin de générer une séquence intermédiaire comprenant un élément intermédiaire respectif au niveau de chacune d'une pluralité d'étapes temporelles intermédiaires; et la génération d'une séquence audio de sortie à l'aide de la séquence intermédiaire.
Latest bibliographic data on file with the International Bureau