Processing

Please wait...

Settings

Settings

Goto Application

1. CN112689871 - SYNTHESIS OF SPEECH FROM TEXT IN VOICE OF TARGET SPEAKER USING NEURAL NETWORKS

Office
China
Application Number 201980033235.1
Application Date 17.05.2019
Publication Number 112689871
Publication Date 20.04.2021
Publication Kind A
IPC
G10L 13/033
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
033Voice editing, e.g. manipulating the voice of the synthesiser
G10L 13/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
04Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 25/30
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/-G10L21/129
27characterised by the analysis technique
30using neural networks
CPC
G10L 13/033
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
033Voice editing, e.g. manipulating the voice of the synthesiser
G10L 13/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
13Speech synthesis; Text to speech systems
02Methods for producing synthetic speech; Speech synthesisers
04Details of speech synthesis systems, e.g. synthesiser structure or memory management
G10L 25/30
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
27characterised by the analysis technique
30using neural networks
G06N 3/08
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
08Learning methods
G10L 25/18
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
25Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
03characterised by the type of extracted parameters
18the extracted parameters being spectral information of each sub-band
G10L 17/04
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
17Speaker identification or verification
04Training, enrolment or model building
Applicants GOOGLE INC.
谷歌有限责任公司
Inventors JIA YE
贾晔
CHEN ZHIFENG
陈智峰
WU YONGHUI
吴永辉
SHEN JONATHAN
乔纳森·沈
PANG RUOMING
庞若鸣
WEISS RON J.
罗恩·J·韦斯
MORENO IGNACIO LOPEZ
伊格纳西奥·洛佩斯·莫雷诺
REN FEI
任飞
ZHANG YU
张羽
WANG QUAN
王泉
NGUYEN PATRICK AN PHU
帕特里克·安·蒲·阮
Agents 中原信达知识产权代理有限责任公司 11219
中原信达知识产权代理有限责任公司 11219
Priority Data 62/672,835 17.05.2018 US
Title
(EN) SYNTHESIS OF SPEECH FROM TEXT IN VOICE OF TARGET SPEAKER USING NEURAL NETWORKS
(ZH) 使用神经网络以目标讲话者的话音从文本合成语音
Abstract
(EN) Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
(ZH) 用于语音合成的方法、系统和装置,包括在计算机存储介质上编码的计算机程序。所述方法、系统和装置包括以下动作:获得目标讲话者的语音的音频表示;获得将要以目标讲话者的话音合成语音的输入文本;通过将所述音频表示提供给被训练以将讲话者彼此区分开的讲话者编码器引擎来生成讲话者矢量;通过将所述输入文本和讲话者矢量提供给已使用参考讲话者的话音进行训练以生成音频表示的声谱图生成引擎来生成以所述目标讲话者的话音讲出的所述输入文本的音频表示;以及提供以所述目标讲话者的话音讲出的所述输入文本的所述音频表示以进行输出。