Processing

Please wait...

Settings

Settings

Goto Application

1. WO2020141108 - METHOD, APPARATUS AND SYSTEM FOR HYBRID SPEECH SYNTHESIS

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

CLAIMS

1. A method of decoding an original speech signal for hybrid adversarial-parametric speech synthesis, wherein the method includes the steps of:

(a) receiving quantized original linear prediction coding parameters estimated by applying linear prediction coding analysis fdtering to an original speech signal and a quantized compressed representation of a residual of the original speech signal;

(b) dequantizing the original linear prediction coding parameters and the compressed representation of the residual;

(c) inputting the dequantized compressed representation of the residual into a decoder part of a Generator for applying adversarial mapping from the compressed residual domain to a fake (first) signal domain;

(d) outputting, by the decoder part of the Generator, a fake speech signal;

(e) applying linear prediction coding analysis fdtering to the fake speech signal for obtaining a corresponding fake residual; and

(f) reconstructing the original speech signal by applying linear prediction coding cross synthesis fdtering to the fake residual and the dequantized original linear prediction coding analysis parameters.

2. The method according to claim 1, wherein the order used for linear prediction coding analysis fdtering in step (e) is the same as the order used for estimating the original linear prediction coding parameters.

3. The method according to claim 1 or claim 2, wherein the Generator is a Generator trained in an Adversarial Network setting including the Generator and a Discriminator, and wherein training of the Generator and the Discriminator is based on one or more of loss functions.

4. The method according to claim 3, wherein the decoder part of the Generator includes an adversarial generation segment including L layers with N filters in each layer, wherein L is a natural number > 1 and wherein N is a natural number > 1, wherein the N fdters operate with a stride of 2 and the size of the N fdters is the same in each of the L layers, and wherein in at least one of the L layers a transposed convolution is performed followed by a gated tanh unit, and wherein an output layer subsequently follows the last of the L layers of the adversarial generation segment, wherein the output layer includes N fdters operating with a stride of 1, and wherein a ID convolution operation is performed in the output layer followed by a tanh operation.

5. The method according to claim 4, wherein the decoder part of the Generator further includes a context decoding segment prior to the adversarial generation segment.

6. The method according to claim 5, wherein the context decoding segment includes L = 1 layers with N filters, wherein N is a natural number > 1, followed by one or more blocks of softmax gated tanh units, wherein the size of the N filters is 1 and the N fdters operate with a stride of 1, and wherein a ID convolution operation is performed in the L = 1 layers and wherein the output of the one or more blocks of softmax gated tanh units of the context decoding segment is concatenated with a random noise vector (z).

7. A method of encoding an original speech signal for hybrid adversarial-parametric speech synthesis, wherein the method includes the steps of:

(a) receiving the original speech signal;

(b) applying linear prediction coding analysis fdtering to the original speech signal for obtaining a corresponding residual;

(c) inputting the obtained residual into an encoder part of a Generator for encoding the residual;

(d) outputting, by the encoder part of the Generator, a compressed representation of the residual;

(e) applying linear prediction coding analysis fdtering to the original speech signal for estimating original linear prediction coding parameters; and

(f) quantizing and transmitting the original linear prediction coding parameters and the compressed representation of the residual,

wherein the order used for linear prediction coding analysis fdtering in step (e) is higher than in step (b).

8. The method according to claim 7, wherein the order used for linear prediction coding analysis fdtering in step (b) is 16 and in step (e) is in a range between 16 to 50.

9. The method according to claim 7 or claim 8, wherein the Generator is a Generator trained in an Adversarial Network setting including the Generator and a Discriminator, and wherein training of the Generator and the Discriminator is based on one or more of loss functions.

10. The method according to claim 9, wherein the encoder part of the Generator includes L layers with N fdters in each layer, wherein L is a natural number > 1 and wherein N is a natural number > 1, wherein the size of the N fdters is the same in each of the L layers and the N fdters operate with a stride of 2, and wherein in at least one layer of the L layers, a ID convolution operation is performed followed by a non-linear operation including one or more of a parametric rectified linear unit (PReLU), a rectified linear unit (ReLU), a leaky rectified linear unit (LReLU), an exponential linear unit (eLU) and a scaled exponential linear unit (SeLU).

11. The method according to claim 10, wherein an output layer subsequently follows the last of the L layers of the encoder part of the Generator, wherein the output layer includes N filters operating with a stride of 1 and wherein a ID convolution operation is performed in the output layer followed by a non linear operation including one or more of a parametric rectified linear unit (PReLU), a rectified linear unit (ReLU), a leaky rectified linear unit (LReLU), an exponential linear unit (eLU) and a scaled exponential linear unit (SeLU).

12. An apparatus for encoding an original speech signal for hybrid adversarial-parametric speech synthesis, wherein the apparatus includes:

(a) a receiver for receiving the original speech signal;

(b) a linear prediction coding analysis filter for applying linear prediction coding analysis filtering to the original speech signal for obtaining a corresponding residual;

(c) an encoder part of a Generator configured to receive at an input of the encoder part the obtained residual and to output at an output of the encoder part a compressed representation of the residual, for encoding the residual;

(d) a linear prediction coding analysis filter for applying linear prediction coding analysis filtering to the original speech signal for estimating original linear prediction coding parameters; and

(e) means for quantizing and transmitting the original linear prediction coding parameters and the compressed representation of the residual,

wherein the order used for linear prediction coding analysis filtering in step (d) is higher than the order used for linear prediction analysis filtering in step (b).

13. An apparatus for decoding an original speech signal for hybrid adversarial-parametric speech synthesis, wherein the apparatus includes:

(a) a receiver for receiving quantized original linear prediction coding parameters estimated by applying linear prediction coding analysis filtering to an original speech signal and a quantized compressed representation of a residual of the original speech signal;

(b) means for dequantizing the original linear prediction coding parameters and the compressed representation of the residual;

(c) a decoder part of a Generator for generating a fake speech signal;

(d) a linear prediction analysis filter for applying linear prediction coding analysis filtering to the fake speech signal for obtaining a corresponding fake residual; and

(e) a linear prediction coding synthesis filter for reconstructing the original speech signal by applying linear prediction coding cross-synthesis filtering to the fake residual and the dequantized original linear prediction coding analysis parameters.

14. A computer program product comprising a computer-readable storage medium with instructions adapted to cause the device to carry out the method according to any of the claims 1 to 11 when executed by a device having processing capability.