Some content of this application is unavailable at the moment.
If this situation persist, please contact us atFeedback&Contact
1. (WO2019027531) NEURAL NETWORKS FOR SPEAKER VERIFICATION
Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

CLAIMS

What is claimed is:

1. A computer-implemented method, comprising:

receiving, by a computing device, data that characterizes a first utterance;

providing, by the computing device, the data that characterizes the utterance to a speaker verification neural network, wherein the speaker verification neural network is trained on batches of training utterances using a respective training loss for each batch that is based on, for each of multiple training speakers represented in the batch:

(i) differences among speaker representations generated by the speaker verification neural network from training utterances of the training speaker within the batch, and

(ii) for each first speaker representation generated from a training utterance of the training speaker within the batch, a similarity between the first speaker representation and a second speaker representation for a particular different training speaker represented in the batch;

obtaining, by the computing device, a speaker representation that indicates speaking characteristics of a speaker of the first utterance, wherein the speaker representation was generated by processing the data that characterizes the first utterance with the speaker verification neural network;

determining, by the computing device and based on the speaker representation, whether the first utterance is classified as an utterance of a registered user of the computing device; and

in response to determining that the first utterance is classified as an utterance of the registered user of the computing device, performing, by the computing device, an action for the registered user of the computing device.

2. The computer-implemented method of claim 1 , wherein determining whether the first utterance is classified as an utterance of the registered user of the computing device comprises comparing the speaker representation for the first utterance to a speaker signature for the registered user, wherein the speaker signature is based on one or more speaker representations derived from one or more enrollment utterances of the registered user.

3. The computer-implemented method of claim 1 or claim 2, wherein the registered user is a first registered user;

the method comprising:

comparing the speaker representation for the first utterance to respective speaker signatures for multiple registered users of the computing device including the first registered user to determine a respective distance between the speaker representation for the first utterance and the respective speaker signatures for the multiple registered users; and

determining that the first utterance is classified as an utterance of the first registered user of the computing device based on the respective distance between the speaker representation for the first utterance and the respective speaker signature for the first registered user being less than a threshold distance from each other.

4. The computer-implemented method of any preceding claim, wherein:

the speaker verification neural network is stored locally on the computing device; and obtaining the speaker representation comprises processing the data that characterizes a first utterance with the speaker verification neural network on the computing device.

5. The computer-implemented method of any preceding claim, wherein for each first speaker representation generated from a training utterance of the training speaker within the batch, the particular different training speaker is selected from among multiple different training speakers represented in the batch based on a distance between the first speaker representation generated from the training utterance of the training speaker and the second speaker representation for the particular different training speaker,

wherein the second speaker representation is an averaged speaker representation generated from multiple training utterances of the particular different training speaker.

6. The computer-implemented method of any preceding claim, wherein for each training speaker of multiple training speakers represented in a batch, the differences among speaker representations generated by the speaker verification neural network from training utterances of the training speaker within the batch are determined based on distances of the speaker representations of the training speaker to an averaged speaker representation generated from two or more training utterances of the training speaker.

7. The computer-implemented method of any preceding claim, wherein the speaker verification neural network is a long short-term memory (LSTM) neural network.

8. The computer-implemented method of any preceding claim, wherein the data that characterizes the first utterance is feature data that characterizes acoustic features of the first utterance; and

the method further comprises generating the feature data for the first utterance from audio data for the first utterance that characterizes an audio waveform of the first utterance.

9. The computer-implemented method of any preceding claim, wherein performing the action that is assigned to the registered user of the computing device comprises transitioning the computing device from a locked state to an unlocked state.

10. The computer-implemented method of any of claims 1 to 8, wherein performing the action that is assigned to the registered user of the computing device comprises accessing user data from a user account of the registered user of the computing device.

11. A computer-implemented method for training a speaker verification neural network, comprising:

obtaining, by a computing system, a training batch that includes a plurality of groups of training samples, wherein:

(i) each training sample in the training batch characterizes a respective training utterance for the training sample, and

(ii) each of the plurality of groups of training samples corresponds to a different speaker such that each group consists of training samples that characterize training utterances of a same speaker that is different from the speakers of training utterances characterized by training samples in other ones of the plurality of groups of training samples;

for each training sample in the training batch, processing the training sample with the speaker verification neural network in accordance with current values of internal parameters of the speaker verification neural network to generate a speaker representation for the training sample that indicates speaker characteristics of a speaker of the respective training utterance characterized by the training sample;

for each group of training samples, averaging the speaker representations for training samples in the group to generate an averaged speaker representation for the group;

for each training sample in the training batch, determining a loss component for the speaker representation for the training sample based on:

(i) a distance between the speaker representation for the training sample and the averaged speaker representation for the group to which the training sample belongs, and

(ii) a distance between the speaker representation for the training sample and a closest averaged speaker representation among the averaged speaker representations for the groups to which the training sample does not belong; and

updating the current values of the internal parameters of the speaker verification neural network using the loss components for the speaker representations for at least some of the training samples in the training batch.

12. The computer-implemented method of claim 1 1 , further comprising iteratively updating the current values of the internal parameters of the speaker verification neural network over a plurality of training iterations,

wherein the computing system trains the speaker verification neural network on different training batches in each of at least some of the plurality of training iterations.

13. The computer-implemented method of claim 1 1 or claim 12, further comprising generating the training batch by:

determining criteria for the training batch, the criteria specifying (i) a total number of speakers to be represented in the training batch and (ii) a total number of training samples per speaker to include in the training batch; and

selecting training samples for inclusion in the training batch according to the criteria.

14. The computer-implemented method of claim 13, wherein the criteria further include a specified length for training utterances characterized by training samples in the training batch; and

the method further comprises extracting segments of the specified length from random locations of the training utterances,

wherein each training sample in the training batch characterizes the segment of the respective training utterance for the training sample to an exclusion of a portion of the

respective training utterance located outside of the segment that was extracted from the respective training utterance.

15. The computer-implemented method of claim 14, wherein the training batch is a first training batch that is used to train the speaker verification neural network in a first training iteration; and

the method further comprises:

determining second criteria for a second training batch that is for training the speaker verification neural network in a second training iteration, the second criteria specifying a second length for training utterances characterized by training samples in the second training batch, the second length being different from the length specified by the criteria for the first training batch; and

selecting training samples for inclusion in the second training batch according to the second criteria, wherein at least one training sample selected for inclusion in the second training batch characterizes a different segment of a same training utterance that is characterized by a training sample in the first training batch.

16. The computer-implemented method of any of claims 1 1 to 15, further comprising, for each training sample in the training batch:

determining a respective distance between the speaker representation for the training sample and a respective averaged speaker representation for each group to which the training sample does not belong; and

selecting the closest averaged speaker representation as the respective from the respective averaged speaker representations for the groups to which the training sample does not belong based on the respective distance between the speaker representation for the training sample and the closest averaged speaker representation being less than the respective distances between the speaker representation and the respective averaged speaker representation for each other group to which the training sample does not belong.

17. The computer-implemented method of claim 16, wherein determining the loss component for the speaker representation for each training sample in the training batch comprises determining the loss component according to a loss function that does not account for the respective distances between the speaker representation and the respective averaged speaker representation for each group to which the training sample does not belong other than the group that corresponds to the closest averaged speaker representation.

18. The computer-implemented method of any of claims 1 1 to 17, wherein updating the current values of the internal parameters of the speaker verification neural network comprises back-propagating a batch loss that is based on the loss components for the speaker representations for the at least some of the training samples using stochastic gradient descent.

19. The computer-implemented method of any of claims 1 1 to 18, wherein the speaker verification neural network is a long short-term memory (LSTM) neural network.

20. One or more non-transitory computer-readable media having instructions stored thereon that, when executed by one or more processors of a computing device, cause the one or more processors to perform operations comprising:

receiving, by the computing device, data that characterizes a first utterance;

providing, by the computing device, the data that characterizes the utterance to a speaker verification neural network, wherein the speaker verification neural network is trained on batches of training utterances using a respective training loss for each batch that is based on, for each of multiple training speakers represented in the batch:

(i) differences among speaker representations generated by the speaker verification neural network from training utterances of the training speaker within the batch, and

(ii) for each first speaker representation generated from a training utterance of the training speaker within the batch, a similarity between the first speaker representation and a second speaker representation for a different training speaker represented in the batch;

obtaining, by the computing device, a speaker representation that indicates speaking characteristics of a speaker of the first utterance, wherein the speaker representation was generated by processing the data that characterizes the first utterance with the speaker verification neural network;

determining, by the computing device and based on the speaker representation, whether the first utterance is classified as an utterance of a registered user of the computing device; and

in response to determining that the first utterance is classified as an utterance of the registered user of the computing device, performing an action for the registered user of the computing device.

21. A computer program comprising machine-readable instructions which when executed by computing apparatus causes it to perform the method of any of claims 1 to 19.

22. Apparatus configured to perform the method of any of claims 1 to 19.