Parte del contenido de esta aplicación no está disponible en este momento.
Si esta situación persiste, contáctenos aComentarios y contacto
1. (WO2018106971) SYSTEM AND METHOD FOR NEURAL NETWORK BASED SPEAKER CLASSIFICATION
Nota: Texto obtenido mediante procedimiento automático de reconocimiento óptico de caracteres.
Solo tiene valor jurídico la versión en formato PDF

WHAT IS CLAIMED IS:

1 . A method for classifying speakers comprises:

receiving, by a speaker recognition system comprising a processor and memory, input audio comprising speech from a speaker;

extracting, by the speaker recognition system, a plurality of speech frames containing voiced speech from the input audio;

computing, by the speaker recognition system, a plurality of features for each of the speech frames of the input audio;

computing, by the speaker recognition system, a plurality of recognition scores for the plurality of features;

computing, by the speaker recognition system, a speaker classification result in accordance with the recognition scores; and

outputting, by the speaker recognition system, the speaker classification result.

2. The method of claim 1 , wherein the extracting the speech frames comprises: dividing the input audio into the plurality of speech frames;

computing a short term energy of each speech frame;

computing a spectral centroid of each speech frame;

classifying a speech frame as a voiced frame in response to determining that the short term energy of the speech frame exceeds a short term energy threshold and that the spectral centroid of the speech frame exceeds a spectral centroid threshold, and classifying the speech frame as an unvoiced frame otherwise;

retaining the voiced frames and removing the unvoiced frames; and outputting the retained voiced frames as the speech frames containing voiced speech.

3. The method of claim 1 , wherein the computing the plurality of features for each of the speech frames comprises:

dividing the speech frames into overlapping windows of audio;

normalizing each of the windows of audio;

computing mel-frequency cepstral coefficients, deltas, and double deltas for each window; and

computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window.

4. The method of claim 3 wherein the normalizing each of the windows of audio comprises applying speaker-level mean-variance normalization.

5. The method of claim 3 wherein the computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window comprises:

grouping the windows into a plurality of overlapping frames, each of the overlapping frames comprising a plurality of adjacent windows;

for each overlapping frame of the overlapping frames, concatenating the mel-frequency cepstral coefficients, the deltas, and the double deltas of the adjacent windows to generate a plurality of features of the overlapping frame; and

outputting the features of the overlapping frames as the plurality of features.

6. The method of claim 1 , wherein the computing the speaker classification result comprises forward propagating the plurality of features through a trained multi-class neural network, the trained multi-class neural network being trained to compute the recognition scores, each of the recognition scores corresponding to a confidence that the speech of the input audio corresponds to speech from one of a plurality of enrolled speakers.

7. The method of claim 6, wherein the trained multi-class neural network is trained by:

receiving training data comprising audio comprising speech from a plurality of enrolled speakers, the audio being labeled with the speakers;

extracting a plurality of features from the audio for each of the enrolled speakers;

applying speaker-level mean-variance normalization to the features extracted from the audio for each of the enrolled speakers; and

training the multi-class neural network to classify an input feature vector as one of the plurality of enrolled speakers.

8. The method of claim 7, wherein the training the multi-class network comprises iteratively reducing a regularization parameter of a cost function.

9. The method of claim 1 , wherein the speaker classification result comprises an identification of a particular speaker of a plurality of enrolled speakers, and

wherein the identification of the particular speaker is computed by identifying a highest recognition score of the plurality of recognition scores and by identifying the particular speaker associated with the highest recognition score.

10. The method of claim 1 , further comprising receiving an allegation that the speaker is a particular enrolled speaker of a plurality of enrolled speakers,

wherein the speaker classification result is a speaker verification indicating whether the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers.

1 1 . The method of claim 10, further comprising computing the speaker verification by:

comparing the recognition score corresponding to the particular speaker to a threshold value; and

outputting a speaker verification indicating that the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers in response to determining that the recognition score exceeds the threshold value and determining that the recognition score is higher than all other enrolled speakers.

12. The method of claim 1 1 , wherein the threshold comprises a speaker-specific threshold, and wherein the speaker-specific threshold is computed by solving for an intersection between a first Gaussian distribution representing the probability that the speaker of the input audio is one of the enrolled speakers and a second Gaussian distribution representing the probability that the speaker of the input audio is not one of the enrolled speakers.

13. A system for classifying speakers comprising:

a processor; and

memory storing instructions that, when executed by the processor, cause the processor to:

receive input audio comprising speech from a speaker;

extract a plurality of speech frames containing voiced speech from the input audio;

compute a plurality of features for each of the speech frames of the input audio;

compute a plurality of recognition scores for the plurality of features;

compute a speaker classification result in accordance with the recognition scores; and

output the speaker classification result.

14. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to extract the speech frames by:

dividing the input audio into the plurality of speech frames;

computing a short term energy of each speech frame;

computing a spectral centroid of each speech frame;

classifying a speech frame as a voiced frame in response to determining that the short term energy of the speech frame exceeds a short term energy threshold and that the spectral centroid of the speech frame exceeds a spectral centroid threshold, and classifying the speech frame as an unvoiced frame otherwise;

retaining the voiced frames and removing the unvoiced frames; and outputting the retained voiced frames as the speech frames containing voiced speech.

15. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the plurality of features for each of the speech frames by:

dividing the speech frames into overlapping windows of audio;

normalizing each of the windows of audio;

computing mel-frequency cepstral coefficients, deltas, and double deltas for each window; and

computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window.

16. The system of claim 15 wherein the normalizing each of the windows of audio comprises applying speaker-level mean-variance normalization.

17. The system of claim 15 wherein the computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window comprises:

grouping the windows into a plurality of overlapping frames, each of the overlapping frames comprising a plurality of adjacent windows;

for each overlapping frame of the overlapping frames, concatenating the mel-frequency cepstral coefficients, the deltas, and the double deltas of the adjacent windows to generate a plurality of features of the overlapping frame; and

outputting the features of the overlapping frames as the plurality of features.

18. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the speaker classification result by forward propagating the plurality of features through a trained multi-class neural network, the trained multi-class neural network being trained to compute the recognition scores, each of the recognition scores corresponding to a confidence that the speech of the input audio corresponds to speech from one of a plurality of enrolled speakers.

19. The system of claim 18, wherein the trained multi-class neural network is trained by:

receiving training data comprising audio comprising speech from a plurality of enrolled speakers, the audio being labeled with the speakers;

extracting a plurality of features from the audio for each of the enrolled speakers;

applying speaker-level mean-variance normalization to the features extracted from the audio for each of the enrolled speakers; and

training the multi-class neural network to classify an input feature vector as one of the plurality of enrolled speakers.

20. The system of claim 19, wherein the training the multi-class network comprises iteratively reducing a regularization parameter of a cost function.

21 . The system of claim 13, wherein the speaker classification result comprises an identification of a particular speaker of a plurality of enrolled speakers, and

wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the identification of the particular speaker by identifying a highest recognition score of the plurality of recognition scores and by identifying the particular speaker associated with the highest recognition score.

22. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to receive an allegation that the speaker is a particular enrolled speaker of a plurality of enrolled speakers,

wherein the speaker classification result is a speaker verification indicating whether the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers.

23. The system of claim 22, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the speaker verification by:

comparing the recognition score corresponding to the particular speaker to a threshold value; and

outputting a speaker verification indicating that the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers in response to determining that the recognition score exceeds the threshold value and determining that the recognition score is higher than all other enrolled speakers.

24. The system of claim 23, wherein the threshold comprises a speaker-specific threshold, and wherein the speaker-specific threshold is computed by solving for an intersection between a first Gaussian distribution representing the probability that the speaker of the input audio is one of the enrolled speakers and a second Gaussian distribution representing the probability that the speaker of the input audio is not one of the enrolled speakers.