국제 및 국내 특허문헌 검색
이 애플리케이션의 일부 콘텐츠는 현재 사용할 수 없습니다.
이 상황이 계속되면 다음 주소로 문의하십시오피드백 및 연락
1. (WO2018106971) SYSTEM AND METHOD FOR NEURAL NETWORK BASED SPEAKER CLASSIFICATION
유의사항: 이 문서는 자동 광학문자판독장치(OCR)로 처리된 텍스트입니다. 법률상의 용도로 사용하고자 하는 경우 PDF 버전을 사용하십시오

WHAT IS CLAIMED IS:

1 . A method for classifying speakers comprises:

receiving, by a speaker recognition system comprising a processor and memory, input audio comprising speech from a speaker;

extracting, by the speaker recognition system, a plurality of speech frames containing voiced speech from the input audio;

computing, by the speaker recognition system, a plurality of features for each of the speech frames of the input audio;

computing, by the speaker recognition system, a plurality of recognition scores for the plurality of features;

computing, by the speaker recognition system, a speaker classification result in accordance with the recognition scores; and

outputting, by the speaker recognition system, the speaker classification result.

2. The method of claim 1 , wherein the extracting the speech frames comprises: dividing the input audio into the plurality of speech frames;

computing a short term energy of each speech frame;

computing a spectral centroid of each speech frame;

classifying a speech frame as a voiced frame in response to determining that the short term energy of the speech frame exceeds a short term energy threshold and that the spectral centroid of the speech frame exceeds a spectral centroid threshold, and classifying the speech frame as an unvoiced frame otherwise;

retaining the voiced frames and removing the unvoiced frames; and outputting the retained voiced frames as the speech frames containing voiced speech.

3. The method of claim 1 , wherein the computing the plurality of features for each of the speech frames comprises:

dividing the speech frames into overlapping windows of audio;

normalizing each of the windows of audio;

computing mel-frequency cepstral coefficients, deltas, and double deltas for each window; and

computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window.

4. The method of claim 3 wherein the normalizing each of the windows of audio comprises applying speaker-level mean-variance normalization.

5. The method of claim 3 wherein the computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window comprises:

grouping the windows into a plurality of overlapping frames, each of the overlapping frames comprising a plurality of adjacent windows;

for each overlapping frame of the overlapping frames, concatenating the mel-frequency cepstral coefficients, the deltas, and the double deltas of the adjacent windows to generate a plurality of features of the overlapping frame; and

outputting the features of the overlapping frames as the plurality of features.

6. The method of claim 1 , wherein the computing the speaker classification result comprises forward propagating the plurality of features through a trained multi-class neural network, the trained multi-class neural network being trained to compute the recognition scores, each of the recognition scores corresponding to a confidence that the speech of the input audio corresponds to speech from one of a plurality of enrolled speakers.

7. The method of claim 6, wherein the trained multi-class neural network is trained by:

receiving training data comprising audio comprising speech from a plurality of enrolled speakers, the audio being labeled with the speakers;

extracting a plurality of features from the audio for each of the enrolled speakers;

applying speaker-level mean-variance normalization to the features extracted from the audio for each of the enrolled speakers; and

training the multi-class neural network to classify an input feature vector as one of the plurality of enrolled speakers.

8. The method of claim 7, wherein the training the multi-class network comprises iteratively reducing a regularization parameter of a cost function.

9. The method of claim 1 , wherein the speaker classification result comprises an identification of a particular speaker of a plurality of enrolled speakers, and

wherein the identification of the particular speaker is computed by identifying a highest recognition score of the plurality of recognition scores and by identifying the particular speaker associated with the highest recognition score.

10. The method of claim 1 , further comprising receiving an allegation that the speaker is a particular enrolled speaker of a plurality of enrolled speakers,

wherein the speaker classification result is a speaker verification indicating whether the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers.

1 1 . The method of claim 10, further comprising computing the speaker verification by:

comparing the recognition score corresponding to the particular speaker to a threshold value; and

outputting a speaker verification indicating that the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers in response to determining that the recognition score exceeds the threshold value and determining that the recognition score is higher than all other enrolled speakers.

12. The method of claim 1 1 , wherein the threshold comprises a speaker-specific threshold, and wherein the speaker-specific threshold is computed by solving for an intersection between a first Gaussian distribution representing the probability that the speaker of the input audio is one of the enrolled speakers and a second Gaussian distribution representing the probability that the speaker of the input audio is not one of the enrolled speakers.

13. A system for classifying speakers comprising:

a processor; and

memory storing instructions that, when executed by the processor, cause the processor to:

receive input audio comprising speech from a speaker;

extract a plurality of speech frames containing voiced speech from the input audio;

compute a plurality of features for each of the speech frames of the input audio;

compute a plurality of recognition scores for the plurality of features;

compute a speaker classification result in accordance with the recognition scores; and

output the speaker classification result.

14. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to extract the speech frames by:

dividing the input audio into the plurality of speech frames;

computing a short term energy of each speech frame;

computing a spectral centroid of each speech frame;

classifying a speech frame as a voiced frame in response to determining that the short term energy of the speech frame exceeds a short term energy threshold and that the spectral centroid of the speech frame exceeds a spectral centroid threshold, and classifying the speech frame as an unvoiced frame otherwise;

retaining the voiced frames and removing the unvoiced frames; and outputting the retained voiced frames as the speech frames containing voiced speech.

15. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the plurality of features for each of the speech frames by:

dividing the speech frames into overlapping windows of audio;

normalizing each of the windows of audio;

computing mel-frequency cepstral coefficients, deltas, and double deltas for each window; and

computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window.

16. The system of claim 15 wherein the normalizing each of the windows of audio comprises applying speaker-level mean-variance normalization.

17. The system of claim 15 wherein the computing the plurality of features from the mel-frequency cepstral coefficients, deltas, and double deltas for each window comprises:

grouping the windows into a plurality of overlapping frames, each of the overlapping frames comprising a plurality of adjacent windows;

for each overlapping frame of the overlapping frames, concatenating the mel-frequency cepstral coefficients, the deltas, and the double deltas of the adjacent windows to generate a plurality of features of the overlapping frame; and

outputting the features of the overlapping frames as the plurality of features.

18. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the speaker classification result by forward propagating the plurality of features through a trained multi-class neural network, the trained multi-class neural network being trained to compute the recognition scores, each of the recognition scores corresponding to a confidence that the speech of the input audio corresponds to speech from one of a plurality of enrolled speakers.

19. The system of claim 18, wherein the trained multi-class neural network is trained by:

receiving training data comprising audio comprising speech from a plurality of enrolled speakers, the audio being labeled with the speakers;

extracting a plurality of features from the audio for each of the enrolled speakers;

applying speaker-level mean-variance normalization to the features extracted from the audio for each of the enrolled speakers; and

training the multi-class neural network to classify an input feature vector as one of the plurality of enrolled speakers.

20. The system of claim 19, wherein the training the multi-class network comprises iteratively reducing a regularization parameter of a cost function.

21 . The system of claim 13, wherein the speaker classification result comprises an identification of a particular speaker of a plurality of enrolled speakers, and

wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the identification of the particular speaker by identifying a highest recognition score of the plurality of recognition scores and by identifying the particular speaker associated with the highest recognition score.

22. The system of claim 13, wherein the memory further stores instructions that, when executed by the processor, cause the processor to receive an allegation that the speaker is a particular enrolled speaker of a plurality of enrolled speakers,

wherein the speaker classification result is a speaker verification indicating whether the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers.

23. The system of claim 22, wherein the memory further stores instructions that, when executed by the processor, cause the processor to compute the speaker verification by:

comparing the recognition score corresponding to the particular speaker to a threshold value; and

outputting a speaker verification indicating that the speaker of the speech of the input audio corresponds to the particular enrolled speaker of the plurality of enrolled speakers in response to determining that the recognition score exceeds the threshold value and determining that the recognition score is higher than all other enrolled speakers.

24. The system of claim 23, wherein the threshold comprises a speaker-specific threshold, and wherein the speaker-specific threshold is computed by solving for an intersection between a first Gaussian distribution representing the probability that the speaker of the input audio is one of the enrolled speakers and a second Gaussian distribution representing the probability that the speaker of the input audio is not one of the enrolled speakers.