처리 중

기다려 주십시오...

설정

설정

1. US20140195236 - Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination

유의사항: 이 텍스트는 자동적으로 광학문자인식(OCR) 처리된 텍스트입니다. 법적 용도로는 PDF 버전을 사용하십시오.

Claims

1. A method comprising:
storing, by a computer system, speech data for a plurality of speakers, the speech data including a plurality of feature vectors and, for each feature vector, an associated sub-phonetic class;
building, by the computer system based on the speech data, an artificial neural network (ANN) for modeling speech of a target speaker in the plurality of speakers, the ANN being configured to discriminate between instances of sub-phonetic classes uttered by the target speaker and instances of sub-phonetic classes uttered by other speakers in the plurality of speakers;
wherein building the ANN comprises:
retrieving an existing ANN that comprises a plurality of existing output nodes, each existing output node corresponding to a sub-phonetic class and being configured to output a probability that a feature vector input to the existing ANN is an instance of the sub-phonetic class uttered by one of the other speakers in the plurality of speakers; and
modifying the existing ANN to generate the ANN, wherein the modifying causes the ANN to include an output layer that comprises the plurality of existing output nodes and, for each existing output node, a new output node that corresponds to the sub-phonetic class of the existing output node, the new output node being configured to output a probability that a feature vector input to the ANN is an instance of the sub-phonetic class uttered by the target speaker; and
verifying or identifying the target speaker using the ANN.
2. The method of claim 1 wherein building the ANN further comprises:
training the ANN using a first portion of the speech data originating from the target speaker such that, for each feature vector and associated sub-phonetic class in the first portion, the new output node that corresponds to the associated sub-phonetic class is tuned to output a relatively higher probability and the existing output node that corresponds to the associated sub-phonetic class is tuned to output a relatively lower probability.
3. The method of claim 2 wherein building the ANN further comprises:
training the ANN using a second portion of the speech data originating from the other speakers such that, for each feature vector and associated sub-phonetic class in the second portion, the new output node corresponding to the associated sub-phonetic class is tuned to output a relatively lower probability and the existing output node corresponding to the associated sub-phonetic class is tuned to output a relatively higher probability.
4. The method of claim 3 wherein the ANN further includes an input layer comprising a plurality of input nodes and one or more hidden layers comprising one or more hidden nodes.
5. The method of claim 4 wherein the input layer is connected to a lowest hidden layer in the one or more hidden layers via a first set of connections having a first set of weights, wherein the one or more hidden layers are connected from a lower layer to a higher layer via a second set of connections having a second set of weights, wherein a highest hidden layer in the one or more hidden layers is connected to the output layer via a third set of connections having a third set of weights, and wherein training the ANN using the first and second portions of the speech data comprises applying a back-propagation algorithm to modify the third set of weights without modifying the first or second set of weights.
6. The method of claim 1 wherein verifying or identifying the target speaker using the ANN comprises:
receiving an acoustic signal corresponding to an utterance of a target phrase;
extracting feature vectors from a plurality of frames in the acoustic signal; and
generating, based on the feature vectors and the ANN, a speaker-verification score indicating a likelihood that the target phrase was uttered by the target speaker.
7. The method of claim 6 wherein generating the speaker-verification score comprises performing a Viterbi search using a Hidden Markov Model (HMM) that lexically models the target phrase.
8. The method of claim 7 wherein, at each frame in the plurality of frames, the Viterbi search passes a feature vector corresponding to the frame as input to the ANN and generates a per-frame speaker-verification score based on one or more probabilities output by the ANN.
9. The method of claim 8 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by the new output node in the ANN which is associated with that state in the HMM.
10. The method of claim 8 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by a new output node in the ANN which is associated with that state in the HMM, less the probability output by the existing output node associated with the new output node in the ANN.
11. The method of claim 8 wherein generating the speaker-verification score further comprises applying a function to one or more of the per-frame speaker-verification scores to calculate the speaker-verification score.
12. The method of claim 11 wherein the function is an average function or a summation function.
13. The method of claim 11 wherein the function is applied to a subset of the per-frame speaker-verification scores that correspond to relevant frames in the plurality of frames, each relevant frame being classified, via the Viterbi search, as being part of a phoneme that is considered useful for speaker verification or identification.
14. A non-transitory computer readable storage medium having stored thereon program code executable by a computer system, the program code comprising:
code that causes the computer system to store speech data for a plurality of speakers, the speech data including a plurality of feature vectors and, for each feature vector, an associated sub-phonetic class;
code that causes the computer system to build, based on the speech data, an artificial neural network (ANN) for modeling speech of a target speaker in the plurality of speakers, the ANN being configured to discriminate between instances of sub-phonetic classes uttered by the target speaker and instances of sub-phonetic classes uttered by other speakers in the plurality of speakers;
wherein the code that causes the computer system to build the ANN comprises:
code that causes the computer system to retrieve an existing ANN that comprises a plurality of existing output nodes, each existing output node corresponding to a sub-phonetic class and being configured to output a probability that a feature vector input to the existing ANN is an instance of the sub-phonetic class uttered by one of the other speakers in the plurality of speakers; and
code that causes the computer system to modify the existing ANN to generate the ANN, wherein the modifying causes the ANN to include an output layer that comprises the plurality of existing output nodes and, for each existing output node, a new output node that corresponds to the sub-phonetic class of the existing output node, the new output node being configured to output a probability that a feature vector input to the ANN is an instance of the sub-phonetic class uttered by the target speaker; and
code that causes the computer to verify or identify the target speaker using the ANN.
15. The non-transitory computer readable storage medium of claim 14 wherein the code that causes the computer system to build the ANN further comprises:
code that causes the computer system to train the ANN using a first portion of the speech data originating from the target speaker such that, for each feature vector and associated sub-phonetic class in the first portion, the new output node that corresponds to the associated sub-phonetic class is tuned to output a relatively higher probability and the existing output node that corresponds to the associated sub-phonetic class is tuned to output a relatively lower probability; and
code that causes the computer system to train the ANN using a second portion of the speech data originating from the other speakers such that, for each feature vector and associated sub-phonetic class in the second portion, the new output node corresponding to the associated sub-phonetic class is tuned to output a relatively lower probability and the existing output node corresponding to the associated sub-phonetic class is tuned to output a relatively higher probability.
16. The non-transitory computer readable storage medium of claim 15 wherein the ANN further includes an input layer comprising a plurality of input nodes and one or more hidden layers comprising one or more hidden nodes.
17. The non-transitory computer readable storage medium of claim 16 wherein the input layer is connected to a lowest hidden layer in the one or more hidden layers via a first set of connections having a first set of weights, wherein the one or more hidden layers are connected from a lower layer to a higher layer via a second set of connections having a second set of weights, wherein a highest hidden layer in the one or more hidden layers is connected to the output layer via a third set of connections having a third set of weights, and wherein training the ANN using the first and second portions of the speech data comprises applying a back-propagation algorithm to modify the third set of weights without modifying the first or second set of weights.
18. The non-transitory computer readable storage medium of claim 14 wherein the code that causes the computer system to verify or identify the target speaker using the ANN comprises:
code that causes the computer system to receive an acoustic signal corresponding to an utterance of a target phrase;
code that causes the computer system to extract feature vectors from a plurality of frames in the acoustic signal; and
code that causes the computer system to generate, based on the feature vectors and the ANN, a speaker-verification score indicating a likelihood that the target phrase was uttered by the target speaker.
19. The non-transitory computer readable storage medium of claim 18 wherein generating the speaker-verification score comprises performing a Viterbi search using a Hidden Markov Model (HMM) that lexically models the target phrase.
20. The non-transitory computer readable storage medium of claim 19 wherein, at each frame in the plurality of frames, the Viterbi search passes a feature vector corresponding to the frame as input to the ANN and generates a per-frame speaker-verification score based on one or more probabilities output by the ANN.
21. The non-transitory computer readable storage medium of claim 20 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by the new output node in the ANN which is associated with that state in the HMM.
22. The non-transitory computer readable storage medium of claim 20 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by a new output node in the ANN which is associated with that state in the HMM, less the probability output by the existing output node associated with the new output node in the ANN.
23. The non-transitory computer readable storage medium of claim 20 wherein generating the speaker-verification score further comprises applying a function to one or more of the per-frame speaker-verification scores to calculate the speaker-verification score.
24. The non-transitory computer readable storage medium of claim 23 wherein the function is an average function or a summation function.
25. The non-transitory computer readable storage medium of claim 23 wherein the function is applied to a subset of the per-frame speaker-verification scores that correspond to relevant frames in the plurality of frames, each relevant frame being classified, via the Viterbi search, as being part of a phoneme that is considered useful for speaker verification or identification.
26. A system comprising:
a processor configured to:
store speech data for a plurality of speakers, the speech data including a plurality of feature vectors and, for each feature vector, an associated sub-phonetic class;
build, based on the speech data, an artificial neural network (ANN) for modeling speech of a target speaker in the plurality of speakers, the ANN being configured to discriminate between instances of sub-phonetic classes uttered by the target speaker and instances of sub-phonetic classes uttered by other speakers in the plurality of speakers;
wherein building the ANN comprises:
retrieving an existing ANN that comprises a plurality of existing output nodes, each existing output node corresponding to a sub-phonetic class and being configured to output a probability that a feature vector input to the existing ANN is an instance of the sub-phonetic class uttered by one of the other speakers in the plurality of speakers; and
modifying the existing ANN to generate the ANN, wherein the modifying causes the ANN to include an output layer that comprises the plurality of existing output nodes and, for each existing output node, a new output node that corresponds to the sub-phonetic class of the existing output node, the new output node being configured to output a probability that a feature vector input to the ANN is an instance of the sub-phonetic class uttered by the target speaker; and
verify or identify the target speaker using the ANN. (ANN).
27. The system of claim 26 wherein building the ANN further comprises:
training the ANN using a first portion of the speech data originating from the target speaker such that, for each feature vector and associated sub-phonetic class in the first portion, the new output node that corresponds to the associated sub-phonetic class is tuned to output a relatively higher probability and the existing output node that corresponds to the associated sub-phonetic class is tuned to output a relatively lower probability; and
training the ANN using a second portion of the speech data originating from the other speakers such that, for each feature vector and associated sub-phonetic class in the second portion, the new output node corresponding to the associated sub-phonetic class is tuned to output a relatively lower probability and the existing output node corresponding to the associated sub-phonetic class is tuned to output a relatively higher probability.
28. The system of claim 27 wherein the ANN further includes an input layer comprising a plurality of input nodes and one or more hidden layers comprising one or more hidden nodes.
29. The system of claim 28 wherein the input layer is connected to a lowest hidden layer in the one or more hidden layers via a first set of connections having a first set of weights, wherein the one or more hidden layers are connected from a lower layer to a higher layer via a second set of connections having a second set of weights, wherein a highest hidden layer in the one or more hidden layers is connected to the output layer via a third set of connections having a third set of weights, and wherein training the ANN using the first and second portions of the speech data comprises applying a back-propagation algorithm to modify the third set of weights without modifying the first or second set of weights.
30. The system of claim 26 wherein verifying or identifying the target speaker using the ANN comprises:
receiving an acoustic signal corresponding to an utterance of a target phrase;
extracting feature vectors from a plurality of frames in the acoustic signal; and
generating, based on the feature vectors and the ANN, a speaker-verification score indicating a likelihood that the target phrase was uttered by the target speaker.
31. The system of claim 30 wherein generating the speaker-verification score comprises performing a Viterbi search using a Hidden Markov Model (HMM) that lexically models the target phrase.
32. The system of claim 31 wherein, at each frame in the plurality of frames, the Viterbi search passes a feature vector corresponding to the frame as input to the ANN and generates a per-frame speaker-verification score based on one or more probabilities output by the ANN.
33. The system of claim 32 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by the new output node in the ANN which is associated with that state in the HMM.
34. The system of claim 32 wherein the per-frame speaker-verification score for a state in the HMM corresponds to the probability output by a new output node in the ANN which is associated with that state in the HMM, less the probability output by the existing output node associated with the new output node in the ANN.
35. The system of claim 32 wherein generating the speaker-verification score further comprises applying a function to one or more of the per-frame speaker-verification scores to calculate the speaker-verification score.
36. The system of claim 35 wherein the function is an average function or a summation function.
37. The system of claim 35 wherein the function is applied to a subset of the per-frame speaker-verification scores that correspond to relevant frames in the plurality of frames, each relevant frame being classified, via the Viterbi search, as being part of a phoneme that is considered useful for speaker verification or identification.