Processing

Please wait...

Settings

Settings

Goto Application

1. WO2020195068 - SYSTEM AND METHOD FOR END-TO-END SPEECH RECOGNITION WITH TRIGGERED ATTENTION

Publication Number WO/2020/195068
Publication Date 01.10.2020
International Application No. PCT/JP2020/002201
International Filing Date 16.01.2020
IPC
G10L 15/16 2006.01
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
15Speech recognition
08Speech classification or search
16using artificial neural networks
G06N 3/04 2006.01
GPHYSICS
06COMPUTING; CALCULATING OR COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
04Architecture, e.g. interconnection topology
G10L 15/32 2013.01
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
15Speech recognition
28Constructional details of speech recognition systems
32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
CPC
G06N 3/0445
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
04Architectures, e.g. interconnection topology
0445Feedback networks, e.g. hopfield nets, associative networks
G06N 3/0454
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
04Architectures, e.g. interconnection topology
0454using a combination of multiple neural nets
G06N 3/0481
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
04Architectures, e.g. interconnection topology
0481Non-linear activation functions, e.g. sigmoids, thresholds
G06N 3/08
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
3Computer systems based on biological models
02using neural network models
08Learning methods
G06N 7/005
GPHYSICS
06COMPUTING; CALCULATING; COUNTING
NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
7Computer systems based on specific mathematical models
005Probabilistic networks
G10L 15/16
GPHYSICS
10MUSICAL INSTRUMENTS; ACOUSTICS
LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
15Speech recognition
08Speech classification or search
16using artificial neural networks
Applicants
  • MITSUBISHI ELECTRIC CORPORATION [JP]/[JP]
Inventors
  • MORITZ, Niko
  • HORI, Takaaki
  • LE ROUX, Jonathan
Agents
  • FUKAMI PATENT OFFICE, P.C.
Priority Data
16/363,02125.03.2019US
Publication Language English (EN)
Filing Language English (EN)
Designated States
Title
(EN) SYSTEM AND METHOD FOR END-TO-END SPEECH RECOGNITION WITH TRIGGERED ATTENTION
(FR) SYSTÈME ET PROCÉDÉ DE RECONNAISSANCE VOCALE DE BOUT EN BOUT AVEC ATTENTION DÉCLENCHÉE
Abstract
(EN)
A speech recognition system includes an encoder to convert an input acoustic signal into a sequence of encoder states, an alignment decoder to identify locations of encoder states in the sequence of encoder states that encode transcription outputs, a partition module to partition the sequence of encoder states into a set of partitions based on the locations of the identified encoder states, and an attention-based decoder to determine the transcription outputs for each partition of encoder states submitted to the attention-based decoder as an input. Upon receiving the acoustic signal, the system uses the encoder to produce the sequence of encoder states, partitions the sequence of encoder states into the set of partitions based on the locations of the encoder states identified by the alignment decoder, and submits the set of partitions sequentially into the attention-based decoder to produce a transcription output for each of the submitted partitions.
(FR)
Un système de reconnaissance vocale comprend un codeur pour convertir un signal acoustique d'entrée en une séquence d'états de codeur, un décodeur d'alignement pour identifier des emplacements d'états de codeur dans la séquence d'états de codeur qui codent des sorties de transcription, un module de partition pour diviser la séquence d'états de codeur en un ensemble de partitions sur la base des emplacements des états de codeur identifiés, et un décodeur basé sur l'attention pour déterminer les sorties de transcription pour chaque partition d'états de codeur soumise au décodeur basé sur l'attention en tant qu'entrée. Lors de la réception du signal acoustique, le système utilise le codeur pour produire la séquence d'états de codeur, divise la séquence d'états de codeur en l'ensemble de partitions sur la base des emplacements des états de codeur identifiés par le décodeur d'alignement, et soumet l'ensemble de partitions séquentiellement dans le décodeur basé sur l'attention pour produire une sortie de transcription pour chacune des partitions soumises.
Also published as
Latest bibliographic data on file with the International Bureau