Welcome to the deep learning in speech recognition series. This is the second part in three part. In the first part, we discussed how to represent audio and encoding.
One obvious fundamental problem for speech recognition is that the length of the input is not the same as the length of the transcription. If I say hello very slowly then you have a very long audio signal even though you didn’t change the length of the transcription or if you say hello very quickly then you have a very short transcription or a very short piece of audio.
That means that this output of my neural network is changing length and you need to come up with some way to reprimand neural network output to this fixed length transcription and also do it in a way that can actually train this pipeline.
There are multiple ways to do it. There’s some current research on how to to use things like an attentional model, the sequence to sequence model. We’ll focus on something call CTC (Connectionist Temporal Classification). That is sort of current state of the art for how to do this.
Our RNN has output neurons C and the job of these output neurons is to encode a distribution over the output symbols. Because of the structure of the recurrent Network, the length of this symbol sequence C is the same as the length of my audio input.
If my audio inputs a was 2 seconds long that might have a hundred audio frames and that would mean that the length of C is also a hundred different values. We’re trying to just predict the characters in this language directly from the audio.
Once RNN gives a distribution over these symbols C define some kind of mapping that can convert this long transcription C into the final transcription Y. Now recognizing that C is itself a probabilistic creature there’s a distribution over choices of C that correspond to the audio.
Once you apply this function that also means that there’s a distribution over Y. There’s a distribution over the possible transcriptions that I could get. Train the network is to maximize the probability of the correct transcription given the audio. Those are the steps that we have to accomplish in order to make CTC work. Let’s start with the first one.
1.Encode distribution over symbols
We have output neurons C and they represent a distribution over the different symbols that in the audio.
I’ve got some audio signal down here you can see the spectrogram frames poking up. This is being processed by the recurrent neural network. The out is a big bank of softmax neurons.
For the first frame of audio, neuron that corresponds to each of the symbols that C could represent. This set of softmax neurons with the output summing to 1 represents the probability of say c1 having the value ABC or special blank.
For example, if I pick one of the neurons over here than the first row which it represents the character B and the 17th column which is the 17th frame in time this represents the probability that C 17 represents the character B given the audio.
Now I can define a distribution not just over the individual characters but if I just assume that all of the characters are independent which is kind of a naive assumption. If I bake this into the system I can define a distribution over all possible sequences of characters in this alphabet.
If I gave you a specific character string using this alphabet. For instance, I represent the string hello as “HHH E E LL LO” and then a bunch of blanks. This is a string in this alphabet for C.I can just use this formula to compute the probability of this specific sequence of characters. That’s how we compute the probability for a sequence of characters when they have the same length as the audio input
2.Define Mapping β(c)->y
The second step is to define a mapping from this long encoding of the audio into symbols. That crunches it down to the actual transcription that we’re trying to predict.
This operator takes this character sequence and it picks up all the duplicates all of the adjacent characters that are repeated and discards the duplicates and just keep some of them and then it drops all of the blanks.
One key thing to note is that when two characters that are different right next to each other just end up keeping those two characters in output but if ever have a double character like LL in hello then you need to have a blank character that gets put in between.
Now that we have a way to define a distribution over these sequences of symbols that are the same length as the audio. We now have a mapping from those strings into transcriptions. This gives us a probability distribution over the possible final transcriptions
There are several combinations that all map to the same transcription. Compute probability is go through all of the possible character sequences that correspond to the transcription “hello” and add up all of their probabilities. You have to sum over all possible choices of C that could give transcription in the end.
You can kind of think of this as searching through all the possible alignments. You could shift characters around a little bit or you can move them forward backward or you could expand them by adding duplicates depending on how fast someone is talking. That corresponds to every possible alignment between the audio and the characters that I want to transcribe. It sort of solves the problem of the variable length. The way that I get the probability of a specific transcription is, to sum up, to marginalize over all the different alignments that could be feasible. This equation just says to sum over all the character sequences C. When applying this little mapping operator I end up with the transcription Y.
Now we have given correct transcription and the job is to tune the neural network to maximize the probability of that transcription using this model that defined.
In equations, maximize the log probability of y* for a given example and maximize the probability of the correct transcription given the audio x. Just going to sum over all the examples.
You can use an ML library that efficiently will calculate this CTC loss function for you. That can calculate this likelihood and can also just give you back the gradient. There are a whole bunch of implementations on the web that you can now use as part of deep learning packages. One of them from Baidu implements CTC on the GPU is called warp CTC. Stanford has a CTC implementation and there’s also now CTC losses implemented in packages like TensorFlow.This is something that’s sufficiently widely distributed that you can use these algorithms off the shelf.
The training starts with an audio spectrogram. We have our neural network structure where you get to choose how it’s put together. Then it outputs this bank of softmax neurons and then there are pieces of off-the-shelf software that will compute for you the CTC cost function. They’ll compute this log-likelihood given a transcription and the output neurons from your recurrent network. The software will also be able to tell you the gradient with respect to the output neurons. Once you’ve got that you’re set, you can feed them back into the rest of your code and get the gradient with respect to all of these parameters.
All available now in sort of efficient off-the-shelf software. You don’t have have to do this work yourself. That’s pretty much all there is to the high-level algorithm. In the next post, we discuss how to decode final transcription Y and use a language model to improve speech recognition.
Part 1: Deep Learning in Speech Recognition: Encoding
Part 3: Speech Recognition: Decoding and Language Model