Welcome to the deep learning in speech recognition series. This is the third post in three part. In the first post, we discussed how to represent audio and encoding. In the second post, we discussed CTC for the length of the input is not the same as the length of the transcription. In this post, we discuss how to decode and improve the speech recognition using a language model.

Encode distribution over symbols

Our neural network now spits out this big blank of softmax neurons. We have a training algorithm which is doing gradient descent. We still sort of to decode these outputs to get the correct transcription. Pick the most likely sequence of symbols for C and then apply our little squeeze operator to get back the transcription the way that we defined it.
Softmax output of speech Recog.

This actually doesn’t give you the most likely transcription because it’s not accounting for the fact that every transcription. That might have multiple sequences of C’s. You can actually do this using the max decoding.
max decoding
I put little red dots on the most likely C.If you apply little squeeze operator you just get the word “CAB”. If you do this it is often terrible it’ll often give you a very strange transcription that doesn’t look like English.

Max decoding which is sort of an approximate way of going from these probability vectors to a transcription Y. If you want to find the actual most likely transcription Y there’s actually no algorithm in general that can give you the perfect solution efficiently.

Language Model


The knowledge about the language take a small step backward from a perfect end-to-end system and make these transcriptions better. The real problem here is that you don’t have enough audio available to learn all these things. If we had millions and millions of hours of audio sitting around you could probably learn all these transcriptions because of you just here enough words that you know how to spell them all maybe the way human does. But unfortunately, we just don’t have enough audio for that so we have to find a way to get around that data problem.

There are certain names in the world right like proper names that if you’ve never heard of it before you have no idea how it’s spelled. The only way to know it is to have seen this word in text before and to see it in context. Part of the purpose of these language models is to get examples like this correct.

There are a couple of solutions one would be to just step back to a more traditional pipeline use phonemes. We can bake new words in along with their phonetic pronunciation and the system will just get it right.

In this case, I want to focus on just fusing in a traditional language model. That gives us the probability of a priory of any sequence of words. The reason that this is helpful is that using a language model we can train these things from massive text corpora. We have more text in the world then we have transcribed audio. That makes it possible to train these giant language models with a huge vocabulary.

Traditional n-gram models are kind of interesting if you’re excited about speech applications. If you use a package like kenlm on the web to build yourself a giant N-gram language model. These are really simple and well supported. That makes them easy to get working. They’ll let you train from lots of corpora.

For speech recognition in practice one of the nice things about the n-gram model, as opposed to an RNN model, is that we can update these things very quickly. If you have a big distributed cluster you can update that N-gram model very rapidly in parallel from new data to keep track of whatever the trending words are today that your speech engine might need to deal with. We also have the need to query this thing very rapidly inside our decoding loop. Being able to just look up the probabilities in a table the way an N-gram model is structured is very valuable.

In order to fuse this into the system to get the most likely transcription probably of Y given X to maximize that thing. We need to use a generic search algorithm anyway. Once we’re using a generic search scheme to do our decoding and find the most likely transcription we can add some extra cost terms.

Decoding with Language Model


You take the probability of a given word sequence from your audio. You can just multiply it by some extra terms. The probability of the word sequence according to your language model raised to some power. Then multiplied by the length raised to another power.

Language Model

Then you get the log probability that your original objective. You get alpha times the log probability of the language model and beta times the log of the length. These alpha and beta parameters let you sort of trade-off the importance of getting a transcription that makes sense to your language model Vs getting a transcription that makes sense to your acoustic model and actually sounds like the thing that you heard.

The reason for this extra term over here is that as you’re multiplying in all of these terms you tend to penalize long transcriptions a bit too much.so having a little bonus or penalty at the end to tweak to get the transcription length right is very helpful

Beam Search


The basic idea behind this is just to use beam search. It is a really popular search algorithm a whole bunch of instances of it. The rough strategy is this

Starting from T=1 at the very beginning of your audio input. Start out with an empty list. That going to populate with prefixes. These prefixes are just partial transcriptions that represent in the audio up to the current time. The way that these proceeds is taken, at the current time step each candidate prefix out of this list. Then try all of the possible characters in softmax neurons that can possibly follow it.

For example, adding a blank if the next element of C is actually supposed to be blank. Then what that would mean is that I don’t change my prefix because the blanks are just going to get dropped later. I need to incorporate the probability of that blank character into the probability of this prefix. It represents one of the ways that I could reach that prefix. You need to sum that probability into that candidate. Likewise whenever you add a space to the end of a prefix that signals that this prefix represents the end of a word.

In addition to adding the probability of the space into the current estimate, this gives you the chance to go look up that word in language model and fold that into the current score. If you try adding a new character onto this prefix it’s just straightforward you just go and update the probabilities based on the probability of that character. At the end of this, you have a huge list of possible prefixes that could be generated.

This is where you would normally get the exponential blow-up of trying all possible prefixes to find the best one. What beam search does is it just says take the K most probable prefixes after you remove all the duplicates. If you have a really large K then your algorithm will be a bit more accurate in finding the best possible solution to this maximization problem but it’ll be slower.

Rescoring with Neural Language Model


If you’re not happy with your N-gram model, one really easy way to do this and link it to your current pipeline is to do rescoring. When this decoding strategy finishes it can give you the most probable transcription but also gives you this big list of the top k transcriptions in terms of probability. What you can do is to take your recurrent network and just rescore all of these and basically reorder them according to this new model.

In the instance of a neural language, model let’s say that this is N best list. I have N candidate that was output by my decoding strategy. Neural language model trained on a bunch of text is going to correctly reorder these things and figure out beam candid is which actually the correct one even though N-gram model didn’t help you.

That is really the scale model that is the set of concepts that you need to get working speech recognition engine based on deep learning.

Related Post

Part 1. Deep Learning in Speech Recognition: Encoding
Part 2. Speech Recognition: Connectionist Temporal Classification