Deep learning has been playing an increasingly large role in speech recognition. A human can quickly turn audio into words and word into meaning effortlessly. For Machines, this has historically hard. You think of this is like one of those sorts of consummate AI tasks.
The goal of building a speech pipeline is input raw audio wave and build a speech recognizer that can do this very simple task of printing out hello world when I input “Hello World” wave.
We have divided the speech recognition process into three part. The first part I’m just going to introduce an about pre-processing and encoding. Second, the Connectionist Temporal Classification(CTC) is the most mature piece of sequence learning technologies for deep learning right now. It turns out that one of the fundamental problems of doing speech recognition is how do I build a neural network that can map this audio signal to a transcription that can have a variable length. The CTC is one highly mature method for doing this. Finally a bit about decoding and language models which is sort of an addendum to the current acoustic models that we can build that make them perform a lot better.
How Audio Represented?
This should be pretty straightforward, unlike a two-dimensional image where we normally have a 2D grid of pixels audio is just a 1D signal.
There are a bunch of different formats for audio but typically this one-dimensional wave something like 8k samples per second or 16k samples per second.
Each wave is quantized into 8 or 16 bits. When we represent this audio signal that’s going to go into our pipeline.
1D vector: X=[x1 x2 ….]
You could just think of that as a 1D vector. X that represented the audio signal. This being broke down into samples x1 x2 and so forth. If you had a one-second audio clip this vector would have a length of either say 8000 or 16000 samples. Each element is a floating-point number that extracted from this 8 or 16-bit sample.
Pre-Processing audio Data
Once you have an audio clip you’ll do a little bit of pre-processing. There are a couple of ways to start the, first is to convert to a simple spectrogram. You lose a little bit of information when you do this.
Spectrogram
The spectrogram is sort of like a frequency domain representation but instead of representing this entire signal in terms of frequencies, represent a small window in terms of frequencies.
To process this audio clip the first thing cut out a little window that’s typically about 20ms long. These audio signals are made up of a combination of different frequencies of sine waves.
The FFT converts this little signal into the frequency domain. Then take the log of the power at each frequency. For every frequency of sine wave, the magnitude and amount of power represented by that sine wave that makes up this original signal. We can just think of this as a vector instead of representing this little 20ms slice as sort of a sequence of audio samples. Each element represents sort of the strengths of each frequency in this little window.
Apply this to a whole bunch of windows across the entire piece of audio and that gives you what we call a spectrogram. You can use either disjoint windows that are just sort of adjacent or you can apply them to overlapping windows if you like. There’s a little bit of parameter tuning there but this is an alternative representation of this audio signal that happens to be easier to use for a lot of purposes.
Acoustic Model
Our goal starting from spectrogram representation of raw audio to build acoustic mode. It is an entire speech engine that is represented by a neural network.
Build a neural net that if you could train it from a whole bunch of pairs X which is the original audio that turns into a spectrogram. The Y* that’s the ground truth transcription.
If you train this big neural network, it produces some kind of output that represent by the character C. That could later extract the correct transcription which denotes by Y.
If I said “hello” the first thing I’m going to do is run pre-processing to get all these spectrogram frames and then I’m going to have a recurrent neural network that consumes each frame and processes them into some new representation called C. You can engineer network in such a way that can just read the transcription of these output neurons. That’s kind of the intuitive picture of what we want to accomplish.
One fundamental problem is that the length of the input is not the same as the length of the transcription. If I say “hello” very slowly then I can have a very long audio signal even though I didn’t change the length of the transcription or if I say hello very quickly then I have a very short transcription or a very short piece of audio. In the next post, we will discuss how to solve this problem using Connectionist Temporal Classification (CTC).
Related Post
Part 2. Speech Recognition: Connectionist Temporal Classification
Part 3. Speech Recognition: Decoding and Language Model