This tutorial will show you how to runs a simple speech recognition TensorFlow model built using the audio training. Listens for a small set of words, and display them in the UI when they are recognized. Once you’ve completed this tutorial, you’ll have a application that tries to classify a one second audio clip as either silence, an unknown word, “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, or “go”.
TensorFow speech recognition model

1.Preparation


You can train your model on the desktop or on the laptop or on the server and then you can use that pre-trained model on our mobile device. So there’s no training that would happen on the device. The training would happen on your bigger machine either a server or our laptop. You can download a pretrained model from tensorflow.org

2. Adding Dependencies


The TensorFlow Inference Interface is available as a JCenter package and can be included quite simply in your android project with a couple of lines in the project’s build.gradle file:

allprojects {
    repositories {
        jcenter()
    }
}

Add the following dependency in app’s build.gradle

dependencies {
    ....
    compile 'org.tensorflow:tensorflow-android:+'
}

This will tell Gradle to use the latest version of the TensorFlow AAR that has been released to https://bintray.com/google/tensorflow/tensorflow-android. You may replace the + with an explicit version label if you wish to use a specific release of TensorFlow in your app.

3.Add Pre-trained Model to Project


You need the pre-trained model and label file.You can download the model from here.Unzip this zip file, You will get conv_actions_labels.txt(label for objects) and conv_actions_frozen.pb(pre-trained model). Put conv_actions_labels.txt and conv_actions_frozen.pb into android/assets directory.

4.Microphone Permission


To request microphone, you should be requesting RECORD_AUDIO permission in your manifest file as below:

<uses-permission android:name="android.permission.RECORD_AUDIO"/>

Since Android 6.0 Marshmallow, the application will not be granted any permission at installation time. Instead, the application has to ask the user for a permission one-by-one at runtime.

private void requestMicrophonePermission() {
        ActivityCompat.requestPermissions(MainActivity.this,
                new String[]{android.Manifest.permission.RECORD_AUDIO}, REQUEST_RECORD_AUDIO);
    }
@Override
public void onRequestPermissionsResult(int requestCode, String[] permissions, int[] grantResults) {
      if (requestCode == REQUEST_RECORD_AUDIO&& grantResults.length > 0
                && grantResults[0] == PackageManager.PERMISSION_GRANTED) {
            startRecording();
            startRecognition();
      }
 }

5.Recording Audio


The AudioRecord class manages the audio resources for Java applications to record audio from the audio input hardware of the platform. This is achieved by reading the data from the AudioRecord object. The application is responsible for polling the AudioRecord object in time using read(short[], int, int).

 private void record() {
        android.os.Process.setThreadPriority(android.os.Process.THREAD_PRIORITY_AUDIO);
        // Estimate the buffer size we'll need for this device.
        int bufferSize =
                AudioRecord.getMinBufferSize(
                        SAMPLE_RATE, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT);
        if (bufferSize == AudioRecord.ERROR || bufferSize == AudioRecord.ERROR_BAD_VALUE) {
            bufferSize = SAMPLE_RATE * 2;
        }
        short[] audioBuffer = new short[bufferSize / 2];
        AudioRecord record =
                new AudioRecord(
                        MediaRecorder.AudioSource.DEFAULT,
                        SAMPLE_RATE,
                        AudioFormat.CHANNEL_IN_MONO,
                        AudioFormat.ENCODING_PCM_16BIT,
                        bufferSize);
        if (record.getState() != AudioRecord.STATE_INITIALIZED) {
            Log.e(LOG_TAG, "Audio Record can't initialize!");
            return;
        }
        record.startRecording();
        Log.v(LOG_TAG, "Start recording");
        // Loop, gathering audio data and copying it to a round-robin buffer.
        while (shouldContinue) {
            int numberRead = record.read(audioBuffer, 0, audioBuffer.length);
            int maxLength = recordingBuffer.length;
            int newRecordingOffset = recordingOffset + numberRead;
            int secondCopyLength = Math.max(0, newRecordingOffset - maxLength);
            int firstCopyLength = numberRead - secondCopyLength;
            // We store off all the data for the recognition thread to access. The ML
            // thread will copy out of this buffer into its own, while holding the
            // lock, so this should be thread safe.
            recordingBufferLock.lock();
            try {
                System.arraycopy(audioBuffer, 0, recordingBuffer, recordingOffset, firstCopyLength);
                System.arraycopy(audioBuffer, firstCopyLength, recordingBuffer, 0, secondCopyLength);
                recordingOffset = newRecordingOffset % maxLength;
            } finally {
                recordingBufferLock.unlock();
            }
        }
        record.stop();
        record.release();
    }

6.Run TensorFlow Model


TensorFlowInferenceInterface class that provides a smaller API surface suitable for inference and summarizing the performance of model execution.

private void recognize() {
    Log.v(LOG_TAG, "Start recognition");
    short[] inputBuffer = new short[RECORDING_LENGTH];
    float[] floatInputBuffer = new float[RECORDING_LENGTH];
    float[] outputScores = new float[labels.size()];
    String[] outputScoresNames = new String[]{OUTPUT_SCORES_NAME};
    int[] sampleRateList = new int[]{SAMPLE_RATE};
    // Loop, grabbing recorded data and running the recognition model on it.
    while (shouldContinueRecognition) {
            // The recording thread places data in this round-robin buffer, so lock to
            // make sure there's no writing happening and then copy it to our own
            // local version.
       recordingBufferLock.lock();
       try {
           int maxLength = recordingBuffer.length;
           int firstCopyLength = maxLength - recordingOffset;
           int secondCopyLength = recordingOffset;
           System.arraycopy(recordingBuffer, recordingOffset, inputBuffer, 0, firstCopyLength);
           System.arraycopy(recordingBuffer, 0, inputBuffer, firstCopyLength, secondCopyLength);
        } finally {
           recordingBufferLock.unlock();
         }
            // We need to feed in float values between -1.0f and 1.0f, so divide the
            // signed 16-bit inputs.
         for (int i = 0; i < RECORDING_LENGTH; ++i) {
             floatInputBuffer[i] = inputBuffer[i] / 32767.0f;
         }
            // Run the model.
       inferenceInterface.feed(SAMPLE_RATE_NAME, sampleRateList);
       inferenceInterface.feed(INPUT_DATA_NAME, floatInputBuffer, RECORDING_LENGTH, 1);
       inferenceInterface.run(outputScoresNames);
       inferenceInterface.fetch(OUTPUT_SCORES_NAME, outputScores);
       // Use the smoother to figure out if we've had a real recognition event.
       long currentTime = System.currentTimeMillis();
       final RecognizeCommands.RecognitionResult result = recognizeCommands.processLatestResults(outputScores, currentTime);
       runOnUiThread(
             new Runnable() {
                 @Override
                 public void run() {
                   // If we do have a new command, highlight the right list entry.
                      if (!result.foundCommand.startsWith("_") && result.isNewCommand) {
                            int labelIndex = -1;
                            for (int i = 0; i < labels.size(); ++i) {
                             if (labels.get(i).equals(result.foundCommand)) {
                                    labelIndex = i;
                              }
                         }
                         label.setText(result.foundCommand);
                    }
                }
              });
        try {
            // We don't need to run too frequently, so snooze for a bit.
            Thread.sleep(MINIMUM_TIME_BETWEEN_SAMPLES_MS);
        } catch (InterruptedException e) {
         // Ignore
        }
    }
    Log.v(LOG_TAG, "End recognition");
}

7.Recognize Commands


RecognizeCommands class is fed the output of running the TensorFlow model over time, it averages the signals and returns information about a label when it has enough evidence to think that a recognized word has been found. The implementation is fairly small, just keeping track of the last few predictions and averaging them.

public RecognitionResult processLatestResults(float[] currentResults, long currentTimeMS) {
        if (currentResults.length != labelsCount) {
            throw new RuntimeException(
                    "The results for recognition should contain "
                            + labelsCount
                            + " elements, but there are "
                            + currentResults.length);
        }
        if ((!previousResults.isEmpty()) && (currentTimeMS < previousResults.getFirst().first)) {
            throw new RuntimeException(
                    "You must feed results in increasing time order, but received a timestamp of "
                            + currentTimeMS
                            + " that was earlier than the previous one of "
                            + previousResults.getFirst().first);
        }
        final int howManyResults = previousResults.size();
        // Ignore any results that are coming in too frequently.
        if (howManyResults > 1) {
            final long timeSinceMostRecent = currentTimeMS - previousResults.getLast().first;
            if (timeSinceMostRecent < minimumTimeBetweenSamplesMs) {
                return new RecognitionResult(previousTopLabel, previousTopLabelScore, false);
            }
        }
        // Add the latest results to the head of the queue.
        previousResults.addLast(new Pair<Long, float[]>(currentTimeMS, currentResults));
        Log.d(TAG, currentResults + " " + currentTimeMS);
        // Prune any earlier results that are too old for the averaging window.
        final long timeLimit = currentTimeMS - averageWindowDurationMs;
        while (previousResults.getFirst().first < timeLimit) {
            previousResults.removeFirst();
        }
        // If there are too few results, assume the result will be unreliable and
        // bail.
        final long earliestTime = previousResults.getFirst().first;
        final long samplesDuration = currentTimeMS - earliestTime;
        if ((howManyResults < minimumCount)
                || (samplesDuration < (averageWindowDurationMs / MINIMUM_TIME_FRACTION))) {
            Log.v("RecognizeResult", "Too few results");
            return new RecognitionResult(previousTopLabel, 0.0f, false);
        }
        // Calculate the average score across all the results in the window.
        float[] averageScores = new float[labelsCount];
        for (Pair<Long, float[]> previousResult : previousResults) {
            final float[] scoresTensor = previousResult.second;
            int i = 0;
            while (i < scoresTensor.length) {
                averageScores[i] += scoresTensor[i] / howManyResults;
                ++i;
            }
        }
        // Sort the averaged results in descending score order.
        ScoreForSorting[] sortedAverageScores = new ScoreForSorting[labelsCount];
        for (int i = 0; i < labelsCount; ++i) {
            sortedAverageScores[i] = new ScoreForSorting(averageScores[i], i);
        }
        Arrays.sort(sortedAverageScores);
        // See if the latest top score is enough to trigger a detection.
        final int currentTopIndex = sortedAverageScores[0].index;
        final String currentTopLabel = labels.get(currentTopIndex);
        final float currentTopScore = sortedAverageScores[0].score;
        // If we've recently had another label trigger, assume one that occurs too
        // soon afterwards is a bad result.
        long timeSinceLastTop;
        if (previousTopLabel.equals(SILENCE_LABEL) || (previousTopLabelTime == Long.MIN_VALUE)) {
            timeSinceLastTop = Long.MAX_VALUE;
        } else {
            timeSinceLastTop = currentTimeMS - previousTopLabelTime;
        }
        boolean isNewCommand;
        if ((currentTopScore > detectionThreshold) && (timeSinceLastTop > suppressionMs)) {
            previousTopLabel = currentTopLabel;
            previousTopLabelTime = currentTimeMS;
            previousTopLabelScore = currentTopScore;
            isNewCommand = true;
        } else {
            isNewCommand = false;
        }
        return new RecognitionResult(currentTopLabel, currentTopScore, isNewCommand);
    }

The demo app updates its UI of results automatically based on the labels text file you copy into assets alongside your frozen graph, which means you can easily try out different models without needing to make any code changes. You will need to updaye LABEL_FILENAME and MODEL_FILENAME to point to the files you’ve added if you change the paths though.

8.conclusion


You can easily replace it with a model you’ve trained yourself. If you do this, you’ll need to make sure that the constants in the main MainActivity Java source file like SAMPLE_RATE and SAMPLE_DURATION match any changes you’ve made to the defaults while training. You’ll also see that there’s a Java version of the RecognizeCommands module that’s very similar to the C++ version in this tutorial. If you’ve tweaked parameters for that, you can also update them in MainActivity to get the same results as in your server testing.

Download this project from GitHub

Related Post

Android TensorFlow Machine Learning
Google Cloud Speech API in Android APP

]]>