
The last part of my speech recognition series: finally training my network. Here's the dataset I did it with (self-generated, small I know), and the code I used.

After running this code (takes about an hour on my Mac), I get a validation accuracy of roughly 30%... not spectacular. Any ideas on how to improve the training speed, or the neural network's accuracy? Any other suggestions in general?

import os
import numpy as np
import tflearn

def main():
    LABELED_DIR = 'labeled_data'
    width = 512
    height = 512
    classes = 26  # characters
    learning_rate = 0.0001
    batch_size = 25

    # load data
    print('Loading data')
    X, Y = tflearn.data_utils.image_preloader(LABELED_DIR, image_shape=(width, height), mode='folder', normalize=True, grayscale=True, categorical_labels=True, files_extension=None, filter_channel=False)
    X_shaped = np.squeeze(X)
    trainX, trainY = X_shaped, Y

    # Network building
    print('Building network')
    net = tflearn.input_data(shape=[None, width, height])
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, classes, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')

    model = tflearn.DNN(net, tensorboard_verbose=3)
    print('Training network')
    model.fit(trainX, trainY, validation_set=0.15, n_epoch=100, show_metric=True, batch_size=batch_size)

if __name__ == '__main__':
  You admit you have a small dataset. Still you're wondering how to improve the accuracy. If you'd half your dataset, the accuracy should drop significantly. You can test this. Now, with the difference in mind, guess what the new accuracy would be if you'd double your dataset instead.
    – Mast
    Commented May 8, 2017 at 8:11
  Is there a reason you went with images as dataset instead of audio files? I'd imagine there's a small error margin on the pictures, adding to the trouble.
    – Mast
    Commented May 8, 2017 at 8:15
  @Mast It's not necessarily a linear relationship though, when I was first setting up the network I had only three letters with several images for each of them and was achieving roughly 90% accuracy. Though I do agree in general more data would lead to better accuracies.
    – syb0rg
    Commented May 8, 2017 at 13:44
  Those were probably easy letters. For combinations, like ch, ph, th and the likes, you may just as well need a new dataset.
    – Mast
    Commented May 8, 2017 at 13:58
  How do you calculate your accuracy?
    – Mast
    Commented May 8, 2017 at 15:53

1 Answer 1


First things first: you can get far better results by fine-tuning the arguments.


Yes, it goes over 52% at times and I'm sure it can go even higher. Let's take a look at what colours are caused by what.


default settings
Epoch 200 instead of 100
Dropout 0.9 instead of 0.8
learning_rate 1e-3 instead of 1e-4
256 instead of 128
Epoch 300, dropout 0.9, learning_rate 1e-3, 256 instead of 128
learning_rate 1e-3, epoch 400

I've rewritten 0.0001 as 1e-4, easier on the eyes. And you made a good start by putting it in a variable (which, according to the PEP8, should be CAPITAL_CASED since they're pseudo-constants). So why didn't you put the others in variables as well? Look at how the functions you call have named their arguments and use this as inspiration for your variable names.

Keep in mind you're using spectograms as input. Spectrograms have some downsides, since they only measure intensity of frequencies and not the phase. You may have heard this problem described as the phase problem. This means every spectrogram has broadband noise, impacting the overall effectiveness of your output. The measured effectiveness might not even be the real effectiveness, since it probably assumes you actually like the noise.

So, not only could you use more data to achieve a higher accuracy, you may eventually need more complete data. As in, less noise and with phase information.

As for performance, there's not much you can do. Your code runs significantly faster on my laptop than on your Mac (original set-up in under 15 minutes), even without using a GPU as acceleration. Tensorflow is pretty well optimized to use multiple cores.

Keep in mind the X-axis displays steps. The time it takes to reach a certain amount of steps can vary wildly depending on the arguments you provide. 0TWRK8 took 3 times as long to reach step 500 than H57Z4I, while the latter appears to be scoring better. Figure out which arguments are 'worth their weight' and which simply slow you down for little to no gain.

My advice? Experiment! After a couple hundred epochs the data will just about flatline, so going above 200 isn't particularly useful when going for sample runs.

Fidgeting with the input reminded me of a game I played a long while back: foldit

There's the early game, the mid game and the end game. In the early game, you're looking for the big changes. The later you get, the more your focus shifts to different aspects to fine-tune your approach. However, if you had an inefficient start, you couldn't fine-tune enough to reach the score you wanted. The score would flat-line.

Consider developing this machine in the same manner. Don't rush the development to make it go fast if that will hurt it's accuracy in the end. After all, nothing is as annoying as speech recognition that only works half the time. If you need certain functions to keep your output in good quality, don't optimize it away only to regret it later.

Something else your dataset doesn't take into account, is combinations of characters. ch isn't exactly pronounced as a combination of c and h. The same goes for ph, th and other combinations. Keep this in mind when you start field testing your network.

  Would you mind sharing the name of the tool/framework that you used to create the charts? :)
    – Yoryo
    Commented Mar 29, 2018 at 22:21
  • 1
    @Yoryo Load the logs with Tensorboard and it does all the plotting automagically.
    – Mast
    Commented Mar 29, 2018 at 22:22

