First things first: you can get far better results by fine-tuning the arguments.
Yes, it goes over 52% at times and I'm sure it can go even higher. Let's take a look at what colours are caused by what.
ORE1X4
default settings
EAPX5J
Epoch 200 instead of 100
6DHKQJ
Dropout 0.9 instead of 0.8
53O25D
learning_rate 1e-3 instead of 1e-4
0TWRK8
256 instead of 128
QCAZN8
Epoch 300, dropout 0.9, learning_rate 1e-3, 256 instead of 128
H57Z4I
learning_rate 1e-3, epoch 400
I've rewritten 0.0001
as 1e-4
, easier on the eyes. And you made a good start by putting it in a variable (which, according to the PEP8, should be CAPITAL_CASED
since they're pseudo-constants). So why didn't you put the others in variables as well? Look at how the functions you call have named their arguments and use this as inspiration for your variable names.
Keep in mind you're using spectograms as input. Spectrograms have some downsides, since they only measure intensity of frequencies and not the phase. You may have heard this problem described as the phase problem. This means every spectrogram has broadband noise, impacting the overall effectiveness of your output. The measured effectiveness might not even be the real effectiveness, since it probably assumes you actually like the noise.
So, not only could you use more data to achieve a higher accuracy, you may eventually need more complete data. As in, less noise and with phase information.
As for performance, there's not much you can do. Your code runs significantly faster on my laptop than on your Mac (original set-up in under 15 minutes), even without using a GPU as acceleration. Tensorflow is pretty well optimized to use multiple cores.
Keep in mind the X-axis displays steps. The time it takes to reach a certain amount of steps can vary wildly depending on the arguments you provide. 0TWRK8 took 3 times as long to reach step 500 than H57Z4I, while the latter appears to be scoring better. Figure out which arguments are 'worth their weight' and which simply slow you down for little to no gain.
My advice? Experiment! After a couple hundred epochs the data will just about flatline, so going above 200 isn't particularly useful when going for sample runs.
Fidgeting with the input reminded me of a game I played a long while back: foldit
There's the early game, the mid game and the end game. In the early game, you're looking for the big changes. The later you get, the more your focus shifts to different aspects to fine-tune your approach. However, if you had an inefficient start, you couldn't fine-tune enough to reach the score you wanted. The score would flat-line.
Consider developing this machine in the same manner. Don't rush the development to make it go fast if that will hurt it's accuracy in the end. After all, nothing is as annoying as speech recognition that only works half the time. If you need certain functions to keep your output in good quality, don't optimize it away only to regret it later.
Something else your dataset doesn't take into account, is combinations of characters. ch
isn't exactly pronounced as a combination of c
and h
. The same goes for ph
, th
and other combinations. Keep this in mind when you start field testing your network.
ch
,ph
,th
and the likes, you may just as well need a new dataset. \$\endgroup\$