Speech Recognition Part 2: Classifying Data

Question

Now that I have generated training data, I need to classify each example with a label to train a TensorFlow neural net (first building a suitable dataset). To streamline the process, I wrote this little Python script to help me. Any suggestions for improvement?

classify.py:

# Builtin modules
import glob
import sys
import os
import shutil
import wave
import time
import re
from threading import Thread

# 3rd party modules
import scipy.io.wavfile
import pyaudio

DATA_DIR = 'raw_data'
LABELED_DIR = 'labeled_data'
answer = None

def classify_files():
    global answer
    # instantiate PyAudio
    p = pyaudio.PyAudio()

    for filename in glob.glob('{}/*.wav'.format(DATA_DIR)):
        # define stream chunk
        chunk = 1024

        #open a wav format music
        wf = wave.open(filename, 'rb')
        #open stream
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)
        #read data
        data = wf.readframes(chunk)

        #play stream
        while answer is None:
            stream.write(data)
            data = wf.readframes(chunk)
            if data == b'': # if file is over then rewind
                wf.rewind()
                time.sleep(1)
                data = wf.readframes(chunk)

        # don't know how to classify, skip sample
        if answer == '.':
            answer = None
            continue

        # sort spectogram based on input
        spec_filename = 'spec{}.jpeg'.format(str(re.findall(r'\d+', filename)[0]))
        os.makedirs('{}/{}'.format(LABELED_DIR, answer), exist_ok=True)
        shutil.copyfile('{}/{}'.format(DATA_DIR, spec_filename), '{}/{}/{}'.format(LABELED_DIR, answer, spec_filename))

        # reset answer field
        answer = None

        #stop stream
        stream.stop_stream()
        stream.close()

    #close PyAudio
    p.terminate()

if __name__ == '__main__':
    try:
        # exclude file from glob
        os.remove('{}/ALL.wav'.format(DATA_DIR))

        num_files = len(glob.glob('{}/*.wav'.format(DATA_DIR)))
        Thread(target = classify_files).start()
        for i in range(0, num_files):
            answer = input("Enter letter of sound heard: ")
    except KeyboardInterrupt:
        sys.exit()

Peilonrayz · Accepted Answer · 2017-04-30 16:10:11Z

Most of your comments aren't that great. Commenting about PEP8 compliance shouldn't be needed, and saying you're instantiating an object before doing it duplicates the amount we have to read for no actual gain.
os.path.join is much better at joining file locations by the OSes separator than '{}/{}'.format. Please use it instead.

An alternate to this in Python 3.4+ could be pathlib, as it allows you to extend the path by using the / operator. I have however not tested that this works with the functions you're using.

Here's an example of using it: (untested)
```
DATA_DIR = pathlib.PurePath('raw_data')

...

os.remove(DATA_DIR / 'All.wav')
```
You should move chunk out of the for loop, making it a function argument may be a good idea too.
Making a function to infinitely read your wf may ease reading slightly, and giving it a good name such as cycle_wave would allow people to know what it's doing. As it'd work roughly the same way as itertools.cycle. This could be implemented as:
```
def cycle_wave(wf):
    while True:
        data = wf.readframes(chunk)
        if data == b'':
            wf.rewind()
            time.sleep(1)
            data = wf.readframes(chunk)
        yield data
```
For your spec_filename you can use re.match to get a single match, rather than all numbers in the file name. You also don't need to use str on the object as format will do that by default.
Rather than removing a file from your directory, to then search the directory, you can instead remove the file from the resulting list from glob.glob. Since it returns a normal list, you can go about this the same way you would otherwise.

One way you can do this, is as followed:
```
files = glob.glob('D:/*')
try:
    files.remove('D:/$RECYCLE.BIN')
except ValueError:
    pass
```
If you have multiple files you want to remove you could instead use sets, and instead use:
```
files = set(glob.glob('D:/*')) - {'D:/$RECYCLE.BIN'}
```

All of this together can get you:

import glob
import sys
import os
import shutil
import wave
import time
import re
from threading import Thread

import scipy.io.wavfile
import pyaudio

DATA_DIR = 'raw_data'
LABELED_DIR = 'labeled_data'
answer = None

def cycle_wave(wf):
    while True:
        data = wf.readframes(chunk)
        if data == b'':
            wf.rewind()
            time.sleep(1)
            data = wf.readframes(chunk)
        yield data

def classify_files(chunk=1024):
    global answer
    join = os.path.join

    p = pyaudio.PyAudio()
    files = set(glob.glob(join(DATA_DIR, '*.wav'))) - {join(DATA_DIR, 'ALL.wav')}
    for filename in files:
        wf = wave.open(filename, 'rb')
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        for data in cycle_wave(wf):
            if answer is not None:
                break
            stream.write(data)

        # don't know how to classify, skip sample
        if answer == '.':
            answer = None
            continue

        # sort spectogram based on input
        spec_filename = 'spec{}.jpeg'.format(re.match(r'\d+', filename)[0])
        os.makedirs(join(LABELED_DIR, answer), exist_ok=True)
        shutil.copyfile(
            join(DATA_DIR, spec_filename),
            join(LABELED_DIR, answer, spec_filename)
        )

        # reset answer field
        answer = None

        #stop stream
        stream.stop_stream()
        stream.close()

    #close PyAudio
    p.terminate()

if __name__ == '__main__':
    join = os.path.join
    try:
        # exclude file from glob
        files = set(glob.glob(join(DATA_DIR, '*.wav'))) - {join(DATA_DIR, 'ALL.wav')}
        num_files = len(files)
        Thread(target = classify_files).start()
        for _ in range(0, num_files):
            answer = input("Enter letter of sound heard: ")
    except KeyboardInterrupt:
        sys.exit()

But I've left out proper handling of streams, in most languages, that I've used streams in, it's recommended to always close the steam. In Python it's the same. You can do this normally in two ways:

Use with, this hides a lot of the code, so it makes using streams seamless. It also makes people know the lifetime of the stream, and so people won't try to use it after it's been closed.

Here's an example of using it:
```
with wave.open('<file location>') as wf:
    print(wf.readframes(1024))
```
Use a try-finally. You don't need to add an except clause to this, as if it errors you may not want to handle it here, but the finally is to ensure that the stream is closed.

Here's an example of using it:
```
p = pyaudio.PyAudio()
try:
    stream = p.open(...)
    try:
        # do some stuff
    finally:
        stream.stop_stream()
        stream.close()
finally:
    p.terminate()
```

I'd personally recommend using one of the above in your code. I'd really recommend using with over a try-finally, but pyaudio doesn't support that interface. And so you'd have to add that interface to their code, if you wanted to go that way.

Do you have a less destructive way for excluding a file from the glob? — syb0rg, Commented Apr 29, 2017 at 15:03
For some alternatives related to using glob: stackoverflow.com/questions/8931099/quicker-to-os-walk-or-glob — holroy, Commented Apr 30, 2017 at 19:46

Stack Exchange Network

Speech Recognition Part 2: Classifying Data

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
multithreading
sorting
file
audio
or ask your own question.

Linked

Hot Network Questions

Speech Recognition Part 2: Classifying Data

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonmultithreadingsortingfileaudio or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
multithreading
sorting
file
audio
or ask your own question.