I am doing a task of parse/processing a "large" raw data generated from unix shell. This raw data needs to be parsed to clean it from some special chars.
What finally I want to do is avoiding the need of big temporary file and do that on the fly instead.
Way 1 generates a big 8GB temporary text file (not desired) but is fast (8 minutes complete execution): First I generate a temporary raw text file (I put the shell output into a txt file) and then is parsed using the following code: Execution time 8 minutes output file size 800MB:
f = open(filepath, 'r')
fOut= open(filepathOut,"w+")
for line in f:
if len(line) > 0:
if "RNC" in line:
rncflag = 1
#quito enter, quito cadena a la izquierda y quito comillas a la derecha
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
else:
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
if rncflag == 1:
if lineNumOne == 1:
processedline = currline
lineNumOne = 0
else:
processedline = '\n' + currline
rncflag = 0
else:
processedline = currline
fOut.write(processedline)
fOut.close()
Way 2, on the fly directly from stdout (~1,5 hours complete execution): Is the one I would prefer since I don't need to generate the previous raw file to parse. I use subprocess library to parse/process the stdout unix shell directly line by line while it's being generated (like if it where the lines of the txt file). The problem is that is infinitely slower than the previous way. Execution time more than 1,5hours to get the same output file (size 800MB):
cmd = subprocess.Popen(isqlCmd, shell=True, stdout=subprocess.PIPE)
for line in cmd.stdout:
if len(line) > 0:
if "RNC" in line:
rncflag = 1
#quito enter, quito cadena a la izquierda y quito comillas a la derecha
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
else:
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
if rncflag == 1:
if lineNumOne == 1:
processedline = currline
lineNumOne = 0
else:
processedline = '\n' + currline
rncflag = 0
else:
processedline = currline
fOut.write(processedline)
fOut.close()
I am not python expert, but I'm sure that there is a way to speed up the processing if unix stdout on the fly instead of previously generating the raw file, to parse it afterwards once generated.
The purpose of the program is to clean up / parse a sybase isql query output. Note: the sybase library cannot be installed.
Python version is -> Python 2.6.4 an can not be change it
Thanks in advance, any improvement is welcome.
Line
andRNCflag
should be lower-case. See python.org/dev/peps/pep-0008subprocess
module to reading fromsys.stdin
, and then piping from a program that generates your SQL. That lets you use identical Python code in both cases (just changing the usage, as in./yourPythonProgram <pregenerated.sql
vsgenerateSql | ./yourPythonProgram
), and lets you put programs to adjust and measure buffering behavior in the pipeline (I strongly recommendpv
for the purpose).