0

I am doing a task of parse/processing a "large" raw data generated from unix shell. This raw data needs to be parsed to clean it from some special chars.

What finally I want to do is avoiding the need of big temporary file and do that on the fly instead.

Way 1 generates a big 8GB temporary text file (not desired) but is fast (8 minutes complete execution): First I generate a temporary raw text file (I put the shell output into a txt file) and then is parsed using the following code: Execution time 8 minutes output file size 800MB:

f = open(filepath, 'r')
fOut= open(filepathOut,"w+")
for line in f:
    if len(line) > 0:
        if "RNC" in line:
            rncflag = 1
            #quito enter, quito cadena a la izquierda y quito comillas a la derecha
            currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
        else:
            currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")

        if rncflag == 1:
            if lineNumOne == 1:
                processedline = currline
                lineNumOne = 0
            else:
                processedline = '\n' + currline
            rncflag = 0
        else:
            processedline = currline

    fOut.write(processedline)
fOut.close()

Way 2, on the fly directly from stdout (~1,5 hours complete execution): Is the one I would prefer since I don't need to generate the previous raw file to parse. I use subprocess library to parse/process the stdout unix shell directly line by line while it's being generated (like if it where the lines of the txt file). The problem is that is infinitely slower than the previous way. Execution time more than 1,5hours to get the same output file (size 800MB):

cmd = subprocess.Popen(isqlCmd, shell=True, stdout=subprocess.PIPE)
for line in cmd.stdout:
    if len(line) > 0:
        if "RNC" in line:
            rncflag = 1
            #quito enter, quito cadena a la izquierda y quito comillas a la derecha
            currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
        else:
            currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")

        if rncflag == 1:
            if lineNumOne == 1:
                processedline = currline
                lineNumOne = 0
            else:
                processedline = '\n' + currline
            rncflag = 0
        else:
            processedline = currline

    fOut.write(processedline)

fOut.close()

I am not python expert, but I'm sure that there is a way to speed up the processing if unix stdout on the fly instead of previously generating the raw file, to parse it afterwards once generated.

The purpose of the program is to clean up / parse a sybase isql query output. Note: the sybase library cannot be installed.

Python version is -> Python 2.6.4 an can not be change it

Thanks in advance, any improvement is welcome.

8
  • 1
    BTW -- it makes your code easier to understand at a glance for people used to Python if you follow PEP-8 naming conventions. That means that a word that starts with an upper-case letter is assumed to be a class name -- so Line and RNCflag should be lower-case. See python.org/dev/peps/pep-0008 Commented Apr 1, 2020 at 13:05
  • One place I would start, by the way, is looking at buffering behavior of the program that's actually writing your SQL file. There are a few different possible ways that could be harming your performance: The SQL program could be blocking (and not feeding new content from the database server) while it's waiting for the Python script to run; and the SQL program could be buffering content in-memory and not writing it until it's generated a larger chunk, leaving the Python script waiting. Until you've measured where your bottlenecks are, one can't say how to fix them. Commented Apr 1, 2020 at 13:10
  • ...to do full-system tracing so you can watch the data flow between actively executing software, I strongly recommend sysdig -- think of it like strace but system-wide and with minimal (single-digit percentage) performance impact. Commented Apr 1, 2020 at 13:11
  • 2
    BTW, the easiest way to address some of those buffering problems might be to switch from using the subprocess module to reading from sys.stdin, and then piping from a program that generates your SQL. That lets you use identical Python code in both cases (just changing the usage, as in ./yourPythonProgram <pregenerated.sql vs generateSql | ./yourPythonProgram), and lets you put programs to adjust and measure buffering behavior in the pipeline (I strongly recommend pv for the purpose). Commented Apr 1, 2020 at 13:13
  • One final note: The Stack Exchange site best suited to requests for general-purpose improvements to already-working code is not Stack Overflow (which is explicitly focused on narrow questions about specific problems) but Code Review. That said, there are some changes that would needed; among others, whereas we want a minimal reproducible example, they want complete and working code. Should this question be closed as "too broad" here, do consider modifying it per the guidelines in A Guide To Code Review For Stack Overflow Users and then posting there. Commented Apr 1, 2020 at 13:17

1 Answer 1

1

Without the ability to reproduce the problem, a canonical answer isn't feasible -- but it's possible to provide the tools needed to narrow in on the problem.

If you switch from using subprocess.Popen(..., stdout=subprocess.PIPE) to just reading from sys.stdin unconditionally, that means we can use the same code in both the reading-from-a-file case (in which case you'll want to run ./yourscript <inputfile), and in the pipe-from-a-process case (./runIsqlCommand | ./yourscript), so we can be confident that we're testing like-for-like.

Once that's done, it also gives us room to put buffering in place, to prevent the sides of the pipeline from blocking on each other unnecessarily. To do this might look like:

./runIsqlCommand | pv | ./yourscript

...where pv is Pipe Viewer, a tool that both provides a progress bar (when the total amount of content is known), a throughput indicator, and -- critically for our purposes -- a much larger buffer than the operating system's default, and room to adjust that size further (and monitor consumption).

To determine whether the Python script is running slower than the SQL code, tell pv to display buffer consumption with the -T argument. (If this shows ----, then pv is using the splice() syscall to transfer content between the processes directly without actually performing buffering; the -C argument will increase pv's overhead, but ensure that it's actually able to perform buffering and report on buffer content). If the buffer is 100% full almost all of the time, then we know that the SQL is being generated faster than the Python can read it; if it's usually empty, we know the Python is keeping up.

Not the answer you're looking for? Browse other questions tagged or ask your own question.