3

I have a function (written in Python) that processes a loosely structured log file.

The log file would have begin-end markers for different sections, with each section describing different things (e.g., errors, assertions, warnings, property dumps, etc.). Sections could repeat for different operations or different objects that are logged.

The function has only one block viz., a with open(filename, .. as log statement.

That block has only one block within it viz., for line in log:

Now, that for-loop has 90+ lines causing PyLint to report a cyclomatic complexity of 38 ;-(

The loop body has sections like so:

if not any(inThisSection, inThatSection, inAssertion, ...):
  # Code to look for Begin Section markers
  # When one is checked mark as true
  # E.g., if Begin Assertion marker is identified
  inAssertion = True
  continue

if inAssertion:
  # Keep track of which line in the Assertion section
  # When End-Marker of Assertion is detected ...
  inAssertion = False
  continue

There are multiple calls to startswith or endswith or at times using regular expressions.

It is messy but then so is the log file. Considerable engineering went into writing this function.

If I could rewrite it, how would I implement this function (or how could I break it into multiple functions)?

I understand Python's iterator protocol, but I do not necessarily see how that could help here. How could I call a function -

  • with the current state of the iterable to that function,
  • that function consuming a few elements of the iterable,
  • and when that function detects an element that it should not be dealing with,
  • return to the loop with the iteration state being the line of the file that it should process next.

Thank you for any insights you can provide.

9
  • 3
    I could be wrong but this code example reads like the author discovered a way to hack around the fact that the language doesn't have a goto command without discovering why the language doesn't have a goto command. I think we need to know more about the log file structure to be able to answer this. Commented Apr 5, 2023 at 18:29
  • 2
    Do you have a reason to be worried about the specific measure "cyclomatic complexity" rather than more subjective judgments such as "engineers who work on this code have no problem understanding it" vs "have to study it for a few hours to work on it" vs "need a couple of weeks to work on it, then they quit"?
    – davidbak
    Commented Apr 5, 2023 at 20:54
  • 1
    I said that (previous comment) specifically because typically a state machine is coded according to strict patterns, repeated, with small per-state modifications. The global thing - the whole state machine - generally violates most software engineering "rules of thumb" or "best practices" but each little piece - each state - is easily understood in isolation. The problem of a state machine is whether it is encoding correctly a state transition diagram - and is that state transition diagram itself correct? Cyclomatic complexity or any other usual measure just don't matter here.
    – davidbak
    Commented Apr 5, 2023 at 20:57
  • 1
    You did not indicate why you care about cyclomatic complexity. Why do you care about cyclomatic complexity? Commented Apr 5, 2023 at 23:01
  • 1
    interpreter pattern is your friend :-).
    – Laiv
    Commented Apr 6, 2023 at 11:06

4 Answers 4

1

use pipelines

OP describes a common problem. Here is a motivating use case I have often encountered:

$ grep "Smith" /var/log/messages

But, alas! The Smith customer appears in both errors and property dumps, and I only care about error occurrences. Oh, no! We could resort to sed '/err_start/,/err_end/', but it's probably better to reformat the error log just once in order to support running diverse reporting tasks against it later. Perhaps by having awk (or even sed) prepend a "type tag" field to each line.

Then we can conveniently query:

$ grep "^err.*Smith" tagged_messages.txt

separation of concerns via generators

In python we'd implement the first stage of the pipeline with a generator.

class Tag(Enum):
    MISC = auto()
    ERR = auto()
    ASSERT = auto()
    WARN = auto()
    PROP_DUMP = auto()
    SEPARATOR = auto()

MESSAGES = "/var/log/messages"

def get_tagged_syslogs(infile=MESSAGES):
    tag = Tag.MISC
    with open(infile) as fin:
        for line in fin:
            tag = _classify(tag, line)
            yield tag, line

def consume(infile=MESSAGES):
    for tag, line in get_tagged_syslogs(infile):
        ...

Now the consuming code is in the same position as grep above, and can conveniently test if tag == Tag.Err and "Smith" in line:


choose a simpler representation

The English part of the problem description suggests we have mutually exclusive sections, of about five distinct types. The code description, however, would use five independent booleans, admitting of 2^5 == 32 arbitrarily nested possibilities, a much more complex situation.

This line abstracts away potential section conflicts by forcing the choice of a single tag:

            tag = _classify(tag, line)

A match on any section-ending marker lets us revert to Tag.MISC. But even if the input file violates the "balanced parentheses" or "balanced section marker" grammar, we still have an opportunity to close out the current section by reporting a tag for the newly observed section-start marker.

Rather than N if clauses for N section types, the classifier will scan a vector of N section-start markers and immediately report a new tag upon seeing a match. Then it will scan N section-end markers and report Tag.MISC for a match. Finally it will default to preserving the current tag unchanged.


tagging enables standard tools

Perhaps a failed customer transaction always emits a stanza like this: (err_start, line1, line2, err_end), with customer name appearing only in line2.

Then we can

from itertools import groupby
from operator import itemgetter

for tag, lines in groupby(get_tagged_messages(), key=itemgetter(0)):
    if tag == Tag.Err and any(("Smith" in line for line in lines)):
        ...

To separate consecutive error records we will probably want to

        yield Tag.SEPARATOR, ""

after each section ending line, such as an err_end line. That way groupby won't aggregate unrelated error sections.

3
  • grep is usually not very well suited at handling something that spreads over several lines. Commented Apr 9, 2023 at 5:29
  • Right. Hence the tagging, which brings the problem back to examining single lines in isolation. Plus, having once identified the details of the relevant line, I will often $ grep -B4 -A4 keyword file.txt to reveal context. It works well with syslogs, and with source code files.
    – J_H
    Commented Apr 9, 2023 at 13:38
  • Thank you @J_H. When I started off on this project, I used awk. Had I stuck with it, I would not be in this situation (few people at work are even aware of awk, let alone how read awk code). I started to use Python only because I wanted to present the information in the log file as multiple worksheets within a spreadsheet (with hyperlinks between them). Commented Apr 13, 2023 at 17:49
3

If the sections have a non-overlapping structure, then strongly consider encoding the state machine via ordinary control flow. Regardless of whether that is possible, you can likely separate low-level scanning of lines from their higher-level grouping.

What I mean by non-overlapping structure

Some data description formats like XML require proper nesting – annotated regions can be consecutive or nested, but they cannot be overlapping.

For example, <a> <b>...</b></a> is allowed, as is <a>...</a> <b>...</b>, but not <a> <b> ... </a> </b>.

Such non-overlapping structures are much simpler to handle programmatically, for computer science reasons involving terms such as "Chomsky hierarchy", "stack automata", and "context-free languages".

Problem statement

I'm assuming you have log file lines that look like this:

foo
#begin SECTION
x
y
#end SECTION
bar

It is possible to maintain a stack or a variable that indicates whether we are currently within that section, e.g.:

section = None
for line in input:
  # handle state transitions
  if section is not None and line == "#end SECTION":
    print(f"completed section: {section}")
    section = None
    continue

  if section is None and line == "#begin SECTION":
    section = []
    continue

  # process line according to current state
  if section is not None:
    section.append(line)
    continue

  # default action
  print(f"ordinary line: {line}")

This would produce output like:

ordinary line: foo
completed section: ['x', 'y']
ordinary line: bar

Possible approach: one function (or class) per state

One potential refactoring is to identify the state you're in, and to then call a function that handles the input in that state. Here, we would select the state on the section is None condition. However, modifying persistent variables is more tricky with such separate function, so that you'd likely want to extract them into an object. Incidentally, this also makes it possible to simply represent the current state via that object's class. Here:

@dataclass
class OrdinaryState:
  def handle_line(self, line: str) -> State:
    if line == "#begin SECTION":
      return InSectionState(contents=[])

    # default action
    print(f"ordinary line: {line}")
    return self

@dataclass
class InSectionState:
  contents: list[str]

  def handle_line(self, line: str) -> State:
    if line == "#end SECTION":
      print(f"completed section: {self.contents}")
      return OrdinaryState()

    # default action
    self.contents.append(line)
    return self

State: TypeAlias = OrdinaryState | InSectionState

state = OrdinaryState()
for line in input:
  state = state.handle_line(line)

Possible approach: recursive functions

Instead of maintaining an explicit state machine, you could call one function per state – which works here because we are assuming at most a nested structure. So there might be a function like parse_section(lines) that will consume the contents of the section, and then return.

This is where iterators become relevant. As long as we don't have to "look ahead" to the next line, then passing an iterator to these functions is an elegant way to handle progress through the file. Each function can consume zero or more lines from that iterator, and the next function will continue where another left off.

Here, a solution might look as follows:

def parse_logfile(lines: Iterator[str]):
  for line in lines:
    if line == "#begin SECTION":
      parse_section(lines)
      continue

    print(f"ordinary line: {line}")

def parse_section(lines: Iterator[str]):
  contents = []

  for line in lines:
    if line == "#end SECTION":
      break

    contents.append(line)

  print(f"completed section: {contents}")

parse_logfile(iter(input))

Whenever possible, this kind of design would be my preferred approach. It is highly debuggable, because you're just writing code, not trying to encode a state machine. The code tends to be simpler, shorter, and less susceptible to tricky bugs.

Drawbacks of extracting state machine parts

Both of these suggested refactorings have the issue that information about the state machine is now spread across multiple places. For example, the code that starts a section is part of one state, the code that ends it is in another. There are techniques to avoid this, but they require lookahead – being able to inspect the next line without consuming it.

General idea: separate low-level parsing form high-level structure/grouping

If the syntax of these lines is more involved, especially if complex regexes are needed, it can make sense to extract some of those low-level tasks from the main state machine.

In my above example, I might extract functions is_section_start(line) and is_section_end(line) functions – though it's probably not worth it here since I'm just comparing constant strings.

Another alternative is to first parse each line individually, and to then run your state machine. This is similar to a "lexer", "scanner", or "tokenizer". I don't think it would be worth in this example, but we might define:

@dataclass
class BeginSection: pass

@dataclass
class EndSection: pass

@dataclass
class OrdinaryLine:
  contents: str

Line: TypeAlias = BeginSection | EndSection | OrdinaryLine

def tokenize(lines: Iterable[str]) -> Iterable[Line]:
  for line in lines:
    if line == "#begin SECTION":
      yield BeginSection()
      continue
    if line == "#end SECTION":
      yield EndSection()
      continue
    yield OrdinaryLine(contents=line)

This can be an especially good strategy if the syntax includes irrelevant parts that should be skipped, like comments or empty lines.

In this particular instance the state machine wouldn't be simplified a lot, but this might be an opportunity to use Python 3.10's shiny new pattern matching feature:

section = None
for line in tokenize(input):
  match line:
    case EndSection():
      print(f"completed section: {section}")
      section = None

    case BeginSection():
      section = []

    case OrdinaryLine(contents=contents) if section is not None:
      section.append(contents)

    case OrdinaryLine(contents=contents):
      print(f"ordinary line: {contents}")

Whether this is better is mainly a matter of taste, but if we introduce some intermediate representation then we can separate different concerns (like recognizing lines vs grouping them). This may or may not make the code more maintainable.

Caveat: this strategy only works if each line has the same meaning in all contexts. If the interpretation of a line depends on the current state, this separation is going to be much more difficult, and the appropriate response would be to extract the low-level details into separate functions instead, as discussed above.

Conclusion

State machines are complex, but there are different strategies for simplifying them. In particular:

  • extracting low-level concerns so that the state machine only consists of high-level operations
  • using different functions or classes for handling each state
  • encoding states in the control flow of (recursive) functions

Depending on context, one or another technique may be preferable.

It's also worth noting that it is difficult to write code that would behave exactly the same in all variants. Even my code examples here have notable differences in how they handle "incorrect" inputs, for example:

#end SECTION
foo
#begin SECTION
x
#begin SECTION
y
3
  • Sincere thanks @amon for your elaborate reply. Much appreciated. In context to your query regarding non-overlapping structure there can be instances where the log file is malformed in that the end-marker could go missing. Also, the property section could have some nesting. In other words, to use your HTML tag example <a><a></a><a> is possible although nesting is never more than 1~2 levels. Commented Apr 6, 2023 at 14:32
  • I am not familiar with Line: TypeAlias = BeginSection | EndSection | OrdinaryLine syntax. Could you point me to Python documentation pertaining to this? Commented Apr 6, 2023 at 14:32
  • Hi @Happy! I wrote the code examples using Python's type annotations, which are not strictly necessary but can make it clearer what is happening (and help with autocomplete). The Line is a type alias. The A | B syntax describes a union type. Together, this means that a value of type Line is either a BeginSection, an EndSection, or an OrdinaryLine object.
    – amon
    Commented Apr 6, 2023 at 15:37
2

You seem to find the code reasonable; it doesn't seem crazy to me (but I'm not the one working on it) and you only care about this because a coworker ran some third-party code analysis tool which raised it as a warning.

The solution: stop caring.

No, really. If the code is not difficult to work with, then what is the problem? If the code is not difficult to work with, there is no problem. The code analysis tool thinks this code might be difficult to work with, but you say it's not, so the code analysis tool is wrong in this case.

Most other solutions add a bunch of extra complexity to please some arbitrary criterion in some stupid tool. Complexity is the enemy in software development. This tool is saying that you have too much "complexity" in one function, so amon's first solution is to... add even more complexity but spread it across several functions so it doesn't trip the warning? No, that's utter nonsense. amon's second solution may be an improvement, depending on how you feel about it, but I classify it as a refactor which doesn't fundamentally change much about the code.

It's conceivable that rewriting the parser using some tool designed for writing parsers (such as ANTLR) could be an improvement. However, I don't think this will be the case, as the language you are parsing is quite simple. I think that any complexity saved by writing the parser with a parser-writing tool will be matched or exceeded by the added complexity of interfacing with the tool.

5
  • I guess the problem is he is paid to care about this measure. Which tbh is quite reasonable, it's a common measure of how complex your code is for other people to understand.
    – Ewan
    Commented Apr 6, 2023 at 23:28
  • @Ewan It seemed like it is not a request from a boss but from a peer Commented Apr 6, 2023 at 23:29
  • "My boss got wind of it and wants it to be public" but i mean even if it's not your boss, people work in teams, if you can rearrange the code a bit and tick this box, why not
    – Ewan
    Commented Apr 6, 2023 at 23:32
  • @Ewan if rearranging the code to tick the box makes it worse, then it's better to get the co-worker to agree on that. Actually, speak to the co-worker. Maybe they have an idea to make it simpler and better instead of worse, or maybe they can agree the warning is not helpful Commented Apr 6, 2023 at 23:34
  • "some other worker...." - this is what code review is all about Commented Apr 9, 2023 at 5:30
1

This is a prototype

You have figured out how to do what you need on a trial-and-error basis and now know what you have and what you need.

If your boss think that the people who think that the PyLint metric is important, are right, then simply say "This means that this is a prototype that is not of production quality" and have allocated resources to create one that is.

It is most likely important that it is not you, because you would just reimplement what you have, but you need to create the material that the coworker can implement from. Additionally, it is most likely important that continue and similar constructs are what causes PyLint to report what it does, so these should not be used. Well-named methods and appropriate if-s will most likely produce code PyLint likes better.

Not the answer you're looking for? Browse other questions tagged or ask your own question.