Is there a general solution to the problem of "sudden unexpected bursts of errors" in software?

Question

Let me explain what I mean.

I have made a complex, highly polished over years PHP framework/library for my own use. I very aggressively log the smallest notice and immediately deal with it as soon as it pops ups, always trying to predict potential errors in my code as to never have them occur even in rare situations, but rather handling them automatically before they get logged.

However, in spite of all my efforts, inevitably, I wake up (such as today) to find that some third-party service has fiddles around with their file format for one of their CSV files of data that they provide on their website and which my system fetches and imports every day.

Then I get a flood of ugly PHP errors. Ouch.

Even though it looks scary at first, it's typically just a pretty simple fix, and it's typically really just ONE error, which cascades into tons of apparent errors because the chain of function calls "fall apart" as each one expects something that they no longer get.

I fix the issue, clear the errors, re-run the logic, verify that it no longer causes any errors, and then it's fixed. For now. Until the same thing happens again, with some other part of the system.

I can personally "deal with" this, but it really bothers me in terms of giving away my system to somebody else to run on their machines. If/when the same kind of thing happens for them, they will doubtlessly blame me and think I'm incompetent (which may be true).

But even for myself, this is quite annoying and makes me feel as if my system is very fragile and a house of cards waiting to fall apart, in spite of there normally being not a single little notice or warning logged during "normal operation".

Short of predicting every possible change and writing enormous amounts of extra "checking" code to verify that all data is always exactly what is expected, is there anything I can do to fix this general problem? Or is this like asking for a pill that cures any disease instantly?

Please don't get hung up on the fact that I mentioned PHP. I'd say that this question goes completely regardless of the programming language or environment. It's really more of a philosophical question than a technical one IMO.

I fear that the answer will be: "There is no way. You have to bite the bullet and verify, verify and verify everything all the time!"

Depending on external formats outside your control is a source of errors that cannot be fixed. But floods of errors in reaction to a single root cause is a sign of inadequate error handling on your part. Detect trouble as early as possible and bail out rather than assuming that previous steps succeeded. — Kilian Foth, Commented Dec 2, 2020 at 9:36
Please don't get hung up on the fact that I mentioned PHP. one of the features of php is that it keeps trying to work even when there are errors. Most other languages stop the process flow on an error. That's a big reason for a cascade of error messages. — Pieter B, Commented Dec 2, 2020 at 12:33
@PieterB Historically that has been true, but it's a lot less so now, especially if you use PHP programming practices and settings designed to avoid that - e.g. specifying types wherever possible, strict mode, maybe choosing libraries such as thecodingmachine/safe etc. — bdsl, Commented Dec 2, 2020 at 13:39
This is actually one of the hardest parts of production code. Making it work even when an assumption fail. Netflix has done a lot of work on making their controlled environments failing, so their code has seen just about anything, and proper error handling has been written. — Thorbjørn Ravn Andersen, Commented Dec 2, 2020 at 21:52
I notice that you mention two problems in this post. First, something occasionally (or frequently) goes wrong with your software. Second, when something goes wrong, your software produces lots of error messages instead of just one. Are you asking for help with both of these problems, or only one or the other? — Tanner Swett, Commented Dec 3, 2020 at 0:00

Simon B · Accepted Answer · 2020-12-02 11:39:09Z

100

An improvement would be to design your system to fail gracefully. If the first step of parsing a file fails, then stop with an error. Don't carry on passing bad data from one step to the next.

The other thing to check is that you are implementing the file handling correctly and robustly. CSV is quite complicated when you encounter quoted strings with embedded commas in them. If the supplier has actually changed the file format, then you should stop processing. If they have used a feature of CSV that you haven't implemented right, you need to fix that robustly.

answered Dec 2, 2020 at 11:39

Simon B

9,6434 gold badges28 silver badges34 bronze badges

14

This (called "fail fast"), with "early returns", will help a lot on processing data. Imagine that you have to fetch from a database, and it succeeds, but the next line fails. Returning as soon as it fails will prevent you from doing additional processing for something you won't be running (say, fetching 2-3 other things from the database or somewhere else).
– Ismael Miguel
Commented Dec 3, 2020 at 12:39
8

And failing fast may sometimes avoid actual damage, such as overwriting a valid (albeit perhaps dated) file with invalid data.
– John Bollinger
Commented Dec 3, 2020 at 13:16
28

@IsmaelMiguel I usually amend that to "fail fast and fail loud". I.e. I don't want any errors just quietly swept under the rug. I want a big glaring error message as close as possible to the initial point of failure, with as much information about the problem as possible, to make it easier to track down the source.
– Darrel Hoffman
Commented Dec 3, 2020 at 17:29
1

@BenButterworth then it depends on if you require 1million processed images, or as many processed images as you can get. And you may want to stop if there are (10? 100? 1000?) corrupt images
– Caleth
Commented Jan 3, 2021 at 1:00
1

@BenButterworth your "contract" should specify what you do when one image fails (it could even be a configuration parameter) as part of the business logic. If you prefer to keep going then treat each image like you'd treat one atomic process: I'd expect a report with the list of failed images and probably logs explaining the "earliest point of failure" for each of them.
– Rad80
Commented Feb 9, 2021 at 12:03

| Show 2 more comments

Karl Bielefeldt · Accepted Answer · 2020-12-02 22:32:45Z

77

There was a popular blog post on this topic last year called Parse, don't validate. It's an excellent read that's difficult to paraphrase, but the essence is you should put your input data into a format where illegal states are unrepresentable as soon as possible.

For reading from an external CSV file, following this advice would mean:

Use a proper CSV parsing library, not a regex or a split or something.
Use the header names, not a column number to get a specific field.
Put it into an object with only the fields you use, already validated that ints are ints, dates are dates, etc.
Pass only that object down to the lower layers of the program. You know all the fields in there are valid.
Use your type system as much as possible to your advantage. I haven't written any php in decades, so I'm not familiar with its current capabilities, but I know it has improved in that area.

I generally expect the following from reputable data providers:

Make only backward-compatible changes if possible.
If not possible, provide some sort of version to indicate backward-incompatible changes.
Announce schema changes in advance, so I can test before they are needed.
If possible, provide the schema in a standard format I can use to automatically adapt my parsing in most cases.
If practical, allow me to customize what fields I am retrieving.

I don't know what sort of relationship you have with your data provider, but if they are not doing these things, I would try to influence them to start. If they are doing those things, make sure you are taking advantage of it.

answered Dec 2, 2020 at 22:32

Karl Bielefeldt

148k38 gold badges281 silver badges481 bronze badges

8

This is good advice. I would like to point out that "a proper CSV parsing library" just means that someone else has already worked out all the bugs in the string processing.
– Robert Dodier
Commented Dec 3, 2020 at 2:28
8

@RobertDodier Even if "a proper library" just means "someone else fixed the bugs", it still means you don't have to do all that work again. Programmers should have an innate distaste for reinventing the wheel in my opinion, especially for non-throwaway code.
– Nzall
Commented Dec 3, 2020 at 10:09
7

I think the first half of this answer is great, but the second half doesn't match my experience at all. I've had to integrate feeds from companies 100 times the size of mine, where just finding someone who knows we exist is a challenge; I've also had integrations where the technical contact simply refuses to acknowledge the problems I demonstrate.
– IMSoP
Commented Dec 3, 2020 at 10:34
@RobertDodier s/all/many of/
– leftaroundabout
Commented Dec 3, 2020 at 10:54
4

"Use a proper CSV parsing library, not a regex or a split or something." though be aware that the source of the data may not be using a "proper csv generating library". So you may well end up writing a specific parser to handle whatever broken "csv" you receive.
– Peter Green
Commented Dec 4, 2020 at 5:57

| Show 2 more comments

Greg Burghardt · Accepted Answer · 2020-12-02 14:32:46Z

There is no general solution that fixes this. When integrating with outside systems, you have very little control. From what you describe, you are including a lot of defensive programming — this is good. As others have mentioned, you need to fail more gracefully. If a chain of operations requires data from an outside source, you'll need additional defensive programming to ensure downstream operations do not get triggered when a failure occurs. End users should also be presented with a reasonable error message.

Beyond that, setting up automated integration tests between your application and the outside provider can help you find issues before they hit production. Many outside services have a "test" or "beta" environment, where they deploy new releases. This allows you to identify breaking changes in their upcoming releases before it hits their production environment (and therefore takes down your production environment). Furthermore, any time a breaking change occurs, add that to your automated integration test suite to guard against that change moving forward.

When integrating with outside services, you absolutely must keep up to date on their changes. Consider subscribing to mailing lists or periodically checking their developer sites for upcoming releases. Integrating with external services is never something you can build and forget. You'll have continuing maintenance work to stay on top of this, which will include regular maintenance releases for your application and/or code.

Yakk · Accepted Answer · 2020-12-02 21:46:24Z

Validate your data early.

As soon as you can, check that your input falls within your required range.

Fuzz test within the domain of your data.

Your system should seek to handle all data that passes validation gracefully. Fuzzing refers to generating random data within the range you are testing in question.

The fuzz data is on the boarder of nonsense, but matches the minimal structure required by your validator. If you find it hard to generate random data that passes validation, you might need to clean up your validation logic; make it more strict, or less strict.

Fuzz test your validators

Your system should sharply and reliably distinguish valid from invalid data.

Fail early on invalid data

If your data doesn't pass validation, do not hobble along. Fail fast and fail gracefully.

Once you have invalid data, your assumptions that your processing is meaningful has failed. Continuing to barge on and keep working will both generate a flood of errors and can result in output that is not just missing, but wrong.

Garbage In, Garbage Out can only be prevented by detecting garbage and stopping before you generate garbage.

Bart van Ingen Schenau · Accepted Answer · 2020-12-03 06:27:18Z

5

When reading data from an external source, and that includes data written by your application in a previous run, then it is a given that sooner or later the data you read does not match exactly with the data you expect.

If the format is specified externally, then the specification can change at any time. Besides that, the program generating the data could have a bug, or some glitch in the storage or communication causes a data corruption.

This is an interoperability problem that has existed as long as multiple machines communicate with each other and has given rise to the adage: "Be strict in what you send, but lenient in what you receive", meaning that when producing data you should try to adhere to the specified formats as much as you can, but when receiving data you should try to make sense of it (without reporting an error) even if it does not exactly match the prescribed format.

edited Dec 3, 2020 at 6:27

answered Dec 2, 2020 at 9:53

Bart van Ingen Schenau

75.4k20 gold badges123 silver badges188 bronze badges

8

I like the sound of an interoperability adagio, but I suspect you meant 'adage' here :)
– Stuart J Cuthbertson
Commented Dec 2, 2020 at 18:37
8

@StuartJCuthbertson "Molto lento in what you receive" is the interoperability adagio.
– Robert Dodier
Commented Dec 3, 2020 at 2:26
4

@Robert Nice :-) Anyway, I never liked that particular adagio/adage. Being lenient in what you accept means you wind up having to maintain support for all sorts of nonstandard inputs, and it subverts your ability to complain when an external source tries to give you nonstandard data that it's their fault for not following the standard. (Then again, sometimes you just have to deal with nonstandard data that you have no power to change, but still, I prefer to be annoyed about it.)
– David Z
Commented Dec 3, 2020 at 5:19
2

Attempting to process an unspecified format is a programmer error - this a prominent example of "failing late". There should be a clear set of accepted formats including all the workarounds for particular providers.
– Basilevs
Commented Dec 3, 2020 at 6:07
1

@DavidZ: A difficulty with the adage is that most standards are poorly written. A good standard should include separate specifications for producers, consumers, validators, and format converters. Among other things, this will greatly facilitate the deprecation of constructs which are bothersome to support.in favor of better alternatives that are easier to support. If a spec for consumers requires support for a deprecated construct, but the spec for validators requires rejection thereof, then deprecation of the construct will not prevent the producer from working with consumers, but...
– supercat
Commented Dec 4, 2020 at 20:31

| Show 3 more comments

Mr. Lynch · Accepted Answer · 2021-01-02 17:26:17Z

The ultimate in "general solutions" is to treat your error-cascade problem not as a program design problem but as a specification problem--specifically, having a missing or inadequate specification. Michael Jackson did this in 1975 in his book, "Principles of Program Design", which treats this subject thoroughly. Although the examples are written in COBOL, the principles are the same for processing linear sequences of inputs, whether it is tokens in a programming language, commands in a shell, a .csv file of billing entries for a job, or keystrokes in a word-processor:

Define the grammar of a valid input stream (valid input)
Define the grammar for each kind of erroneous input stream (error input)
All other input structures are by default "invalid"
Define the program's response to valid input, creating test cases for each equivalence class of valid input
Define the program's response to error input, creating test cases as before
Define the program's response to invalid input, creating test cases as before

What most of us often do (myself included), is to let external actors teach us by example about error inputs (step 2 above) after we have deployed the system, and then have to react with a patch, and mollify unhappy users in the meantime. By treating this as a specification problem, you avoid this situation entirely.

Jackson shows program structures for responding to valid, error, and invalid data sequences, using COBOL. Of course, now we have all kinds of different programming constructs for handling errors, but defining the errors and your program's expected response to them helps you create a design which meets your needs rather than trying to play catch-up with an inadequate design.

In summary, there is a general solution, but it is at the specification level: define all the kinds of meaningful input you will provide meaningful responses to, and engineer for each of them. The rest are simply rejected with some sort of error indication.

Mr. Negi · Accepted Answer · 2020-12-03 04:26:36Z

Basically, I would argue you should write the checking code (maybe offer a "performance" mode that doesn't run the checks). I would recommend using assert statements to ensure that the input is in the expected format. Maybe, put a comment in the code next to the assert statement saying the semantic meaning of that particular assert statement. That way, when your code fails, it is obvious to an outside developer that your code has not failed due to an internal fault, but because its assumptions have been violated.

Thorbjørn Ravn Andersen · Accepted Answer · 2020-12-05 13:02:09Z

When you have a mature codebase, then you have seen a lot of different error scenarios and implemented all the code needed to handle it appropriately.

This means that if your code encounters an unexpected error now, you are in a situation where your world is broken (because it is something you have never seen before or you would already have handled it) and the only sensible approach from here on is for your code to stop what it is are doing and asking for emergency help.

Your cascading of errors come from that you are not prepared for this. If you aren't then your code cannot be either.

I would suggest you read "Release it!" as it contains a lot of useful advice for writing more robust code. https://pragprog.com/titles/mnee2/release-it-second-edition/

John Glen · Accepted Answer · 2020-12-03 23:47:13Z

Writing enormous amounts of extra "checking" code is pretty much needed unfortunately. The checking code is usually enough to help, as you can get the changes that break your code by printing what made the code fail where it failed. This is useful to the user if they gave to program bad input. Failing with decent error messages during checks is the easiest way to debug bad input.

One of the ways to validate data is to have a builder. You give the builder the pieces of data you have and then have it build an object consisting of that data. The factory can produce fuzzy logic (Yakk's idea), or it can throw an error if any data is missing when you tell it to build. You can also add methods to the builder to check if the data was fuzzy generated or is valid. Each data feed into the builder can also check that the data is valid on input, and throw a helpful error.

Anticipating bad input is one way to deal with it like you say in your question. You can write write code that checks if data should be equal (simple example being Hello and hello being the same words despite capitalism). This is really something to wait for an error for except simple examples. If the user really needs you to support a format, you can get the error message with details if you wrote good checking code. Then you can add support for the format they want. This can be harder said than done.

If you do need to add support, using a base interface can help if you need to change a lot of code. So say one customer has a different csv format, you can create code on top of the original interface that is labelled with that customer. So with your csv example, say one customer uses ;s instead on ,s. The base interface would deal with that and you can label the code on tope as semicolonSeperatedValues or something like that. This does take some thought as to what is needed in the base interface. This comes with a disadvantage of a lot of refactoring if there are some poor design choices early on, but it can help prevent duplicate code and bloated program files.

You can also ask the users of your software what their format is. Make sure if there is an error to print the error with the formatting that produced the error so they can fix the input; they can also give you a decent error message that helps you write more robust code.

As far as decent error messaging goes, as long as the error has helpful information and doesn't give the user an ugly crash or exception, you are good. Going with the csv example, if the user has a bad file, you should display an error that says what file, what line, and why that line is bad. Also, make sure not to change the state of data you are reporting the error on, otherwise you will be left with a potentially very obscure bug, and could confuse the user.

Try to avoid creating exceptions. An example in Java being a NullPointerException. You can pass null around, but unless you are checking for null everywhere it is getting passed around, eventually a NullPointerException will get thrown. In Java, they have a way to avoid this by using empty containers. If a method you are using throws an exception, you want to write code that will never trigger that exception.

Also, very important, do not ignore exceptions as a way of error handling. You will cover up what could potentially cause errors far away from their source.

Minimizing variable scopes also helps with errors. Having a global variable that multiple programs depend on is a good way to introduce a bug. Giving each program their own variable is much safer. Even safer is method local variables.

In multithreading environments, using immutable classes helps avoid a lot of potential headaches. In fact, multithreading is best avoided unless the performance is needed, because debugging errors is a lot harder in a multithreaded environment.

You want to give the user only as much control as they need. You want to give the user as little control as they need. This will prevent a user from messing up and getting frustrated at you for what they perceive as being your fault.

Using programs designed to do stuff for you is also a good way to avoid errors. A simple example being a for each loop in Java vs a normal loop where it is much easier to get an IndexOutOfBoundsException.

Getting familiar with the programming language you are using is also a great way to avoid errors. Find some reading material and exercises and do them.

Also, in multithreading environments, make sure to synchronize data. This is a complex subject in and of itself and has books written on it. Once again limit the control of the user. They absolutely should not be able to mutate synchronized data while it is being synchronized.

I have tried to make this list as general as possible, but the ideas are from reading Effective Java.

paulsm4 · Accepted Answer · 2020-12-04 23:48:51Z

Q : I get a flood of ugly PHP errors...

Short of predicting every possible change ... is there anything I can do to fix this general problem?

A: Yes, definitely. As Simon B suggested above, you want to fail fast, and fail loud. This is excellent advice :)

I'm surprised nobody mentioned using exceptions (thrown where the problem is first detected) and try/catch/finally blocks (at a higher level, where you can intelligently handle, and/or recover).

If you're interested, please take a look at these articles:

Frank Hopkins · Accepted Answer · 2020-12-05 05:37:03Z

-2

The general counter-technique to cascading errors is resilience and compartmentalization.

Resilience - if part of a data stream is broken, ignore the broken part and work with the part that is okay if possible, otherwise abort the affected process then and there, but only the affected process. Have fallback options available.

Compartmentalization - separate your resources. Assign Threads, memory, access rights to different parts of the system and keep them separate. If one part fails make sure the others are not affected. For instance, if you call another component (internal or external)and that repeatedly fails or produces errors, stop calling it (circuit breaker concept) - at least for a while - use fallback values and don't spread faulty data in your system.

answered Dec 5, 2020 at 5:37

Frank Hopkins

6553 silver badges7 bronze badges

4

Partially parsing data and using fallback values in case of failure is a way to get random errors in other parts of the application, not less. My advise is to NEVER use fallback values unless you really understand the situation in which this fallback would happen. This effects the opposite of what OP wants.
– Helena
Commented Dec 5, 2020 at 9:03
@Helena I think you all focus too much on the data parsing part, sure identify if a source is broken, but in a more complex system the key to keep errors to a minimum of the affected place is to isolate them but keep the rest running. That is identify the issue, then if the use-case allows go with default values for the part or mark that part as broken and adjust functionality.
– Frank Hopkins
Commented Dec 5, 2020 at 17:14
Failing fast makes sure that the problem is isolated to the csv parsing component. Using data from a broken importer downstream allows the issue to propagate through the system and lead to more problems popping up elsewhere like OP described. Graceful degradation is great if you care about availability of your service, but not necessarily if you care more about maintainability.
– Helena
Commented Dec 5, 2020 at 17:22
@Helena if the whole system is down that means you need to take immediate action, if only part of it is down you or your users may be able to live with that for quite a bit until you can take care. I'm all for direct degradation of the affected affected piece of information parsed, but my approach allows to keep the system running, not having to rush into it, have less errors in the log that dilute the issue and in cases of temp problems having to do nothing at all. sure, sometimes no default data makes sense, which is when you fail that component not everything.
– Frank Hopkins
Commented Dec 5, 2020 at 17:41
In any case I didn't have much time to go into it but I feel the focus on how to handle parsing of a source right (sure fail it if it looks iffy) is off-track here to the real problem OP's having. But we seem to disagree on some fundamental understanding or evaluation, which is fine.
– Frank Hopkins
Commented Dec 5, 2020 at 17:43

| Show 2 more comments

Stack Exchange Network

Is there a general solution to the problem of "sudden unexpected bursts of errors" in software?

11 Answers 11

Not the answer you're looking for? Browse other questions tagged
php
error-handling
error-messages
errors
error-detection
or ask your own question.

Hot Network Questions

Is there a general solution to the problem of "sudden unexpected bursts of errors" in software?

11 Answers 11

Not the answer you're looking for? Browse other questions tagged phperror-handlingerror-messageserrorserror-detection or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
php
error-handling
error-messages
errors
error-detection
or ask your own question.