32

Given a web application where user data must be properly escaped to avoid XSS, is it better to try to remove the "bad stuff" before it enters the database, or is it best to allow it in the database but be careful about escaping output when it is displayed on the page?

I see some applications where the input is stored raw in the database but output is always escaped (at least so far -- I'm still hunting!). It makes me uncomfortable to see malicious data in the db, because the safety of the output relies on the developers remembering to escape the strings every time they make an output... (Some kind of framework would be better, at least it collects the output code and filtering/escaping into a common location.)

Edit

For clarity: I'm auditing existing web applications, not developing. (At least for the purposes of this question -- when I do web dev, I reach for a framework.) A lot of what I see uses ad hoc filtering and/or escaping on input and/or output. @D.W.'s answer hit the nail on the head -- getting to the essence of what I was asking.

7 Answers 7

31

Great question! You are asking the right questions.

Short answer. In most cases, escaping at the output side is the most important thing to do. The best solution is to use a web development framework (such as Google ctemplate) that provides context-dependent automatic escaping and automatic defenses against other injection attacks (like prepared statements to avoid SQL injection). This is likely to be more effective than sanitization on the input side.

Explanation. Here we have a flow of untrusted data from some input source (e.g., a URL parameter), through a complex chain of computations (e.g., through the database), and finally out to some output sink (e.g., dynamic content in a HTML template). Where should we put the sanitization/escaping? We could put it near the input, or near the output, or somewhere in the middle. How do we decide where is best to put it? I think that's what you are asking.

The first part of the puzzle is to realize that it is better to have a consistent policy. It is better to put everything at the input, or everything at the output, than to sanitize 50% of the inputs and 50% of the outputs (if you do the latter, then it is too hard to check that your policy has been followed consistently, and it is too easy to end up with a flow of data from untrusted source to output sink that never gets sanitized/escaped). It is better to have a policy that "everything in the database is already sanitized and escaped, and it can all be treated as trusted" or "nothing in the database is sanitized/escaped, and it should all be treated as untrusted" (or to have a policy which documents which fields in the database can be trusted to have already been sanitized/escaped, and which ones are trusted) than to have no documented policy.

The second part of the puzzle is to ask: What extra information do I need to know, to do the sanitization/escaping correctly? Do I need to know some information about where the untrusted input came from? Do I need to know some information about where it will be used (what part of the output it will be inserted into)?

In most cases, it turns out that the answer is: we need to know where the untrusted data will be used (where it will appear in the HTML output), but not where it came from. We need to know where in the HTML document it will be inserted, because this determines the choice of escaping function: if it is inserted in between tags, then you should use HTML escaping to escape <, >, and &; if it is inserted inside an attribute, then you need to escape quotes as well; if it is inserted as a URL, then you also need to check the protocol scheme (to make sure it is not a javascript: URL). This information is readily available at the output sink, but not at the input source. If you perform escaping at the output side, then this information is readily available: when you insert dynamic data into a HTML document, you have all the information you need about what parse context it will be inserted in, at your fingertips. On the other hand, if you try to sanitize at the input source, it is not clear where the data might be used, so it is hard to know how it needs to be escaped. So this suggests escaping at the output sink, rather than sanitizing at the input source.

The third piece of the puzzle is that there are web programming frameworks that do context-sensitive auto-escaping. Typically, they use a template system, and for each value that will be dynamically inserted into the template, they look at the HTML context where it will be inserted (is it between tags? inside an attribute? a URL value? inside Javascript?), figure out what escaping function needs to be used, and then automatically apply that escaping function. This is a big win, because it ensures that the proper escaping function is used, and eliminates vulnerabilities where you forgot to escape some value. Today, both of those kinds of vulnerabilities are common: developers often forget to escape some value, and when they do remember to escape, they often apply the wrong escaping function for the context where the value will be used. Context-sensitive auto-escaping should essentially eliminate those vulnerabilities.

Discussion. That said, the best defense is to use both context-sensitive escaping at the output, and input validation/sanitization at the input. I consider context-sensitive escaping your most important line of defense. But sanitizing values at the input (based upon your expectation of what valid data should look like) is also a good idea, as a form of defense-in-depth. It can eliminate or mitigate some kinds of programming errors, making it harder or impossible to exploit them.

2
  • 1
    This is an excellent answer. You managed to read between the lines of my question and covered all of the relevant points. I especially appreciate the "why" -- I knew this intuitively but hadn't formulated in the way you have.
    – bstpierre
    Commented Dec 5, 2011 at 3:45
  • @DW - have just highlighted the sentence in your last paragraph as I think you have hit the nail on the head with this answer.
    – Rory Alsop
    Commented Dec 7, 2011 at 13:47
8

I strongly suggest to use existing frameworks to do validation on input, and escaping on output.

Escaping on input has three big issues:

  • You have to un-escaping and re-escaping for a different output media, such as a pdf file instead of a HTML page
  • It is more complicated to introduce fixes because the existing (incompletely escaped) data needs to be vetted.
  • Looking at the data with low level database diagnostic tools will only show you the escaped data.

More important: Escaping on input gains you very little, since it is as easy to forget it on input than it is on output. And doing both obviously does not work as it would destroy the data.

You should use a framework (such as JSF for example) that frees the domain developers from the burden of having to keeping escaping in mind. This will result in a very small amount of code for the technical parts of the display components, that needs to be fully aware of proper escaping. A small amount of critical code is good because it hugely reduces the chance of bugs and simplifies auditing and education.

1
  • Good list of the issues with escaping input. FWIW, this isn't my code -- I'd definitely use a framework. At least with a framework, it's easier to inspect for places where the developers either use low-level functions, or circumvent the framework.
    – bstpierre
    Commented Dec 3, 2011 at 18:29
5

You always filter out the "bad stuff" that affects the place where you're putting it.

So if you're sending data to the database, you escape for database injection. If you're sending to HTML, you escape for HTML injection. This means you're escaping twice, but for different things.

This isn't just a matter of preference; instead this ensures that your application stays safe in the future. For example, if you HTML escape data as it's inserted into the database, then if a malicious user finds a way to get HTML into the DB, then the database becomes a potential vector for scripting and other HTML based attacks.

In other words, by escaping as close as possible to the point of exploit, you decrease your overall attack surface.

1
  • You always filter out the "bad stuff" that affects the place where you're putting it. I think that rule-of-thumb will take you a very long way. Nice simple answer.
    – jmrah
    Commented Feb 17, 2022 at 13:54
2

Ideally all three. Assume that, at some point, the developer working on the input screen will leave out a check (or more likely that some new exploit will come along that breaks the check). The difficulty with validating at input is that if a new exploit comes along, you may have already accepted malicious data before you can fix your validation. Assume that, at some point, a developer working on the output will leave out the filtering (or the new exploit breaks the filter).

So if possible, have validation in the input, in the output and as a database constraint. Whenever the constraint is changed, all existing data will be revalidated against the new rule.

If I had to choose one, it would be the output because enhancements will be applied to new and existing data.

1
  • Indeed. About an hour after I posted the question I found a spot where an output variable wasn't escaped.
    – bstpierre
    Commented Dec 3, 2011 at 18:30
1

Hopefully when you say "input is stored raw in the database" it's already been sanitized to protect against SQL injection attacks.

IMHO, if you're already doing that work, you might as well do the best you can to filter out XSS naughtiness, too.

1
  • Good point. I'm hunting for SQL injection too. It's ad-hoc sanitized so I may be able to find something.
    – bstpierre
    Commented Dec 3, 2011 at 18:24
0

The method and placement of input filtering really depends on the specific situation and requirements for the processed data, but I would consider it with the following four principles in mind:

1) use as tight and strict filtering and validation as possible

2) use multiple / redundant level of defenses

3) make sure the data sanitizing is appropriate for the context in which it will be used!!

4) if using a framework or other automation, understand what it is doing and its edge cases

The application of the first principle - and the answer to your question regarding whether to allow unfiltered data in the database - often depends on the application functional requirements. For example, if you collect first or last names that should never have HTML tags inside them, you can strip the tags at the earliest point of processing them. If you have an age or price field, you can enforce the value being numeric before doing anything else with it.

But there are cases where you have to work with arbitrary user data - for example, user postings on this site that may include tags or script snippets :) - and in some cases you need access to the original data. Then, several approaches are possible: store unsafe data and filter/encode it before displaying; store sanitized version and convert back to the original if needed; and finally, to store both - original data ("red stream") for the applications that need it as well as the sanitized version ("green stream") to be used for display purposes.

Unless there are strong guarantees that the data in the database has been properly sanitized, it can be worth to consider the DB as an external source and apply input filtering to data coming from it. Doing input validation on several points of data flow - initial input processing, database data, and possibly output encoding would provide multiple defenses against XSS & Co.

The crucial issue in selecting appropriate method of input/output sanitization is to know where and in which HTML context the data will go on the page. Stripping or htmlentity encoding the html tags is usually sufficient if the data goes into the "normal" position - so-called HTML context - but is inadequate if you try to insert it as the value of href attribute of an tag: even with all the tags stripped from the input, you can end up with click here and "one-click XSS" vulnerability. Study carefully the rules on https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

Finally, a word of caution regarding the frameworks or sanitization libraries - they can be dangerous exactly due to being very effective. Usually they work wonderfully in 95% of "normal" / easier cases and can lead developers to forget doing appropriate things for the edge cases, such as sanitization for special contexts (attributes, Javascript etc).

0

There is no golden rule when talking about validating & sanitizing data.

I learned this over the years developing my own apps. Why, you ask?

Because there are so many ways your app can work. There are so many ways you will be manipulating input/output. There will be many functions and mechanics your data will be treated with. It's mostly up to your imagination. So instead of asking for the golden rule you should:

  1. Understand and read a lot about all types of risks and hacking techniques: XSS, code injection risks (PHP, SQL, RCE), path traversal etc. Only when you understand these risks and its mechanics, you may proceed to step 2.
  2. Learn about techniques that try to prevent these all different types of attacks. There are no bulletproof method. Like in real life there will always be a lockpicking method for every lock. But even the worst lock is better than leaving your home unprotected.
  3. Know what exactly you do with data. What are the exact paths the data may go through, where is input and output. This should include later usage by retrieving from database. For example there is a big difference between data which is taken from user, and then parsed through some high risk functions like eval() or exec(), compared to not critical data which is only used for cosmetic reasons. Sometimes an attacker can, without any deep knowledge, simply swap your JS variables when its values are not validated serverside.

Only when you went through these 3 steps, you can protect yourself and your users. Of course there are some simple basics, but there are no golden rules if you go deeper. Dont use blindly any techniques until you fully understand what the risks and the methods are, and how your "machine" works.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .