Great question! You are asking the right questions.
Short answer. In most cases, escaping at the output side is the most important thing to do. The best solution is to use a web development framework (such as Google ctemplate) that provides context-dependent automatic escaping and automatic defenses against other injection attacks (like prepared statements to avoid SQL injection). This is likely to be more effective than sanitization on the input side.
Explanation. Here we have a flow of untrusted data from some input source (e.g., a URL parameter), through a complex chain of computations (e.g., through the database), and finally out to some output sink (e.g., dynamic content in a HTML template). Where should we put the sanitization/escaping? We could put it near the input, or near the output, or somewhere in the middle. How do we decide where is best to put it? I think that's what you are asking.
The first part of the puzzle is to realize that it is better to have a consistent policy. It is better to put everything at the input, or everything at the output, than to sanitize 50% of the inputs and 50% of the outputs (if you do the latter, then it is too hard to check that your policy has been followed consistently, and it is too easy to end up with a flow of data from untrusted source to output sink that never gets sanitized/escaped). It is better to have a policy that "everything in the database is already sanitized and escaped, and it can all be treated as trusted" or "nothing in the database is sanitized/escaped, and it should all be treated as untrusted" (or to have a policy which documents which fields in the database can be trusted to have already been sanitized/escaped, and which ones are trusted) than to have no documented policy.
The second part of the puzzle is to ask: What extra information do I need to know, to do the sanitization/escaping correctly? Do I need to know some information about where the untrusted input came from? Do I need to know some information about where it will be used (what part of the output it will be inserted into)?
In most cases, it turns out that the answer is: we need to know where the untrusted data will be used (where it will appear in the HTML output), but not where it came from. We need to know where in the HTML document it will be inserted, because this determines the choice of escaping function: if it is inserted in between tags, then you should use HTML escaping to escape <
, >
, and &
; if it is inserted inside an attribute, then you need to escape quotes as well; if it is inserted as a URL, then you also need to check the protocol scheme (to make sure it is not a javascript:
URL). This information is readily available at the output sink, but not at the input source. If you perform escaping at the output side, then this information is readily available: when you insert dynamic data into a HTML document, you have all the information you need about what parse context it will be inserted in, at your fingertips. On the other hand, if you try to sanitize at the input source, it is not clear where the data might be used, so it is hard to know how it needs to be escaped. So this suggests escaping at the output sink, rather than sanitizing at the input source.
The third piece of the puzzle is that there are web programming frameworks that do context-sensitive auto-escaping. Typically, they use a template system, and for each value that will be dynamically inserted into the template, they look at the HTML context where it will be inserted (is it between tags? inside an attribute? a URL value? inside Javascript?), figure out what escaping function needs to be used, and then automatically apply that escaping function. This is a big win, because it ensures that the proper escaping function is used, and eliminates vulnerabilities where you forgot to escape some value. Today, both of those kinds of vulnerabilities are common: developers often forget to escape some value, and when they do remember to escape, they often apply the wrong escaping function for the context where the value will be used. Context-sensitive auto-escaping should essentially eliminate those vulnerabilities.
Discussion. That said, the best defense is to use both context-sensitive escaping at the output, and input validation/sanitization at the input. I consider context-sensitive escaping your most important line of defense. But sanitizing values at the input (based upon your expectation of what valid data should look like) is also a good idea, as a form of defense-in-depth. It can eliminate or mitigate some kinds of programming errors, making it harder or impossible to exploit them.