1

I am creating an API proxy that acts as a bridge between our frontend application and an AWS opensearch server. This proxy has additional features such as retries and timeouts.

One of the features I'm considering is request body validation. This API accepts 2 types of data: JSON when Content-Type is set to application/json and NDJSON when Content-Type is set to application/ndjson. I am wondering if it is a good idea to validate the request payload before it gets to the opensearch endpoint (I.E. validate that the request payload is valid JSON when Content-Type is application/json).

Arguments for request body validation:

  • The opensearch server won't get overloaded easily because the API proxy can fail requests without calling the opensearch endpoint in case of malformed data.
  • In case of very large NDJSON payloads with malformed JSON (I.E. _bulk requests), not all data needs to be loaded in memory at once due to streaming requests.

Arguments for non validation:

  • Since no validation occurs on the proxy side, the payload get sent directly to opensearch (which has its own validation), giving potentially better overall performance.
  • If another content type is added, no additional change is needed, the proxy will work as expected.

So now, I am unsure as to what approach to use. What do you suggest? Are there anything I missed regarding validation or lack of it? What is the better approach? Or is there a better way I haven't considered?

2 Answers 2

3

It sounds like you accept search requests from the general Internet.

On the "pro" side you list "won't get overloaded easily", avoiding user-visible interruptions or slowdowns. But shouldn't we let horizontal scaling worry about that kind of thing?

I'm not seeing any real motivation in OP to implement a pre-validator feature that no one is asking for and that doesn't correspond to a current pain point. So don't do it. Save the effort, and spend it coding up features that are being requested.


OTOH, if general internet clients can sometimes send a few "unwanted" queries (no business value to you), or many such queries, then a proxy that parses authentication headers and suppresses unauthorized queries offers direct value. It can suppress the forwarding of queries to an AWS service which bills you for its usage costs.

Prioritize current or anticipated AWS costs from unwanted queries, and if that bubbles up as more important than other priorities, sure, go ahead and implement such filtering.

1
  • 1
    Ah. I didn't know about horizontal scaling. Thanks. What about content types? If I remove all data validations, does this mean validating content types is now meaningless? As of now the only validation remaining is the Authorization header. Commented Jan 12 at 19:00
2

It seems the general issue here is the potential overloading of the backend services.

Sure you might use horizontal autoscaling to "absorb" the extra load, but I would not recommend this if:

  1. Your organisation is very cost-aware with regards to cloud
  2. Your services are prone to scraping attacks, for example if you are in the business of aggregating some sort of niche-data, such as for example vehicle histories or similar.
  3. Or if your service is a well known, public service, prone to be misused (sometimes intentially, sometimes un-intentionally).

If you have any of the risks listed above, then implementing some validation is always a good idea. I would suggest to go for it, however keep in mind that your validation service needs to be significantly faster than your backends, in order to not add too much extra load (10-15% is generally not noticable).

Also keep in mind that you will need a smart strategy for handling the proxying to your backend - if you just rely on simple stateless https, it may cause trouble down the road if your proxy service ends up being a bottleneck.

2
  • can you expand a little more regarding this? With an example of a strategy if possible? Thanks! "Also keep in mind that you will need a smart strategy for handling the proxying to your backend - if you just rely on simple stateless https, it may cause trouble down the road if your proxy service ends up being a bottleneck." Commented Jan 13 at 18:42
  • 1
    Well its really multimensional a problem,, but in short what I mean is you do not want to hold up your backend services with thousands of open connections for each query coming in, so you might need to somehow pool those validations together and probably build some kind of async messaging in between, so you guarantee that the requests dont get lost between your proxy and your backends, which might be a realistic scenario here. It is really quite a complex problem to solve, so there is a multitude of other smaller factors at play.
    – Jas
    Commented Jan 13 at 20:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.