The kinds of requirements you need to store data is very different from the kind of requirements you need to run a set of computationally complex calculations on a data set, and while this is not the only use case it especially applies if you're running into exponentially expanding complexity (or close to it) due to the interactions in your large data set.
Just to create an example, it's fairly straightforward to store the data for customers, products, stores (including inventory) and purchases. Those are 4 CRUD endpoints that hardly interact with one another. CPU-wise, you don't need much.
But let's ask ourselves the question if the customer bought products optimally, i.e. in a way that the total commute distance from the customer's home to the store where they made a purchase is minimal (across all purchases made).
This requires a lot of calculation complexity. Not only do you have to calculate distances, you also have to consider that if you find that Customer A lives closer to store B than to store A, and that these stores both sell the same product, you have to also account whether store B's inventory has one of these products to spare. And if there's multiple customers who could go to store B instead, which ones should we shift around to maximize our improvements?
I've intentionally picked an example that leads to many different possible subcomplexities and calculation to show you the difference between straight up storing data and operating on it on a per-entity basis, compared to the kind of interaction and complexity that reporting on that same data can bring with it.
AWS Redshift or Google BigQuery are tailored towards big data operations, and therefore will be able to run your reporting logic better and faster than your usual hardware can.
There's also some side points:
- What if your usual hardware does not have the kind of spare capacity for running reports, i.e. if your normal use case does not accept any kind of performance degradation?
- The point made above is not too dissimilar from asking why search engines exist as a remote third party service (as opposed to being run locally by either the searcher or the websites themselves), albeit that this is more blatantly obvious due to the network overhead in needing to call every site in existence.
wouldn't a failed assertion on data in the data warehouse mean that something is not right with the data from the data source(s)?
Be very careful about negating that statement. Yes, if your extraction already fails, there's clearly something wrong. But if your extraction doesn't fail, does that prove that your data is therefore all correct? No. And that's the more important consideration of the two.
wouldn't it be better if the ETL used my app's API to extract the data instead of directly connecting to the operational (in this case, Mongo) database?
I generally favor considering the datastore a private implementation detail of the service and therefore routing everything via the service. However, this decision is scoped to application use. I do make exceptions for infrastructural operations, e.g. datastore backups. Your ETL can similarly be considered an infrastructural operation that gets special access to the raw data. There are justifications for this:
- Maybe it's because it would put an undue burden on the service whose performance would degrade.
- Maybe the service has a complex business layer which is plainly irrelevant for the purposes of the ETL.
- Maybe the service is only tailored towards end-user interactions (per-entity) as opposed to bulk data operations.
- Maybe your datastore contains audit logs that are absent from the service's output by design.
I don't know which one would apply in your case since I don't know your case.