In data engineering, why is data integrity checked on the DW rather than on the data sources?

Question

I'm a software developer and new to data engineering, so this may be a newbie question, but I'm wondering why data integrity checks (for instance, dbt tests) are ran on the data warehouse, rather than on the data sources themselves.

For example, I have an app that runs on MongoDB. Let's say I want to do reporting on the activity of my users. From what I understand, I would have something like an ETL that extracts the data from MongoDB, transforms it and loads it in some data warehouse, like AWS Redshift or Google BigQuery.

Then, I would have a tool, like DBT, that runs checks/assertions on the data in the data warehouse to make sure it makes sense for analysis. But I don't understand why that is. Provided that the ETL is tested properly and does its job well, wouldn't a failed assertion on data in the data warehouse mean that something is not right with the data from the data source(s)? In this case, wouldn't it be better to run the assertions against the original data since this would mean the app runs on some invalid data? This is the part that I don't get.

Thank you for your help!

Side question: wouldn't it be better if the ETL used my app's API to extract the data instead of directly connecting to the operational (in this case, Mongo) database?

It seems you found a tutorial or recommendation for making data integrity checks. Your question would be way easier to answer if you would tell us where exactly you found that recommendation, ideally with an online reference, so giving readers a chance to check the original source. — Doc Brown, Commented Oct 22, 2023 at 20:45
@DocBrown There's no specific article. All the articles I've read seem to say that it's the way to do things. Another argument to support that is that there does not seem to have dbt testing equivalent for NoSQL databases — samdouble, Commented Oct 22, 2023 at 22:46
Some data , only starts to make sense, when aggregated over a considerable time period. — S.D., Commented Oct 23, 2023 at 7:34

Steve · Accepted Answer · 2023-10-22 20:30:07Z

In this case, wouldn't it be better to run the assertions against the original data since this would mean the app runs on some invalid data? This is the part that I don't get.

There's several possible explanations I can think of.

Firstly, there may be conventional constraints on the data which are not strictly enforced by the source (at least not right down to the database level), and the source cannot be safely altered to enforce those constraints. The checks therefore get done in the data warehouse. If there was a violation detected, the source data would be re-adjusted.

Secondly, and related to the first, your data warehouse may simply have the processing horsepower available to perform certain checks that do apply to the source system, and which the source system could theoretically be altered to enforce, but which the source system cannot practically afford to enforce itself.

Thirdly, your data warehouse might ultimately transform the data in a way that assumes certain constraints for the time being (for example, a limited understanding of the source may force inferences to be made about latent constraints), but these inferences may be invalidated. If the inferences are invalidated and a violation detected, the data warehouse should be reconfigured to adjust - the source system itself should not be interrupted, and no fault in the source data is implied.

Fourthly, your data warehouse may analyse and select data from the source in a way that new constraints can now be asserted which couldn't be asserted against the source.

Side question: wouldn't it be better if the ETL used my app's API to extract the data instead of directly connecting to the operational (in this case, Mongo) database?

This can hardly be answered in the abstract. Generally speaking, database technologies are natively well-tailored for high volume data processing and transport - usually far better tailored than any custom API - but there could be legitimate architectural reasons to pipe everything through an API, if the advantage (such as simplifying certain security arrangements or deduplicating certain application logic) exceeded the performance penalties and special development effort.

Thank you for your answer. Your second point makes sense. I think the concept of a data warehouse is born because reporting and analysis used to be done on operational databases and sometimes made it crash. Doing the validity assertions on the data warehouse could be common practice to avoid the same exact issue. — samdouble, Commented Oct 22, 2023 at 22:50

Flater · Accepted Answer · 2023-10-22 21:52:11Z

The kinds of requirements you need to store data is very different from the kind of requirements you need to run a set of computationally complex calculations on a data set, and while this is not the only use case it especially applies if you're running into exponentially expanding complexity (or close to it) due to the interactions in your large data set.

Just to create an example, it's fairly straightforward to store the data for customers, products, stores (including inventory) and purchases. Those are 4 CRUD endpoints that hardly interact with one another. CPU-wise, you don't need much.

But let's ask ourselves the question if the customer bought products optimally, i.e. in a way that the total commute distance from the customer's home to the store where they made a purchase is minimal (across all purchases made).
This requires a lot of calculation complexity. Not only do you have to calculate distances, you also have to consider that if you find that Customer A lives closer to store B than to store A, and that these stores both sell the same product, you have to also account whether store B's inventory has one of these products to spare. And if there's multiple customers who could go to store B instead, which ones should we shift around to maximize our improvements?

I've intentionally picked an example that leads to many different possible subcomplexities and calculation to show you the difference between straight up storing data and operating on it on a per-entity basis, compared to the kind of interaction and complexity that reporting on that same data can bring with it.

AWS Redshift or Google BigQuery are tailored towards big data operations, and therefore will be able to run your reporting logic better and faster than your usual hardware can.

There's also some side points:

What if your usual hardware does not have the kind of spare capacity for running reports, i.e. if your normal use case does not accept any kind of performance degradation?
The point made above is not too dissimilar from asking why search engines exist as a remote third party service (as opposed to being run locally by either the searcher or the websites themselves), albeit that this is more blatantly obvious due to the network overhead in needing to call every site in existence.

wouldn't a failed assertion on data in the data warehouse mean that something is not right with the data from the data source(s)?

Be very careful about negating that statement. Yes, if your extraction already fails, there's clearly something wrong. But if your extraction doesn't fail, does that prove that your data is therefore all correct? No. And that's the more important consideration of the two.

wouldn't it be better if the ETL used my app's API to extract the data instead of directly connecting to the operational (in this case, Mongo) database?

I generally favor considering the datastore a private implementation detail of the service and therefore routing everything via the service. However, this decision is scoped to application use. I do make exceptions for infrastructural operations, e.g. datastore backups. Your ETL can similarly be considered an infrastructural operation that gets special access to the raw data. There are justifications for this:

Maybe it's because it would put an undue burden on the service whose performance would degrade.
Maybe the service has a complex business layer which is plainly irrelevant for the purposes of the ETL.
Maybe the service is only tailored towards end-user interactions (per-entity) as opposed to bulk data operations.
Maybe your datastore contains audit logs that are absent from the service's output by design.

I don't know which one would apply in your case since I don't know your case.

Hobbyist · Accepted Answer · 2024-04-09 18:31:13Z

And the most practical answer from the warehousing point of view - because production systems contain bad data (in all possible meanings) more often than not. First of all, data warehousing solutions are mostly for old and very large companies, that have very very legacy production systems by definition (e.g. those ugly blobs in db fields and many many other wanders). Secondly, production system devs are prone to client/user preasure which usually ends up with numerous special cases which inevitably leads to ... (just in case, not devs fault) Thirdly, comes aggregation that allows to pinpoint business data errors by comparing same data from different sources. Fourthly, mistakes just happen. In a way warehousing is like other level of validation and production debugging.

P.s. warehousing people have no say on production systems, which by itself is ok, yet see above

Stack Exchange Network

In data engineering, why is data integrity checked on the DW rather than on the data sources?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
database
etl
data-warehouse
data-integrity
or ask your own question.

Hot Network Questions

In data engineering, why is data integrity checked on the DW rather than on the data sources?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged databaseetldata-warehousedata-integrity or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
database
etl
data-warehouse
data-integrity
or ask your own question.