1

I created an ETL in GCP, I process XML file from a bucket, and load them to bigquery.

Sometimes we find that some files are not processed, or they are not in the bigquery dataset.

I created a metric table that contains metadata about processed files, however, I want to automate checks (for example checking that all files in storage exist in the metric table...)

EDIT

In short what I want is to be able to compare source and target environment/ compare data before entering the ETL and the data after exiting it, to tell that I didn't forget anything, I could work out some scripts to do that but I wonder if there is something already created.

1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking.
    – Community Bot
    Commented Nov 7, 2022 at 19:37

1 Answer 1

0

In GCP you have two tools that can help you organize your pipeline, clean the data and send alerts in case of errors:

Open a ticket with GCP, they might suggest you what is the most suitable solution for you.

EDIT: What you are looking for is called Fuzzy Lookup:

The Fuzzy Lookup transformation performs data cleaning tasks such as standardizing data, correcting data, and providing missing values.

And is present in SSIS.

Not the answer you're looking for? Browse other questions tagged or ask your own question.