I need to build a data pipeline to populate a database from various files. This is a common scenario. However, I want to have expert opinions for implementing a pipeline that is robust, modular and resilient to future variations.
- There are several clients
- Each client shares daily data in different formats:
- Excel workbook (each worksheet will go in a different DB table)
- Multiple CSV files (each csv will go in a different DB table)
- Multiple XML files (each file will go in a different DB table)
- File names (worksheet names in case of Excel) are consistent and represents the DB table names where data will be pushed.
I need to come up with a pipeline to automate daily data push to a PostgreSQL DB (separate schema for each client). The question is about how to design the pipeline. My current plan is:
- Make a folder hierarchy for each client
- Client1 (separate folder for each client)
- source
- processed
- transformed
- processed
- source
- Client1 (separate folder for each client)
- Each client will upload daily data in their respective source folder.
- A custom script (may be different for each client) will act on the source folder:
- Transform the client data in a consistent CSV format and place it in the transformed folder
- Move the original file in source/processed folder
- A generic script (across clients) will act on the transformed folder:
- Push the CSV's to respective DB tables.
- Move the CSV file to transformed/processed folder
Step 3 and 4 will be independent of each other. Step 3 will trigger at a certain time each day and act upon files in the source folder. Step 4 will trigger at a later time (after an hour of triggering Step 3) and act upon files in the transformed folder. The files in processed folders will be archived every month.
Any improvements to make it robust, modular and capable to incorporate additional clients with possibly different data formats is appreciated.