1

I need to build a data pipeline to populate a database from various files. This is a common scenario. However, I want to have expert opinions for implementing a pipeline that is robust, modular and resilient to future variations.

  • There are several clients
  • Each client shares daily data in different formats:
    • Excel workbook (each worksheet will go in a different DB table)
    • Multiple CSV files (each csv will go in a different DB table)
    • Multiple XML files (each file will go in a different DB table)
  • File names (worksheet names in case of Excel) are consistent and represents the DB table names where data will be pushed.

I need to come up with a pipeline to automate daily data push to a PostgreSQL DB (separate schema for each client). The question is about how to design the pipeline. My current plan is:

  1. Make a folder hierarchy for each client
    • Client1 (separate folder for each client)
      • source
        • processed
      • transformed
        • processed
  2. Each client will upload daily data in their respective source folder.
  3. A custom script (may be different for each client) will act on the source folder:
    • Transform the client data in a consistent CSV format and place it in the transformed folder
    • Move the original file in source/processed folder
  4. A generic script (across clients) will act on the transformed folder:
    • Push the CSV's to respective DB tables.
    • Move the CSV file to transformed/processed folder

Step 3 and 4 will be independent of each other. Step 3 will trigger at a certain time each day and act upon files in the source folder. Step 4 will trigger at a later time (after an hour of triggering Step 3) and act upon files in the transformed folder. The files in processed folders will be archived every month.

Any improvements to make it robust, modular and capable to incorporate additional clients with possibly different data formats is appreciated.

2
  • 1
    There doesn't seem much to add. You might want to think about how the system reacts to duff data or duplicate submissions, and how errors would be notified (especially with time lags between the processing steps), and ensuring data in the database can be traced back to source when necessary.
    – Steve
    Commented Jun 4, 2023 at 11:18
  • Excellent points, @Steve. I surely need to give some thought to the points you raised.
    – Imtiaz
    Commented Jun 4, 2023 at 14:00

1 Answer 1

1

Your flow seems reasonable, but has a potential hole. Given that you're polling for new work rather than being event-driven, you need an additional directory, transfer, into which you upload your new files. This takes a non-zero amount of time, and can fail in the middle. Only after a successful transfer should you move the transferred file to source. (This is presumably done somewhat atomically.) That way you never attempt to process an incomplete file.

The equivalent effect may be had by renaming files in source after they're transferred, and not processing files that don't conform to the post-transfer name. Your choice.

Not the answer you're looking for? Browse other questions tagged or ask your own question.