
I have thousands of .csv files with the same structure and, in most of the cases, some column values are the same ones recurring. Each file represents a report on some structures, with numeric attributes and a date.

I want those to be structured and suitable for cleaning missing values, visualizations, predictions, etc. and I know that several ETL tools are designed to provide this, although I never learned to use one.

But I cannot figure out how my "integrated" data will look like: the names of my structures are recurrent and the csv reports are a few minutes apart one from another: what does commonly happen in an ETL system in those scenarios? Is really ETL the right thing? Do I have to know in anticipation which operation I will perform on the "structured" data (e.g. predictions, temporal roll-ups, etc.) to choose (let's say) an SQL-like output or the "philosophy" is just a different one?

2 Answers 2


An ETL is just a procedure. You:

  1. Extract the data from any source. In your case, this would be the CSV files, but it could be anything: data from an external service, data stored in some proprietary format, etc.

  2. Transform the data somehow. The idea is that the raw data you got in the first step may not be exactly what you want, or may not be in the exact format you want it to be. In terms of format, you do here a bunch of transformations, which could be as simple as renaming one column to another, and as complex as joining data from multiple sources. In terms of data itself, the transformation may involve fixing incorrect data, removing the one you don't need, etc.

  3. Load what you got in the previous step in some source. Here again, it can be anything—a plain text file, a JSON, some proprietary format, a relational database, a NoSQL database, an Amazon S3 bucket, a web service—you name it.

Since your goal is to grab a bunch of CSV files, do some transformation on them, and dump the result in a database, the procedure you will be doing is indeed called ETL.

Now, whether a given ETL software product is a good fit for you is a different question. Some may help you to do your job faster; other wouldn't. It's up to you to try a few, or maybe you'll prefer doing all three steps of the ETL yourself through a custom script.


ETL is not an end in itself, it is a means to an end. It transfers and transforms data coming from one or many sources into a target system. The target will have some requirements for the data quality, depending on the processes it supports. Typical target systems can be

  • business, scientific or health analytics databases (reporting, for statistics, OLAP)

  • geographic information systems

  • a system which supports implementation of more business processes on a technology very different from the one in which the source data is coming from. (Maybe the source is some 40 year old financial system written in COBOL, maybe it is sensorical data log coming from some specialized hardware, and the processes of the target system shall be implemented with software technology from the Oracle / Google / Microsoft / Apple world).

(and this list is neither disjoint nor complete).

So what you need to do here is to make a requirements analysis. Look at your specific kind of target and find out which requirement on the data quality it has. There is no general answer to this - ETL always serves a purpose, and if you don't know the purpose (yet), you better start to find it out before creating something which does not match the latter.

Not the answer you're looking for? Browse other questions tagged or ask your own question.