Skip to main content

Questions tagged [etl]

Extract, Transform, Load - process in a database

0 votes
1 answer
354 views

Data Integration Design Using Microsoft SSIS

I am working on a data integration project, where I need to extract data from oracle source and load it to XML file. The requirement is to get the list of customers and foreach customer create an xml ...
7 votes
5 answers
376 views

What can I do to get a message processor to slow down the rate of writes that it is making to a database?

We have this architecture: queue -> message processor (horizontal scaling) -> RDBMS Sometimes external systems dump 10k messages onto the queue and the message processor of course dutifully ...
0 votes
3 answers
221 views

In data engineering, why is data integrity checked on the DW rather than on the data sources?

I'm a software developer and new to data engineering, so this may be a newbie question, but I'm wondering why data integrity checks (for instance, dbt tests) are ran on the data warehouse, rather than ...
1 vote
1 answer
4k views

Best practice for sharing code and data between airflow worker nodes?

New to Apache Airflow and curious about how code and data are expected to be used across worker nodes in a multinode airflow setup. When considering if ETL logic should be in the dags or in separate ...
-1 votes
1 answer
190 views

Modeling a CSV file: What is the standard? Python or SQL?

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis. I could split the data into tables with python and ...
3 votes
1 answer
407 views

What are some design ideas for a data mapping and transformation application?

Here is a high level outline of the project: We frequently need to convert data from a new incoming system to our in house system (sort of a basic ETL process) We would prefer to do this dynamically, ...
1 vote
1 answer
126 views

Data pipeline design - robust and resilient to future variations

I need to build a data pipeline to populate a database from various files. This is a common scenario. However, I want to have expert opinions for implementing a pipeline that is robust, modular and ...
0 votes
1 answer
83 views

Better design for a REST import into web store

I have an import that needs to grab data from a REST service and import into an web store. It's basically an ETL type of service, but because the REST service can be slow and I don't want to call it ...
1 vote
1 answer
449 views

Is microservice approach always best fit for ETL processes?

In our project we are using Django and Django Rest Framework as main application to get/query the data from database and send it to the frontend. Those endpoints are very fast as they should be. ...
2 votes
2 answers
509 views

Reading a large CSV file and then loading data to a DB

I have a Django application of 2 GB running and I need to receive a CSV file of more than 1 GB, read it and load the data to a PostgreSQL DB in IBM Cloud. The problem is that if I receive the file, it ...
0 votes
1 answer
74 views

Running ad hoc queries on JSON log files

I have a situation where let's say I have a folder called logs which has N folders. Each folder contains events for a specific event type and each folder has N .log files where each file has multiple ...
-1 votes
1 answer
554 views

Agile approach in ETL/ELT development

What are the pros and cons of using agile/iterative approach in ETL/ELT (Extract Transform Load or Extract Load Transform) data warehouses/data lakes/lakehouses systems development? I often find that ...
-2 votes
2 answers
405 views

What happens after the ETL process?

I have thousands of .csv files with the same structure and, in most of the cases, some column values are the same ones recurring. Each file represents a report on some structures, with numeric ...
4 votes
2 answers
116 views

Designing an ETL with where there are a few points of entry

I'm trying to think of a scalable solution for my current system. The current system is 3 microscopes 1 processing machine 1. 60-100GB Files come from 2-3 microscopes every 30 minutes 2. That data ...
-1 votes
1 answer
33 views

Duplicating API implementations for declaring intention

I'm developing an ETL process in Python and Pandas to pull data from a rest API, and then dump it into a relational database. A few of the fields that come back contain sensitive that I do not want to ...

15 30 50 per page