Skip to main content

Questions tagged [etl]

Extract, Transform, Load - process in a database

7 votes
5 answers
376 views

What can I do to get a message processor to slow down the rate of writes that it is making to a database?

We have this architecture: queue -> message processor (horizontal scaling) -> RDBMS Sometimes external systems dump 10k messages onto the queue and the message processor of course dutifully ...
jcollum's user avatar
  • 229
-1 votes
1 answer
190 views

Modeling a CSV file: What is the standard? Python or SQL?

I have a wide CSV file of about 350mb, and want to load it into a SQL database and properly model the data to make it easier to use for analysis. I could split the data into tables with python and ...
HappilyCoding's user avatar
0 votes
3 answers
221 views

In data engineering, why is data integrity checked on the DW rather than on the data sources?

I'm a software developer and new to data engineering, so this may be a newbie question, but I'm wondering why data integrity checks (for instance, dbt tests) are ran on the data warehouse, rather than ...
samdouble's user avatar
  • 243
1 vote
1 answer
126 views

Data pipeline design - robust and resilient to future variations

I need to build a data pipeline to populate a database from various files. This is a common scenario. However, I want to have expert opinions for implementing a pipeline that is robust, modular and ...
Imtiaz's user avatar
  • 23
0 votes
1 answer
83 views

Better design for a REST import into web store

I have an import that needs to grab data from a REST service and import into an web store. It's basically an ETL type of service, but because the REST service can be slow and I don't want to call it ...
user204588's user avatar
1 vote
1 answer
449 views

Is microservice approach always best fit for ETL processes?

In our project we are using Django and Django Rest Framework as main application to get/query the data from database and send it to the frontend. Those endpoints are very fast as they should be. ...
Alex T's user avatar
  • 161
2 votes
2 answers
509 views

Reading a large CSV file and then loading data to a DB

I have a Django application of 2 GB running and I need to receive a CSV file of more than 1 GB, read it and load the data to a PostgreSQL DB in IBM Cloud. The problem is that if I receive the file, it ...
Elvin Quero's user avatar
0 votes
1 answer
74 views

Running ad hoc queries on JSON log files

I have a situation where let's say I have a folder called logs which has N folders. Each folder contains events for a specific event type and each folder has N .log files where each file has multiple ...
Sriram R's user avatar
-1 votes
1 answer
554 views

Agile approach in ETL/ELT development

What are the pros and cons of using agile/iterative approach in ETL/ELT (Extract Transform Load or Extract Load Transform) data warehouses/data lakes/lakehouses systems development? I often find that ...
Eugene Lycenok's user avatar
-2 votes
2 answers
405 views

What happens after the ETL process?

I have thousands of .csv files with the same structure and, in most of the cases, some column values are the same ones recurring. Each file represents a report on some structures, with numeric ...
BoardsOfConsulting's user avatar
4 votes
2 answers
116 views

Designing an ETL with where there are a few points of entry

I'm trying to think of a scalable solution for my current system. The current system is 3 microscopes 1 processing machine 1. 60-100GB Files come from 2-3 microscopes every 30 minutes 2. That data ...
user3145912's user avatar
-1 votes
1 answer
33 views

Duplicating API implementations for declaring intention

I'm developing an ETL process in Python and Pandas to pull data from a rest API, and then dump it into a relational database. A few of the fields that come back contain sensitive that I do not want to ...
ADataGMan's user avatar
  • 181
1 vote
2 answers
300 views

How to handle manual corrections to data in ETL pipeline

We receive product data from vendors on a regular basis to be incorporated into our catalog. The data looks like this: [ { id: 123, collection: Spring, name: New Beginnings, size: 8, price:...
user2468842's user avatar
0 votes
1 answer
38 views

Is there any general guidelines to allocate table space quota to different layers in ETL?

I am looking for any general guidelines to allocate table space quota to different layers/schemas in ETL flow of a data warehouse (% of total space in each layer). As per my research, ETL flow can ...
Curious_Mind's user avatar
0 votes
1 answer
268 views

Do Data Warehouse standards allow foreign key constraints at a dimensional model?

Is it true that we never enable foreign key constraints in the dimensional model of a data warehouse? If yes, then what is the rationale behind that? As per my research: Some experts told me in a ...
Curious_Mind's user avatar
-3 votes
1 answer
86 views

what is event based data integration? [closed]

Please help me to understand what is event based data integration in simple layman term with some examples? How it is different from other form of data integration. Some sample use cases will be ...
Rajneesh Shukla's user avatar
-5 votes
1 answer
563 views

Sync local database with remote

My client has a business which work mostly in remote areas where internet felicity is limited, We have a central database and the branches in remote areas need to connect to the central database. We ...
Jisson's user avatar
  • 93
4 votes
2 answers
995 views

How should a data warehouse be maintained for a quickly changing schema

I am currently in a process of maintaining a data warehouse for a quickly growing start up company. There is a lot of reporting demands from the clients, and this is usually handled by a data ...
Yong Jun Kim's user avatar
2 votes
0 answers
42 views

How to manage scheduled ETL jobs that are time sensitive?

We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter ...
Igneous01's user avatar
  • 2,333
1 vote
0 answers
822 views

Parsing a JSON file from S3 using Airflow

I'm new to Airflow and I'm working on a proof of concept. The project is fairly simple... every day some 10,000 JSON files are loaded onto a folder on AWS S3. I have to get each one of them, parse ...
Gabe's user avatar
  • 143
1 vote
1 answer
4k views

Best practice for sharing code and data between airflow worker nodes?

New to Apache Airflow and curious about how code and data are expected to be used across worker nodes in a multinode airflow setup. When considering if ETL logic should be in the dags or in separate ...
lampShadesDrifter's user avatar
2 votes
0 answers
393 views

Data pipeline architecture: airflow triggered by message broker

Let us say we have: a web app with a Postgres DB that produces data over time, another DB optimized for analytics that we would like to populate over time. My goal is to build and monitor an ETL ...
sunless's user avatar
  • 151
2 votes
2 answers
1k views

Micro-services architecture for Data Ingestion/Transformation pipeline project

I am working on designing a brand new Data Ingestion Pipeline with the Key highlights of the new project are as follows: Download and Update data to/from SharePoint using SharePoint APIs Download and ...
Nanu's user avatar
  • 121
1 vote
2 answers
81 views

Should data be pre-processed before being handled by an ETL framework?

So I was discussing coding with an associate of mine at work, and was mentioning how I was working on a project where I'd need to transform the data that was provided into a standardized format before ...
canadiancreed's user avatar
1 vote
1 answer
1k views

How/when to normalize during ETL?

Let's say you're loading a denormalized flat file of purchase transactions that looks like this: | location_name | location_zip | product | product_price | |---------------|--------------|---------|--...
seriestoo2's user avatar
-1 votes
1 answer
279 views

WebApp for ETL with visual mapping - read csv and map it to data model

a few years ago I wrote a python script for reading CSV, handling the headers, filtering data, renaming stuff via RegEx...bascially to do various ETL stuff. This was done using a exhaustive ...
and0r's user avatar
  • 109
-1 votes
1 answer
152 views

How to incrementally update value of features in a machine learning pipeline?

I am working on a machine learning pipeline where we have to compute certain measures on streaming data. Every day, new raw data enters our pipeline. To update our features, we have to run an ETL that ...
spoderman's user avatar
-3 votes
2 answers
700 views

Why are multiple backends in this system? [closed]

I am trying to understand the architecture of the system described in this patent about aggregating and analyzing confidential data: https://patents.justia.com/patent/20180089196. The general ...
MikiBelavista's user avatar
0 votes
2 answers
102 views

How to automatically test the result of an ETL tool?

If an ETL tool is being used to move data from an OLTP database into a "business intelligence reporting" database, is there any standard way of automatically testing that the data in the reporting ...
binarylegit's user avatar
0 votes
1 answer
354 views

Data Integration Design Using Microsoft SSIS

I am working on a data integration project, where I need to extract data from oracle source and load it to XML file. The requirement is to get the list of customers and foreach customer create an xml ...
sab's user avatar
  • 109
3 votes
1 answer
407 views

What are some design ideas for a data mapping and transformation application?

Here is a high level outline of the project: We frequently need to convert data from a new incoming system to our in house system (sort of a basic ETL process) We would prefer to do this dynamically, ...
dpberry178's user avatar
1 vote
1 answer
124 views

What does 'data coverage' mean when talking about ETL processes?

I was watching this talk about ETL's shortcomings and the solutions provided by the Kafka platform but I don't quite understand what the speaker is referring to when she says ETL tools have been ...
Indaco789's user avatar
1 vote
0 answers
836 views

Is MapReduce a correct framework for Extract, Transform, Load of data?

EDIT I am working on a project to update a legacy ETL infrastructure that supports a number of clients, each with a slightly different setup. Constraints that cannot be changed: Source data can ...
Noah Goodrich's user avatar
2 votes
2 answers
405 views

Automating tests for ETL flow

I created am ETL that parse various files, transform file's data and then push the lines into a DB, until now I did a manual tests and check that all the values parsed correctly and all the lines (...
Michael's user avatar
  • 197
1 vote
2 answers
314 views

Data integration from heterogeneous sources

A client has requested that we build a platform for integrating data from partner to their central data store. This will not be "Big Data" scale. The data from each partner will be accessible through ...
Elad Lachmi's user avatar
1 vote
1 answer
1k views

How to implement ETL with MySQL?

I have a legacy MySQL Database (A), and a new reviewed structure for MySQL data base (B). Problem number one is that Database has to be alive and keeps receiving data from legacy apps. What I need is ...
koalaok's user avatar
  • 513
3 votes
2 answers
253 views

Translate data between inconsistently-matched data structures

How can my program best represent a translation between imperfectly-matched data structures? I am tasked with a one-way translation of data from one system to another. Both systems are established, I ...
bignose's user avatar
  • 191
0 votes
1 answer
670 views

Exporting data to file share vs calling a web service to handle the export

I would like to hear opinions/best practices for the following scenario: I have an application A (C# app from a vendor that does not have an accessible database, it has a sqllite db but we don't have ...
TuSabesTuSabes's user avatar
0 votes
2 answers
53 views

what is a better approach in replicating data from table to another? triggers or a third party ETL tool?

We have multiple tables which we need to retrieve data from and dump to one centralized table. Currently what we are doing is running an ETL job made from Pentaho, retrieve the records from the source ...
chip's user avatar
  • 239
3 votes
1 answer
1k views

How is one or more aggregate function implemented in most SQL engines?

In the book Database Fundamentals, Silberschatz. It is explained that aggregate functions can be calculated on the march. This make sense. What it means is that for calculating the maximun, average ...
jgomo3's user avatar
  • 336
5 votes
1 answer
15k views

Difference between ESB and ETL

When should a ESB vs ETL tool be used? I have worked on ESB projects using Tibco Businessworks quite a few years ago. The message bus that we built used to consume messages from a source system , ...
Punter Vicky's user avatar
2 votes
2 answers
1k views

Using ESB for database synchronisation / replication

We're starting to look at implementing an ESB / Microservices architecture. I (think) I know about the concepts, but there's one thing I don't seem to be able to get a good idea about: data ...
Gabriël's user avatar
  • 205
4 votes
2 answers
2k views

Is ReST useful in Read/Write Operations that involve over 100 Gig

I work in Healthcare and we use SAS to Extract and Transform medical and pharmacy claims data for use in downstream reporting applications. For a given Report Request(usually 40 are running at a time)...
Charlie Bastnagel's user avatar
0 votes
1 answer
359 views

Coordinating a complicated data migration process

A project I'm involved in has suffered a change in scope, and before I set about trying to cook up some homegrown solution, I'm wondering if there is something out there -- some framework, for example ...
Mario's user avatar
  • 151
4 votes
1 answer
3k views

What is the right way to process inconsistent data files?

I'm working at a company that uses Excel files to store product data, specifically, test results from products before they are shipped out. There are a few thousand spreadsheets with anywhere from 50-...
Tahabi's user avatar
  • 61
6 votes
2 answers
322 views

Disagreement Concerning Data Integration (I may not understand enterprise ETL tools)

I have been in an ongoing conversation concerning a project we are about to undertake at my place at work. The project concerns data integration. Our customers want to be able to integrate our data ...
Josh's user avatar
  • 321
1 vote
1 answer
192 views

Production or Custom Test Data for Unit Testing?

I've recently had a little disagreement with fellow developers. We're transforming various ontologies from the original source format (Pica+, RDF, etc) into our data format and have several converters ...
IAE's user avatar
  • 1,460
3 votes
2 answers
1k views

Data warehouse architecture for mutating schema

I am setting up an ETL process and small data warehouse for querying the data in a few different dimensions. One issue is that the schema for the objects can mutate over time - mainly that some fields ...
Rex M's user avatar
  • 231
3 votes
1 answer
221 views

Enterprise Wide Keys [closed]

I have for a long time been working on an ODS as well as Data Warehouse. Both are integrating a wide variety of data sources from stove pipe applications. One of the uses of the ODS is to provide ...
AaronLS's user avatar
  • 206
4 votes
1 answer
3k views

Designing a Content-Based ETL Process with .NET and SFDC

As my firm makes the transition to using SFDC as our main operational system, we've spun together a couple of SFDC portals where we can post customer-specific documents to be viewed at will. As such, ...
Patrick's user avatar
  • 165

15 30 50 per page