Skip to main content

Questions tagged [big-data]

The tag has no usage guidance.

2 votes
1 answer
149 views

What is an optimal system design for tracking product views per user that is scalable?

I have a web application that contains products and users. There are 10,000+ products and 100,000+ users to give a sense of the scale that's required. For some application specific reasons, I need to ...
kitkat's user avatar
  • 29
0 votes
1 answer
63 views

Data file ingestion with minio and kafka

I want to collect a lot of files (file data + metadata) from local servers to a central server. Files are important, need to ensure that no files are lost Local servers: implement a collector to ...
kietheros's user avatar
  • 109
3 votes
1 answer
805 views

How to store a huge volume of time-series datapoints in an efficient way?

We have an application producing 5k-10k datapoints per second. Each datapoint has more than one metric, alongside its time of creation. We are looking for an efficient, scalable way to store this huge ...
Paul Benn's user avatar
  • 147
5 votes
1 answer
1k views

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
Namah's user avatar
  • 61
1 vote
0 answers
393 views

How to (simply) architecture a way to ingest multiple types of large files, process them, and send data in chunks to web services?

Note: All of this would be in AWS Hi everyone, What would you guys suggest for building something that: Takes in several different input file types (ex: csv, json, jsonl, xml, .gz, ...) That can be ...
user avatar
0 votes
0 answers
58 views

Should aggregated data include meta data?

I want to create a aggregation job that executes a big db query and flush it into BigQuery. My question is should I include only the id of the entities (campaign id, advertiser id, user id) or should ...
Avi L's user avatar
  • 109
0 votes
1 answer
92 views

A program design question: Good idea using HDFS in c for reading large data?

I have mainly three groups of CSV files (each file is divided into several small files): First group of CSV files have 600+ GB in total (MAYBE 200+ GB if in int, cause CSV calculates by char right?), ...
heisthere's user avatar
  • 101
2 votes
2 answers
3k views

From Oracle to Apache Parquet : how to handle eventual consistency?

I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries. That's why I ...
Klun's user avatar
  • 31
1 vote
1 answer
762 views

Load for Date dimension table of a warehouse

I have a general question about loading data into a data warehouse (DW). This is basically a followup to an older question of mine. I have a general understanding problem about fill the [Date] ...
Steffen Mangold's user avatar
3 votes
2 answers
152 views

Enterprise application warehousing and relational database

I have a general question about design pattern for an enterprise application. I read a lot about it but its actually hard to find an answer because most you find it rater about how to design a data ...
Steffen Mangold's user avatar
3 votes
2 answers
1k views

Aggregation and storage system design for user event processing?

I have a eCommerce like system which produces 5000 user events (of different kind like product search/product view/profile view) per second Now for reporting business users would like to view the ...
M Sach's user avatar
  • 267
1 vote
3 answers
291 views

Query 30 million HTML documents

I have 30-ish million html documents in a file system. There is no emergency, the files are in a reasonable directory tree, it's not breaking the file system. But I'd like to be able to organize and ...
Martin K's user avatar
  • 2,917
0 votes
0 answers
84 views

Generating fake number for a 25 digit PII number in a file containing millions of rows

I have to expose some sensitive data containing a PII column that has a 25 digit number. Rest of the columns aren't PII data. This is done such that the data can be safely shared to the larger ...
stormfield's user avatar
2 votes
0 answers
27 views

How to design a report processing model using Spark in the most efficient way

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming ...
Remis Haroon - رامز's user avatar
2 votes
2 answers
1k views

Designing a big data web app

How do you design a website that allows users to query a large amount of user data, more specifically: there are ~100 million users with ~100TB of data, data is stored in HDFS (not a database) number ...
Minh Thai's user avatar
  • 141

15 30 50 per page
1
2 3 4 5