Questions tagged [big-data]
The big-data tag has no usage guidance.
75
questions
2
votes
1
answer
149
views
What is an optimal system design for tracking product views per user that is scalable?
I have a web application that contains products and users. There are 10,000+ products and 100,000+ users to give a sense of the scale that's required.
For some application specific reasons, I need to ...
0
votes
1
answer
63
views
Data file ingestion with minio and kafka
I want to collect a lot of files (file data + metadata) from local servers to a central server. Files are important, need to ensure that no files are lost
Local servers: implement a collector to ...
3
votes
1
answer
805
views
How to store a huge volume of time-series datapoints in an efficient way?
We have an application producing 5k-10k datapoints per second. Each datapoint has more than one metric, alongside its time of creation.
We are looking for an efficient, scalable way to store this huge ...
5
votes
1
answer
1k
views
How do you perform accumulation on large data sets and pass the results as a response to REST API?
I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
1
vote
0
answers
393
views
How to (simply) architecture a way to ingest multiple types of large files, process them, and send data in chunks to web services?
Note: All of this would be in AWS
Hi everyone,
What would you guys suggest for building something that:
Takes in several different input file types (ex: csv, json, jsonl,
xml, .gz, ...)
That can be ...
0
votes
0
answers
58
views
Should aggregated data include meta data?
I want to create a aggregation job that executes a big db query and flush it into BigQuery.
My question is should I include only the id of the entities (campaign id, advertiser id, user id) or should ...
0
votes
1
answer
92
views
A program design question: Good idea using HDFS in c for reading large data?
I have mainly three groups of CSV files (each file is divided into several small files): First group of CSV files have 600+ GB in total (MAYBE 200+ GB if in int, cause CSV calculates by char right?), ...
2
votes
2
answers
3k
views
From Oracle to Apache Parquet : how to handle eventual consistency?
I have an existing production Oracle Database. However, there are performance issues for certain kind of operations, because of the volume of the data, or the complexity of queries.
That's why I ...
1
vote
1
answer
762
views
Load for Date dimension table of a warehouse
I have a general question about loading data into a data warehouse (DW).
This is basically a followup to an older question of mine.
I have a general understanding problem about fill the [Date] ...
3
votes
2
answers
152
views
Enterprise application warehousing and relational database
I have a general question about design pattern for an enterprise application.
I read a lot about it but its actually hard to find an answer because most you find it rater about how to design a data ...
3
votes
2
answers
1k
views
Aggregation and storage system design for user event processing?
I have a eCommerce like system which produces 5000 user events (of different kind like product search/product view/profile view) per second
Now for reporting business users would like to view the ...
1
vote
3
answers
291
views
Query 30 million HTML documents
I have 30-ish million html documents in a file system. There is no emergency, the files are in a reasonable directory tree, it's not breaking the file system. But I'd like to be able to organize and ...
0
votes
0
answers
84
views
Generating fake number for a 25 digit PII number in a file containing millions of rows
I have to expose some sensitive data containing a PII column that has a 25 digit number. Rest of the columns aren't PII data. This is done such that the data can be safely shared to the larger ...
2
votes
0
answers
27
views
How to design a report processing model using Spark in the most efficient way
I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data)
I need to generate several reports based on different combinations of the incoming ...
2
votes
2
answers
1k
views
Designing a big data web app
How do you design a website that allows users to query a large amount of user data, more specifically:
there are ~100 million users with ~100TB of data, data is stored in HDFS (not a database)
number ...