0

I have a situation where let's say I have a folder called logs which has N folders. Each folder contains events for a specific event type and each folder has N .log files where each file has multiple lines of JSON.

Example:

event1.1.log

{"id":1, "name": "ABCD"}
{"id":2, "name": "EFGH"}
{"id":5, "name": "IJKL"}
{"id":7, "name": "MNOP"}

event1.2.log

{"id":3, "name": "ABCD"}
{"id":4, "name": "EFGH"}
{"id":6, "name": "IFKL"}
{"id":8, "name": "ABED"}

Now, each event can have its own structure, but it's guaranteed that each log in the same event will always have the same structure.

Now, I need a way to run ad hoc queries on these: get a list of students, get top ten students, etc.

I thought of loading them onto a temporary table and then run queries on it, but I was wondering if there was any other way to do this.

I could write an application that could parse the files in memory, but the amount of data could be huge to do computation in memory. And every time I want to run a different query on the same dataset within the next few days, it would have to parse all files into memory again.

Any approaches on this?

2
  • There are tools such as jq (link), but it only supports comparatively simple queries.
    – amon
    Commented May 9, 2021 at 21:19
  • What amount of data do you consider "huge"? Is reading the data into memory really not an option? For ad hoc queries, flexibility is often more important than performance, so some Python code that reads the files, performs the selection and sorting, and writes out results would most likely be the simplest solution. Commented May 10, 2021 at 12:51

1 Answer 1

1

You want to upload these to a "query time schema" database like Splunk or use the ELK stack if there is some structure.

https://aws.amazon.com/elasticsearch-service/the-elk-stack/#:~:text=The%20ELK%20stack%20is%20an,Elasticsearch%2C%20Logstash%2C%20and%20Kibana.

Not the answer you're looking for? Browse other questions tagged or ask your own question.