-2

Question

What is a good storage layer, coding paradigm, and query language for computing over realtime data?

Use-case

For example, stock and options prices are (essentially) realtime data streams. I want to load that data into my storage layer every N ms, compute a set of strategies from the data, and surface strategies that meet some criteria (price, transactions, delta, any attribute, etc.) so that I can manually inspect and validate. The transformation of data into strategies and strategy filtering should happen each time the underlying data changes so that my view doesn't become stale--almost like a dashboard.

I'm looking for generic tools and not something bespoke to financial data. Potentially another use-case would be for evaluating next moves in a board game with discrete actions (not an RTS). For example, imagine you're playing a game of chess and it's your opponent's turn. During your opponent's turn, assume they continuously signal to you the move they're currently considering / calculating over. In this case, the move is stored/written somewhere, once written a process computes the world of possible next moves you could take, calculates a heuristic function for each, filters for highly valued moves, and presents the top X. Each time the opponent's considered move changes (which you can see / receive the data for), the process reacts (reactive paradigm) and updates the candidate move heuristic and selection function (declarative query language).

Thoughts

Naively I think I can have some process that regularly polls for the data, writes to a relational db, and I run a query against the db with SQL. However, although I'm not deeply familiar with all of SQL's capabilities, I don't think SQL results will update in reaction to the next db write--so the results will become stale. Nonetheless, I'm looking for something like SQL because I want to declaratively specify my criteria. I want to be able to say, "select all the aggregations of the data such that X". I don't want to imperatively and explicitly for-loop through all the data. (As an aside, what I'm looking for reminds me of something like a logic programming language in which I only need to describe the output in terms of constraints.)

Re: paradigm, I'm imagining something like reactive programming that will execute the transformation function on the data each time the data changes. Something that abstracts away any polling against the database.

Re: storage layer, I assumed that a relational database would be best because I was imagining a query language like SQL (or some other declarative lang) but perhaps there's another database type that I'm not familiar with.

Recap

To reiterate, what are the right generic tools that compose a system design to retrieve live query results over realtime data?

2
  • 1
    Hi geofflittle. IMHO this question misses a very central recommendation on our don't ask page: "Your questions should be reasonably scoped. If you can imagine an entire book that answers your question, you’re asking too much.". Moreover, you won't find "one right set of generic tools" for this domain. There are way-too-many tools and possibilities to pick from, all with different tradeoffs. So I question your whole approach to this. ...
    – Doc Brown
    Commented Oct 22, 2022 at 9:00
  • ... here is a better one: pick one restricted real-world task, like your "stock and options" example, find a data source where you can get the data from and try to implement a small prototype, using some tools you already know or want to learn. Then see where it takes you, where obstacles occur and refine your technological choices from that experience.
    – Doc Brown
    Commented Oct 22, 2022 at 9:07

1 Answer 1

0

OK so, the main issue you have to deal with in these types of problem are

  1. Too much data
  2. Rate of data incoming is faster than your computation

Lets take a simple example, your stock calculation works out a rolling average of the price for the last month.

This is fairly simple to do in SQL select avg(price) from stock where date>now-30d

However, if you have a price point per second, and you are writing new data every second to that database times number of stocks. Well the volume of data it needs to iterate over is too high for the calc to be performed quickly, and it will lock out new writes while its running.

To solve these issues you have to

  1. move the calculation to a different machine than your database
  2. keep the values you need for the calculation in memory, don't fetch the full set each time.

So instead of our sql database, we use a push messaging system like a message queue.

  • As prices come in they are separated by stock symbol and sent to a calculation machine for that symbol.

  • The calculation machine keeps track of last 10000 prices its seen.

  • When a new price comes in you add it to the list, forget the oldest value and recalculate

  • A different machine also receives the price and saves it to a database

In doing this we have split the problem over several computers, which gives us more scope to scale up the computational power we can throw at any given algorithm.

You can also see how a complex algorithm could be split into several sub problems, ie the rolling average calc might push its result to another queue, which is in turn consumed by a second calculation.

Not the answer you're looking for? Browse other questions tagged or ask your own question.