2

I have a web application that contains products and users. There are 10,000+ products and 100,000+ users to give a sense of the scale that's required.

For some application specific reasons, I need to track which products have been seen by which users. For example, Productid1 was seen by User1 but not Productid3000.

I'm thinking of a solution using MySQL where I would add another table called "Seen Products" having two columns: ProductId and UserId. Any time a product is seen by a user, I would add a new row to this table. To see whether a product has been seen by a user, I would then check the first ProductId in my Product table that is not in the Seen Table for a particular user.

However, I'm worried about the scalability of calling this multiple times (50+) per user in a user session. Would MySQL tables be fast/scalable enough to achieve this task? What would an optimal way to achieve this requirement?

Edit: Updating this post to add that this data will be used to display a subset (~10) of products to the user which they have not previously been shown before. So, it is a user-facing feature rather than analytics. Therefore, the database "reads" happen to answer the question: "What are 10 products that the user has not seen before?" and this query will be used during the user's session to show those products!

6
  • 1
    If real time is not necessary, and potential loss of some small data is acceptable, then just let the server buffer that data for some time and push it to the database all in one go. Also you may consider some other database to store it. For example some nosql like ScyllaDb or something.
    – freakish
    Commented Jun 10 at 5:37
  • 1
    Asking strangers from the internet will not bring you any further. The only way to get some reliable information here is by implementing a small proof-of-concept on the hardware you have available for your budget, generate some fake data and run some measurements for some realistic access patterns, (reads as well as writes). Don't forget proper indexing. This will also depend on how many users you expect to be online at the same time, If that's becoming a bottleneck, inform yourself about horizontal sharding.
    – Doc Brown
    Commented Jun 10 at 5:46
  • 4
    Optimal depends incredibly on how the data is going to be read. There's a world of difference between "I only query I ever need to make is to work out if user X has seen product Y" and "I want to run arbitrary analytic queries on this data". Commented Jun 10 at 8:45
  • 1
    I have a draft answer written, but I don't think it is what you are aiming for. You are asking about tracking product views, which to me seems like "analytics," but how will this information be used? Can you edit your question to include that? Commented Jun 10 at 15:19
  • Edited my question to answer this feedback!
    – kitkat
    Commented Jun 10 at 17:02

1 Answer 1

1

add another table called "Seen Products" having two columns

Yes! That's perfect, it is the standard design solution.

ProductId and UserId

Well, I would amend that to be UserId and ProductId. Why? Think about the output of EXPLAIN SELECT ....

We'll be asking about unseen products for a specific user ID. So we don't want a query plan that asks for a table (or index) scan to answer that. We want to zip right to the portion of the on-disk datastructure that can concisely answer our question. Create a composite PK in that order, FTW. Yes, the order really does matter, it affects the query plan.


If someone views e.g. four products in a single page view, definitely bundle up four INSERTs in a single COMMIT. That's much more efficient than submitting four separate transactions.

Consider using a write-through RAM cache, so you can delay writing a user's "seen" products for several page views, and so you can maybe put several users' updates into a single transaction. We will want a load balancer that hashes on client address when picking an origin webserver, in order for this to work smoothly. Then a given user is typically routed to a single server.


A two-column PK on a two-column table suffices for your use case. It models a mathematical set.

But perhaps the true use case is slightly more complex than what you described. Minimally, tack on a timestamp column. This will help you to judge relevance when a user has seen a dozen or more products. And it will prove invaluable to a midnight cron job that prunes out month-old records, so we don't retain "seen" until the end of time.

Whether you choose to interpret the timestamp as created or updated is up to you. You might even have sort of a lazy updated stamp which you'll only re-write if it's more than K hours old, in order to cut down on useless writes.

Some folks might wish to have a count column, as well.

A relational database is a terrific repository for such information, and it might be all you need. If you want lower latency and are reluctant to pay the ACID tax, front-ending with a Valkey / Redis cache might be attractive. Buffering writes via a Kafka channel might help reduce user-perceived interactive latencies.

Not the answer you're looking for? Browse other questions tagged or ask your own question.