0

I have a kubernetes deployment which is fielding expensive (but cache-able) requests, let's say a website scraping service (not really) which takes about 15 seconds to scrape a website. In my backend I also have a postgres database running which I can use to cache request data. The a typical pattern of requests to my service is a bunch of requests to scan the same URL, which slowly tapers off. To be considerate of the website owner, I don't want every pod which receives a request to go off and scrape the website. I'd like to somehow ensure that the website is only scraped once per arbitrary amount of requests.

For table schema I was thinking of the following

CREATE TABLE url_scans (
    url VARCHAR(2048) PRIMARY KEY,
    scrape_status ENUM('in_progress', 'failed', 'completed') NOT NULL DEFAULT 'in_progress',
    insert_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    scrape_result JSONB
);

When a scan request comes in, pods will first search the database to see if the result has been cached (row is marked 'completed'). If no result is cached, pods could do an insert on this table:

  • On insert success, this pod would be designated the "scraper", and would perform the scraping operation, and update the row in the table accordingly.
  • On insert failure, the pods are now in a "waiting" state (this is the same case for when a selects from the database and the row is in the 'in_progress' state).

In the database I am planning on defining a function as follows:

CREATE OR REPLACE FUNCTION notify_scrape_status_change()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.scrape_status != OLD.scrape_status THEN
    EXECUTE 'NOTIFY ' || NEW.url || ', ''Scrape status changed to ' || NEW.scrape_status || ''';';
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

With a corresponding trigger against the scrape:

CREATE TRIGGER scrape_status_change_trigger
AFTER UPDATE OF scrape_status ON url_scans
FOR EACH ROW EXECUTE FUNCTION notify_scrape_status_change();

Now the listener pods will listen on a channel named after the URL for when the URL is complete.

However, I have two concerns which I cannot resolve.

  • if a pod is killed in the middle of a scrape... what then? How do I pick the next "scraper"
  • If a pod selects from the database and finds the URL it is looking for in the in_progress state, it will proceed to listen on the channel for a transition to completion. But what happens if the URL transitions to complete between those two operations?
  • With all these uniquely named NOTIFY channels, should I be concerned about memory usage? Does postgres automatically clear them out at some point?

1 Answer 1

2
  • if a pod is killed in the middle of a scrape... what then? How do I pick the next "scraper"
  • If a pod selects from the database and finds the URL it is looking for in the in_progress state, it will proceed to listen on the channel for a transition to completion. But what happens if the URL transitions to complete between those two operations?

Both of these can be handled with a timeout and retry mechanism.

  1. Have a cleanup process that searches the database for cache entries that were added more than X time ago and are in the status "in_progress". Remove those entries from the database.
  2. Have each pod that is waiting for a notification poll the database every Y time if the cache entry has become available (something went wrong with the notifications) or was deleted (the previous worker pod timed out and this one can become the new worker).

You can play around with the times X and Y to tune the behavior. Time X should be long enough that even a worst-case successful scraping attempt is finished within time X. Time Y should be long enough that the polling doesn't put a real load on the system and that most requests are handled through the NOTIFY channels before a pod starts polling.

  • With all these uniquely named NOTIFY channels, should I be concerned about memory usage? Does postgres automatically clear them out at some point?

I am not familiar enough with postgres to tell how it deals with NOTIFY channels that nobody listens to. But other than that, I see no trigger for the database engine to cleanup NOTIFY channels and I would assume that you have to do that manually.

Not the answer you're looking for? Browse other questions tagged or ask your own question.