I have a kubernetes deployment which is fielding expensive (but cache-able) requests, let's say a website scraping service (not really) which takes about 15 seconds to scrape a website. In my backend I also have a postgres database running which I can use to cache request data. The a typical pattern of requests to my service is a bunch of requests to scan the same URL, which slowly tapers off. To be considerate of the website owner, I don't want every pod which receives a request to go off and scrape the website. I'd like to somehow ensure that the website is only scraped once per arbitrary amount of requests.
For table schema I was thinking of the following
CREATE TABLE url_scans (
url VARCHAR(2048) PRIMARY KEY,
scrape_status ENUM('in_progress', 'failed', 'completed') NOT NULL DEFAULT 'in_progress',
insert_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
scrape_result JSONB
);
When a scan request comes in, pods will first search the database to see if the result has been cached (row is marked 'completed'). If no result is cached, pods could do an insert on this table:
- On insert success, this pod would be designated the "scraper", and would perform the scraping operation, and update the row in the table accordingly.
- On insert failure, the pods are now in a "waiting" state (this is the same case for when a selects from the database and the row is in the 'in_progress' state).
In the database I am planning on defining a function as follows:
CREATE OR REPLACE FUNCTION notify_scrape_status_change()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.scrape_status != OLD.scrape_status THEN
EXECUTE 'NOTIFY ' || NEW.url || ', ''Scrape status changed to ' || NEW.scrape_status || ''';';
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
With a corresponding trigger against the scrape:
CREATE TRIGGER scrape_status_change_trigger
AFTER UPDATE OF scrape_status ON url_scans
FOR EACH ROW EXECUTE FUNCTION notify_scrape_status_change();
Now the listener pods will listen on a channel named after the URL for when the URL is complete.
However, I have two concerns which I cannot resolve.
- if a pod is killed in the middle of a scrape... what then? How do I pick the next "scraper"
- If a pod selects from the database and finds the URL it is looking for in the in_progress state, it will proceed to listen on the channel for a transition to completion. But what happens if the URL transitions to complete between those two operations?
- With all these uniquely named NOTIFY channels, should I be concerned about memory usage? Does postgres automatically clear them out at some point?