92

I have a PHP script that takes a long time (5-30 minutes) to complete. Just in case it matters, the script is using curl to scrape data from another server. This is the reason it's taking so long; it has to wait for each page to load before processing it and moving to the next.

I want to be able to initiate the script and let it be until it's done, which will set a flag in a database table.

What I need to know is how to be able to end the http request before the script is finished running. Also, is a php script the best way to do this?

2
  • 1
    Although you did not mention it in the languages supported by your server, I'm gonna guess if you have the ability to run Ruby and Perl, you proabably could add Node.js, and this sounds to me like a perfect use case for Javascript: your script will spend most of its time waiting for requests to complete, which is an area the async paradigm excells in. No threads means easy synchronization, concurrency means spead.
    – djfm
    Commented Jan 24, 2015 at 10:26
  • You can do this with PHP. I would use Goutte and Guzzle to implement concurrency threads. You can also take a look into Gearman to launch parallel requests it the form of workers. Commented Jan 17, 2017 at 20:22

16 Answers 16

125

Update +12 years - Security Note

While this is still a good way to invoke a long running bit of code, it is good for security to limit or even disable the ability of PHP in the webserver to launch other executables. And since this decouples the behaviour of the log running thing from that which started it, in many cases it may be more appropriate to use a daemon or a cron job.

Original Answer

Certainly it can be done with PHP, however you should NOT do this as a background task - the new process has to be dissociated from the process group where it is initiated.

Since people keep giving the same wrong answer to this FAQ, I've written a fuller answer here:

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

From the comments:

The short version is shell_exec('echo /usr/bin/php -q longThing.php | at now'); but the reasons "why", are a bit long for inclusion here.

15
  • 3
    any chance of copying the relevant details into the answer? there's too many old answers that link to dead blogs. That blog isn't dead (yet) but will be one day.
    – Murphy
    Commented Jun 2, 2015 at 14:22
  • 5
    The short version is shell_exec('echo /usr/bin/php -q longThing.php | at now'); but the reasons why are a bit long for inclusion here.
    – symcbean
    Commented Jun 2, 2015 at 14:44
  • 2
    Works with this at the end though: > /dev/null 2>&1 &
    – BSUK
    Commented Apr 22, 2017 at 2:50
  • 2
    may I know what this -q option is doing?
    – Kiren S
    Commented Feb 14, 2019 at 7:08
  • 2
    More discussion at the linked article - user needs to be included in /etc/at.allow, selinux/apparmor needs specific considerations, won't work if the PHP is running chroot, permissions must allow for running of at and the target process, the syntax of the implementation depends on the path being set correctly.
    – symcbean
    Commented Feb 17, 2019 at 22:23
11

The quick and dirty way would be to use the ignore_user_abort function in php. This basically says: Don't care what the user does, run this script until it is finished. This is somewhat dangerous if it is a public facing site (because it is possible, that you end up having 20++ versions of the script running at the same time if it is initiated 20 times).

The "clean" way (at least IMHO) is to set a flag (in the db for example) when you want to initiate the process and run a cronjob every hour (or so) to check if that flag is set. If it IS set, the long running script starts, if it is NOT set, nothin happens.

3
  • So the "ignore_user_abort" method would allow the user to close the browser window, but is there something I could do to have it return an HTTP response to the client before it is finished running?
    – kbanman
    Commented Feb 6, 2010 at 16:07
  • 1
    @kbanman Yep. You need to close the connection: header("Connection: close", true);. And don't forget to flush()
    – Benubird
    Commented Jul 10, 2013 at 14:57
  • This answer should be combined with set_time_limit(0) to keep the script from timing out (or use set_time_limit with an appropriate timeout value). Commented Jan 18 at 5:20
8

You could use exec or system to start a background job, and then do the work in that.

Also, there are better approaches to scraping the web that the one you're using. You could use a threaded approach (multiple threads doing one page at a time), or one using an eventloop (one thread doing multiple pages at at time). My personal approach using Perl would be using AnyEvent::HTTP.

ETA: symcbean explained how to detach the background process properly here.

1
  • 6
    Nearly right. Just using exec or system will come back to bite you on the ass. See my reply for details.
    – symcbean
    Commented Feb 6, 2010 at 23:33
6

Yes, you can do it in PHP. But in addition to PHP it would be wise to use a Queue Manager. Here's the strategy:

  1. Break up your large task into smaller tasks. In your case, each task could be loading a single page.

  2. Send each small task to the queue.

  3. Run your queue workers somewhere.

Using this strategy has the following advantages:

  1. For long running tasks it has the ability to recover in case a fatal problem occurs in the middle of the run -- no need to start from the beginning.

  2. If your tasks do not have to be run sequentially, you can run multiple workers to run tasks simultaneously.

You have a variety of options (this is just a few):

  1. RabbitMQ (https://www.rabbitmq.com/tutorials/tutorial-one-php.html)
  2. ZeroMQ (http://zeromq.org/bindings:php)
  3. If you're using the Laravel framework, queues are built-in (https://laravel.com/docs/5.4/queues), with drivers for AWS SES, Redis, Beanstalkd
1
  • A cron job is another good option if this is happening at a smaller scale and you don't already have something like RabbitMQ available in your production environment. Commented Jan 18 at 4:48
5

No, PHP is not the best solution.

I'm not sure about Ruby or Perl, but with Python you could rewrite your page scraper to be multi-threaded and it would probably run at least 20x faster. Writing multi-threaded apps can be somewhat of a challenge, but the very first Python app I wrote was mutlti-threaded page scraper. And you could simply call the Python script from within your PHP page by using one of the shell execution functions.

5
  • The actual processing part of my scraping is very efficient. As I mentioned above, it's the loading of each page that kills me. What I was wondering is if PHP is meant to be run for such long periods.
    – kbanman
    Commented Feb 6, 2010 at 16:05
  • I'm a bit biased because since learning Python I outright loathe PHP. However, if you're scraping more than one page (in series), you're almost certain to get better performance by doing it in parallel with a multithreaded app.
    – jamieb
    Commented Feb 6, 2010 at 21:34
  • 1
    Any chance you could send me an example of such a page scraper? It would help me out aplenty seeing as I haven't yet touched Python.
    – kbanman
    Commented Feb 9, 2010 at 8:52
  • If I had to rewrite it, I'd just use eventlet. It's make my code about 10x simpler: eventlet.net/doc
    – jamieb
    Commented Feb 12, 2010 at 8:40
  • Or you could rewrite your PHP script to be multi-threaded. Python is indeed probably the best language for scraping the web (if you don't need an admin portal to go along with the scraper), however, PHP would be faster not slower than Python. Commented Jan 18 at 5:31
3

PHP may or may not be the best tool, but you know how to use it, and the rest of your application is written using it. These two qualities, combined with the fact that PHP is "good enough" make a pretty strong case for using it, instead of Perl, Ruby, or Python.

If your goal is to learn another language, then pick one and use it. Any language you mentioned will do the job, no problem. I happen to like Perl, but what you like may be different.

Symcbean has some good advice about how to manage background processes at his link.

In short, write a CLI PHP script to handle the long bits. Make sure that it reports status in some way. Make a php page to handle status updates, either using AJAX or traditional methods. Your kickoff script will the start the process running in its own session, and return confirmation that the process is going.

Good luck.

1

I agree with the answers that say this should be run in a background process. But it's also important that you report on the status so the user knows that the work is being done.

When receiving the PHP request to kick off the process, you could store in a database a representation of the task with a unique identifier. Then, start the screen-scraping process, passing it the unique identifier. Report back to the iPhone app that the task has been started and that it should check a specified URL, containing the new task ID, to get the latest status. The iPhone application can now poll (or even "long poll") this URL. In the meantime, the background process would update the database representation of the task as it worked with a completion percentage, current step, or whatever other status indicators you'd like. And when it has finished, it would set a completed flag.

1

You can send it as an XHR (Ajax) request. Clients don't usually have any timeout for XHRs, unlike normal HTTP requests.

1

I realize this is a quite old question but would like to give it a shot. This script tries to address both the initial kick off call to finish quickly and chop down the heavy load into smaller chunks. I haven't tested this solution.

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}
5
  • @symcbean I read the post you suggested and would like to hear your thoughts on this alternative solution. Commented Jun 27, 2013 at 1:51
  • Firstly, you've given me a starting idea for my first bot (teehee). Secondly, how did you find the performance of your solution? Have you worked with it further and learnt anything more? I'm interested in implementing something similar to dredge through 26,000 images (1,3GB), perform various operations, etc. It's going to take a while. Yours is the only solution that doesn't seem hacky, use exec() shudder or require Linux (some of us losers still have to use Windows). I prefer to learn from your headbashing, rather than my own :P
    – user651390
    Commented Nov 22, 2013 at 9:45
  • @HighPriestessofTheTech Hi mate, I haven't gone any further. At the time I wrote this I was just putting out a thought experiment. Commented Nov 23, 2013 at 14:13
  • 1
    Oh dear... So I'll be learning from my own headbashing... I'll let you know how it goes ;)
    – user651390
    Commented Nov 23, 2013 at 16:07
  • 1
    I did try this and I find it quite useful.
    – Alex
    Commented Feb 20, 2014 at 13:29
1

I would like to propose a solution that is a little different from symcbean's, mainly because I have additional requirement that the long running process need to be run as another user, and not as apache / www-data user.

First solution using cron to poll a background task table:

  • PHP web page inserts into a background task table, state 'SUBMITTED'
  • cron runs once each 3 minutes, using another user, running PHP CLI script that checks the background task table for 'SUBMITTED' rows
  • PHP CLI will update the state column in the row into 'PROCESSING' and begin processing, after completion it will be updated to 'COMPLETED'

Second solution using Linux inotify facility:

  • PHP web page updates a control file with the parameters set by user, and also giving a task id
  • shell script (as a non-www user) running inotifywait will wait for the control file to be written
  • after control file is written, a close_write event will be raised an the shell script will continue
  • shell script executes PHP CLI to do the long running process
  • PHP CLI writes the output to a log file identified by task id, or alternatively updates progress in a status table
  • PHP web page could poll the log file (based on task id) to show progress of the long running process, or it could also query status table

Some additional info could be found in my post : http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

0

I have done similar things with Perl, double fork() and detaching from parent process. All http fetching work should be done in forked process.

0

Use a proxy to delegate the request.

0

what I ALWAYS use is one of these variants (because different flavors of Linux have different rules about handling output/some programs output differently):

Variant I @exec('./myscript.php \1>/dev/null \2>/dev/null &');

Variant II @exec('php -f myscript.php \1>/dev/null \2>/dev/null &');

Variant III @exec('nohup myscript.php \1>/dev/null \2>/dev/null &');

You might havet install "nohup". But for example, when I was automating FFMPEG video converstions, the output interface somehow wasn't 100% handled by redirecting output streams 1 & 2, so I used nohup AND redirected the output.

0

if you have long script then divide page work with the help of input parameter for each task.(then each page act like thread) i.e if page has 1 lac product_keywords long process loop then instead of loop make logic for one keyword and pass this keyword from magic or cornjobpage.php(in following example)

and for background worker i think you should try this technique it will help to call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.

cornjobpage.php //mainpage

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS:if you want to send url parameters as loop then follow this answer :https://stackoverflow.com/a/41225209/6295712

0

Not the best approach, as many stated here, but this might help:

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here
0

If the desired output of your script is some processing, not a webpage, then I believe the desired solution is to run your script from shell, simply as

php my_script.php

Not the answer you're looking for? Browse other questions tagged or ask your own question.