Scalable talk notes

3/3/12 No Title

Building Scalable Websites with Perl
by Perrin Harkins

Who is doing it?
First, let's establish some credit with any doubters in the audience. I shouldn't have to tell you this, but Perl
runs some of the largest websites in the world. Take a look at some of the better-known examples:

Yahoo.com uses Perl in nearly all of their properties, in particular the personalized My Yahoo service. On
the whole, Yahoo serves three billion page views per day, and about 100 million unique users. Yahoo owns
Overture, the largest sponsored search company. According to their posting on the Perl jobs list at
http://jobs.perl.org/, they handle "more than 10 billion transactions per month!"

Amazon.com, the company that pretty much defines e-commerce, uses Perl on their main site and partner
sites. Amazon also operates the popular Internet Movie Database, IMDB.com, which is built in Perl.

Ticketmaster.com, the largest on-line ticket retailer, is built almost entirely with Perl. So is it's sister
company, CitySearch.com, which operates the most widely-used city guide sites in the US.

Nielsen NetRatings says that Yahoo, Amazon, and InterActiveCorp, which owns Ticketmaster Online and
CitySearch, are all in the top 10 in terms of overall web traffic. We're talking about phenomenal numbers of
users and page views here. By comparison, Slashdot.org, which people frequently point to as a high traffic
site using Perl, is barely a drop in the bucket.

How are they doing it?
Okay, so your company probably doesn't get as much traffic as Yahoo. Still, you may be wondering, what
is it that really large sites do that allows them to scale so big, and is it something you could apply to your
own sites?

Obviously, these are all very different applications. There is no single solution for scaling all of them. Even
buying a lot of hardware isn't a magic bullet, since it just isn't feasible to buy enough computing power to
prop up a slow application at these levels of traffic. However, what you discover when you talk to people
who work at these sites, is that there are a few common techniques that tend to get used by almost everyone
in one form or another. These are fundamental software techniques that have been around for ages, not
some kind of newly invented Internet magic. Feel free to refer to them as design patterns if it will raise your
salary. Today we're going to talk about a couple of these and how they apply to web development
problems.

Things we won't be covering
ﬁle:///Users/perrinharkins/Conferences/scalable_talk.html 1/7

3/3/12 No Title

I should also mention what we're not going to talk about.

We're not going to talk about mod_perl tuning: httpd.conf settings, reverse proxy configurations, increasing
copy-on-write memory sharing, running the profiler... This stuff is very well-documented in the mod_perl
books and the on-line documentation at http://perl.apache.org/. If you're serious about building a scalable
site and you haven't read these resources yet, get on it!

We're not going to talk about DBI tuning. Tim Bunce has detailed slides from his talks available on CPAN
(http://search.cpan.org/~timb/), and there is more in the mod_perl documentation and books.

We're not going to talk about hardware because, well, I'm not very interested in hardware. That's for
cheaters. (However, I'm willing to cut the sites I mentioned above a little slack on this...)

Caching
Caching helps performance by reducing the amount of work that needs to be done, and helps scalability by
reducing the load on shared resources like databases. All of the sites I mentioned above cache like mad
wherever they can. Page caching, object caching, de-normalized database tables - all of these are variations
on a theme. Even if your data is so volatile that it changes every 30 seconds, if it only takes 1 second to
generate it you will still get to serve it from cache for the other 29.

Whole Pages
If you can possibly get away with it, cache entire HTML pages and serve them as static files. This is simply
unbeatable from a performance standpoint. Web servers and operating systems have been tuned to serve
static files with incredible efficiency. When I worked at eToys.com, we were caching all of the non-
interactive pages (i.e. the ones that people just browsing the catalog would see) as static files, and serving
those pages was about ten times as fast as generating the same page on the fly, even when all of the data
needed to create the page was cached in our mod_perl servers.

There are a few ways to make this happen. One of them is to simply write out all of the possible pages on
your site on a regular basis. You can write a big batch job that generates all the files for your website,
probably by reading a database and then pounding the data through templates. Sometimes people write
elaborate versions of this, with dependency checking and make-like functionality. See the ttree program that
comes with Template Toolkit for one take on it.

However, you can also do this for a site that was not built to be pre-published. Many tools exist for
spidering websites to local copies, so all you have to do is point one at your dynamic site and dump it out as
static files.
wget --mirror --convert-links --html-extension --reject gif,jpg,png
--no-parent http://app-server/dynamic/pages/

In reality, most sites would end up needing something more customized than this, but a simple tool like this
can give you something to do benchmarks on at least.

This kind of approach is only feasible if your site is small enough to write out the whole thing on a regular
basis. If you have a site which is a front-end to a large database of some kind, you might have potentially
millions of different pages to publish. There might be a few that get the vast majority of the hits though, and

3/3/12 No Title

are thus worth caching. Rather than try to figure out which ones to pre-publish, you can use a generate-on-
demand approach. This is what most people think of when they hear talk about caching web pages.

The simplest way to do that is with a caching proxy server. If you've read the mod_perl documentation you
should be familiar with the idea of a reverse proxy, sometimes called an HTTP accelerator. It's an HTTP
proxy that sits in front of your server, passing through requests for dynamic pages. You can configure it to
cache the pages and then tell it how long to keep cache them by setting the Expires and Cache-Control
headers during page generation.

ProxyRequests Off

ProxyPass /dynamic/stuff http://app-server/
ProxyPassReverse /dynamic/stuff http://app-server/
CacheRoot "/mnt/proxy-cache"
CacheSize 500000
CacheGcInterval 12
CacheMaxExpire 36
CacheDefaultExpire 2

These pages are not quite as fast as regular static ones -- mod_proxy checks the headers at the top of the file
to make sure it hasn't expired before serving it. However, they are much faster than dynamic generation.
Note that this will only work for pages which you can generate on the fly in a reasonable amount of time. If
you have a page that takes two minutes to generate, you need to generate it before users ask for it. Of course
you can still use this approach, and seed it with some artificial requests beforehand, which will basically
give you a mix between the generate-on-demand and pre-generation approaches.

One final variation worth mentioning is intercepting the 404 error. It works like this: you set up your
program as the handler for 404 "NOT FOUND" errors on the site. When a page is requested that is not
found on the file system, that triggers a 404 and sends the request over to you. You then generate the
requested page, and write it out to the file system so that it will be there the next time someone comes
looking for it.

This is the approach that Vignette StoryServer uses for caching, or at least it did, back in the early days
when it was spun off from cnet.com. It's easy to configure an Apache server to do this:

ErrorDocument 404 /page/generator

This will make apache do an internal redirect to the program at
/page/generator, passing information about the URL originally
requested as environment variables. This program writes out the file,
and then, if you're using mod_perl, you can just do an internal
redirect to the newly generated page and let apache handle it like any
other file.

The upside is great performance, since the pages are served as normal static files. The downside of this is
that you then have to manage expiring these pages yourself, probably by writing a cron job that will check
for ones that are too old and delete them. You run the risk of serving a file a little after its expiration time if
the cron doesn't do its job frequently enough. In general, I think the caching proxy approach is easier to
manage, but if you are using something other than mod_perl -- like FastCGI, which already separates the
Perl interpreters from the web server -- there is not as much incentive to run a proxy.

Chunks of HTML or data
Many of you were probably thinking during that last part "That sounds great, but my web designers insisted

3/3/12 No Title

on putting the current user's name on every page. I can't cache the whole thing." Obviously sites like
Amazon or My Yahoo can't cache the whole page either. They can cache pieces of pages though, and
reduce the page generation to little more than knitting the pieces together, like server-side includes. Yahoo
uses this technique quite a bit, generating the pieces of content for the portal in advanace, and building a
custom template for each user based on their preferences that includes the appropriate pieces at request-time.

By the way, you may be aware that PHP is being used at Yahoo now and assumed that this meant it was
replacing Perl. That's not the case. PHP is mostly being used for this sort of include-template work,
replacing some older in-house solutions that Yahoo used to use. The content generation that was done in
Perl is still being done in Perl.

The caching built into the Mason web development framework is a good example of caching pieces. It
allows you to cache arbitrary content with a key and an expiration time and then retrieve it later.

my $result = $m->cache->get($search_term);
if (!defined($result)) {
$result = run_search($search_term);
$m->cache->set($search_term, $result, '30 min');
}

You can cache generated HTML, or you can cache data which you've fetched from a database or
elsewhere. Caching the generated HTML gives better performance, because it allows you to skip more
work when you get a cache hit (the HTML generation), but caching at the data level means you get to reuse
the cached content if it shows up in multiple different layouts. That increases your chances of getting a
cache hit. Rent.com, one of the top apartment listing services on the web, uses Mason's cache to store
results on a commonly used search page. Since there is a fair amount of repitition in these searches, they are
able to serve 55% of the search hits from cache instead of going to the database. That also frees up database
resources for other things.

I created a simple plugin module for Template Toolkit that adds partial-page caching, which is available on
CPAN as Template::Plugin::Cache. It's only really useful if you have templates that do a lot of work,
fetching data and the like inside the template itself, which is generally not the best way to use Template
Toolkit. When using a model-view-controller style of development, you will typically be caching data and
doing it before you get to the templates.

If you want to add caching to your application, there are several good options on CPAN. For a local cache
on a single machine, I would recommend Rob Mueller's Cache::FastMmap. BerkeleyDB is about the same
speed if you use the OO interface and built-in locking, but you'd have to build the cache expiration code
yourself. Both of these are several times as fast as the popular Cache::FileCache module and hundreds of
times faster than any of the modules built on top of IPC::ShareLite.

our $Cache = Cache::FastMmap->new(
cache_size => '500m',
expire_time => '30m',
);

$Cache->set($key, $value);
my $value = $Cache->get($key);

My only real complaint about Cache::FastMmap is that it doesn't provide a way to set different expiration
times for individual items. You could add this yourself in a wrapper around Cache::FastMmap, but at that
point it loses its main advantage over BerkeleyDB, which is the built-in expiration and purging
functionality.


3/3/12 No Title

For a cache that needs to be shared across a whole cluster of machines, you need something different.
Memcached (http://www.danga.com/memcached/) is a cache server that you can access over the network. It
keeps the cached items in RAM, but can be scaled for large amounts of data by running it on multiple
servers. Requests are automatically hashed across the available servers, spreading the data set out across all
of them. It uses some recent advances like the epoll system call in the Linux 2.6 kernel to offer impressive
scalability. The livejournal.com website is currently using memcached.

$memd = Cache::Memcached->new({
'servers' => [ "10.0.0.15:11211", "10.0.0.15:11212",
"10.0.0.17:11211", [ "10.0.0.17:11211", 3 ] ],
'debug' => 0,
'compress_threshold' => 10_000,
};

$memd->set($key, $value, 5*60 );
my $value = $memd->get($key);

If that sounds like more than you want to deal with, you can make something simple with MySQL. Because
MySQL has an option to use a lightweight non-transactional table type, it is a good choice for this kind of
application. Just create a simple table with key, value, and expiration time columns and use it the way you
would use a hash. If you follow DBI best practices, you can get performance that beats most of the cache
modules on CPAN except the ones I mentioned here.

Job Queuing
I could go on for hours about caching, but there are other important things to cover.

Let's say you run a website that sells concert tickets. That means that at a specific, publicly-announced time,
Madonna tickets will go on sale. That, in turn, means that a staggering number of people will all be waiting
at 11am on Sunday morning with their fingers poised above the mouse button ready to click "buy" until
they get a ticket. But wait, it gets worse! In order to give people who are trying to buy tickets by phone or in
person a fair shot at the action, you are only allowed to put holds on a certain number of tickets at a time,
meaning that only that number of people can be in the process of actually buying a ticket at once. Does this
sound like a good way to ruin your weekend? This is the sort of thing that the ticketmaster.com site has to
deal with routinely.

How do you handle excessive demand for a limited resource? The same way you do it in real life: you
make people line up for it. Queues are a common approach for preventing overloading and making efficient
use of resources.

[ queue diagram ]

So, what have we accomplished with our queue? First of all, we have control of how many processes are
handling requests in parallel, so we won't overwhelm our backend systems. Second, since it hardly takes
any time at all to queue a request or or check status, we are keeping our web server processes free to handle
more users. The site will be responsive even when there are far more users on it sending in requests than we
can actually handle at one time. Finally, we are providing frequently updated status information to users, so
they won't leave or try to resubmit their requests.

Queues are also useful when you have long-running jobs. For example, suppose you're building a site that
compares prices on hotel rooms by making price quote requests to a bunch of remote servers and comparing

3/3/12 No Title

them. That could take some time, even if you send the requests in parallel.

You can keep the browser from timing out by using the standard forking technique, where you fork off a
process to do the work and return an "in progress" page. When the forked process finishes handling the
request, it writes the results to a shared data location, like a database or session file. Meanwhile, the page
reloads, and until the results are available it justkeeps sending back the "in progress" page. Randal Schwartz
has an article on-line that demonstrates this technique. It's located at
http://www.stonehenge.com/merlyn/WebTechniques/col20.html.

However, this doesn't completely solve the problem. Say these jobs take 15 seconds to complete. What
happens if 1000 people come in and submit jobs during 15 seconds? You'll have 1000 new processes
forked! A queue approach avoids this, by just dropping the requests onto the queue and letting the already-
running job processors handle them at a fixed rate.

Modules to Use
Now that you know what queues are good for, where do you get one? The Ticketmaster code is closely tied
to their backend systems, so it's not open source. There are some other options. One that you can grab from
CPAN is Jason May's Spread::Queue. This is built on top of the Spread toolkit (http://spread.org/) for
reliable multicast messaging. What Spread provides is a scalable way to send messages out across a cluster
of machines and make sure they are received reliably and in order. It actually provides other things too, but
this is the part that Spread::Queue is using.

The system consists of three parts: a client library, a queue manager, and a worker library. The client library
is called from your code when you want to add a request to the queue. That sends a request to the queue
manager using Spread. You define your job processing code in a worker class. You can start as many
worker processes as you like and they can be on any machine in the cluster. They will register themselves
and begin accepting jobs.

In the client process:

use Spread::Queue::Sender;

my $sender = Spread::Queue::Sender->new("myqueue");

$sender->submit("myfunc", { name => "value" });
my $response = $sender->receive;

In the worker process:

use Spread::Queue::Worker;

my $worker = Spread::Queue::Worker->new("myqueue");
$worker->callbacks(
myfunc => &myfunc,
);
$SIG{INT} = &signal_handler;
$worker->run;

sub myfunc {
my ($worker, $originator, $input) = @_;

my $result = {
response => "I heard you!",


3/3/12 No Title

};
$worker->respond($originator, $result);
}

The Spread::Queue system looks very attractive, but there are a few things it could use. There doesn't seem
to be a way to check where a particular job is in the queue, or even to ask if that job is done yet or not
without blocking until it is done. Also, the queue is not stored in a durable way: it's just in the memory of
the queue manager process, so if that process dies, the entire state of the queue is lost. Adding these features
would make a good project for someone, and someone may be me if I need them before someone else does.

Where to Learn More
If some of these concepts are new to you, and you want to learn more about them, the good news is that
there is lots of good technical writing on these subjects. The Perl Journal, including the "best of" collection
that O'Reilly has been publishing, is a good resource, and so is the "Algorithms in Perl" book.

The bad news is that some of the most interesting stuff is written for a Java audience. My advice is that if
you want to learn how to do this scalable web development well, you can't be trapped in one community or
one language -- you need to see what other people are doing. I like Martin Fowler's books, because he
doesn't have an agenda to push and isn't trying to sell you on a particular tool or API. Similarly, the O'Reilly
sites at http://oreillynet.com/, including http://onjava.com/, get some good stuff. The Java content is mostly
open-source oriented so it's much less fluffy than most Java sites.

Acknowledgements
I'd like to thank Craig McLane and Adam Sussman of Ticketmaster, and Zack Steinkamp of Yahoo for
being very generous with their time in answering my questions while I was working on this talk.


Scalable talk notes

Related slideshows

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Scalable talk notes

Similar to Scalable talk notes (20)

More from Perrin Harkins

More from Perrin Harkins (7)

Recently uploaded

Recently uploaded (20)

Scalable talk notes