SlideShare a Scribd company logo
Performance is good,
Understanding performance is better
               Peter HJ van Eijk
              Chairman NLCMG
         A non-profit community of professionals

                      Feb 11, 2012
CMG 101
                   Computer Cloud Measurement Group
Understand:
• Definitions of availability and response time
• Psychological and business effect of delay/response time. User
  interfaces, cost of downtime
• Transactions, and their structure.
• Waterfall diagrams for transactions and web page downloads
• Performance measures (seconds, bytes, bits per seconds, IOPS, etc).
• Reporting measures / metrics.
• Visualization of quantitative data, how to
• Resources (CPU, memory, disk, network, software)
• Elementary queuing theory
• Phases in development and how to incorporate performance and capacity
  (analysis, design, etc.), performance engineering
• Typical free and commercial tools, or at least their functionality
    – monitoring, reporting, alerting, analysis, modelling
Availability and Response Time
• Availability: Ability of a
  Configuration Item or IT
  Service to perform its
  agreed Function when
  required. *…+ Availability is
  usually calculated as a
  percentage.
• Response Time: A
  measure of the time taken
  to complete an Operation
  or Transaction
Graphs of availability and response time
Psychological and business
    cost of downtime




   €+$+£
Pageviews




                                   0
                                       100,000
                                                 200,000
                                                           300,000
                                                                                400,000
                                                                                          500,000
                                                                                                    600,000
                                                                                                              700,000
                       1-jan-08
                      29-jan-08
                      26-feb-08
            25-Mrt-2008
                      22-apr-08
                      20-mei-08
                      17-jun-08
                       15-jul-08
                                                              IceSave failure



                      12-aug-08
                       9-sep-08
                                                                                                              Pageviews




                       7-okt-08



Pageviews
                       4-nov-08
                       2-dec-08
                      30-dec-08
                      27-jan-09
                      24-feb-09
            24-Mrt-2009
                                                                                                                          Sudden surges can kill you




                      21-apr-09
                      19-mei-09
     Bron: SiteStat
KNMI.nl
                                               Pageviews per hour


180000



160000



140000                               Weather alarm day

120000



100000
                                                                                                                    30-dec
                                                                                                                    31-dec
 80000



 60000



 40000

                                                                  Ordinary day
 20000



     0
         1   2   3   4   5   6   7    8   9   10   11   12   13   14   15   16   17   18   19   20   21   22   23
Transactions and their structure
                 waterfall diagrams
                          A single user level transaction decomposes into
                          multiple transactions on components


Client           Server
                                                                  Yslow detail
         Query
                              Netwerk latency
         Ack
                             Server
                             turnaround
                             time

         Reply

         Ack
Transactions:
                               from visits to bandwidth
                                                                                                       1,7 visits/sec
                            Visits                                       Sitestat meting

                                                                                                       6.380 /uur
                                7,42 pageviews per bezoek (volgens
                                SiteStat), echter lager tijdens crisis
79 GET per bezoek
                                                                                                       13 pageviews/sec
volgens logfile en
Sitestat                    Pageviews                                    Sitestat meting, Serverlogs
                                                                         Pageopbouw via FireBug
                                                                                                       47.338 /uur
                                10,6 (=79/7,42) GET/pageview effectief
                                32 GET voor homepage (volgens browser)


                            GET requests                                 HTTP Serverlogs               140 requests/sec

                                Circa 6800 bytes per request gemiddeld

                                                                         HTTP Serverlogs
                                                                                                       0,95 Mbyte/sec
                            Bandwidth
                                                                         9
                                                                                                       7,6 Megabit/sec



© Digital Infrastructures
How to diagnose a problem,
     where to look? Resource = capacity
              (Test) client


               WAN Link
                                                      Users

             Router Switch
                (CPE)

             Firewall, Proxy                     Application
             LAN switches


End to end   Load Balancer


             HTTP front end                  Server           Network

               MySQL DB


                  NAS
                                                        Network lines
                  SAN


                        Example breakdowns
Resource contribution to response time,
    modeling different resource allocations
Modelling different network bandwidth’s effect on response time                                      Excessive client/server
                                                                                                     chatter leads to a user
        64K
                                                                                                     interaction time of more
       256K
                                                                                                     than 7 minutes!
 ICTRO 2Mb
                                                                   Op basis van 50 mSec
       GBO                                                         roundtrip op het WAN

               0             100             200             300             400            500         How much faster will
                                                                                                        this be with?
          Server tijd (sec)                        Client tijd (sec)
                                                                                                        •Very fast network/
          Netwerk tijd delay (sec)                 Netwerk tijd bandbreedte (sec)
                                                                                                        •Very fast client /
 Na het uitvragen van de medewerkersnummers (er zijn 373 Janssen’s), worden dienstverbanddetails
 per stuk uitgevraagd (in totaal 612). Dit leidt op het GBO LAN tot 30 sec doorlooptijd (gemeten).
                                                                                                        •Very fast server
Queuing theory
               Response depends on capacity                                                 At higher
                                                                                            loads, congestion can set
                                                                                            in




                                                                            Actual throughput
           12

           10
Delay factor




               8                                                                                                    Perfect
               6
                                           Sweet spot
               4
                                                                                                                      Congestion
               2

               0
                   10%   20%   30%   40%     50%    60%   70%   80%   90%




                                                                                                      Sweet spot
                                           Utilisation                                                             Traffic load
So what was the bottleneck?
• KNMI: static page served from database
  1000/sec
• Ministry: very chatty client/server interaction
• DNB: JSP application server serves static
  content
• Anne Frank: many, large digital assets, no use
  of CDN
• Hospital information system: client (front-end)
  code
How to incorporate performance in
  development and operations
Typical free and commercial tools
         and their functionality
Functionality   Example tools
• Monitoring    • Nagios
• Reporting     • Cacti
• Alerting      • WatchMouse
• Analysis      • PDQ
• Modelling     • R
• Etc …         • Yslow
                • …
CMG 101
• We want to develop a ‘standard’ body of
  knowledge
  – To educate our people
  – Speak more of the same language
  – Enable tool vendors to more easily express their
    offerings
• Note: defining what is in the course is not the
  same as developing a course
Call for Action
•   Want to know more?
•   Want to collaborate, contribute?
•   Want to get a course?
•   Want to sponsor?

• Talk to me
                    Peter HJ van Eijk
                    @petersgriddle
               inbox@peterhjvaneijk.nl
                     +31 2268 4939
       www.nlcmg.nl NLCMG is a chapter of CMG.org
Some of my performance projects
• KNMI (Weather service): website meltdown after
  weather emergency (“weeralarm”)
• DNB (Dutch Banks Authority): website meltdown
  during 2008 financial crisis
• Unnamed Ministry: information system with
  multi-minute response times
• Crisis.nl: ….
• Anne Frank website: … anticipated surge after
  major redesign
• Hospital information system: storage sizing
Achtung alles Lookenspeepers! Nur watchen das Cloud.




          http://zoom.nl/foto/1713577/portret/cloudwa
          tch.html
How does a financial IT crisis look like?
Fernando’s office (bank’s capacity planner)

More Related Content

CMG 101 - Understanding performance

  • 1. Performance is good, Understanding performance is better Peter HJ van Eijk Chairman NLCMG A non-profit community of professionals Feb 11, 2012
  • 2. CMG 101 Computer Cloud Measurement Group Understand: • Definitions of availability and response time • Psychological and business effect of delay/response time. User interfaces, cost of downtime • Transactions, and their structure. • Waterfall diagrams for transactions and web page downloads • Performance measures (seconds, bytes, bits per seconds, IOPS, etc). • Reporting measures / metrics. • Visualization of quantitative data, how to • Resources (CPU, memory, disk, network, software) • Elementary queuing theory • Phases in development and how to incorporate performance and capacity (analysis, design, etc.), performance engineering • Typical free and commercial tools, or at least their functionality – monitoring, reporting, alerting, analysis, modelling
  • 3. Availability and Response Time • Availability: Ability of a Configuration Item or IT Service to perform its agreed Function when required. *…+ Availability is usually calculated as a percentage. • Response Time: A measure of the time taken to complete an Operation or Transaction
  • 4. Graphs of availability and response time
  • 5. Psychological and business cost of downtime €+$+£
  • 6. Pageviews 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 1-jan-08 29-jan-08 26-feb-08 25-Mrt-2008 22-apr-08 20-mei-08 17-jun-08 15-jul-08 IceSave failure 12-aug-08 9-sep-08 Pageviews 7-okt-08 Pageviews 4-nov-08 2-dec-08 30-dec-08 27-jan-09 24-feb-09 24-Mrt-2009 Sudden surges can kill you 21-apr-09 19-mei-09 Bron: SiteStat
  • 7. KNMI.nl Pageviews per hour 180000 160000 140000 Weather alarm day 120000 100000 30-dec 31-dec 80000 60000 40000 Ordinary day 20000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
  • 8. Transactions and their structure waterfall diagrams A single user level transaction decomposes into multiple transactions on components Client Server Yslow detail Query Netwerk latency Ack Server turnaround time Reply Ack
  • 9. Transactions: from visits to bandwidth 1,7 visits/sec Visits Sitestat meting 6.380 /uur 7,42 pageviews per bezoek (volgens SiteStat), echter lager tijdens crisis 79 GET per bezoek 13 pageviews/sec volgens logfile en Sitestat Pageviews Sitestat meting, Serverlogs Pageopbouw via FireBug 47.338 /uur 10,6 (=79/7,42) GET/pageview effectief 32 GET voor homepage (volgens browser) GET requests HTTP Serverlogs 140 requests/sec Circa 6800 bytes per request gemiddeld HTTP Serverlogs 0,95 Mbyte/sec Bandwidth 9 7,6 Megabit/sec © Digital Infrastructures
  • 10. How to diagnose a problem, where to look? Resource = capacity (Test) client WAN Link Users Router Switch (CPE) Firewall, Proxy Application LAN switches End to end Load Balancer HTTP front end Server Network MySQL DB NAS Network lines SAN Example breakdowns
  • 11. Resource contribution to response time, modeling different resource allocations Modelling different network bandwidth’s effect on response time Excessive client/server chatter leads to a user 64K interaction time of more 256K than 7 minutes! ICTRO 2Mb Op basis van 50 mSec GBO roundtrip op het WAN 0 100 200 300 400 500 How much faster will this be with? Server tijd (sec) Client tijd (sec) •Very fast network/ Netwerk tijd delay (sec) Netwerk tijd bandbreedte (sec) •Very fast client / Na het uitvragen van de medewerkersnummers (er zijn 373 Janssen’s), worden dienstverbanddetails per stuk uitgevraagd (in totaal 612). Dit leidt op het GBO LAN tot 30 sec doorlooptijd (gemeten). •Very fast server
  • 12. Queuing theory Response depends on capacity At higher loads, congestion can set in Actual throughput 12 10 Delay factor 8 Perfect 6 Sweet spot 4 Congestion 2 0 10% 20% 30% 40% 50% 60% 70% 80% 90% Sweet spot Utilisation Traffic load
  • 13. So what was the bottleneck? • KNMI: static page served from database 1000/sec • Ministry: very chatty client/server interaction • DNB: JSP application server serves static content • Anne Frank: many, large digital assets, no use of CDN • Hospital information system: client (front-end) code
  • 14. How to incorporate performance in development and operations
  • 15. Typical free and commercial tools and their functionality Functionality Example tools • Monitoring • Nagios • Reporting • Cacti • Alerting • WatchMouse • Analysis • PDQ • Modelling • R • Etc … • Yslow • …
  • 16. CMG 101 • We want to develop a ‘standard’ body of knowledge – To educate our people – Speak more of the same language – Enable tool vendors to more easily express their offerings • Note: defining what is in the course is not the same as developing a course
  • 17. Call for Action • Want to know more? • Want to collaborate, contribute? • Want to get a course? • Want to sponsor? • Talk to me Peter HJ van Eijk @petersgriddle inbox@peterhjvaneijk.nl +31 2268 4939 www.nlcmg.nl NLCMG is a chapter of CMG.org
  • 18. Some of my performance projects • KNMI (Weather service): website meltdown after weather emergency (“weeralarm”) • DNB (Dutch Banks Authority): website meltdown during 2008 financial crisis • Unnamed Ministry: information system with multi-minute response times • Crisis.nl: …. • Anne Frank website: … anticipated surge after major redesign • Hospital information system: storage sizing
  • 19. Achtung alles Lookenspeepers! Nur watchen das Cloud. http://zoom.nl/foto/1713577/portret/cloudwa tch.html
  • 20. How does a financial IT crisis look like?
  • 21. Fernando’s office (bank’s capacity planner)