SlideShare a Scribd company logo
Insight without Interference
Monitoring with Scala, Swagger, MongoDB and Wordnik OSS
                       Tony Tam
                       @fehguy
Nagios Dashboard
Monitoring?

 Disk                    Host
Space                   Checks


          IT Ops 101
                        System
Network                  Load
Monitoring?

 Disk                           Host
Space                          Checks

            Necessary
          (but insufficient)
                               System
Network                         Load
Why Insufficient?

• What about Services?
 •   Database running?
 •   HTTP traffic?
• Install Munin Node!
 •   Some (good) service-level insight
System insight without Interference
Your boss    “OH pretty
   LOVES charts    colors!”



                     “up and
                      to the
 “it MUST             right!”
     be
important!”
Good vs. Bad?

• Database calls avg 1ms?
 •   Great! DB working well
 •   But called 1M times per page load/user?
• Most tools are for system, not your app
• By the time you know, it’s too late
                Need business
                   metrics
                 monitoring!
Enter APM

• Application Performance Monitoring
• Many flavors, degrees of integration
 •   Heavy: transaction monitoring, code performance,
     heap, memory analysis
 •   Medium: home-grown profiling
 •   Light: digest your logs (failure forensics)
• What you need depends on architecture,
  business + technology stage
APM @ Wordnik

• Micro Services make the System



       Monolithic
       application
APM @ Wordnik

• Micro Services make the System
                           API Calls
                          are the unit
                            of work!

       Monolithic
       application
Monitoring API Calls

• Every API must be
  profiled
• Other logic as needed
 •   Database calls
 •   Connection manager
 •   etc...
• Anything that might
  matter!
How?

• Wordnik-OSS Profiler for Scala
  •   Apache 2.0 License, available in Maven Central
• Profiling Arbitrary code block:
import com.wordnik.util.perf.Profile
Profile("create a cat", {/* do something */})

• Profiling an API call:
Profile("/store/purchase", {/* do something */})
Profiler gives you…

• Nearly free*** tracking
• Simple aggregation
• Trigger mechanism
  •   Actions on time spent “doing things”:

Profile.triggers += new Function1[ProfileCounter, Unit] {
  def apply(counter: ProfileCounter): Unit = {
    if (counter.name == "getDb" && counter.duration > 5000)
      wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW)
    return counter
  }
}
Profiler gives you…

• Nearly free*** tracking
• Simple aggregation
• Trigger mechanism
  •   Actions on time spent “doing things”:

Profile.triggers += new Function1[ProfileCounter, Unit] {
  def apply(counter: ProfileCounter): Unit = {
    if (counter.name == "getDb" && counter.duration > 5000)
      wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW)
   This is intrusive
    return counter
  }
}
       on your
     codebase
Accessing Profile Data

• Easy to get in code
       ProfileScreenPrinter.dump




• Output where you want
  logger.info(ProfileScreenPrinter.toString)

• Send to logs, email, etc.
Accessing Profile Data

• Easier to get via API with Swagger-JAXRS
import com.wordnik.resource.util

@Path("/activity.json")
@Api("/activity")
@Produces(Array("application/json"))
class ProfileResource extends ProfileTrait
Accessing Profile Data
Accessing Profile Data




                 Inspect
                 without
                 bugging
                  devs!
Is Aggregate Data Enough?

• Probably not
• Not Actionable
 •   Have calls increased? Decreased?
 •   Faster response? Slower?
Make it Actionable

    • “In a 3 hour window, I expect 300,000
      views per server”
      •   Poll & persist the counters

{
      •   Example: Log page views, every min
      "_id" : "web1-word-page-view-20120625151812",
      "host" : "web1",
      "count" : 627172,
      "timestamp" : NumberLong("1340637492247")
},{
      "_id" : "web1-word-page-view-20120625151912",
      "host" : "web1",
      "count" : 627372,
      "timestamp" : NumberLong("1340637552778")
}
Make it Actionable
Make it Actionable


              Your boss
            LOVES charts
That’s not Actionable!

• Custompretty
  But it’s
   Time                         APIs to
  window                        track?
             What’s missing?


Too much                       Low + High
 custom                        Watermark
Engineerin                         s
    g
That’s not Actionable!

Custom
 Time                             APIs to
window                            track?

              Call to Action!

 Too much                       Low + High
  custom                        Watermarks
Engineering
Make it Actionable

• Swagger + a      tiny   bit of engineering
 •   Let your *product* people create monitors, set
     goals
• A Check: specific API call mapped to a
  service function
 {
     "name": "word-page-view",
     "path": "/word/*/wordView (post)",
     "checkInterval": 60,
     "healthSpan": 300,
     "minCount": 300,
     "maxCount": 100000
 }
Make it Actionable

• A Service Type: a collection of checks
  which make a functional unit
  {
          "name": "www-api",
          "checks": [
            "word-of-the-day",
            "word-page-view",
            "word-definitions",
            "user-login",
            "api-account-signup",
            "api-account-activated"
          ]
      }
Make it Actionable

• A Host: “directions” to get to the checks
{
  "host": "ip-10-132-43-114",
  "path": "/v4/health.json/profile?api_key=XYZ",
  "serviceType": "www-api”
},
{
  "host": "ip-10-130-134-82",
  "path": "/v4/health.json/profile?api_key=XYZ",
  "serviceType": "www-api”
}
Make it Actionable

• And finally, a simple GUI
Make it Actionable

• And finally, a simple GUI
Make it Actionable

• Point Nagios at this!
serviceHealth.json/status/www-
api?explodeOnFailure=true        Metrics from
                                  Product
• Get a 500, get an alert

        Treat like                Based on
         system                   YOUR app
         failure
Make it Actionable
Is this Enough?

System monitoring
Aggregate monitoring
Windowed monitoring
Object monitoring?
 •   Action on a specific event/object


                               Why!?
Object-level Actions

• Any back-end engineer can build this
 •   But shouldn’t
• ETL to a cube?
• Run BI queries against production?
• Best way to “siphon” data from production
  w/o intrusive engineering?
Avoiding Code Invasion

• We use MongoDB everywhere
• We use > 1 server wherever we use
  MongoDB
• We have an opLog record against
  everything we do
What is the OpLog

• All participating members have one
• Capped collection of all write ops        t3

                  time

 t0         t1                         t2
        primary replica    replica
So What?

• It’s a “pseudo-durable global topic
  message bus” (PDGTMB)
  •   WTF?
• All DB transactions in there
• It’s persistent (cyclic collection)
• It’s fast (as fast as your writes)
• It’s non-blocking
• It’s easily accessible
More about this
{
    "ts" : {
         "t" : 1340948921000, "i" : 1
    },
    "h" : NumberLong("5674919573577531409"),
    "op" : "i",
    "ns" : "test.animals",
    "o" : {"_id" : "fred", "type" : "cat"
    }
}, {
    "ts" : {
         "t" : 1340948935000, "i" : 1
    },
    "h" : NumberLong("7701120461899338740"),
    "op" : "i",
    "ns" : "test.animals",
    "o" : {
         "_id" : "bill", "type" : "rat"
    }
}
Tapping into the Oplog

• Made easy for you!
https://github.com/wordnik/wordnik-oss
Tapping into the Oplog

 • Made easy for you!
 https://github.com/wordnik/wordnik-oss

Incremental
  Backup                     Snapshots
              Replication

                              Same
                            Technique!
Tapping into the Oplog

    • Create an OpLogProcessor
class OpLogReader extends OplogRecordProcessor {
  val recordTriggers =
      new HashSet[Function1[BasicDBObject, Unit]]
  @throws(classOf[Exception])
  def processRecord(dbo: BasicDBObject) = {
    recordTriggers.foreach(t => t(dbo))
  }
  @throws(classOf[IOException])
  def close(string: String) = {}
}
Tapping into the Oplog

• Attach it to an OpLogTailThread
val util = new OpLogReader
val coll: DBCollection =
 (MongoDBConnectionManager.getOplog("oplog",
 "localhost", None, None)).get
val tailThread = new OplogTailThread(util, coll)
tailThread.start
Tapping into the Oplog

• Add some observer functions
util.recordTriggers +=
  new Function1[BasicDBObject, Unit] {
      def apply(e: BasicDBObject): Unit =
        Profile("inspectObject", {
          totalExamined += 1
          /* do something here */
        }
      })
    }
  }
/* do something here */

• Like?
• Convert to business objects and act!
 •   OpLog to domain object is EASY
 •   Just process the ns that you care about
     "ns" : "test.animals”
• How?
Converting OpLog to Object

• Jackson makes this trivial
case class User(username: String, email: String,
  createdAt: Date)

val user = jacksonMapper.convertValue(
  dbo.get("o").asInstanceOf[DBObject],
  classOf[User])


• Reuse your DAOs?      Bonus points!
• Got your objects!
Converting OpLog to Object

• Jackson makes this trivial
                     “o” is for
case class User(username: String,   email: String,
  createdAt: Date)
                     “Object”

val user = jacksonMapper.convertValue(
  dbo.get("o").asInstanceOf[DBObject],
  classOf[User])


• Reuse your DAOs?      Bonus points!
• Got your objects!            Now What?
Use Case 1: Alert on Action

• New account!
obj match {
  case newAccount: UserAccount => {
    /* ring the bell! */
  }
  case _ => {
    /* ignore it */
  }
}
Use case 2: What’s Trending?

• Real-time activity
case o: VisitLog =>
 Profile("ActivityMonitor:processVisit", {
   wordTracker.add(o.word)
 })
Use case 3: External Analytics
case o: UserProfile => {
    getSqlDatabase().executeSql(
      "insert into user_profile values(?,?,?)",
       o.username, o.email, o.createdAt)
}
Use case 3: External Analytics
case o: UserProfile => {
    getSqlDatabase().executeSql(
      "insert into user_profile values(?,?,?)",
                                 Your Data
       o.username, o.email, o.createdAt)
}                                  pushes to
                                   Relational!

                   Don’t mix
                   runtime &
                     OLAP!
Use case 4: Cloud analysis
case o: NewUserAccount => {
    getSalesforceConnector().create(
      Lead(Account.ID, o.firstName, o.lastName,
         o.company, o.email, o.phone))
}
Use case 4: Cloud analysis
case o: NewUserAccount => {
    getSalesforceConnector().create(
      Lead(Account.ID, o.firstName, o.lastName,
         o.company, o.email, o.phone))
}

                                We didn’t
  Pushed                      interrupt core
 directly to                   engineering!
Salesforce!
Examples




     Polling profile
      APIs cross
        cluster
Examples



       Siphoning
        hashtags
      from opLog
Examples


       Page view
      activity from
         opLog
Examples


      Health check
          w/o
      engineering
Summary

• Don’t mix up monitoring servers & your
  application
• Leave core engineering alone
• Make a tiny engineering investment now
• Let your product folks set metrics
• FOSS tools are available (and well tested!)
• The opLog is incredibly powerful
 •   Hack it!
Find out more

• Wordnik: developer.wordnik.com
• Swagger: swagger.wordnik.com
• Wordnik OSS: github.com/wordnik/wordnik-oss
• Atmosphere: github.com/Atmosphere/atmosphere
• MongoDB: www.mongodb.org

More Related Content

System insight without Interference

  • 1. Insight without Interference Monitoring with Scala, Swagger, MongoDB and Wordnik OSS Tony Tam @fehguy
  • 3. Monitoring? Disk Host Space Checks IT Ops 101 System Network Load
  • 4. Monitoring? Disk Host Space Checks Necessary (but insufficient) System Network Load
  • 5. Why Insufficient? • What about Services? • Database running? • HTTP traffic? • Install Munin Node! • Some (good) service-level insight
  • 7. Your boss “OH pretty LOVES charts colors!” “up and to the “it MUST right!” be important!”
  • 8. Good vs. Bad? • Database calls avg 1ms? • Great! DB working well • But called 1M times per page load/user? • Most tools are for system, not your app • By the time you know, it’s too late Need business metrics monitoring!
  • 9. Enter APM • Application Performance Monitoring • Many flavors, degrees of integration • Heavy: transaction monitoring, code performance, heap, memory analysis • Medium: home-grown profiling • Light: digest your logs (failure forensics) • What you need depends on architecture, business + technology stage
  • 10. APM @ Wordnik • Micro Services make the System Monolithic application
  • 11. APM @ Wordnik • Micro Services make the System API Calls are the unit of work! Monolithic application
  • 12. Monitoring API Calls • Every API must be profiled • Other logic as needed • Database calls • Connection manager • etc... • Anything that might matter!
  • 13. How? • Wordnik-OSS Profiler for Scala • Apache 2.0 License, available in Maven Central • Profiling Arbitrary code block: import com.wordnik.util.perf.Profile Profile("create a cat", {/* do something */}) • Profiling an API call: Profile("/store/purchase", {/* do something */})
  • 14. Profiler gives you… • Nearly free*** tracking • Simple aggregation • Trigger mechanism • Actions on time spent “doing things”: Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter } }
  • 15. Profiler gives you… • Nearly free*** tracking • Simple aggregation • Trigger mechanism • Actions on time spent “doing things”: Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) This is intrusive return counter } } on your codebase
  • 16. Accessing Profile Data • Easy to get in code ProfileScreenPrinter.dump • Output where you want logger.info(ProfileScreenPrinter.toString) • Send to logs, email, etc.
  • 17. Accessing Profile Data • Easier to get via API with Swagger-JAXRS import com.wordnik.resource.util @Path("/activity.json") @Api("/activity") @Produces(Array("application/json")) class ProfileResource extends ProfileTrait
  • 19. Accessing Profile Data Inspect without bugging devs!
  • 20. Is Aggregate Data Enough? • Probably not • Not Actionable • Have calls increased? Decreased? • Faster response? Slower?
  • 21. Make it Actionable • “In a 3 hour window, I expect 300,000 views per server” • Poll & persist the counters { • Example: Log page views, every min "_id" : "web1-word-page-view-20120625151812", "host" : "web1", "count" : 627172, "timestamp" : NumberLong("1340637492247") },{ "_id" : "web1-word-page-view-20120625151912", "host" : "web1", "count" : 627372, "timestamp" : NumberLong("1340637552778") }
  • 23. Make it Actionable Your boss LOVES charts
  • 24. That’s not Actionable! • Custompretty But it’s Time APIs to window track? What’s missing? Too much Low + High custom Watermark Engineerin s g
  • 25. That’s not Actionable! Custom Time APIs to window track? Call to Action! Too much Low + High custom Watermarks Engineering
  • 26. Make it Actionable • Swagger + a tiny bit of engineering • Let your *product* people create monitors, set goals • A Check: specific API call mapped to a service function { "name": "word-page-view", "path": "/word/*/wordView (post)", "checkInterval": 60, "healthSpan": 300, "minCount": 300, "maxCount": 100000 }
  • 27. Make it Actionable • A Service Type: a collection of checks which make a functional unit { "name": "www-api", "checks": [ "word-of-the-day", "word-page-view", "word-definitions", "user-login", "api-account-signup", "api-account-activated" ] }
  • 28. Make it Actionable • A Host: “directions” to get to the checks { "host": "ip-10-132-43-114", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api” }, { "host": "ip-10-130-134-82", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api” }
  • 29. Make it Actionable • And finally, a simple GUI
  • 30. Make it Actionable • And finally, a simple GUI
  • 31. Make it Actionable • Point Nagios at this! serviceHealth.json/status/www- api?explodeOnFailure=true Metrics from Product • Get a 500, get an alert Treat like Based on system YOUR app failure
  • 33. Is this Enough? System monitoring Aggregate monitoring Windowed monitoring Object monitoring? • Action on a specific event/object Why!?
  • 34. Object-level Actions • Any back-end engineer can build this • But shouldn’t • ETL to a cube? • Run BI queries against production? • Best way to “siphon” data from production w/o intrusive engineering?
  • 35. Avoiding Code Invasion • We use MongoDB everywhere • We use > 1 server wherever we use MongoDB • We have an opLog record against everything we do
  • 36. What is the OpLog • All participating members have one • Capped collection of all write ops t3 time t0 t1 t2 primary replica replica
  • 37. So What? • It’s a “pseudo-durable global topic message bus” (PDGTMB) • WTF? • All DB transactions in there • It’s persistent (cyclic collection) • It’s fast (as fast as your writes) • It’s non-blocking • It’s easily accessible
  • 38. More about this { "ts" : { "t" : 1340948921000, "i" : 1 }, "h" : NumberLong("5674919573577531409"), "op" : "i", "ns" : "test.animals", "o" : {"_id" : "fred", "type" : "cat" } }, { "ts" : { "t" : 1340948935000, "i" : 1 }, "h" : NumberLong("7701120461899338740"), "op" : "i", "ns" : "test.animals", "o" : { "_id" : "bill", "type" : "rat" } }
  • 39. Tapping into the Oplog • Made easy for you! https://github.com/wordnik/wordnik-oss
  • 40. Tapping into the Oplog • Made easy for you! https://github.com/wordnik/wordnik-oss Incremental Backup Snapshots Replication Same Technique!
  • 41. Tapping into the Oplog • Create an OpLogProcessor class OpLogReader extends OplogRecordProcessor { val recordTriggers = new HashSet[Function1[BasicDBObject, Unit]] @throws(classOf[Exception]) def processRecord(dbo: BasicDBObject) = { recordTriggers.foreach(t => t(dbo)) } @throws(classOf[IOException]) def close(string: String) = {} }
  • 42. Tapping into the Oplog • Attach it to an OpLogTailThread val util = new OpLogReader val coll: DBCollection = (MongoDBConnectionManager.getOplog("oplog", "localhost", None, None)).get val tailThread = new OplogTailThread(util, coll) tailThread.start
  • 43. Tapping into the Oplog • Add some observer functions util.recordTriggers += new Function1[BasicDBObject, Unit] { def apply(e: BasicDBObject): Unit = Profile("inspectObject", { totalExamined += 1 /* do something here */ } }) } }
  • 44. /* do something here */ • Like? • Convert to business objects and act! • OpLog to domain object is EASY • Just process the ns that you care about "ns" : "test.animals” • How?
  • 45. Converting OpLog to Object • Jackson makes this trivial case class User(username: String, email: String, createdAt: Date) val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User]) • Reuse your DAOs? Bonus points! • Got your objects!
  • 46. Converting OpLog to Object • Jackson makes this trivial “o” is for case class User(username: String, email: String, createdAt: Date) “Object” val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User]) • Reuse your DAOs? Bonus points! • Got your objects! Now What?
  • 47. Use Case 1: Alert on Action • New account! obj match { case newAccount: UserAccount => { /* ring the bell! */ } case _ => { /* ignore it */ } }
  • 48. Use case 2: What’s Trending? • Real-time activity case o: VisitLog => Profile("ActivityMonitor:processVisit", { wordTracker.add(o.word) })
  • 49. Use case 3: External Analytics case o: UserProfile => { getSqlDatabase().executeSql( "insert into user_profile values(?,?,?)", o.username, o.email, o.createdAt) }
  • 50. Use case 3: External Analytics case o: UserProfile => { getSqlDatabase().executeSql( "insert into user_profile values(?,?,?)", Your Data o.username, o.email, o.createdAt) } pushes to Relational! Don’t mix runtime & OLAP!
  • 51. Use case 4: Cloud analysis case o: NewUserAccount => { getSalesforceConnector().create( Lead(Account.ID, o.firstName, o.lastName, o.company, o.email, o.phone)) }
  • 52. Use case 4: Cloud analysis case o: NewUserAccount => { getSalesforceConnector().create( Lead(Account.ID, o.firstName, o.lastName, o.company, o.email, o.phone)) } We didn’t Pushed interrupt core directly to engineering! Salesforce!
  • 53. Examples Polling profile APIs cross cluster
  • 54. Examples Siphoning hashtags from opLog
  • 55. Examples Page view activity from opLog
  • 56. Examples Health check w/o engineering
  • 57. Summary • Don’t mix up monitoring servers & your application • Leave core engineering alone • Make a tiny engineering investment now • Let your product folks set metrics • FOSS tools are available (and well tested!) • The opLog is incredibly powerful • Hack it!
  • 58. Find out more • Wordnik: developer.wordnik.com • Swagger: swagger.wordnik.com • Wordnik OSS: github.com/wordnik/wordnik-oss • Atmosphere: github.com/Atmosphere/atmosphere • MongoDB: www.mongodb.org