System insight without Interference
- 4. Monitoring?
Disk Host
Space Checks
Necessary
(but insufficient)
System
Network Load
- 5. Why Insufficient?
• What about Services?
• Database running?
• HTTP traffic?
• Install Munin Node!
• Some (good) service-level insight
- 7. Your boss “OH pretty
LOVES charts colors!”
“up and
to the
“it MUST right!”
be
important!”
- 8. Good vs. Bad?
• Database calls avg 1ms?
• Great! DB working well
• But called 1M times per page load/user?
• Most tools are for system, not your app
• By the time you know, it’s too late
Need business
metrics
monitoring!
- 9. Enter APM
• Application Performance Monitoring
• Many flavors, degrees of integration
• Heavy: transaction monitoring, code performance,
heap, memory analysis
• Medium: home-grown profiling
• Light: digest your logs (failure forensics)
• What you need depends on architecture,
business + technology stage
- 11. APM @ Wordnik
• Micro Services make the System
API Calls
are the unit
of work!
Monolithic
application
- 12. Monitoring API Calls
• Every API must be
profiled
• Other logic as needed
• Database calls
• Connection manager
• etc...
• Anything that might
matter!
- 13. How?
• Wordnik-OSS Profiler for Scala
• Apache 2.0 License, available in Maven Central
• Profiling Arbitrary code block:
import com.wordnik.util.perf.Profile
Profile("create a cat", {/* do something */})
• Profiling an API call:
Profile("/store/purchase", {/* do something */})
- 14. Profiler gives you…
• Nearly free*** tracking
• Simple aggregation
• Trigger mechanism
• Actions on time spent “doing things”:
Profile.triggers += new Function1[ProfileCounter, Unit] {
def apply(counter: ProfileCounter): Unit = {
if (counter.name == "getDb" && counter.duration > 5000)
wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW)
return counter
}
}
- 15. Profiler gives you…
• Nearly free*** tracking
• Simple aggregation
• Trigger mechanism
• Actions on time spent “doing things”:
Profile.triggers += new Function1[ProfileCounter, Unit] {
def apply(counter: ProfileCounter): Unit = {
if (counter.name == "getDb" && counter.duration > 5000)
wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW)
This is intrusive
return counter
}
}
on your
codebase
- 16. Accessing Profile Data
• Easy to get in code
ProfileScreenPrinter.dump
• Output where you want
logger.info(ProfileScreenPrinter.toString)
• Send to logs, email, etc.
- 17. Accessing Profile Data
• Easier to get via API with Swagger-JAXRS
import com.wordnik.resource.util
@Path("/activity.json")
@Api("/activity")
@Produces(Array("application/json"))
class ProfileResource extends ProfileTrait
- 20. Is Aggregate Data Enough?
• Probably not
• Not Actionable
• Have calls increased? Decreased?
• Faster response? Slower?
- 21. Make it Actionable
• “In a 3 hour window, I expect 300,000
views per server”
• Poll & persist the counters
{
• Example: Log page views, every min
"_id" : "web1-word-page-view-20120625151812",
"host" : "web1",
"count" : 627172,
"timestamp" : NumberLong("1340637492247")
},{
"_id" : "web1-word-page-view-20120625151912",
"host" : "web1",
"count" : 627372,
"timestamp" : NumberLong("1340637552778")
}
- 24. That’s not Actionable!
• Custompretty
But it’s
Time APIs to
window track?
What’s missing?
Too much Low + High
custom Watermark
Engineerin s
g
- 26. Make it Actionable
• Swagger + a tiny bit of engineering
• Let your *product* people create monitors, set
goals
• A Check: specific API call mapped to a
service function
{
"name": "word-page-view",
"path": "/word/*/wordView (post)",
"checkInterval": 60,
"healthSpan": 300,
"minCount": 300,
"maxCount": 100000
}
- 27. Make it Actionable
• A Service Type: a collection of checks
which make a functional unit
{
"name": "www-api",
"checks": [
"word-of-the-day",
"word-page-view",
"word-definitions",
"user-login",
"api-account-signup",
"api-account-activated"
]
}
- 28. Make it Actionable
• A Host: “directions” to get to the checks
{
"host": "ip-10-132-43-114",
"path": "/v4/health.json/profile?api_key=XYZ",
"serviceType": "www-api”
},
{
"host": "ip-10-130-134-82",
"path": "/v4/health.json/profile?api_key=XYZ",
"serviceType": "www-api”
}
- 31. Make it Actionable
• Point Nagios at this!
serviceHealth.json/status/www-
api?explodeOnFailure=true Metrics from
Product
• Get a 500, get an alert
Treat like Based on
system YOUR app
failure
- 33. Is this Enough?
System monitoring
Aggregate monitoring
Windowed monitoring
Object monitoring?
• Action on a specific event/object
Why!?
- 34. Object-level Actions
• Any back-end engineer can build this
• But shouldn’t
• ETL to a cube?
• Run BI queries against production?
• Best way to “siphon” data from production
w/o intrusive engineering?
- 35. Avoiding Code Invasion
• We use MongoDB everywhere
• We use > 1 server wherever we use
MongoDB
• We have an opLog record against
everything we do
- 36. What is the OpLog
• All participating members have one
• Capped collection of all write ops t3
time
t0 t1 t2
primary replica replica
- 37. So What?
• It’s a “pseudo-durable global topic
message bus” (PDGTMB)
• WTF?
• All DB transactions in there
• It’s persistent (cyclic collection)
• It’s fast (as fast as your writes)
• It’s non-blocking
• It’s easily accessible
- 38. More about this
{
"ts" : {
"t" : 1340948921000, "i" : 1
},
"h" : NumberLong("5674919573577531409"),
"op" : "i",
"ns" : "test.animals",
"o" : {"_id" : "fred", "type" : "cat"
}
}, {
"ts" : {
"t" : 1340948935000, "i" : 1
},
"h" : NumberLong("7701120461899338740"),
"op" : "i",
"ns" : "test.animals",
"o" : {
"_id" : "bill", "type" : "rat"
}
}
- 39. Tapping into the Oplog
• Made easy for you!
https://github.com/wordnik/wordnik-oss
- 40. Tapping into the Oplog
• Made easy for you!
https://github.com/wordnik/wordnik-oss
Incremental
Backup Snapshots
Replication
Same
Technique!
- 41. Tapping into the Oplog
• Create an OpLogProcessor
class OpLogReader extends OplogRecordProcessor {
val recordTriggers =
new HashSet[Function1[BasicDBObject, Unit]]
@throws(classOf[Exception])
def processRecord(dbo: BasicDBObject) = {
recordTriggers.foreach(t => t(dbo))
}
@throws(classOf[IOException])
def close(string: String) = {}
}
- 42. Tapping into the Oplog
• Attach it to an OpLogTailThread
val util = new OpLogReader
val coll: DBCollection =
(MongoDBConnectionManager.getOplog("oplog",
"localhost", None, None)).get
val tailThread = new OplogTailThread(util, coll)
tailThread.start
- 43. Tapping into the Oplog
• Add some observer functions
util.recordTriggers +=
new Function1[BasicDBObject, Unit] {
def apply(e: BasicDBObject): Unit =
Profile("inspectObject", {
totalExamined += 1
/* do something here */
}
})
}
}
- 44. /* do something here */
• Like?
• Convert to business objects and act!
• OpLog to domain object is EASY
• Just process the ns that you care about
"ns" : "test.animals”
• How?
- 45. Converting OpLog to Object
• Jackson makes this trivial
case class User(username: String, email: String,
createdAt: Date)
val user = jacksonMapper.convertValue(
dbo.get("o").asInstanceOf[DBObject],
classOf[User])
• Reuse your DAOs? Bonus points!
• Got your objects!
- 46. Converting OpLog to Object
• Jackson makes this trivial
“o” is for
case class User(username: String, email: String,
createdAt: Date)
“Object”
val user = jacksonMapper.convertValue(
dbo.get("o").asInstanceOf[DBObject],
classOf[User])
• Reuse your DAOs? Bonus points!
• Got your objects! Now What?
- 47. Use Case 1: Alert on Action
• New account!
obj match {
case newAccount: UserAccount => {
/* ring the bell! */
}
case _ => {
/* ignore it */
}
}
- 48. Use case 2: What’s Trending?
• Real-time activity
case o: VisitLog =>
Profile("ActivityMonitor:processVisit", {
wordTracker.add(o.word)
})
- 49. Use case 3: External Analytics
case o: UserProfile => {
getSqlDatabase().executeSql(
"insert into user_profile values(?,?,?)",
o.username, o.email, o.createdAt)
}
- 50. Use case 3: External Analytics
case o: UserProfile => {
getSqlDatabase().executeSql(
"insert into user_profile values(?,?,?)",
Your Data
o.username, o.email, o.createdAt)
} pushes to
Relational!
Don’t mix
runtime &
OLAP!
- 51. Use case 4: Cloud analysis
case o: NewUserAccount => {
getSalesforceConnector().create(
Lead(Account.ID, o.firstName, o.lastName,
o.company, o.email, o.phone))
}
- 52. Use case 4: Cloud analysis
case o: NewUserAccount => {
getSalesforceConnector().create(
Lead(Account.ID, o.firstName, o.lastName,
o.company, o.email, o.phone))
}
We didn’t
Pushed interrupt core
directly to engineering!
Salesforce!
- 53. Examples
Polling profile
APIs cross
cluster
- 54. Examples
Siphoning
hashtags
from opLog
- 55. Examples
Page view
activity from
opLog
- 56. Examples
Health check
w/o
engineering
- 57. Summary
• Don’t mix up monitoring servers & your
application
• Leave core engineering alone
• Make a tiny engineering investment now
• Let your product folks set metrics
• FOSS tools are available (and well tested!)
• The opLog is incredibly powerful
• Hack it!
- 58. Find out more
• Wordnik: developer.wordnik.com
• Swagger: swagger.wordnik.com
• Wordnik OSS: github.com/wordnik/wordnik-oss
• Atmosphere: github.com/Atmosphere/atmosphere
• MongoDB: www.mongodb.org