SlideShare a Scribd company logo
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 1
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 2
1. Intro
2. Problem to solve?
3. How does Flume/Solr help?
4. Syslog indexing example
5. HA, DR & scalability
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 3
Ops Architect at Cisco CCATG (WebEx)
Ensure operational readiness for complex distributed services
HA, DR, monitoring, config, deployment
Previously eBay, Excite@Home, IBM, VISA
Operations architecture, monitoring, event correlation
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 4
© 2012 Cisco and/or its affiliates. All rights reserved. 5
Cisco WebEx Meetings
• Voice, video, desktop sharing
• Meeting/Event/Support/Training
• Centers
• Integration with TelePresence
Cisco WebEx Social
• Social networking
• Content creation
• Integrated IM
Cisco WebEx Messenger
• IM, presence
• Integrate with voice, video
• XMPP
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 6© 2010 Cisco and/or its affiliates. All rights reserved. 6
Participants from over 231 countries, 52% market share
2.2 Billion meeting minutes per month
40.5 Million meeting attendees per month
9.4 million registered hosts worldwide
4 Million mobile downloads
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 7© 2010 Cisco and/or its affiliates. All rights reserved. 7
Datacenter / PoP
Leased network link
Global Scale: 13 datacenters &
iPoPs around the globe
Dedicated network: dual path
10G circuits between DCs
Multi-tenant: 95k sites
Real-time collaboration:
voice, desktop sharing, video, chat
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 8© 2010 Cisco and/or its affiliates. All rights reserved. 8
Datacenter / PoP
Leased network link
People make mistakes
Hardware fails
Software fails
Even failovers sometimes fail
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 9
“If a problem has no solution, it may not be a problem,
but a fact, not to be solved, but to be coped with over time”
— Shimon Peres (“Peres’s Law”)
People/HW/SW failures are facts, not problems
Operations main goal is to maintain high service availability
• Recovery/repair is how we cope with above facts
• Improving recovery/repair improves availability
UnAvailability = MTTR / MTBF
1/10th MTTR just as valuable as 10x MTBF
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 10
Even better: proactive
Good: reactive
Your search – What is the root cause of the outage? – did not match any documents.
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 11
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 12
Flume
Log4j
File
Avro
Syslog
Other Sinks
Solr
Sink
Applicationstate&APIs
HDFS
Thrift
AMQP RDBMS
Sqoop
HTTP/REST
MySQL
Unstructured/semi-structured data Structured data
Cisco UCS C240 M3 servers
12 x 3TB = 36 TB / server
HDFS
Sink
SolrCloud
Raw dataSolr index
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 13
DC 1
HDFS
Flume
SolrCloud
Flume
Flume
DC 2
HDFS
Flume
SolrCloud
Flume
Flume
DC 1
Flume
Flume
Flume
syslog log4j file
DC N
Flume
Flume
Flume
syslog log4j file
… Collector
tier
Storage
tier
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 14
agent agent agent
File
Channel 1
Avro
src
DC1
Avro
sink
DC2
Avro
sink
File
Channel 2
…
Replicating
fan-out
flow
Flume Collector server
Failover & load
balancing agents
Flume Storage tier
All events replicated to
both Channels
DC1 DC2
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 15
DC 1
HDFS
Flume
SolrCloud
Flume
Flume
DC 2
HDFS
Flume
SolrCloud
Flume
Flume
DC 1
Flume
Flume
Flume
syslog log4j file
DC N
Flume
Flume
Flume
syslog log4j file
… Collector
tier
Storage
tier
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 16
File
Channel 1
Avro
src
Solr
Sink
HDFS
sink
File
Channel 2
…
Multiplexing
fan-out
flow
Flume Storage tier server
Failover & load
balancing agents
Flume
Collector
Flume
Collector
Flume
Collector
HDFSSolrCloud
Routing to Solr by
Flume event header
All events to HDFS
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 17
Isn’t Big Data “schema on read”?
• Why does Solr require a schema on write?
• Dirty little secret: there’s always a schema
• Performance & functionality vs flexibility
• Optimize operations and storage based on field type - that's how you
get sub second response times
There’s always a schema
• Application code vs. central location
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 18
Cloudera Morphlines
• Framework to simplify event transformation
• Compatible with existing grok patterns
• Reusable across multiple index workloads:
Flume & M/R
Command: readLine
Command: grok
Command: loadSolr
Solr
Flume event = headers + body
Record
Document matching schema.xml
Command: tryRules
Command: addValues
…
Record
Record
Record
Record
SolrSink
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 19
Convert syslog message..
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com : %ACE-3-251008: Health probe
failed for server 10.240.22.111 on port 1234
.. into Solr schema fields
Severity=[3]
Facility=[22]
host=[colo01-wxp00-ace01b-connect.webex.com]
timestamp=[2013-06-16T04:36:49.000Z]
syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234]
severity_label=[error]
access_token=[54asdf654]
id=[b2f839c3-dece-404f-a535-e0141ad549bf]
cisco_product=[ACE]
cisco_level=[3]
cisco_id=[251008]
cisco_code=[%ACE-3-251008]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 20
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 1: readLine reads in Flume event headers and body
timestamp=[1371357409000]
host=[colo01-wxp00-ace01b-connect.webex.com]
category=[545f5sfsd5sf]
Severity=[3]
Facility=[22]
message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013
04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port
1234]
Headers
Body
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 21
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 2: convertTimestamp converts epoch to ISO 8601 format
timestamp=[2013-06-16T04:36:49.000Z]
host=[colo01-wxp00-ace01b-connect.webex.com]
access_token=[545f5sfsd5sf]
message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013
04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port
1234]
Severity=[3]
Facility=[22]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 3: addValues creates new field access_token
timestamp=[2013-06-16T04:36:49.000Z]
category=[545f5sfsd5sf]
access_token=[545f5sfsd5sf]
message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16
2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111
on port 1234]
host=[colo01-wxp00-ace01b-connect.webex.com]
Severity=[3]
Facility=[22]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 4: tryRules creates field severity_label for severity
timestamp=[2013-06-16T04:36:49.000Z]
severity_label=[error]
access_token=[545f5sfsd5sf]
message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16
2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111
on port 1234]
host=[colo01-wxp00-ace01b-connect.webex.com]
category=[545f5sfsd5sf]
Severity=[3]
Facility=[22]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 5: tryRules creates new fields
syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111
on port 1234]
cisco_product=[ACE]
cisco_level=[3]
cisco_id=[251008]
cisco_code=[%ACE-3-251008]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 25
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 6: sanitizeUnknownSolrFields drops non-schema fields
timestamp=[2013-06-16T04:36:49.000Z]
syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111
on port 1234]
severity_label=[error]
access_token=[545f5sfsd5sf]
host=[colo01-wxp00-ace01b-connect.webex.com]
cisco_product=[ACE]
cisco_level=[3]
cisco_id=[251008]
cisco_code=[%ACE-3-251008]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 26
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 7: generateUUID creates an unique id for the document
timestamp=[2013-06-16T04:36:49.000Z]
syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111
on port 1234]
severity_label=[error]
access_token=[545f5sfsd5sf]
id=[b2f839c3-dece-404f-a535-e0141ad549bf]
host=[colo01-wxp00-ace01b-connect.webex.com]
cisco_product=[ACE]
cisco_level=[3]
cisco_id=[251008]
cisco_code=[%ACE-3-251008]
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 27
Convert syslog message
<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-
251008: Health probe failed for server 10.240.22.111 on port 1234
Step 8: loadSolr loads a record into a Solr server
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 28
Command: readLine
Command: grok
Command: loadSolr
SolrCloud
Flume syslog event = headers + body
Record
Document matching schema.xml
Command: tryRules
Command: addValues
…
Record
Record
Record
Record
SolrSink
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 29
ZooKeeper
leader1
replica1
Shard1
leader2
replica2
Shard2
leader3
replica3
Shard3
SolrCloud cluster
zk1
zk2
zk3
Pluggable filesystem
(local, HDFS)
Add doc to syslog index
• Collections, shards & replicas
• Pluggable file system
• Central config & coordination with ZK
• Full HA, automatic fail-over
• NRT indexing
• Automatic routing
Where can I index data?
leader3
Collection
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 30
Collection “syslog” with
three shards
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 31
Special case of search
• Logs are time series data: timestamp + data
• High indexing rate, no updates
• New data is more frequently searched than old
Collection aliases
• Time partitioned collections – e.g. one collection per day
• Reduces the workload to near-real-time data only
• One-to-many collection mapping: queries go to a logical representation
mapped to multiple, same-schema collection
• Simplifies for hot-warm-cold migration of data
Index expiration
• Old data is aged out by Collection Aliases
• Remap only the latest collection to an alias
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 32
Solr
• No multi-datacenter cluster support
HDFS
• No multi-datacenter cluster support
Options?
• All our services must survive DC outage
• . . so should logging and indexing
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 33
DC 1
HDFS
Flume
SolrCloud
Flume
Flume
DC 2
HDFS
Flume
SolrCloud
Flume
Flume
DC 1
Flume
Flume
Flume
syslog log4j file
DC 2
Flume
Flume
Flume
syslog log4j file
DC N
Flume
Flume
Flume
syslog log4j file
…
Collector
tier
Storage
tierPlanned or
unplanned outage
Flume Collector
disk channel
buffering DC1
events
DC1 Hadoop cluster
back online after outage
Replicate
aggregate
data
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 34
DC 1
HDFS
Flume
SolrCloud
Flume
Flume
DC 2
HDFS
SolrCloud
DC 1
Flume
Flume
Flume
syslog log4j file
DC 2
Flume
Flume
Flume
syslog log4j file
DC N
Flume
Flume
Flume
syslog log4j file
… Collector
tier
Storage
tier
Flume
Flume
Flume
distcp
Manual CNAME
change to DC2
DC1 back
online, sync data
from DC2
Data sent only
to a single DC
distcp
DNS CNAME change
back to DC1
Flip distcp
the other way
Flume buffering events
at collector tier
Create indexes with M/R
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 35
Tiers to scale
• Flume Collector tier
• Flume Storage tier
• SolrCloud
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 36
100 – 5000 servers per a datacenter
agent agent agent
File
Channel 1
Avro
src
DC1
Avro
sink
DC2
Avro
sink
File
Channel 2
…
Replicating
fan-out
flow
agent agent agent …
…Flume Collector
More agents and data
FileChannel:
14MB/sec
NIC:
100MB/sec
NIC:
100MB/sec
File
Channel 1
Avro
src
DC1
Avro
sink
DC2
Avro
sink
File
Channel 2
Replicating
fan-out
flow
Max per server:
14MB/s
1.2 TB/day
70k events/s
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 37
DC 1 collectors
DC 1
storage tier
Flume 1
DC 2
storage tier
Avro
sink
1
Avro
sink
2
Avro
sink
N
…
DC 2 collectors
Avro
sink
1
Avro
sink
2
Avro
sink
N
…
DC N collectors
Avro
sink
1
Avro
sink
2
Avro
sink
N
……
File
Chan1
Avro
src
HDFS
sink
Solr
sink
File
Chan2
Multiplexing
fan-out
flow
File
Chan1
Avro
src
HDFS
sink
Solr
sink
File
Chan2
Multiplexing
fan-out
flow
File
Chan1
Avro
src
HDFS
sink
Solr
sink
File
Chan2
Multiplexing
fan-out
flow
File
Chan1
Avro
src
HDFS
sink
Solr
sink
File
Chan2
Multiplexing
fan-out
flow
Max per server:
14MB/s
1.2 TB/day
70k events/s
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 38
ZooKeeper
leader1
replica1
Shard1
leader2
replica2
Shard2
leader3
replica3
Shard3
SolrCloud cluster
zk1
zk2
zk3
Pluggable filesystem
(local, HDFS)
New logs
to index
Search
queries
1000
tx/sec/core
2x8 cores
16k tx/sec
3 shards
3 x 16k =
48k tx/sec
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 39
Central syslog servers
• Network and OS system messages forwarded to several central syslog
servers
Forward syslog to Solr using Flume Morphline SolrSink
• Parse messages with Morphline and grok patterns
SolrCloud
• Index log lines as documents into a Collection (i.e. index)
HUE Solr search
• Simple UI to build a customized search page layout with faceting, sorting.
• Easy drill down with multiple facets: severity, datacenter, hostname, etc
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 40
Screen shots
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 41
Search by time
Sort by select field
Facets by selected fields
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 42
Wildcard query by field
Highlight the query
keywords
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 43
Data sources: REST/JSON, log4j, syslog, Avro, Thrift
Parsing: Cloudera Morphlines
NRT Indexing: SolrCloud embedded in CDH
Batch indexing: MapReduce
Analytics: Use your favorite tool, raw detailed data stored in HDFS
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 44
email: ari.flink@webex.com
twitter: @raaka
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 45
Thank you.

More Related Content

Large scale near real-time log indexing with Flume and SolrCloud

  • 1. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 1
  • 2. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 2 1. Intro 2. Problem to solve? 3. How does Flume/Solr help? 4. Syslog indexing example 5. HA, DR & scalability
  • 3. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 3 Ops Architect at Cisco CCATG (WebEx) Ensure operational readiness for complex distributed services HA, DR, monitoring, config, deployment Previously eBay, Excite@Home, IBM, VISA Operations architecture, monitoring, event correlation
  • 4. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 4
  • 5. © 2012 Cisco and/or its affiliates. All rights reserved. 5 Cisco WebEx Meetings • Voice, video, desktop sharing • Meeting/Event/Support/Training • Centers • Integration with TelePresence Cisco WebEx Social • Social networking • Content creation • Integrated IM Cisco WebEx Messenger • IM, presence • Integrate with voice, video • XMPP
  • 6. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 6© 2010 Cisco and/or its affiliates. All rights reserved. 6 Participants from over 231 countries, 52% market share 2.2 Billion meeting minutes per month 40.5 Million meeting attendees per month 9.4 million registered hosts worldwide 4 Million mobile downloads
  • 7. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 7© 2010 Cisco and/or its affiliates. All rights reserved. 7 Datacenter / PoP Leased network link Global Scale: 13 datacenters & iPoPs around the globe Dedicated network: dual path 10G circuits between DCs Multi-tenant: 95k sites Real-time collaboration: voice, desktop sharing, video, chat
  • 8. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 8© 2010 Cisco and/or its affiliates. All rights reserved. 8 Datacenter / PoP Leased network link People make mistakes Hardware fails Software fails Even failovers sometimes fail
  • 9. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 9 “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Operations main goal is to maintain high service availability • Recovery/repair is how we cope with above facts • Improving recovery/repair improves availability UnAvailability = MTTR / MTBF 1/10th MTTR just as valuable as 10x MTBF
  • 10. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 10 Even better: proactive Good: reactive Your search – What is the root cause of the outage? – did not match any documents.
  • 11. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 11
  • 12. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 12 Flume Log4j File Avro Syslog Other Sinks Solr Sink Applicationstate&APIs HDFS Thrift AMQP RDBMS Sqoop HTTP/REST MySQL Unstructured/semi-structured data Structured data Cisco UCS C240 M3 servers 12 x 3TB = 36 TB / server HDFS Sink SolrCloud Raw dataSolr index
  • 13. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 13 DC 1 HDFS Flume SolrCloud Flume Flume DC 2 HDFS Flume SolrCloud Flume Flume DC 1 Flume Flume Flume syslog log4j file DC N Flume Flume Flume syslog log4j file … Collector tier Storage tier
  • 14. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 14 agent agent agent File Channel 1 Avro src DC1 Avro sink DC2 Avro sink File Channel 2 … Replicating fan-out flow Flume Collector server Failover & load balancing agents Flume Storage tier All events replicated to both Channels DC1 DC2
  • 15. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 15 DC 1 HDFS Flume SolrCloud Flume Flume DC 2 HDFS Flume SolrCloud Flume Flume DC 1 Flume Flume Flume syslog log4j file DC N Flume Flume Flume syslog log4j file … Collector tier Storage tier
  • 16. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 16 File Channel 1 Avro src Solr Sink HDFS sink File Channel 2 … Multiplexing fan-out flow Flume Storage tier server Failover & load balancing agents Flume Collector Flume Collector Flume Collector HDFSSolrCloud Routing to Solr by Flume event header All events to HDFS
  • 17. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 17 Isn’t Big Data “schema on read”? • Why does Solr require a schema on write? • Dirty little secret: there’s always a schema • Performance & functionality vs flexibility • Optimize operations and storage based on field type - that's how you get sub second response times There’s always a schema • Application code vs. central location
  • 18. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 18 Cloudera Morphlines • Framework to simplify event transformation • Compatible with existing grok patterns • Reusable across multiple index workloads: Flume & M/R Command: readLine Command: grok Command: loadSolr Solr Flume event = headers + body Record Document matching schema.xml Command: tryRules Command: addValues … Record Record Record Record SolrSink
  • 19. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 19 Convert syslog message.. <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234 .. into Solr schema fields Severity=[3] Facility=[22] host=[colo01-wxp00-ace01b-connect.webex.com] timestamp=[2013-06-16T04:36:49.000Z] syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[54asdf654] id=[b2f839c3-dece-404f-a535-e0141ad549bf] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]
  • 20. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 20 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 1: readLine reads in Flume event headers and body timestamp=[1371357409000] host=[colo01-wxp00-ace01b-connect.webex.com] category=[545f5sfsd5sf] Severity=[3] Facility=[22] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] Headers Body
  • 21. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 21 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 2: convertTimestamp converts epoch to ISO 8601 format timestamp=[2013-06-16T04:36:49.000Z] host=[colo01-wxp00-ace01b-connect.webex.com] access_token=[545f5sfsd5sf] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] Severity=[3] Facility=[22]
  • 22. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 3: addValues creates new field access_token timestamp=[2013-06-16T04:36:49.000Z] category=[545f5sfsd5sf] access_token=[545f5sfsd5sf] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] host=[colo01-wxp00-ace01b-connect.webex.com] Severity=[3] Facility=[22]
  • 23. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 4: tryRules creates field severity_label for severity timestamp=[2013-06-16T04:36:49.000Z] severity_label=[error] access_token=[545f5sfsd5sf] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] host=[colo01-wxp00-ace01b-connect.webex.com] category=[545f5sfsd5sf] Severity=[3] Facility=[22]
  • 24. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 5: tryRules creates new fields syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]
  • 25. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 25 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 6: sanitizeUnknownSolrFields drops non-schema fields timestamp=[2013-06-16T04:36:49.000Z] syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[545f5sfsd5sf] host=[colo01-wxp00-ace01b-connect.webex.com] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]
  • 26. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 26 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 7: generateUUID creates an unique id for the document timestamp=[2013-06-16T04:36:49.000Z] syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[545f5sfsd5sf] id=[b2f839c3-dece-404f-a535-e0141ad549bf] host=[colo01-wxp00-ace01b-connect.webex.com] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]
  • 27. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 27 Convert syslog message <179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3- 251008: Health probe failed for server 10.240.22.111 on port 1234 Step 8: loadSolr loads a record into a Solr server
  • 28. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 28 Command: readLine Command: grok Command: loadSolr SolrCloud Flume syslog event = headers + body Record Document matching schema.xml Command: tryRules Command: addValues … Record Record Record Record SolrSink
  • 29. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 29 ZooKeeper leader1 replica1 Shard1 leader2 replica2 Shard2 leader3 replica3 Shard3 SolrCloud cluster zk1 zk2 zk3 Pluggable filesystem (local, HDFS) Add doc to syslog index • Collections, shards & replicas • Pluggable file system • Central config & coordination with ZK • Full HA, automatic fail-over • NRT indexing • Automatic routing Where can I index data? leader3 Collection
  • 30. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 30 Collection “syslog” with three shards
  • 31. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 31 Special case of search • Logs are time series data: timestamp + data • High indexing rate, no updates • New data is more frequently searched than old Collection aliases • Time partitioned collections – e.g. one collection per day • Reduces the workload to near-real-time data only • One-to-many collection mapping: queries go to a logical representation mapped to multiple, same-schema collection • Simplifies for hot-warm-cold migration of data Index expiration • Old data is aged out by Collection Aliases • Remap only the latest collection to an alias
  • 32. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 32 Solr • No multi-datacenter cluster support HDFS • No multi-datacenter cluster support Options? • All our services must survive DC outage • . . so should logging and indexing
  • 33. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 33 DC 1 HDFS Flume SolrCloud Flume Flume DC 2 HDFS Flume SolrCloud Flume Flume DC 1 Flume Flume Flume syslog log4j file DC 2 Flume Flume Flume syslog log4j file DC N Flume Flume Flume syslog log4j file … Collector tier Storage tierPlanned or unplanned outage Flume Collector disk channel buffering DC1 events DC1 Hadoop cluster back online after outage Replicate aggregate data
  • 34. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 34 DC 1 HDFS Flume SolrCloud Flume Flume DC 2 HDFS SolrCloud DC 1 Flume Flume Flume syslog log4j file DC 2 Flume Flume Flume syslog log4j file DC N Flume Flume Flume syslog log4j file … Collector tier Storage tier Flume Flume Flume distcp Manual CNAME change to DC2 DC1 back online, sync data from DC2 Data sent only to a single DC distcp DNS CNAME change back to DC1 Flip distcp the other way Flume buffering events at collector tier Create indexes with M/R
  • 35. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 35 Tiers to scale • Flume Collector tier • Flume Storage tier • SolrCloud
  • 36. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 36 100 – 5000 servers per a datacenter agent agent agent File Channel 1 Avro src DC1 Avro sink DC2 Avro sink File Channel 2 … Replicating fan-out flow agent agent agent … …Flume Collector More agents and data FileChannel: 14MB/sec NIC: 100MB/sec NIC: 100MB/sec File Channel 1 Avro src DC1 Avro sink DC2 Avro sink File Channel 2 Replicating fan-out flow Max per server: 14MB/s 1.2 TB/day 70k events/s
  • 37. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 37 DC 1 collectors DC 1 storage tier Flume 1 DC 2 storage tier Avro sink 1 Avro sink 2 Avro sink N … DC 2 collectors Avro sink 1 Avro sink 2 Avro sink N … DC N collectors Avro sink 1 Avro sink 2 Avro sink N …… File Chan1 Avro src HDFS sink Solr sink File Chan2 Multiplexing fan-out flow File Chan1 Avro src HDFS sink Solr sink File Chan2 Multiplexing fan-out flow File Chan1 Avro src HDFS sink Solr sink File Chan2 Multiplexing fan-out flow File Chan1 Avro src HDFS sink Solr sink File Chan2 Multiplexing fan-out flow Max per server: 14MB/s 1.2 TB/day 70k events/s
  • 38. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 38 ZooKeeper leader1 replica1 Shard1 leader2 replica2 Shard2 leader3 replica3 Shard3 SolrCloud cluster zk1 zk2 zk3 Pluggable filesystem (local, HDFS) New logs to index Search queries 1000 tx/sec/core 2x8 cores 16k tx/sec 3 shards 3 x 16k = 48k tx/sec
  • 39. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 39 Central syslog servers • Network and OS system messages forwarded to several central syslog servers Forward syslog to Solr using Flume Morphline SolrSink • Parse messages with Morphline and grok patterns SolrCloud • Index log lines as documents into a Collection (i.e. index) HUE Solr search • Simple UI to build a customized search page layout with faceting, sorting. • Easy drill down with multiple facets: severity, datacenter, hostname, etc
  • 40. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 40 Screen shots
  • 41. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 41 Search by time Sort by select field Facets by selected fields
  • 42. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 42 Wildcard query by field Highlight the query keywords
  • 43. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 43 Data sources: REST/JSON, log4j, syslog, Avro, Thrift Parsing: Cloudera Morphlines NRT Indexing: SolrCloud embedded in CDH Batch indexing: MapReduce Analytics: Use your favorite tool, raw detailed data stored in HDFS
  • 44. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 44 email: ari.flink@webex.com twitter: @raaka
  • 45. C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 45 Thank you.

Editor's Notes

  1. As of Feb 2013
  2. As of Feb 2013
  3. As of Feb 2013
  4. CEP: Complex Event Processing
  5. CEP: Complex Event Processing
  6. CEP: Complex Event Processing
  7. CEP: Complex Event Processing
  8. CEP: Complex Event Processing
  9. CEP: Complex Event Processing
  10. CEP: Complex Event Processing
  11. CEP: Complex Event Processing
  12. CEP: Complex Event Processing
  13. CEP: Complex Event Processing
  14. CEP: Complex Event Processing
  15. CEP: Complex Event Processing
  16. CEP: Complex Event Processing