SlideShare a Scribd company logo
1
Inges&ng	
  HDFS	
  data	
  into	
  Solr	
  using	
  
Spark	
  
Wolfgang	
  Hoschek	
  (whoschek@cloudera.com)	
  
So@ware	
  Engineer	
  @	
  Cloudera	
  Search	
  
Lucene/Solr	
  Meetup,	
  Jan	
  2015	
  
The	
  Enterprise	
  Data	
  Hub	
  
Unified	
  Scale-­‐out	
  Storage	
  
For	
  Any	
  Type	
  of	
  Data	
  
Elas&c,	
  Fault-­‐tolerant,	
  Self-­‐healing,	
  In-­‐memory	
  capabili&es	
  
Resource	
  Management	
  
Online	
  
NoSQL	
  	
  
DBMS	
  
Analy>c	
  	
  
MPP	
  DBMS	
  
Search	
  	
  
Engine	
  
Batch	
  	
  
Processing	
  
Stream	
  	
  
Processing	
  
Machine	
  	
  
Learning	
  
SQL	
   Streaming	
   File	
  System	
  (NFS)	
  
System	
  	
  
Management	
  
Data	
  	
  
Management	
  
Metadata,	
  Security,	
  Audit,	
  Lineage	
  
The image cannot be displayed.
Your computer may not have
2	
  
•  Mul&ple	
  processing	
  
frameworks	
  
•  One	
  pool	
  of	
  data	
  
•  One	
  set	
  of	
  system	
  
resources	
  
•  One	
  management	
  
interface	
  
•  One	
  security	
  
framework	
  
Apache	
  Spark	
  
•  Mission	
  
•  Fast	
  and	
  general	
  engine	
  for	
  large-­‐scale	
  data	
  processing	
  
•  Speed	
  
•  Advanced	
  DAG	
  execu&on	
  engine	
  that	
  supports	
  cyclic	
  data	
  
flow	
  and	
  in-­‐memory	
  compu&ng	
  
•  Ease	
  of	
  Use	
  
•  Write	
  applica&ons	
  quickly	
  in	
  Java,	
  Scala	
  or	
  Python	
  
•  Generality	
  
•  Combine	
  batch,	
  streaming,	
  and	
  complex	
  analy&cs	
  
•  Successor	
  to	
  MapReduce	
  
3	
  
Open	
  Source	
  
•  100%	
  Apache,	
  100%	
  Solr	
  
•  Standard	
  Solr	
  APIs	
  
What	
  is	
  Cloudera	
  Search?	
  
Interac>ve	
  search	
  for	
  Hadoop	
  
•  Full-­‐text	
  and	
  faceted	
  naviga&on	
  
•  Batch,	
  near	
  real-­‐&me,	
  and	
  on-­‐demand	
  indexing	
  
4
Apache	
  Solr	
  integrated	
  with	
  CDH	
  
•  Established,	
  mature	
  search	
  with	
  vibrant	
  community	
  
•  Incorporated	
  as	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Apache	
  Flume,	
  Apache	
  HBase	
  
•  Apache	
  MapReduce,	
  Kite	
  Morphlines	
  
•  Apache	
  Spark,	
  Apache	
  Crunch	
  
HDFS	
  
Online	
  Streaming	
  Data	
  
End	
  User	
  Client	
  App	
  
(e.g.	
  Hue)	
  
Flume	
  
Raw,	
  filtered,	
  or	
  
annotated	
  data	
  
SolrCloud	
  Cluster(s)	
  
NRT	
  Data	
  
indexed	
  w/	
  
Morphlines	
  
Indexed	
  data	
  
Spark	
  &	
  MapReduce	
  Batch	
  
Indexing	
  w/	
  Morphlines	
  
GoLive	
  updates	
  
HBase	
  
Cluster	
  
NRT	
  Replica&on	
  
Events	
  indexed	
  
w/	
  Morphlines	
  
OLTP	
  Data	
  
Cloudera	
  Manager	
  
Search	
  queries	
  
Cloudera	
  Search	
  Architecture	
  Overview	
  
5	
  
Customizable	
  Hue	
  UI	
  
•  Navigated,	
  faceted	
  drill	
  down	
  
•  Full	
  text	
  search,	
  standard	
  
Solr	
  API	
  and	
  query	
  language	
  
6	
   hdp://gethue.com	
  
Scalable	
  Batch	
  ETL	
  &	
  Indexing	
  
Index	
  
shard	
  
Files	
  
Index	
  
shard	
  
Indexer	
  w/	
  
Morphlines	
  
Files	
  or	
  
HBase	
  
tables	
  
Solr	
  
server	
  
Indexer	
  w/	
  
Morphlines	
  
Solr	
  
server	
  
7
HDFS	
  
Solr	
  and	
  MapReduce	
  
•  Flexible,	
  scalable,	
  reliable	
  
batch	
  indexing	
  
•  On-­‐demand	
  indexing,	
  cost-­‐
efficient	
  re-­‐indexing	
  
•  Start	
  serving	
  new	
  indices	
  
without	
  down&me	
  
•  “MapReduceIndexerTool”	
  
•  “HBaseMapReduceIndexerTool”	
  
•  “CrunchIndexerTool	
  on	
  MR”	
  
Solr	
  and	
  Spark	
  
•  “CrunchIndexerTool	
  	
  on	
  Spark”	
  
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
Streaming	
  ETL	
  (Extract,	
  Transform,	
  Load)	
  
Kite	
  Morphlines	
  
•  Consume	
  any	
  kind	
  of	
  data	
  from	
  any	
  
kind	
  of	
  data	
  source,	
  process	
  and	
  load	
  
into	
  Solr,	
 ��HDFS,	
  HBase	
  or	
  anything	
  else	
  
•  Simple	
  and	
  flexible	
  data	
  transforma&on	
  
•  Extensible	
  set	
  of	
  transf.	
  commands	
  
•  Reusable	
  across	
  mul&ple	
  workloads	
  
•  For	
  Batch	
  &	
  Near	
  Real	
  Time	
  
•  Configura&on	
  over	
  coding	
  	
  
•  reduces	
  &me	
  &	
  skills	
  
•  ASL	
  licensed	
  on	
  Github	
  
hdps://github.com/kite-­‐sdk/kite	
  
syslog	
   Flume	
  
Agent	
  
Solr	
  sink	
  
Command:	
  readLine	
  
Command:	
  grok	
  
Command:	
  loadSolr	
  
Solr	
  
Event	
  
Record	
  
Record	
  
Record	
  
Document	
  
8	
  
Morphline	
  Example	
  –	
  syslog	
  with	
  grok	
  
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  [”org.kitesdk.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dic&onaryFiles	
  :	
  [/tmp/grok-­‐dic&onaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_&mestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example	
  Input	
  
<164>Feb	
  	
  4	
  10:46:14	
  syslog	
  sshd[607]:	
  listening	
  on	
  0.0.0.0	
  port	
  22	
  
Output	
  Record	
  
syslog_pri:164	
  
syslog_&mestamp:Feb	
  	
  4	
  10:46:14	
  
syslog_hostname:syslog	
  
syslog_program:sshd	
  
syslog_pid:607	
  
syslog_message:listening	
  on	
  0.0.0.0	
  port	
  22.	
  
	
  
	
  9
Example	
  Java	
  Driver	
  Program	
  -­‐	
  
Can	
  be	
  wrapped	
  into	
  Spark	
  func&ons	
  
/** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */
public static void main(String[] args) {
// compile morphline.conf file on the fly
File conf= new File(args[0]);
MorphlineContext ctx= new MorphlineContext.Builder().build();
Command morphline = new Compiler().compile(conf, null, ctx, null);
// process each input data file
Notifications.notifyBeginTransaction(morphline);
for (int i = 1; i < args.length; i++) {
InputStream in = new FileInputStream(new File(args[i]));
Record record = new Record();
record.put(Fields.ATTACHMENT_BODY, in);
morphline.process(record);
in.close();
}
Notifications.notifyCommitTransaction(morphline);
}
10
Scalable	
  Batch	
  Indexing	
  
11
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
S0_0_0
Extractors
(Mappers)
Leaf Shards
(Reducers)
Root Shards
(Mappers)
S0_0_1
S0S0_1_0
S0_1_1
S1_0_0
S1_0_1
S1S1_1_0
S1_1_1
Input
Files
...
...
...
...
•  Morphline	
  runs	
  inside	
  Mapper	
  
•  Reducers	
  build	
  local	
  Solr	
  indexes	
  
•  Mappers	
  merge	
  microshards	
  
•  GoLive	
  merges	
  into	
  live	
  SolrCloud	
  
GoLive	
  
GoLive	
  
•  Can	
  exploit	
  all	
  reducer	
  slots	
  even	
  if	
  	
  
#reducers	
  >>	
  #solrShards	
  
•  Great	
  throughput	
  but	
  poor	
  latency	
  
•  Only	
  inserts,	
  no	
  updates	
  &	
  deletes!	
  
•  Want	
  to	
  migrate	
  from	
  MR	
  to	
  Spark	
  
Batching	
  Indexing	
  with	
  CrunchIndexerTool	
  
12
spark-submit ... CrunchIndexerTool --morphline-file morphline.conf ...
or	
  
hadoop ... CrunchIndexerTool --morphline-file morphline.conf ...
•  Morphline	
  runs	
  inside	
  Spark	
  executors	
  
•  Morphline	
  sends	
  docs	
  to	
  live	
  SolrCloud	
  
•  Good	
  throughput	
  and	
  good	
  latency	
  
•  Supports	
  inserts,	
  updates	
  &	
  deletes	
  
•  Flag	
  to	
  run	
  on	
  Spark	
  or	
  MapReduce	
  
Extractors
(Executors/
Mappers)
SolrCloud
Shards
S0
S1
Input
Files
...
...
...
...
More	
  CrunchIndexerTool	
  features	
  (1/2)	
  
•  Implemented	
  with	
  Apache	
  Crunch	
  library	
  
•  Eases	
  migra&on	
  from	
  MapReduce	
  execu&on	
  engine	
  to	
  Spark	
  
execu&on	
  engine	
  –	
  can	
  run	
  on	
  either	
  engine	
  
•  Supported	
  Spark	
  modes	
  
•  Local	
  (for	
  tes&ng)	
  
•  YARN	
  client	
  
•  YARN	
  cluster	
  (for	
  produc&on)	
  
•  Efficient	
  batching	
  of	
  Solr	
  updates	
  and	
  deleteById	
  and	
  
deleteByQuery	
  
•  Efficient	
  locality-­‐aware	
  processing	
  for	
  splidable	
  HDFS	
  files	
  
•  avro,	
  parquet,	
  text	
  lines	
  
13
More	
  CrunchIndexerTool	
  features	
  (2/2)	
  
•  Dry-­‐run	
  mode	
  for	
  rapid	
  prototyping	
  
•  Sends	
  commit	
  to	
  Solr	
  on	
  job	
  success	
  
•  Inherits	
  Fault	
  tolerance	
  &	
  retry	
  from	
  Spark	
  (and	
  MR)	
  
•  Security	
  in	
  progress:	
  Kerberos	
  token	
  delega&on,	
  SSL	
  
•  ASL	
  licensed	
  on	
  Github	
  
•  hdps://github.com/cloudera/search/tree/
cdh5-­‐1.0.0_5.3.0/search-­‐crunch	
  
14
Conclusions	
  
•  Easy	
  migra&on	
  from	
  MapReduce	
  to	
  Spark	
  
•  Also	
  supports	
  updates	
  &	
  deletes	
  &	
  good	
  latency	
  
•  Recommenda&on	
  
•  Use	
  MapReduceIndexerTool	
  for	
  large	
  scale	
  batch	
  inges&on	
  
use	
  cases	
  where	
  updates	
  or	
  deletes	
  of	
  exis&ng	
  documents	
  
in	
  Solr	
  are	
  not	
  required	
  
•  Use	
  CrunchIndexerTool	
  for	
  all	
  other	
  use	
  cases	
  
•  Shipping	
  in	
  CDH5.2	
  and	
  above	
  
15
©2014 Cloudera, Inc. All rights
reserved.

More Related Content

Ingesting hdfs intosolrusingsparktrimmed

  • 1. 1 Inges&ng  HDFS  data  into  Solr  using   Spark   Wolfgang  Hoschek  (whoschek@cloudera.com)   So@ware  Engineer  @  Cloudera  Search   Lucene/Solr  Meetup,  Jan  2015  
  • 2. The  Enterprise  Data  Hub   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas&c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili&es   Resource  Management   Online   NoSQL     DBMS   Analy>c     MPP  DBMS   Search     Engine   Batch     Processing   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   System     Management   Data     Management   Metadata,  Security,  Audit,  Lineage   The image cannot be displayed. Your computer may not have 2   •  Mul&ple  processing   frameworks   •  One  pool  of  data   •  One  set  of  system   resources   •  One  management   interface   •  One  security   framework  
  • 3. Apache  Spark   •  Mission   •  Fast  and  general  engine  for  large-­‐scale  data  processing   •  Speed   •  Advanced  DAG  execu&on  engine  that  supports  cyclic  data   flow  and  in-­‐memory  compu&ng   •  Ease  of  Use   •  Write  applica&ons  quickly  in  Java,  Scala  or  Python   •  Generality   •  Combine  batch,  streaming,  and  complex  analy&cs   •  Successor  to  MapReduce   3  
  • 4. Open  Source   •  100%  Apache,  100%  Solr   •  Standard  Solr  APIs   What  is  Cloudera  Search?   Interac>ve  search  for  Hadoop   •  Full-­‐text  and  faceted  naviga&on   •  Batch,  near  real-­‐&me,  and  on-­‐demand  indexing   4 Apache  Solr  integrated  with  CDH   •  Established,  mature  search  with  vibrant  community   •  Incorporated  as  part  of  the  Hadoop  ecosystem   •  Apache  Flume,  Apache  HBase   •  Apache  MapReduce,  Kite  Morphlines   •  Apache  Spark,  Apache  Crunch  
  • 5. HDFS   Online  Streaming  Data   End  User  Client  App   (e.g.  Hue)   Flume   Raw,  filtered,  or   annotated  data   SolrCloud  Cluster(s)   NRT  Data   indexed  w/   Morphlines   Indexed  data   Spark  &  MapReduce  Batch   Indexing  w/  Morphlines   GoLive  updates   HBase   Cluster   NRT  Replica&on   Events  indexed   w/  Morphlines   OLTP  Data   Cloudera  Manager   Search  queries   Cloudera  Search  Architecture  Overview   5  
  • 6. Customizable  Hue  UI   •  Navigated,  faceted  drill  down   •  Full  text  search,  standard   Solr  API  and  query  language   6   hdp://gethue.com  
  • 7. Scalable  Batch  ETL  &  Indexing   Index   shard   Files   Index   shard   Indexer  w/   Morphlines   Files  or   HBase   tables   Solr   server   Indexer  w/   Morphlines   Solr   server   7 HDFS   Solr  and  MapReduce   •  Flexible,  scalable,  reliable   batch  indexing   •  On-­‐demand  indexing,  cost-­‐ efficient  re-­‐indexing   •  Start  serving  new  indices   without  down&me   •  “MapReduceIndexerTool”   •  “HBaseMapReduceIndexerTool”   •  “CrunchIndexerTool  on  MR”   Solr  and  Spark   •  “CrunchIndexerTool    on  Spark”   hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
  • 8. Streaming  ETL  (Extract,  Transform,  Load)   Kite  Morphlines   •  Consume  any  kind  of  data  from  any   kind  of  data  source,  process  and  load   into  Solr,  HDFS,  HBase  or  anything  else   •  Simple  and  flexible  data  transforma&on   •  Extensible  set  of  transf.  commands   •  Reusable  across  mul&ple  workloads   •  For  Batch  &  Near  Real  Time   •  Configura&on  over  coding     •  reduces  &me  &  skills   •  ASL  licensed  on  Github   hdps://github.com/kite-­‐sdk/kite   syslog   Flume   Agent   Solr  sink   Command:  readLine   Command:  grok   Command:  loadSolr   Solr   Event   Record   Record   Record   Document   8  
  • 9. Morphline  Example  –  syslog  with  grok   morphlines  :  [    {        id  :  morphline1        importCommands  :  [”org.kitesdk.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic&onaryFiles  :  [/tmp/grok-­‐dic&onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_&mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example  Input   <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22   Output  Record   syslog_pri:164   syslog_&mestamp:Feb    4  10:46:14   syslog_hostname:syslog   syslog_program:sshd   syslog_pid:607   syslog_message:listening  on  0.0.0.0  port  22.      9
  • 10. Example  Java  Driver  Program  -­‐   Can  be  wrapped  into  Spark  func&ons   /** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */ public static void main(String[] args) { // compile morphline.conf file on the fly File conf= new File(args[0]); MorphlineContext ctx= new MorphlineContext.Builder().build(); Command morphline = new Compiler().compile(conf, null, ctx, null); // process each input data file Notifications.notifyBeginTransaction(morphline); for (int i = 1; i < args.length; i++) { InputStream in = new FileInputStream(new File(args[i])); Record record = new Record(); record.put(Fields.ATTACHMENT_BODY, in); morphline.process(record); in.close(); } Notifications.notifyCommitTransaction(morphline); } 10
  • 11. Scalable  Batch  Indexing   11 hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ... S0_0_0 Extractors (Mappers) Leaf Shards (Reducers) Root Shards (Mappers) S0_0_1 S0S0_1_0 S0_1_1 S1_0_0 S1_0_1 S1S1_1_0 S1_1_1 Input Files ... ... ... ... •  Morphline  runs  inside  Mapper   •  Reducers  build  local  Solr  indexes   •  Mappers  merge  microshards   •  GoLive  merges  into  live  SolrCloud   GoLive   GoLive   •  Can  exploit  all  reducer  slots  even  if     #reducers  >>  #solrShards   •  Great  throughput  but  poor  latency   •  Only  inserts,  no  updates  &  deletes!   •  Want  to  migrate  from  MR  to  Spark  
  • 12. Batching  Indexing  with  CrunchIndexerTool   12 spark-submit ... CrunchIndexerTool --morphline-file morphline.conf ... or   hadoop ... CrunchIndexerTool --morphline-file morphline.conf ... •  Morphline  runs  inside  Spark  executors   •  Morphline  sends  docs  to  live  SolrCloud   •  Good  throughput  and  good  latency   •  Supports  inserts,  updates  &  deletes   •  Flag  to  run  on  Spark  or  MapReduce   Extractors (Executors/ Mappers) SolrCloud Shards S0 S1 Input Files ... ... ... ...
  • 13. More  CrunchIndexerTool  features  (1/2)   •  Implemented  with  Apache  Crunch  library   •  Eases  migra&on  from  MapReduce  execu&on  engine  to  Spark   execu&on  engine  –  can  run  on  either  engine   •  Supported  Spark  modes   •  Local  (for  tes&ng)   •  YARN  client   •  YARN  cluster  (for  produc&on)   •  Efficient  batching  of  Solr  updates  and  deleteById  and   deleteByQuery   •  Efficient  locality-­‐aware  processing  for  splidable  HDFS  files   •  avro,  parquet,  text  lines   13
  • 14. More  CrunchIndexerTool  features  (2/2)   •  Dry-­‐run  mode  for  rapid  prototyping   •  Sends  commit  to  Solr  on  job  success   •  Inherits  Fault  tolerance  &  retry  from  Spark  (and  MR)   •  Security  in  progress:  Kerberos  token  delega&on,  SSL   •  ASL  licensed  on  Github   •  hdps://github.com/cloudera/search/tree/ cdh5-­‐1.0.0_5.3.0/search-­‐crunch   14
  • 15. Conclusions   •  Easy  migra&on  from  MapReduce  to  Spark   •  Also  supports  updates  &  deletes  &  good  latency   •  Recommenda&on   •  Use  MapReduceIndexerTool  for  large  scale  batch  inges&on   use  cases  where  updates  or  deletes  of  exis&ng  documents   in  Solr  are  not  required   •  Use  CrunchIndexerTool  for  all  other  use  cases   •  Shipping  in  CDH5.2  and  above   15
  • 16. ©2014 Cloudera, Inc. All rights reserved.