SlideShare a Scribd company logo
Apache Hive Hook
2013. 8
Minwoo Kim
michael.kim@nexr.com
Apache Hive Hook
• The reason why I made this is that Ryan asked me about
hive hook, but I couldn’t find any info about hook in hive
wiki.
• I hope this will be helpful to develop applications using Hive
when you want to get extra info while executing a query on
Hive.
• This document was written based on release-0.11 tag
• Source:
- https://github.com/apache/hive (mirror of apache hive)
What is a hook?
• As you know, this is about computer programming technique,
but ..
• Hooking
- Techniques for intercepting function calls or
messages or events in an operating system, applications,
and other software components.
• Hook
- Code that handles intercepted function calls, events or
messages
Hive provides some hooking
points
• pre-execution
• post-execution
• execution-failure
• pre- and post-driver-run
• pre- and post-semantic-analyze
• metastore-initialize
How to set up hooks in Hive
<property>
<name>hive.exec.pre.hooks</name>
<value></value>
<description>
Comma-separated list of pre-execution hooks to be invoked for each statement.
A pre-execution hook is specified as the name of a Java class which implements
the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.
</description>
</property>
hive-site.xml
<property>
<name>hive.aux.jars.path</name>
<value></value>
</property>
Setting hook property
Setting path of jars contains implementations of hook interfaces or abstract class
You can use hive.added.jars.path instead of hive.aux.jars.path
Hive hook properties and interfaces
Property Interface or Abstract class
hive.exec.pre.hooks
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PreExecute is deprecated)
hive.exec.post.hooks
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PostExecute is deprecated)
hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener
hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook
hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
When those hooks fire?
• You can submit a query on Hive through the
following entry points
- CLIDriver main method (called by shell script)
- HCatCli main method (called by shell script)
- HiveServer (called by thrift client)
- HiveServer2 (called by thrift client or beeline)
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
HCatCli
HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
HiveServer.execute() ➔ Driver.run() ➠
HiveServer
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
HCatCli
HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
HiveServer2
ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()
CLIService.executeStatement()
↳ SessionManager.getSession()
↳ HiveSession.executeStatement()
↳ OperationManager.newExecuteStatementOperation()
↳ SQLOperation.run() ➔ Driver.run() ➠
⤶
HiveServer2
ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()
CLIService.executeStatement()
↳ SessionManager.getSession()
↳ HiveSession.executeStatement()
↳ OperationManager.newExecuteStatementOperation()
↳ SQLOperation.run() ➔ Driver.run() ➠
• OperationManager.newExecuteStatementOperation() is like a kind of factory
- AddResourceOperation, DeleteResourceOperation, DfsOperation,
GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation,
GetSchemasOperation, GetTablesOperation, GetTableTypesOperation,
GetTypeInfoOperation, SetOperation, SQLOperation
⤶
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse() ↝ HiveParser
{
• HiveParser.g
- SelectClauseParser.g
- FromClauseParser.g
- IdentifiersParser.g
• ParseDriver.parse()
- Command String ➡ root of AST tree
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• SemanticAnalyzerFactory.get(conf, ast)
- SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer,
ExportSemanticAnalyzer, FunctionSemanticAnalyzer,
ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• FilterOperator
• SelectOperator
• ForwardOperator
• FileSinkOperator
• ScriptOperator
• PTFOperator
• ReduceSinkOperator
• ExtractOperator
• GroupByOperator
• JoinOperator
• MapJoinOperator
• SMBMapJoinOperator
• LimitOperator
• TableScanOperator
• UnionOperator
• UDTFOperator
• LateralViewJoinOperator
• LateralViewForwardOperator
• HashTableDummyOperator
• HashTableSinkOperator
• DummyStoreOperator
• DemuxOperator
• MuxOperator
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• PredicateTransitivePropagate
• PredicatePushDown
• PartitionPruner
• PartitionConditionRemover
• ListBucketingPruner
• ListBucketingPruner
• ColumnPruner
• SkewJoinOptimizer
• RewriteGBUsingIndex
• GroupByOptimizer
• SamplePruner
• MapJoinProcessor
• BucketMapJoinOptimizer
• BucketMapJoinOptimizer
• SortedMergeBucketMapJoinO
ptimizer
• BucketingSortingReduceSink
Optimizer
• UnionProcessor
• JoinReorder
• ReduceSinkDeDuplication
• NonBlockingOpDeDupProc
• GlobalLimitOptimizer
• CorrelationOptimizer
• SimpleFetchOptimizer
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• MapRedTask
• FetchTask
• ConditionalTask
• ExplainTask
• CopyTask
• DDLTask
• MoveTask
• FunctionTask
• StatsTask
• ColumnStatsTask
• DependencyCollectionTask
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
• ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob()
ExecMapper, ExecReducer
➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
PRE- and POST-DRIVER-RUN
PRE- and POST-SEMANTIC-ANALYZE
PRE-, POST-EXEC and ON-FAILURE
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
➠ new HiveMetaStoreClient()
➔ HiveMetaStore.newHMSHandler()
➔ RetryingHMSHandler.getProxy()
➔ new RetryingHMSHandler()
➔ new HMSHandler() ➔ HMSHandler.init()
➔ HiveMetaStore.init()
CLIService.executeStatement()
⇒
MATASTORE-INIT
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
How Hive executes hooks
List<HiveDriverRunHook> driverRunHooks;
try {
driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
HiveDriverRunHook.class);
for (HiveDriverRunHook driverRunHook : driverRunHooks) {
driverRunHook.preDriverRun(hookContext);
}
} catch (Exception e) {
• Hive executes multiple hooks on each hook points.
ex. Driver.runInternal()
1. MetaStoreInitListener
public abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){
this.conf = config;
}
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override
public Configuration getConf() {
return this.conf;
}
@Override
public void setConf(Configuration config) {
this.conf = config;
}
}
1. MetaStoreInitListener
public abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){
this.conf = config;
}
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override
public Configuration getConf() {
return this.conf;
}
@Override
public void setConf(Configuration config) {
this.conf = config;
}
}
What MetaStoreInitContext got
• has Nothing!
- This hook just alarms you when metastore initialize.
(but you, of course, can get HiveConf by calling getConf())
public class MetaStoreInitContext {
}
2. HiveDriverRunHook
• preDriverRun
- Invoked before Hive begins any processing of a command in the Driver,
before compilation
• postDriverRun
- Invoked after Hive performs any processing of a command,
just before a response is returned to the entity calling the Driver.run()
public interface HiveDriverRunHook extends Hook {
public void preDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
public void postDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
}
What
HiveDriverRunHookContext got
• You can get command string from this hook context.
- This is the only thing that HiveDriverRunHookContext has.
public interface HiveDriverRunHookContext extends Configurable{
public String getCommand();
public void setCommand(String command);
}
3.AbstractSemanticAnalyzerHook
• You can get
- HiveSemanticAnalyzerHookContext and ASTNode (Root node of
abstract syntax tree) before analyze.
- HiveSemanticAnalyzerHookContext and List<Task> after analyze.
public abstract class AbstractSemanticAnalyzerHook implements
HiveSemanticAnalyzerHook {
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext
context,ASTNode ast)
throws SemanticException {
return ast;
}
public void postAnalyze(HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks) throws
SemanticException {
}
}
What
HiveSemanticAnalyzerHookContext got
• Hive Object
- contains information about a set of data in HDFS organized for query
processing. (from comment)
• ReadEntity, WriteEntity
• update method will be invoked after the semantic analyzer completes.
public interface HiveSemanticAnalyzerHookContext extends Configurable{
public Hive getHive() throws HiveException;
public void update(BaseSemanticAnalyzer sem);
public Set<ReadEntity> getInputs();
public Set<WriteEntity> getOutputs();
}
How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
4. ExecuteWithHookContext
• Can be used in the followings
- hive.exec.pre.hooks
- hive.exec.post.hooks
- hive.exec.failure.hooks
public interface ExecuteWithHookContext extends Hook {
/**
   *
   * @param hookContext
   * The hook context passed to each hooks.
   * @throws Exception
   */
void run(HookContext hookContext) throws Exception;
}
What HookContext got
• HookType
- PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK
• QueryPlan
• HiveConf
• LineageInfo
• UserGroupInformation
• OperationName
• List<TaskRunner> completeTaskList
• Set<ReadEntity> inputs
• Set<WriteEntity> outputs
• Map<String, ContentSummary> inputPathToContentSummary
How Hive fires hooks without
executing query physically
• This has the effect of causing the pre/post execute hooks to fire.
ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
MetaStore Event Listeners
Property Abstract Class
hive.metastore.pre.event.listeners MetaStorePreEventListener
hive.metastore.end.function.listeners MetaStoreEndFunctionListener
hive.metastore.event.listeners MetaStoreEventListener
package : org.apache.hadoop.hive.metastore
• I think those listeners look like hooks.
• I couldn’t find any particular differences between listeners and hooks while just taking a look.
The only thing I found is that listeners can’t affect query processing. It can only read.
• Anyway, it looks useful to let you know when a metastore do something.
MetaStoreEventListener
• The followings will be performed when a particular event occurs on a
metastore.
- onCreateTable
- onDropTable
- onAlterTable
- onDropPartition
- onAlterPartition
- onCreateDatabase
- onDropDatabase
- onLoadPartitionDone
If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
Be careful!
• Hooks
- can be a critical failure point!
(you should better catch runtime exceptions)
- are preformed synchronously.
- can affect query processing time.
Let's try it out
• Demo
- Don’t be surprised if it doesn’t work.
- That’s the way the demo is...
Thanks!
• Questions?
• Resources
- https://cwiki.apache.org/confluence/display/Hive/
- https://github.com/apache/hive

More Related Content

Apache Hive Hook

  • 1. Apache Hive Hook 2013. 8 Minwoo Kim michael.kim@nexr.com
  • 2. Apache Hive Hook • The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki. • I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive. • This document was written based on release-0.11 tag • Source: - https://github.com/apache/hive (mirror of apache hive)
  • 3. What is a hook? • As you know, this is about computer programming technique, but .. • Hooking - Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components. • Hook - Code that handles intercepted function calls, events or messages
  • 4. Hive provides some hooking points • pre-execution • post-execution • execution-failure • pre- and post-driver-run • pre- and post-semantic-analyze • metastore-initialize
  • 5. How to set up hooks in Hive <property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description> </property> hive-site.xml <property> <name>hive.aux.jars.path</name> <value></value> </property> Setting hook property Setting path of jars contains implementations of hook interfaces or abstract class You can use hive.added.jars.path instead of hive.aux.jars.path
  • 6. Hive hook properties and interfaces Property Interface or Abstract class hive.exec.pre.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PreExecute is deprecated) hive.exec.post.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PostExecute is deprecated) hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
  • 7. When those hooks fire? • You can submit a query on Hive through the following entry points - CLIDriver main method (called by shell script) - HCatCli main method (called by shell script) - HiveServer (called by thrift client) - HiveServer2 (called by thrift client or beeline)
  • 8. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no
  • 9. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
  • 10. HiveServer.execute() ➔ Driver.run() ➠ HiveServer CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
  • 11. HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ ⤶
  • 12. HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ • OperationManager.newExecuteStatementOperation() is like a kind of factory - AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation ⤶
  • 13. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse()
  • 14. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↝ HiveParser { • HiveParser.g - SelectClauseParser.g - FromClauseParser.g - IdentifiersParser.g • ParseDriver.parse() - Command String ➡ root of AST tree
  • 15. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • SemanticAnalyzerFactory.get(conf, ast) - SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
  • 16. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 17. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • FilterOperator • SelectOperator • ForwardOperator • FileSinkOperator • ScriptOperator • PTFOperator • ReduceSinkOperator • ExtractOperator • GroupByOperator • JoinOperator • MapJoinOperator • SMBMapJoinOperator • LimitOperator • TableScanOperator • UnionOperator • UDTFOperator • LateralViewJoinOperator • LateralViewForwardOperator • HashTableDummyOperator • HashTableSinkOperator • DummyStoreOperator • DemuxOperator • MuxOperator ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 18. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • PredicateTransitivePropagate • PredicatePushDown • PartitionPruner • PartitionConditionRemover • ListBucketingPruner • ListBucketingPruner • ColumnPruner • SkewJoinOptimizer • RewriteGBUsingIndex • GroupByOptimizer • SamplePruner • MapJoinProcessor • BucketMapJoinOptimizer • BucketMapJoinOptimizer • SortedMergeBucketMapJoinO ptimizer • BucketingSortingReduceSink Optimizer • UnionProcessor • JoinReorder • ReduceSinkDeDuplication • NonBlockingOpDeDupProc • GlobalLimitOptimizer • CorrelationOptimizer • SimpleFetchOptimizer ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 19. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • MapRedTask • FetchTask • ConditionalTask • ExplainTask • CopyTask • DDLTask • MoveTask • FunctionTask • StatsTask • ColumnStatsTask • DependencyCollectionTask ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 20. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 21. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() { • ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob() ExecMapper, ExecReducer
  • 22. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() PRE- and POST-DRIVER-RUN PRE- and POST-SEMANTIC-ANALYZE PRE-, POST-EXEC and ON-FAILURE
  • 23. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
  • 24. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 25. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 26. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ ➠ new HiveMetaStoreClient() ➔ HiveMetaStore.newHMSHandler() ➔ RetryingHMSHandler.getProxy() ➔ new RetryingHMSHandler() ➔ new HMSHandler() ➔ HMSHandler.init() ➔ HiveMetaStore.init() CLIService.executeStatement() ⇒ MATASTORE-INIT SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 27. How Hive executes hooks List<HiveDriverRunHook> driverRunHooks; try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); } } catch (Exception e) { • Hive executes multiple hooks on each hook points. ex. Driver.runInternal()
  • 28. 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
  • 29. 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
  • 30. What MetaStoreInitContext got • has Nothing! - This hook just alarms you when metastore initialize. (but you, of course, can get HiveConf by calling getConf()) public class MetaStoreInitContext { }
  • 31. 2. HiveDriverRunHook • preDriverRun - Invoked before Hive begins any processing of a command in the Driver, before compilation • postDriverRun - Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run() public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception; }
  • 32. What HiveDriverRunHookContext got • You can get command string from this hook context. - This is the only thing that HiveDriverRunHookContext has. public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command); }
  • 33. 3.AbstractSemanticAnalyzerHook • You can get - HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze. - HiveSemanticAnalyzerHookContext and List<Task> after analyze. public abstract class AbstractSemanticAnalyzerHook implements HiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { } }
  • 34. What HiveSemanticAnalyzerHookContext got • Hive Object - contains information about a set of data in HDFS organized for query processing. (from comment) • ReadEntity, WriteEntity • update method will be invoked after the semantic analyzer completes. public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs(); }
  • 35. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 36. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 37. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 38. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 39. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 40. 4. ExecuteWithHookContext • Can be used in the followings - hive.exec.pre.hooks - hive.exec.post.hooks - hive.exec.failure.hooks public interface ExecuteWithHookContext extends Hook { /**    *    * @param hookContext    * The hook context passed to each hooks.    * @throws Exception    */ void run(HookContext hookContext) throws Exception; }
  • 41. What HookContext got • HookType - PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK • QueryPlan • HiveConf • LineageInfo • UserGroupInformation • OperationName • List<TaskRunner> completeTaskList • Set<ReadEntity> inputs • Set<WriteEntity> outputs • Map<String, ContentSummary> inputPathToContentSummary
  • 42. How Hive fires hooks without executing query physically • This has the effect of causing the pre/post execute hooks to fire. ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
  • 43. MetaStore Event Listeners Property Abstract Class hive.metastore.pre.event.listeners MetaStorePreEventListener hive.metastore.end.function.listeners MetaStoreEndFunctionListener hive.metastore.event.listeners MetaStoreEventListener package : org.apache.hadoop.hive.metastore • I think those listeners look like hooks. • I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read. • Anyway, it looks useful to let you know when a metastore do something.
  • 44. MetaStoreEventListener • The followings will be performed when a particular event occurs on a metastore. - onCreateTable - onDropTable - onAlterTable - onDropPartition - onAlterPartition - onCreateDatabase - onDropDatabase - onLoadPartitionDone If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
  • 45. Be careful! • Hooks - can be a critical failure point! (you should better catch runtime exceptions) - are preformed synchronously. - can affect query processing time.
  • 46. Let's try it out • Demo - Don’t be surprised if it doesn’t work. - That’s the way the demo is...
  • 47. Thanks! • Questions? • Resources - https://cwiki.apache.org/confluence/display/Hive/ - https://github.com/apache/hive