Apache Hive Hook
- 2. Apache Hive Hook
• The reason why I made this is that Ryan asked me about
hive hook, but I couldn’t find any info about hook in hive
wiki.
• I hope this will be helpful to develop applications using Hive
when you want to get extra info while executing a query on
Hive.
• This document was written based on release-0.11 tag
• Source:
- https://github.com/apache/hive (mirror of apache hive)
- 3. What is a hook?
• As you know, this is about computer programming technique,
but ..
• Hooking
- Techniques for intercepting function calls or
messages or events in an operating system, applications,
and other software components.
• Hook
- Code that handles intercepted function calls, events or
messages
- 4. Hive provides some hooking
points
• pre-execution
• post-execution
• execution-failure
• pre- and post-driver-run
• pre- and post-semantic-analyze
• metastore-initialize
- 5. How to set up hooks in Hive
<property>
<name>hive.exec.pre.hooks</name>
<value></value>
<description>
Comma-separated list of pre-execution hooks to be invoked for each statement.
A pre-execution hook is specified as the name of a Java class which implements
the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.
</description>
</property>
hive-site.xml
<property>
<name>hive.aux.jars.path</name>
<value></value>
</property>
Setting hook property
Setting path of jars contains implementations of hook interfaces or abstract class
You can use hive.added.jars.path instead of hive.aux.jars.path
- 6. Hive hook properties and interfaces
Property Interface or Abstract class
hive.exec.pre.hooks
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PreExecute is deprecated)
hive.exec.post.hooks
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
(PostExecute is deprecated)
hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener
hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook
hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
- 7. When those hooks fire?
• You can submit a query on Hive through the
following entry points
- CLIDriver main method (called by shell script)
- HCatCli main method (called by shell script)
- HiveServer (called by thrift client)
- HiveServer2 (called by thrift client or beeline)
- 8. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
- 9. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
HCatCli
HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
- 10. HiveServer.execute() ➔ Driver.run() ➠
HiveServer
CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()
↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠
➔ processLocalCmd() ➔ Driver.run() ➠
CLIDriver
➔ is remote ?
yes
no
HCatCli
HCatCli.main() ➔ processLine() ➔ processCmd()
➔ HCatDriver.run() ⤇ Driver.run() ➠
- 12. HiveServer2
ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()
CLIService.executeStatement()
↳ SessionManager.getSession()
↳ HiveSession.executeStatement()
↳ OperationManager.newExecuteStatementOperation()
↳ SQLOperation.run() ➔ Driver.run() ➠
• OperationManager.newExecuteStatementOperation() is like a kind of factory
- AddResourceOperation, DeleteResourceOperation, DfsOperation,
GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation,
GetSchemasOperation, GetTablesOperation, GetTableTypesOperation,
GetTypeInfoOperation, SetOperation, SQLOperation
⤶
- 14. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse() ↝ HiveParser
{
• HiveParser.g
- SelectClauseParser.g
- FromClauseParser.g
- IdentifiersParser.g
• ParseDriver.parse()
- Command String ➡ root of AST tree
- 15. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• SemanticAnalyzerFactory.get(conf, ast)
- SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer,
ExportSemanticAnalyzer, FunctionSemanticAnalyzer,
ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
- 16. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
- 17. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• FilterOperator
• SelectOperator
• ForwardOperator
• FileSinkOperator
• ScriptOperator
• PTFOperator
• ReduceSinkOperator
• ExtractOperator
• GroupByOperator
• JoinOperator
• MapJoinOperator
• SMBMapJoinOperator
• LimitOperator
• TableScanOperator
• UnionOperator
• UDTFOperator
• LateralViewJoinOperator
• LateralViewForwardOperator
• HashTableDummyOperator
• HashTableSinkOperator
• DummyStoreOperator
• DemuxOperator
• MuxOperator
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
- 18. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• PredicateTransitivePropagate
• PredicatePushDown
• PartitionPruner
• PartitionConditionRemover
• ListBucketingPruner
• ListBucketingPruner
• ColumnPruner
• SkewJoinOptimizer
• RewriteGBUsingIndex
• GroupByOptimizer
• SamplePruner
• MapJoinProcessor
• BucketMapJoinOptimizer
• BucketMapJoinOptimizer
• SortedMergeBucketMapJoinO
ptimizer
• BucketingSortingReduceSink
Optimizer
• UnionProcessor
• JoinReorder
• ReduceSinkDeDuplication
• NonBlockingOpDeDupProc
• GlobalLimitOptimizer
• CorrelationOptimizer
• SimpleFetchOptimizer
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
- 19. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
• MapRedTask
• FetchTask
• ConditionalTask
• ExplainTask
• CopyTask
• DDLTask
• MoveTask
• FunctionTask
• StatsTask
• ColumnStatsTask
• DependencyCollectionTask
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
- 20. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
- 21. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
➔ analyzeInternal()
• processPositionAlias()
• doPhase1()
• getMetaData()
• genPlan()
• Optimizer.optimize()
• MapReduceCompiler.compile()
{
• ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob()
ExecMapper, ExecReducer
- 22. ➠ Driver.run()
➔ Driver.runInternal()
↳ Driver.compile()
↳ ParseDriver.parse()
↳ SemanticAnalyzer.analyze()
↳ Driver.execute()
➔ loop (List<Task>)
⟳ Driver.launchTask()
➔ TaskRunner.runSequential() ➔ Task.executeTask()
➔ Task.execute()
PRE- and POST-DRIVER-RUN
PRE- and POST-SEMANTIC-ANALYZE
PRE-, POST-EXEC and ON-FAILURE
- 24. HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
- 25. HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
CLIService.executeStatement()
⇒
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
- 26. HiveServer2.main() ➔ HiveServer2.start()
➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
➔ HiveSession.getMetaStoreClient()
➔ new HiveMetaStoreClient() ➠
➠ new HiveMetaStoreClient()
➔ HiveMetaStore.newHMSHandler()
➔ RetryingHMSHandler.getProxy()
➔ new RetryingHMSHandler()
➔ new HMSHandler() ➔ HMSHandler.init()
➔ HiveMetaStore.init()
CLIService.executeStatement()
⇒
MATASTORE-INIT
SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object
Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠
GetColumnsOperation.run()
GetSchemasOperation.run()
GetTablesOperation.run()
- 27. How Hive executes hooks
List<HiveDriverRunHook> driverRunHooks;
try {
driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
HiveDriverRunHook.class);
for (HiveDriverRunHook driverRunHook : driverRunHooks) {
driverRunHook.preDriverRun(hookContext);
}
} catch (Exception e) {
• Hive executes multiple hooks on each hook points.
ex. Driver.runInternal()
- 28. 1. MetaStoreInitListener
public abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){
this.conf = config;
}
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override
public Configuration getConf() {
return this.conf;
}
@Override
public void setConf(Configuration config) {
this.conf = config;
}
}
- 29. 1. MetaStoreInitListener
public abstract class MetaStoreInitListener implements Configurable {
private Configuration conf;
public MetaStoreInitListener(Configuration config){
this.conf = config;
}
public abstract void onInit(MetaStoreInitContext context) throws MetaException;
@Override
public Configuration getConf() {
return this.conf;
}
@Override
public void setConf(Configuration config) {
this.conf = config;
}
}
- 30. What MetaStoreInitContext got
• has Nothing!
- This hook just alarms you when metastore initialize.
(but you, of course, can get HiveConf by calling getConf())
public class MetaStoreInitContext {
}
- 31. 2. HiveDriverRunHook
• preDriverRun
- Invoked before Hive begins any processing of a command in the Driver,
before compilation
• postDriverRun
- Invoked after Hive performs any processing of a command,
just before a response is returned to the entity calling the Driver.run()
public interface HiveDriverRunHook extends Hook {
public void preDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
public void postDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
}
- 32. What
HiveDriverRunHookContext got
• You can get command string from this hook context.
- This is the only thing that HiveDriverRunHookContext has.
public interface HiveDriverRunHookContext extends Configurable{
public String getCommand();
public void setCommand(String command);
}
- 33. 3.AbstractSemanticAnalyzerHook
• You can get
- HiveSemanticAnalyzerHookContext and ASTNode (Root node of
abstract syntax tree) before analyze.
- HiveSemanticAnalyzerHookContext and List<Task> after analyze.
public abstract class AbstractSemanticAnalyzerHook implements
HiveSemanticAnalyzerHook {
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext
context,ASTNode ast)
throws SemanticException {
return ast;
}
public void postAnalyze(HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks) throws
SemanticException {
}
}
- 34. What
HiveSemanticAnalyzerHookContext got
• Hive Object
- contains information about a set of data in HDFS organized for query
processing. (from comment)
• ReadEntity, WriteEntity
• update method will be invoked after the semantic analyzer completes.
public interface HiveSemanticAnalyzerHookContext extends Configurable{
public Hive getHive() throws HiveException;
public void update(BaseSemanticAnalyzer sem);
public Set<ReadEntity> getInputs();
public Set<WriteEntity> getOutputs();
}
- 35. How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
- 36. How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
- 37. How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
- 38. How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
- 39. How Hive executes analyzer
hooks
List<AbstractSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class);
// ~ ellipsis ~
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
sem.analyze(tree, ctx);
hookCtx.update(sem);
for (AbstractSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
- 40. 4. ExecuteWithHookContext
• Can be used in the followings
- hive.exec.pre.hooks
- hive.exec.post.hooks
- hive.exec.failure.hooks
public interface ExecuteWithHookContext extends Hook {
/**
*
* @param hookContext
* The hook context passed to each hooks.
* @throws Exception
*/
void run(HookContext hookContext) throws Exception;
}
- 41. What HookContext got
• HookType
- PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK
• QueryPlan
• HiveConf
• LineageInfo
• UserGroupInformation
• OperationName
• List<TaskRunner> completeTaskList
• Set<ReadEntity> inputs
• Set<WriteEntity> outputs
• Map<String, ContentSummary> inputPathToContentSummary
- 42. How Hive fires hooks without
executing query physically
• This has the effect of causing the pre/post execute hooks to fire.
ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
- 43. MetaStore Event Listeners
Property Abstract Class
hive.metastore.pre.event.listeners MetaStorePreEventListener
hive.metastore.end.function.listeners MetaStoreEndFunctionListener
hive.metastore.event.listeners MetaStoreEventListener
package : org.apache.hadoop.hive.metastore
• I think those listeners look like hooks.
• I couldn’t find any particular differences between listeners and hooks while just taking a look.
The only thing I found is that listeners can’t affect query processing. It can only read.
• Anyway, it looks useful to let you know when a metastore do something.
- 44. MetaStoreEventListener
• The followings will be performed when a particular event occurs on a
metastore.
- onCreateTable
- onDropTable
- onAlterTable
- onDropPartition
- onAlterPartition
- onCreateDatabase
- onDropDatabase
- onLoadPartitionDone
If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
- 45. Be careful!
• Hooks
- can be a critical failure point!
(you should better catch runtime exceptions)
- are preformed synchronously.
- can affect query processing time.
- 46. Let's try it out
• Demo
- Don’t be surprised if it doesn’t work.
- That’s the way the demo is...