28

Hive documentation lacking again:

I'd like to write the results of a query to a local file as well as the names of the columns.

Does Hive support this?

Insert overwrite local directory 'tmp/blah.blah' select * from table_name;

Also, separate question: Is StackOverflow the best place to get Hive Help? @Nija, has been very helpful, but I don't to keep bothering them...

7 Answers 7

64

Try

set hive.cli.print.header=true;
7
  • 1
    Is there a way to permanently have this as the default instead of having to specify this setting upon each hive shell and/or command invocation?
    – J.D.
    Commented Oct 1, 2012 at 22:10
  • 21
    I tried this; it causes the header to output to the console, not to the local file . . .
    – maverick
    Commented Nov 9, 2012 at 21:42
  • 7
    @JD Yes, just put it into .hiverc file in your home directory
    – wlk
    Commented Sep 16, 2013 at 14:38
  • It appears to be working in CLI only; however, does not have any effect when running a SQL file or from Oozie
    – Pasha
    Commented Aug 24, 2015 at 22:03
  • 3
    This does not answer the OP question Commented Apr 3, 2017 at 12:48
15

Yes you can. Put the set hive.cli.print.header=true; in a .hiverc file in your main directory or any of the other hive user properties files.

Vague Warning: be careful, since this has crashed queries of mine in the past (but I can't remember the reason).

1
  • 6
    The property hive.cli.print.header=true won't work for 'Insert overwrite local directory' command. it works if we run 'hive -e 'select ..' > Out.tsv'
    – Munesh
    Commented Jul 30, 2016 at 0:52
9

Indeed, @nija's answer is correct - at least as far as I know. There isn't any way to write the column names when doing an insert overwrite into [local] directory ... (whether you use local or not).

With regards to the crashes described by @user1735861, there is a known bug in hive 0.7.1 (fixed in 0.8.0) that, after doing set hive.cli.print.header=true;, causes a NullPointerException for any HQL command/query that produces no output. For example:

$ hive -S
hive> use default; 
hive> set hive.cli.print.header=true;
hive> use default;
Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:222)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:287)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:517)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Whereas this is fine:

$ hive -S
hive> set hive.cli.print.header=true;
hive> select * from dual;
c
c
hive> 

Non-HQL commands are fine though (set,dfs !, etc...)

More info here: https://issues.apache.org/jira/browse/HIVE-2334

7

Hive does support writing to the local directory. You syntax looks right for it as well.
Check out the docs on SELECTS and FILTERS for additional information.

I don't think Hive has a way to write the names of the columns to a file for the query you're running . . . I can't say for sure it doesn't, but I do not know of a way.

I think the only place better than SO for Hive questions would be the mailing list.

4

I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.

create table new_table as
select 
  field1,
  field2,
  field3
from
(
  select
    0 as sort_col,  --header row gets lowest number
    'field1_name' as field1,
    'field2_name' as field2,
    'field3_name' as field3
  from
    some_small_table  --table needs at least 1 row
  limit 1  --only need 1 header row
  union all
  select
    1 as sort_col,  --original query goes here
    field1,
    field2,
    field3
  from
    main_table
) a
order by 
  sort_col  --make sure header row is first

It's a little bulky, but at least you can get what you need with a single query.

Hope this helps!

2
  • This will fail if the col values are boolean, array etc.
    – amrk7
    Commented Sep 12, 2016 at 14:30
  • Basically a good solution, but 1) you no longer need 'from some_small_table' and limit 2) you have to include sort_col into main select 3) you need semicolon at the end Commented Feb 26, 2019 at 12:55
3

Not a great solution, but here is what I do:

create table test_dat
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS 
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" 
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" 
LOCATION '/tmp/test_dat' as select * from YOUR_TABLE;

hive -e 'set hive.cli.print.header=true;select * from YOUR_TABLE limit 0' > /tmp/test_dat/header.txt

cat header.txt 000* > all.dat
1
  • 1
    this can be very slow Commented Oct 1, 2014 at 21:33
2

Here's my take on it. Note, i'm not very well versed in bash, so improvements suggestions welcome :)

#!/usr/bin/env bash

# works like this:
# ./get_data.sh database.table > data.csv

INPUT=$1
TABLE=${INPUT##*.}
DB=${INPUT%.*}

HEADER=`hive -e "
  set hive.cli.print.header=true;
  use $DB;
  INSERT OVERWRITE LOCAL DIRECTORY '$TABLE'
  row format delimited
  fields terminated  by ','
  SELECT * FROM $TABLE;"`

HEADER_WITHOUT_TABLE_NAME=${HEADER//$TABLE./}
echo ${HEADER_WITHOUT_TABLE_NAME//[[:space:]]/,}
cat $TABLE/*

Not the answer you're looking for? Browse other questions tagged or ask your own question.