0

I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.

Here is a screenshot of the public location of the dataset on Google Cloud:

Screenshot showing the available dataset

I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?

2 Answers 2

1

If you select the 311_service_requests table from the list on the left, an "Export" button will appear:

BigQuery Export

Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).

However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:

  • You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
  • When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
  • You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

EDIT:

A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.

To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:

bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv

and then create the composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv

Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:

unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location

or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT column_name
    FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
    WHERE table_name='311_service_requests'
"""
query_job = client.query(query)

columns = []
for row in query_job:
    columns.append(row["column_name"])
with open("headers.csv", "w") as f:
    print(','.join(columns), file=f) 

Note that for the above script to run you need to have the BigQuery Python Client library installed:

pip install --upgrade google-cloud-bigquery 

Upload the headers.csv file to your bucket:

gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv

And now you are ready to create the final composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

In case you want the headers you can skip creating the first composite and just create the final one using all sources:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv
7
  • Hi thanks for the answer. If I use a wildcard, how would I be able to combine those files into one file that I can work on effectively? Commented Dec 19, 2019 at 2:35
  • @KaustubhMulay I have edited my answer providing as many details as I could on this. If you have any questions please let me known.
    – itroulli
    Commented Dec 19, 2019 at 17:08
  • Thanks much for the detailed answer. I am grateful to you. If I have any questions, I will definitely let you know. Thanks again!! Commented Dec 21, 2019 at 3:19
  • 1
    @KaustubhMulay Since it was helpful, it would be nice if you could accept my answer!
    – itroulli
    Commented Dec 21, 2019 at 5:21
  • Done. Is there a way to limit the blobs to 32? Commented Dec 21, 2019 at 23:04
1

You can also use the gcoud commands:

  1. Create a bucket:

    gsutil mb gs://my-bigquery-temp  
    
  2. Extract the data set:

    bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*
    

Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.

  1. Check the bucket:

    gsutil ls gs://my-bigquery-temp
    
    gs://my-bigquery-temp/dataset000000000
    
    ......................................
    
    gs://my-bigquery-temp/dataset000000000045
    

    You can find more information Exporting table data

Edit:

To compose an object from the exported dataset files you can use gsutil tool:

 gsutil compose gs://my-bigquery-temp/dataset*  gs://my-bigquery-temp/composite-object

Please keep in mind that you can not use more that 32 blobs (files) to compose the object.

Related SO Question Google Cloud Storage Joining multiple csv files

1
  • Hi thanks for the answer. If I use a wildcard, how would I be able to combine those files into one file that I can work on effectively? Commented Dec 19, 2019 at 2:36

Not the answer you're looking for? Browse other questions tagged or ask your own question.