How to import public data set into Google Cloud Bucket

Question

I am going to work on a data set that contains information about 311 calls in the United States. This data set is available publicly in BigQuery. I would like to copy this directly to my bucket. However, I am clueless about how to do this as I am a novice.

Here is a screenshot of the public location of the dataset on Google Cloud:

I have already created a bucket named 311_nyc in my Google Cloud Storage. How can I directly transfer the data without having to download the 12 gb file and uploading it again through my VM instance?

itroulli · Accepted Answer · 2019-12-19 12:29:54Z

If you select the 311_service_requests table from the list on the left, an "Export" button will appear:

Then you can select Export to GCS, select your bucket, type a filename, choose format (between CSV and JSON) and check if you want the export file to be compressed (GZIP).

However, there are some limitations in BigQuery Exports. Copying some from the documentation link that apply to your case:

You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

EDIT:

A simple way to merge the output files together is to use the gsutil compose command. However, if you do this the header with the column names will appear multiple times in the resulting file because it appears in all the files that are extracted from BigQuery.

To avoid this, you should perform the BigQuery Export by setting the print_header parameter to False:

bq extract --destination_format CSV --print_header=False bigquery-public-data:new_york_311.311_service_requests gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv

and then create the composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/nyc_311_* gs://<YOUR_BUCKET_NAME>/all_data.csv

Now, in the all_data.csv file there are no headers at all. If you still need the column names to appear in the first row you have to create another CSV file with the column names and create a composite of these two. This can be done either manually by pasting the following (column names of the "311_service_requests" table) into a new file:

unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,cross_street_1,cross_street_2,intersection_street_1,intersection_street_2,address_type,city,landmark,facility_type,status,due_date,resolution_description,resolution_action_updated_date,community_board,borough,x_coordinate,y_coordinate,park_facility_name,park_borough,bbl,open_data_channel_type,vehicle_type,taxi_company_borough,taxi_pickup_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location

or with the following simple Python script (in case you want to use it with a table with a big amount of columns that is hard to be done manually) that queries the column names of the table and writes them into a CSV file:

from google.cloud import bigquery

client = bigquery.Client()

query = """
    SELECT column_name
    FROM `bigquery-public-data`.new_york_311.INFORMATION_SCHEMA.COLUMNS
    WHERE table_name='311_service_requests'
"""
query_job = client.query(query)

columns = []
for row in query_job:
    columns.append(row["column_name"])
with open("headers.csv", "w") as f:
    print(','.join(columns), file=f)

Note that for the above script to run you need to have the BigQuery Python Client library installed:

pip install --upgrade google-cloud-bigquery

Upload the headers.csv file to your bucket:

gsutil cp headers.csv gs://<YOUR_BUCKET_NAME/headers.csv

And now you are ready to create the final composite:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/all_data.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

In case you want the headers you can skip creating the first composite and just create the final one using all sources:

gsutil compose gs://<YOUR_BUCKET_NAME>/headers.csv gs://<YOUR_BUCKET_NAME>/nyc_311_*.csv gs://<YOUR_BUCKET_NAME>/all_data_with_headers.csv

Hi thanks for the answer. If I use a wildcard, how would I be able to combine those files into one file that I can work on effectively? — Kaustubh Mulay, Commented Dec 19, 2019 at 2:35
@KaustubhMulay I have edited my answer providing as many details as I could on this. If you have any questions please let me known. — itroulli, Commented Dec 19, 2019 at 17:08
Thanks much for the detailed answer. I am grateful to you. If I have any questions, I will definitely let you know. Thanks again!! — Kaustubh Mulay, Commented Dec 21, 2019 at 3:19
@KaustubhMulay Since it was helpful, it would be nice if you could accept my answer! — itroulli, Commented Dec 21, 2019 at 5:21

marian.vladoi · Accepted Answer · 2019-12-19 09:10:01Z

You can also use the gcoud commands:

Create a bucket:
```
gsutil mb gs://my-bigquery-temp  
```

Extract the data set:

bq extract --destination_format CSV --compression GZIP 'bigquery-public-data:new_york_311.311_service_requests' gs://my-bigquery-temp/dataset*

Please note that you have to use gs://my-bigquery-temp/dataset* because the dataset is to large and can not be exported to a single file.

Check the bucket:

gsutil ls gs://my-bigquery-temp

gs://my-bigquery-temp/dataset000000000

......................................

gs://my-bigquery-temp/dataset000000000045

You can find more information Exporting table data

Edit:

To compose an object from the exported dataset files you can use gsutil tool:

 gsutil compose gs://my-bigquery-temp/dataset*  gs://my-bigquery-temp/composite-object

Please keep in mind that you can not use more that 32 blobs (files) to compose the object.

Related SO Question Google Cloud Storage Joining multiple csv files

Hi thanks for the answer. If I use a wildcard, how would I be able to combine those files into one file that I can work on effectively? — Kaustubh Mulay, Commented Dec 19, 2019 at 2:36

Collectives™ on Stack Overflow

How to import public data set into Google Cloud Bucket

2 Answers 2

EDIT:

Edit:

Not the answer you're looking for? Browse other questions tagged
google-bigquery
export
google-cloud-storage
bucket
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

EDIT:

Edit:

Not the answer you're looking for? Browse other questions tagged google-bigqueryexportgoogle-cloud-storagebucket or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
google-bigquery
export
google-cloud-storage
bucket
or ask your own question.