2

Whats the best way to export data from BigQuery to Google Storage. Note, I need to run a query against Bigquery and not export all data. Essentially, I need to run a custom query against BigQuery ( like select * from mytable where code=foo ) and the results of the query need to be written into a csv , stored on Google Cloud. I Believe, the best way to do this is via Google Dataflow. Let me know if there are other options? Also, I am looking for some samples on how to accomplish this. Is there somewhere I can find some examples?

This is what I have so far PipelineOptions pipelineOptions = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(pipelineOptions);

    Date date = new Date();

    p.getOptions().setTempLocation("gs://mybucket/tmp"+date.getTime());

    PCollection<TableRow> rowPCollection = p.apply(BigQueryIO.Read.named("promos")
            .fromQuery("SELECT * FROM [projectid:mydataset.mytable] where id = 256 LIMIT 1000"));

    PCollection<String> stringPCollection = rowPCollection.apply(ParDo.named("Extract").of(new DoFn<TableRow, String>() {
        @Override
        public void processElement(ProcessContext c) {
            TableRow tableRow = c.element();
            try {
                String prettyString = tableRow.toPrettyString();
                c.output(prettyString);
            } catch (IOException e) {
                log.error("Exception occurred:" + e.getMessage());
            }
        }
    }));

    stringPCollection.apply(TextIO.Write.named("WriteOutput").to("gs://mybucket/avexport").withSuffix(".csv"));

    p.run();

When this run, a exception is thrown at creation of ParDo

caused by: java.io.NotSerializableException: com.my.validation.CommonValidator
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:50)

2 Answers 2

1

I'm guessing that your anonymous DoFn is pulling in something from the enclosing class (CommonValidator) which is failing to serialize. If you create a static class for your DoFn implementation, does that fix the problem?

For more information, please see NotSerializableException on anonymous class.

0

The error aside, you don't have to use Dataflow to export BigQuery data to GCS unless you're doing some complex transformations in your Dataflow pipeline (which you could almost certainly do in SQL/UDFs anyway, but I digress). From your code snippet and description, you don't seem to be doing any type of transform on the data.

You could just:

  1. Run your SQL and save the results to a BigQuery table.
  2. Export the table to GCS as described here.
4
  • Thanks for that suggestion. The data I have is in google cloud datastore and the data is also in BigData, So the data is in both places. So the options are Option 1: 1. Run a SQL query against datastore and write to bigQuery. 2. Then do the export from BigQuery to storage Option 2: 1. Run a SQL query against BigQuery and write to another table in BigQuery 2. Then to the export from BigQuery to Storage Is Option 2.1 doable without dataflow?
    – verma
    Commented Jan 5, 2017 at 1:49
  • @verma - you never mentioned anything about Cloud Datastore in your question. Secondly, Cloud Datastore is a NoSQL solution, so you would not be able to "Run a SQL query against datastore and write to bigQuery". Commented Jan 5, 2017 at 4:33
  • Yeh. We have data getting written in both the places. Cloud storage is our primary database and we are replicating data in BigQuery just for this use case. So based on what you said, here is what I am thinking. 1. Execute a query against the primary database ( Cloud Datastore ) 2. Write the data in BigQuery to a new table 'mytable-uuid' 3. Execute a export from the table create in step 2 to Cloud-Storage How should I go about executing all these steps? Isn't Cloud-DataFlow the best to tool to use here?
    – verma
    Commented Jan 5, 2017 at 15:53
  • 1
    Is there a way I can export data as csv from datastore to storage? Essentially, I want to runa query against the datastore and then export the results a csv?
    – verma
    Commented Jan 5, 2017 at 20:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.