7

I'd like to explore a Google Analytics 360 data with bigrquery using dplyr syntax (rather than SQL), if possible. The gist is that I want to understand user journeys—I'm interested in finding the most common sequences of pages at the user level (even across sessions).

I thought I could do it this way:

sample_query <- ga_sample %>%
  select(fullVisitorId, date, visitStartTime, totals, channelGrouping,
  hits.page.pagePath) %>% 
  collect()

But I get an error that hits.page.pagePath was not found. Then I tried:

sample_query <- ga_sample %>%
  select(fullVisitorId, date, visitStartTime, totals, channelGrouping, hits) %>% 
  collect() %>% 
  unnest_wider(hits)

But the result is Error: Requested Resource Too Large to Return [responseTooLarge], which makes perfect sense.

From what I've gathered, with the SQL syntax, the workaround is to unnest remotely, and select only the hits.page.pagePath field (rather than the entire hits top-level field).

E.g., something like this (which is a different query, but conveys the point):

SELECT
  hits.page.pagePath
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
  UNNEST(GA.hits) AS hits
GROUP BY
  hits.page.pagePath

Is it possible to do something similar with dplyr syntax? If it's not possible, what's the best approach with SQL?

Thanks!

Update: Actual query/code

SELECT DISTINCT
fullVisitorId, visitId, date, visitStartTime, hits.page.pagePath, hits.time, geoNetwork.networkDomain
FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_*' AS GA, UNNEST(GA.hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20191101" AND "20191102"
AND geoNetwork.networkDomain NOT LIKE "%google%"
2
  • Do you have the bigquery code/syntax that would execute your query?
    – Simon.S.A.
    Commented Dec 16, 2019 at 19:59
  • Hey @Simon.S.A.: I've added the full query at the end of the question.
    – Khashir
    Commented Dec 17, 2019 at 20:32

2 Answers 2

4
+50

The kinds of queries dbplyr can create when translating from R to BigQuery (or whatever database language you are using) depends on the translations that have been defined between R and BigQuery. I can not find any example that suggests a translation is defined for UNNEST in the existing dbplyr package. Reference 1, Reference 2

One work around is to define a custom function, not to do translation within dbplyr, but as a translator alongside dbplyr. I have used this approach with success before when I needed PIVOT in SQL but could not find a translation for tidyr::spread.

The approach works, because remote tables in dbplyr are defined by two things: (1) the connection to the remote database, (2) the code/query that returns the current view of the table. Hence once dbplyr has translated R to BigQuery or SQL it is updating the second half of the definition.

We can do this using a custom function:

unnest <- function(input_tbl, select_columns, array_column, unnested_columns){

  # extract connection
  db_connection <- input_tbl$src$con

  select_columns = paste0(select_columns, collapse = ", ")
  unnested_columns = paste0(paste0("un.", unnested_columns), collapse = ", ")

  # build SQL unnest query
  sql_query <- dbplyr::build_sql(
    con = db_connection
    ,"SELECT ", select_columns, ", ", position, ", ", unnested_columns, "\n"
    ,"FROM (\n"
    ,dbplyr::sql_render(input_tbl)
    ,"\n) AS src\n"
    ,"CROSS JOIN UNNEST(", array_column, ") AS un WITH OFFSET position"
  )

  return(dplyr::tbl(db_connection, dbplyr::sql(sql_query)))
}

Note that I am a dbplyr user, but not a BigQuery user, so my syntax in the above may not be quite perfect. I have followed this question and this one for syntax.

Example use:

remote_table = tbl(bigquery_connection, from = "table_name")
unnested_table = unnest(remote_table, "ID", "array_col", "list")

# check syntax of dbplyr query
unnested_table %>% show_query()
# if this is not a valid bigquery query then next command will error

# view top 10 rows
unnested_table %>% head(10)

If remote_table looks like:

ID ARRAY_COL
01 list = [a,b,c]
02 list = [d,e]
03 list = [q]

Then unnested_table should look like:

ID POSITION un.list
01    0        a
01    1        b
01    2        c
02    0        d
02    1        e
03    0        q

And unnested_table %>% show_query() should look something like:

<SQL>
SELECT *, position, un.list
FROM (
    SELECT *
    FROM table_name
) AS src
CROSS JOIN UNNEST(ARRAY_COL) AS un WITH OFFSET position

Update to match target query

I am aware of no dbplyr feature that will translate _TABLE_SUFFIX BETWEEN "20191101" AND "20191102" easily so you will have to handle this another way - perhaps looping over a list of dates in R.

First step is to get dbplyr to render the query prior to unnesting. Probably something like:

for(date in c("20191101", "20191102")){
    table_name = paste0("bigquery-public-data.google_analytics_sample.ga_sessions_",date)

    remote_table = tbl(bigquery_connection, from = table_name)

    remote_table = remote_table %>%
        filter(! (geoNetwork.networkDomain %like% "%google%")) %>%
        select(fullVisitorId, visitId, date, visitStartTime, hits, geoNetwork.networkDomain) %>%
        distinct()
}

Calling show_query(remote_table) should then produce something equivalent to the following. But it will not be exactly identical because dbplyr writes code differently to humans.

SELECT DISTINCT fullVisitorId, visitId, date, visitStartTime, hits, geoNetwork.networkDomain
FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20191101'
WHERE NOT(geoNetwork.networkDomain LIKE "%google%")

The second step is to call the custom unnest function"

remote_table = unnest(remote_table,
                      select_columns = c("fullVisitorId", "visitId", "date", "visitStartTime", "geoNetwork.networkDomain"),
                      array_column = "hits",
                      unnested_columns = c("page.pagePath", "time")
               )

Calling show_query(remote_table) should then produce the following:

SELECT fullVisitorId, visitId, date, visitStartTime, geoNetwork.networkDomain, position, un.page.pagePath, un.time, 
FROM (

the_query_from_the_first_step

) AS src
CROSS JOIN UNNEST(src.hits) AS un WITH OFFSET position

That is probably as far as I can assist as I do not have a bigquery environment to test this in myself. You may have to adjust the custom unnest function to get it to exactly match your context. Hopefully the above is enough to get you started.

6
  • Hey @Simon.S.A.: I tried pasting the unnest function in R and got a few errors that I couldn't quickly fix (I don't know much about functions). I'll try a few fixes and send you a screenshot with the errors.
    – Khashir
    Commented Dec 18, 2019 at 20:24
  • There is a close quote missing before the "\n" fixed above.
    – Simon.S.A.
    Commented Dec 18, 2019 at 20:49
  • What's your environment like? I now get "Error: unexpected string constant in: " ,"\n) AS src\n" ,"CROSS JOIN UNNEST(", array_column ") AS un WITH OFFSET position""
    – Khashir
    Commented Dec 18, 2019 at 20:51
  • There was a comma missing after array_column and before ")... fixed above
    – Simon.S.A.
    Commented Dec 18, 2019 at 21:27
  • Thanks! The function works but when I try it with actual BigQuery fields, I get an error: `object 'position' not found. Not sure if you want to take a stab with the Sandbox: cloud.google.com/bigquery/docs/sandbox
    – Khashir
    Commented Dec 18, 2019 at 22:49
0

As noted in the comment, the function given by Simon.S.A. doesn't work (valiant attempt, but was not familiar with bigquery).

I made some alterations to create a function that does work with a single nested variable.

library(magrittr)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(bigrquery)

bq_deauth()
bq_auth(email="[email protected]")

bq_conn = dbConnect(
  bigquery(),
  project = "elite-magpie-257717",
  dataset = "test_dataset"
)

df = tibble(
  chr =   c(1,1,1,2,2,3),
  start = c(0, 10, 12, 0, 5, 1),
  end =   c(2, 11, 15, 1, 8, 3)
)

df %>%
  rowwise() %>% mutate(range = list(seq(start, end)))
#> # A tibble: 6 x 4
#> # Rowwise: 
#>     chr start   end range    
#>   <dbl> <dbl> <dbl> <list>   
#> 1     1     0     2 <int [3]>
#> 2     1    10    11 <int [2]>
#> 3     1    12    15 <int [4]>
#> 4     2     0     1 <int [2]>
#> 5     2     5     8 <int [4]>
#> 6     3     1     3 <int [3]>

df %>%
  rowwise() %>% mutate(range = list(seq(start, end))) %>%
  unnest(range)
#> # A tibble: 18 x 4
#>      chr start   end range
#>    <dbl> <dbl> <dbl> <int>
#>  1     1     0     2     0
#>  2     1     0     2     1
#>  3     1     0     2     2
#>  4     1    10    11    10
#>  5     1    10    11    11
#>  6     1    12    15    12
#>  7     1    12    15    13
#>  8     1    12    15    14
#>  9     1    12    15    15
#> 10     2     0     1     0
#> 11     2     0     1     1
#> 12     2     5     8     5
#> 13     2     5     8     6
#> 14     2     5     8     7
#> 15     2     5     8     8
#> 16     3     1     3     1
#> 17     3     1     3     2
#> 18     3     1     3     3

dbWriteTable(
  bq_conn,
  name = "test_dataset.range_test",
  value = df,
  overwrite = T
)

df_bq = tbl(bq_conn, "test_dataset.range_test")

df_bq %>%
  mutate(range = generate_array(start, end, 1))
#> # Source:   lazy query [?? x 4]
#> # Database: BigQueryConnection
#>     end start   chr range    
#>   <int> <int> <int> <list>   
#> 1     2     0     1 <dbl [3]>
#> 2    11    10     1 <dbl [2]>
#> 3    15    12     1 <dbl [4]>
#> 4     1     0     2 <dbl [2]>
#> 5     8     5     2 <dbl [4]>
#> 6     3     1     3 <dbl [3]>

df_bq %>%
  mutate(range = generate_array(start, end, 1)) %>%
  unnest_wider(range)
#> Error: `x` must be a vector, not a `tbl_BigQueryConnection/tbl_dbi/tbl_sql/tbl_lazy/tbl` object.


my_unnest = function(input_tbl, array_column)
{

  ### extract connection
  db_connection = input_tbl$src$con

  ### column names surrounded by `` and separated by commas
  all_cols =
    colnames(input_tbl) %>%
    sprintf("`%s`", .) %>%
    paste(., collapse=", ")

  ### Build sql string
  sql_string =
    paste0(
      "SELECT ", all_cols,
      "FROM (", dbplyr::sql_render(input_tbl), ") ",
      "CROSS JOIN UNNEST(`", array_column, "`) AS `", array_column, "`"
    ) %>%
    str_replace("\n", " ")

  ### Build query object
  sql_query = dbplyr::sql(sql_string)

  print(sql_query)

  dplyr::tbl(db_connection, sql_query)

  return(dplyr::tbl(db_connection, sql_query))
}


df_bq %>%
  mutate(range = generate_array(start, end, 1)) %>%
  my_unnest("range")
#> <SQL> SELECT `end`, `start`, `chr`, `range`FROM (SELECT `end`, `start`, `chr`, generate_array(`start`, `end`, 1.0) AS `range` FROM `test_dataset.range_test`) CROSS JOIN UNNEST(`range`) AS `range`
#> # Source:   SQL [?? x 4]
#> # Database: BigQueryConnection
#>      end start   chr range
#>    <int> <int> <int> <dbl>
#>  1     2     0     1     0
#>  2     2     0     1     1
#>  3     2     0     1     2
#>  4    11    10     1    10
#>  5    11    10     1    11
#>  6    15    12     1    12
#>  7    15    12     1    13
#>  8    15    12     1    14
#>  9    15    12     1    15
#> 10     1     0     2     0
#> # ... with more rows

Created on 2021-02-18 by the reprex package (v1.0.0)

Note, it is important to make sure you specify the dataset in the connection (not just the project) or it will throw an error for missing dataset.

Furthermore, if you call the function unnest you will clobber tidyr::unnest which you may not want to do.

Not the answer you're looking for? Browse other questions tagged or ask your own question.