0

A simple question:

Is the data that is processed via Google Big Query stored on Google Cloud Storage, and is just segmented for GBQ purposes? or does Google Big Query hold it's own Storage mechanism.

I'm trying to learn the architecture, and I see arrows pointing back and forth to each other, but it doesn't say where GBQ's architecture sits?

Thanks.

1 Answer 1

3

From Bigquery under the hood:

Colossus - Distributed Storage

BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management (so there is no single point of failure). Colossus is fast enough to allow BigQuery to provide similar performance to many in-memory databases, but leveraging much cheaper yet highly parallelized, scalable, durable and performant infrastructure.

BigQuery leverages the ColumnIO columnar storage format and compression algorithm to store data in Colossus in the most optimal way for reading large amounts of structured data.Colossus allows BigQuery users to scale to dozens of Petabytes in storage seamlessly, without paying the penalty of attaching much more expensive compute resources — typical with most traditional databases.

The part about ColumnIO is outdated--BigQuery uses the Capacitor format now--but the rest is still relevant.

3
  • is Colossus a Google Cloud Storage thing? meaning is it used on both? or is it a separate architecture between GCS and Colossus?
    – arcee123
    Commented Aug 10, 2017 at 20:46
  • 2
    GCS is built on top of Colossus. Colossus provides a lower-level storage API for Google's own services. Commented Aug 10, 2017 at 20:49
  • Thank you! that's the piece I needed to know.
    – arcee123
    Commented Aug 10, 2017 at 20:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.