Cassandra Data Model

the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010

the data model no sql? design patterns

things we know about cassandra it’s eventually consistent it’s column-oriented

PACELC Daniel Abadi if Partition Trade some Consistency for Availability Else (normal circumstances) Balance Consistency & Latency

cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start R + W > N r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent

cassandra is row-oriented each row is uniquely identifiable by key rows group columns and super columns 2

column atomic unit name : value : timestamp email : alison@foo.com : 12578123685

User { 123 : { email: alison@foo.com, img: }, 456 : { email: eben@bar.com, username: The Situation} } column family

super column super columns group columns under a common name

about super column families sub-column names inside a SCF are not indexed top level columns (SCF Name) are always indexed often used for denormalizing data from standard CFs

PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, }, //end phx key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc } s super column super column family flexible schema key column super column family

what about… SELECT WHERE ORDER BY JOIN ON GROUP ?

rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?

Questions • Find hotels in a given area • Find information about a given hotel, such as its name and location • Find Points of Interest near a given hotel • Find an available room in a given date range • Find the rate and amenities about a room • Book the selected room by entering guest information

SELECT WHERE <<cf>> USER Key: UserID Cols: username, email, birth date, city, state To support a query like SELECT * FROM User WHERE city = ‘Austin’: Create a new CF called UserCity: <<cf>> USERCITY Key: city Cols: IDs of the users in that city. Also uses the Valueless Column pattern.

Use an aggregate row key state:city: { user1, user2} Get rows between TX: & TX; for all Texas users Get rows between TX:Austin & TX:Austin1 for all Austin users SELECT WHERE pt 2

ORDER BY Rows are placed according to their Partitioner: Random: MD5 of key Order-Preserving: actual key are sorted by key, regardless of partitioner Columns are sorted according to CompareWith or CompareSubcolumnsWith

JOIN ON “ join” means “create a relationship” rdbms : pay at runtime cassandra : pay at design time representing the relationship rdbms : opaque, in query cassandra : transparent, first-class citizen

GROUP SELECT COUNT(*) from Hotel GROUP BY ZipCode  calculated column value

problem You need to perform SELECT FROM WHERE queries. solution Create a new CF. Use the WHERE idea as the row key. impacts Must also write to the index every time you write to the primary CF. Or run as a job. materialized view

problem Indexes require repeating columns from other column families. solution Treat the name of the column as the value . Use a byte[0] as the column ‘value’. impacts Only works with <= 2B columns in 0.7 valueless column

problem Keys must support references and uniqueness. solution Fuse multiple values with a separator. impacts Can substitute for Super Column. Use a Custom Comparator if necessary. composite key

problem Keys must be unique. You use OPP. solution Use a key that is meaningful in your application. impacts Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. semantic key

problem You need to keep clocks on different clients synchronized to support read repair. solution System.nanoTime() used in StorageProxy NTP . impacts Consider geographic dispersement. client clock sync

is cassandra a good fit? very fast writes need to be always writeable lots of data evolving schema

custom comparators i want to compare with float lat/long

choose keys carefully key-based routing to find data queries are executed by key keys need to be easily discoverable you rely on keys for referential integrity

new use cases geographic data weather data rfid travel schedules data services in soa hotel reservations CEP

see also http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf http://github.com/ dietrichf/brireme

@ebenhewitt cassandraguide.com

Cassandra Data Model

More Related Content

Cassandra Data Model