SlideShare a Scribd company logo
the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
 
the data model no sql? design patterns
things we know about cassandra it’s eventually consistent it’s column-oriented
these things are wrong.
?...
consistency eventual? 1
consistency tune-able? 1
PACELC Daniel Abadi if Partition Trade some Consistency for Availability Else (normal circumstances) Balance Consistency & Latency
cassandra is consistent  when read replica count +  write replica count > replication factor cassandra is consistent  if you read & write at  CL.QUORUM   (once  Q  nodes are up) good place to start R + W > N r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
cassandra is   row-oriented each row is uniquely identifiable by key rows group columns and super columns 2
column atomic unit name : value : timestamp email : alison@foo.com : 12578123685
column family
User { 123 : { email: alison@foo.com,    img:  }, 456 : { email: eben@bar.com,    username:  The Situation} } column family
super column super columns group columns under a common name
super column family
about super column families sub-column names inside a SCF are not indexed top level columns (SCF Name) are always indexed often used for denormalizing data from standard CFs
PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. },   Spring Training { phone: 623-333-3333,  desc: Fun for baseball fans.  }, }, //end phx key: 10019 {   Central Park { desc: Walk around. It's pretty.} ,   Empire State Building { phone: 212-777-7777, desc: Great view from    102nd floor. } } //end nyc } s super column super column family flexible schema key column  super column family
the data model no sql? design patterns
what about… SELECT WHERE ORDER BY JOIN ON  GROUP ?
rdbms :  domain-based model   what answers do I have? cassandra :  query-based model   what  questions  do I have?
Questions •  Find hotels in a given area •  Find information about a given hotel, such as its name and location •  Find Points of Interest near a given hotel •  Find an available room in a given date range •  Find the rate and amenities about a room •  Book the selected room by entering guest information
SELECT WHERE <<cf>> USER Key: UserID Cols: username, email, birth date, city, state   To support a query like  SELECT * FROM User WHERE city = ‘Austin’: Create a new CF called UserCity:   <<cf>> USERCITY Key: city Cols: IDs of the users in that city. Also uses the Valueless Column pattern.
Use an aggregate row key state:city: { user1, user2} Get rows between  TX:  &  TX;  for all Texas users Get rows between  TX:Austin  &  TX:Austin1   for all Austin users SELECT WHERE pt 2
ORDER BY Rows   are  placed  according to their Partitioner: Random: MD5 of key Order-Preserving: actual key are  sorted  by key, regardless of partitioner Columns   are sorted according to CompareWith  or  CompareSubcolumnsWith
JOIN ON “ join” means “create a relationship” rdbms : pay at runtime cassandra : pay at design time representing the relationship rdbms : opaque, in query cassandra : transparent, first-class citizen
GROUP SELECT COUNT(*) from Hotel  GROUP BY ZipCode    calculated column value
the data model no sql? design patterns
1.  materialized view
problem You need to perform SELECT FROM WHERE queries. solution Create a new CF. Use the WHERE idea as the row key.  impacts Must also write to the index every time you write to the primary CF.  Or run as a job. materialized view
2.  valueless column
problem Indexes require repeating columns from other column families. solution Treat the  name of the column as the value .  Use a  byte[0] as the column ‘value’. impacts Only works with <= 2B columns in 0.7 valueless column
3.  composite key
problem Keys must support references and uniqueness. solution Fuse multiple values  with a separator. impacts Can substitute for Super Column. Use a Custom Comparator if necessary. composite key
4.  semantic key
problem   Keys must be unique. You use OPP. solution Use a key that is meaningful in your application.  impacts Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. semantic key
6.  client clock sync
problem You need to keep clocks on different clients synchronized to support read repair. solution System.nanoTime() used in  StorageProxy NTP .  impacts Consider geographic dispersement.  client clock sync
EXAMPLE
 
 
is cassandra a  good fit? very fast writes need to be always writeable lots of data evolving schema
custom comparators i want to compare with  float lat/long
choose  keys  carefully key-based  routing to find data queries  are executed by key keys need to be  easily discoverable you rely on keys for  referential integrity
new use cases geographic data weather data rfid travel schedules data services in soa  hotel reservations CEP
see also http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf  http://github.com/ dietrichf/brireme
@ebenhewitt cassandraguide.com

More Related Content

Cassandra Data Model

  • 1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
  • 2.  
  • 3. the data model no sql? design patterns
  • 4. things we know about cassandra it’s eventually consistent it’s column-oriented
  • 9. PACELC Daniel Abadi if Partition Trade some Consistency for Availability Else (normal circumstances) Balance Consistency & Latency
  • 10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start R + W > N r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
  • 11. cassandra is row-oriented each row is uniquely identifiable by key rows group columns and super columns 2
  • 12. column atomic unit name : value : timestamp email : alison@foo.com : 12578123685
  • 14. User { 123 : { email: alison@foo.com, img: }, 456 : { email: eben@bar.com, username: The Situation} } column family
  • 15. super column super columns group columns under a common name
  • 17. about super column families sub-column names inside a SCF are not indexed top level columns (SCF Name) are always indexed often used for denormalizing data from standard CFs
  • 18. PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, }, //end phx key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc } s super column super column family flexible schema key column super column family
  • 19. the data model no sql? design patterns
  • 20. what about… SELECT WHERE ORDER BY JOIN ON GROUP ?
  • 21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
  • 22. Questions • Find hotels in a given area • Find information about a given hotel, such as its name and location • Find Points of Interest near a given hotel • Find an available room in a given date range • Find the rate and amenities about a room • Book the selected room by entering guest information
  • 23. SELECT WHERE <<cf>> USER Key: UserID Cols: username, email, birth date, city, state   To support a query like SELECT * FROM User WHERE city = ‘Austin’: Create a new CF called UserCity:   <<cf>> USERCITY Key: city Cols: IDs of the users in that city. Also uses the Valueless Column pattern.
  • 24. Use an aggregate row key state:city: { user1, user2} Get rows between TX: & TX; for all Texas users Get rows between TX:Austin & TX:Austin1 for all Austin users SELECT WHERE pt 2
  • 25. ORDER BY Rows are placed according to their Partitioner: Random: MD5 of key Order-Preserving: actual key are sorted by key, regardless of partitioner Columns are sorted according to CompareWith or CompareSubcolumnsWith
  • 26. JOIN ON “ join” means “create a relationship” rdbms : pay at runtime cassandra : pay at design time representing the relationship rdbms : opaque, in query cassandra : transparent, first-class citizen
  • 27. GROUP SELECT COUNT(*) from Hotel GROUP BY ZipCode  calculated column value
  • 28. the data model no sql? design patterns
  • 30. problem You need to perform SELECT FROM WHERE queries. solution Create a new CF. Use the WHERE idea as the row key. impacts Must also write to the index every time you write to the primary CF. Or run as a job. materialized view
  • 31. 2. valueless column
  • 32. problem Indexes require repeating columns from other column families. solution Treat the name of the column as the value . Use a byte[0] as the column ‘value’. impacts Only works with <= 2B columns in 0.7 valueless column
  • 34. problem Keys must support references and uniqueness. solution Fuse multiple values with a separator. impacts Can substitute for Super Column. Use a Custom Comparator if necessary. composite key
  • 35. 4. semantic key
  • 36. problem Keys must be unique. You use OPP. solution Use a key that is meaningful in your application. impacts Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. semantic key
  • 37. 6. client clock sync
  • 38. problem You need to keep clocks on different clients synchronized to support read repair. solution System.nanoTime() used in StorageProxy NTP . impacts Consider geographic dispersement. client clock sync
  • 40.  
  • 41.  
  • 42. is cassandra a good fit? very fast writes need to be always writeable lots of data evolving schema
  • 43. custom comparators i want to compare with float lat/long
  • 44. choose keys carefully key-based routing to find data queries are executed by key keys need to be easily discoverable you rely on keys for referential integrity
  • 45. new use cases geographic data weather data rfid travel schedules data services in soa hotel reservations CEP
  • 46. see also http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf http://github.com/ dietrichf/brireme