Cassandra's data model is more flexible than typically assumed.
Cassandra allows tuning of consistency levels to balance availability and consistency. It can be made consistently when certain replication conditions are met.
Cassandra uses a row-oriented model where rows are uniquely identified by keys and group columns and super columns. Super column families allow grouping columns under a common name and are often used for denormalizing data.
Cassandra's data model is query-based rather than domain-based. It focuses on answering questions through flexible querying rather than storing predefined objects. Design patterns like materialized views and composite keys can help support different types of queries.
Report
Share
Report
Share
1 of 47
More Related Content
Cassandra Data Model
1. the cassandra data model eben hewitt cassandra summit - san francisco 8.10.2010
9. PACELC Daniel Abadi if Partition Trade some Consistency for Availability Else (normal circumstances) Balance Consistency & Latency
10. cassandra is consistent when read replica count + write replica count > replication factor cassandra is consistent if you read & write at CL.QUORUM (once Q nodes are up) good place to start R + W > N r = # of nodes consulted during read op w = # of replicas consulted during write op n = replicas r + w + n = consistent
11. cassandra is row-oriented each row is uniquely identifiable by key rows group columns and super columns 2
12. column atomic unit name : value : timestamp email : alison@foo.com : 12578123685
17. about super column families sub-column names inside a SCF are not indexed top level columns (SCF Name) are always indexed often used for denormalizing data from standard CFs
18. PointOfInterest { key: 85255 { Phoenix Zoo { phone: 480-555-5555, desc: They have animals here. }, Spring Training { phone: 623-333-3333, desc: Fun for baseball fans. }, }, //end phx key: 10019 { Central Park { desc: Walk around. It's pretty.} , Empire State Building { phone: 212-777-7777, desc: Great view from 102nd floor. } } //end nyc } s super column super column family flexible schema key column super column family
21. rdbms : domain-based model what answers do I have? cassandra : query-based model what questions do I have?
22. Questions • Find hotels in a given area • Find information about a given hotel, such as its name and location • Find Points of Interest near a given hotel • Find an available room in a given date range • Find the rate and amenities about a room • Book the selected room by entering guest information
23. SELECT WHERE <<cf>> USER Key: UserID Cols: username, email, birth date, city, state To support a query like SELECT * FROM User WHERE city = ‘Austin’: Create a new CF called UserCity: <<cf>> USERCITY Key: city Cols: IDs of the users in that city. Also uses the Valueless Column pattern.
24. Use an aggregate row key state:city: { user1, user2} Get rows between TX: & TX; for all Texas users Get rows between TX:Austin & TX:Austin1 for all Austin users SELECT WHERE pt 2
25. ORDER BY Rows are placed according to their Partitioner: Random: MD5 of key Order-Preserving: actual key are sorted by key, regardless of partitioner Columns are sorted according to CompareWith or CompareSubcolumnsWith
26. JOIN ON “ join” means “create a relationship” rdbms : pay at runtime cassandra : pay at design time representing the relationship rdbms : opaque, in query cassandra : transparent, first-class citizen
30. problem You need to perform SELECT FROM WHERE queries. solution Create a new CF. Use the WHERE idea as the row key. impacts Must also write to the index every time you write to the primary CF. Or run as a job. materialized view
32. problem Indexes require repeating columns from other column families. solution Treat the name of the column as the value . Use a byte[0] as the column ‘value’. impacts Only works with <= 2B columns in 0.7 valueless column
34. problem Keys must support references and uniqueness. solution Fuse multiple values with a separator. impacts Can substitute for Super Column. Use a Custom Comparator if necessary. composite key
36. problem Keys must be unique. You use OPP. solution Use a key that is meaningful in your application. impacts Harder to get right. Can proliferate Indexes. Range scans over keys less efficient. semantic key
38. problem You need to keep clocks on different clients synchronized to support read repair. solution System.nanoTime() used in StorageProxy NTP . impacts Consider geographic dispersement. client clock sync
44. choose keys carefully key-based routing to find data queries are executed by key keys need to be easily discoverable you rely on keys for referential integrity
45. new use cases geographic data weather data rfid travel schedules data services in soa hotel reservations CEP