What are the use cases for SQL row version columns?

Question

Docs and blog posts describe what row version columns do, but rarely delve into the decision process of when it's appropriate to use them. I suspect that many developers just add them to every table without considering whether they're actually needed. This saves development effort, but it comes at a cost that I may want to avoid.

What are the use cases for SQL row version columns? When should they not be used?

(There's another kind of row versioning that I understand to be related in purpose but unrelated in implementation. This question is not about that.)

Perhaps the clearest case for row versioning is when an entity has a property with read-increment-write semantics. In favor of row versioning everywhere, you could argue that any property could be used this way, and this usage might be difficult to foresee. Even a boolean property could have read-modify-write semantics if the intention is to toggle it rather than set it.

At the other extreme, if all of an entity's properties are independent and "last write wins" semantics are acceptable, then row versioning seems pointless.

I've used APIs where entire entities had to be read and written to change a single property, even when that property was independent. This made update conflicts frequent, so these APIs relied heavily on row versioning. In that scenario, I view row versioning as a necessary evil mitigating a bad API design.

On the other hand, if update DTOs contain only changed properties to merge or coalesce, and those properties are independent or groups of dependent properties are sent together, then row versioning might be unnecessary. Or maybe a row version could exist but be optional in the DTO, making it the client's responsibility to include it if it wants safe read-modify-write semantics.

From the perspective of API design, another consideration is that adding a row version column would be a breaking change. But then again, so would adding columns or business rules that make row versioning necessary where it wasn't before. This goes back to the effort of thinking about whether row versioning is needed. Maybe the cost of using it everywhere is worth eliminating the risk of failing to recognize where it's necessary.

Werner Altewischer · Accepted Answer · 2023-01-13 08:26:18Z

Whenever a last-commit wins scenario is acceptable or when you use a very high transaction isolation level (i.e. serializable) you generally don't need a version column. Situations where you do need it are if you cannot tolerate dirty read scenarios or serialization anomalies such as with financial transactions. See this link for clarification.

An example is a situation where you withdraw a certain amount from a bank account with two concurrent transactions. Both at the start may see a value x and both may try to update the column to z = x - y if y is the amount to withdraw. If these two transactions are not isolated with serializable both transactions can be successful with the result being that the balance is x - y instead of x - 2 * y. A version column will make sure this scenario cannot occur since the update only will succeed if the version was not change since it was read within the same transaction (it would result in an optimistic locking failure).

From my own experience I know there are other situations where anomalies might occur if you rely on ORM (Object Relational Mapping) frameworks. An example would be a merge action on an entity where some of the field it contains are outdated. This merge action may then revert these fields back to their old values in the database which may be unintentional and can result in weird bugs.

Steve · Accepted Answer · 2022-07-09 10:24:37Z

The old name for the ROWVERSION type in SQL Server was TIMESTAMP, and mentioning this is a useful lever for explaining it, although 'timestamp' was (rightly) regarded as a confusing term.

To recap on how it works, it is similar to an INT IDENTITY column in that it assigns a unique number to a row according to the order in which the row was inserted, except 'rowversion' is database-scoped rather than table-scoped, and (unlike 'int identity') it increments on every update to an existing row as well as every insert of a new row.

The effect of this is that every row in the database, and every changed version of that row, is assigned a globally-unique sequence number according to the relative order in time in which the insert or update occurred. Hence the original term 'timestamp'.

As with 'identity', there can be gaps in the sequence number where transactions seized a number but were then rolled back, but the sequence is always ascending over time.

What isn't clear about the term 'timestamp' is that we are here referring to a scant, topological notion of time, in which it is possible to discern what event came before and what event came after - it is possible to place each row and each change to a row in an unambiguous time order - but not to discern the actual size of the gaps (or quantity of time) between each event.

The ordinary DATETIME, DATE, and TIME types (which all record the more familiar, geometric notion of time) are not actually capable of performing this ordering function, because they have a defined resolution. Two rows (even from separate transactions) can be assigned to the same point in geometric time.

This is quite obvious if the resolution is as coarse as a whole day (where obviously many rows can be updated on the same day), but it remains technically true all the down to the finest resolution possible (and, for either technical or behavioural reasons, such collisions at finer resolutions are never as unlikely as they may seem at first consideration).

So you use 'rowversion' in any case where you need to know unambiguously which rows (and versions of those rows) came before a certain point in time, and which came after.

The most typical application of this functionality is in comparing some sort of cached data to master data, and finding out whether anything has changed since you last looked. The danger with using 'datetime' types is that you can look, and then something new can appear with a recorded time before (or most confoundingly, exactly at) the threshold at which you last looked. With 'rowversion', two such timings can never be the same, and the database tracks the effect of pending transactions so that you never risk older transactions being committed later than newer ones.

Note the 'rowversion' functionality does not just consist of the column type, but also a supporting set of functions including @@DBTS and MIN_ACTIVE_ROWVERSION(), which would often be used as part of a correct solution.

Note also that 'rowversion' is probably not the only way certain things can be done, like comparing cached values. Hand-rolled solutions may be more obvious and flexible in typical business applications - 'rowversion' is something that will have been designed to scale under extreme volumes and high frequencies.

Greg Burghardt · Accepted Answer · 2022-07-09 02:55:38Z

2

A rowversion column is simply a 64-bit number that the database increments automatically on each insert or update. Since the value changes with DML operations, a rowversion column should not be used as a primary key. While I cannot imagine a specific business use for the value, Optimistic Locking is a technique that allows you to issue an UPDATE statement to the database, and it will only succeed as long as that row has not changed since the last time you fetched it. The rowversion column could be used as a descriminator value in the WHERE clause along with the primary key value. If the row version has not changed, then the row gets updated. If the row version did change, the database simply reports back that no rows were updated.

While this doesn't cover every possible use case (an answer like that would never be comprehensive), it should give you some guidance on why you would use a rowversion column. Basically any time you might want a discriminator value to determine if anything has changed since you last retrieved that data. Note that this is different than determining whether data has been corrupted or tampered with. A checksum would be appropriate in that case.

answered Jul 9, 2022 at 2:55

Greg Burghardt

39.2k8 gold badges77 silver badges126 bronze badges

"Basically any time you might want a discriminator value to determine if anything has changed since you last retrieved that data." - I get that. The question is, what is the decision process for whether you want or need that?
– Kevin Krumwiede
Commented Jul 9, 2022 at 2:59
@KevinKrumwiede: there is no standard decision process. You must understand that particular feature of the database and analyze the requirements of your application. If the requirements need some sort of auto-incrementing value on insert or update, use a row version column. It is a subjective decision that you as the engineer must make.
– Greg Burghardt
Commented Jul 9, 2022 at 3:03
I don't think it's subjective. I gave some example scenarios that might be part of a decision tree. Whether a heavy-handed approach is "worth it" is subjective, but the costs and benefits and reasons to choose one approach or another are objective.
– Kevin Krumwiede
Commented Jul 9, 2022 at 3:20
1

You are asking a fuzzy question and expecting a concrete answer. The decision to use it or not is very subjective and there is no definitive list of when you should or should not use it. It totally depends on your application's needs. If these questions were easy, computers would write our programs for us.
– user10489
Commented Jul 9, 2022 at 5:09
@user10489, the question doesn't seem that fuzzy: "What do you people use 'rowversion' for? Give me some examples!"
– Steve
Commented Jul 9, 2022 at 9:35

| Show 1 more comment

juhist · Accepted Answer · 2022-07-09 10:34:21Z

My opinion is that row versioning at the application level is mainly useful for the bad practice of not using transactions. If you read a value from database outside of a transaction, and then are going to modify that value, it's helpful to know if the row you're modifying has been modified in the meantime.

However, that's not how you should implement your application. You should select a database that supports true serializability. That's a transaction isolation level that guarantees there is an ordering for the transactions that would guarantee the same results, if the transactions were implemented in that order rather than in parallel. So if you can prove that a transaction running on its own with no other running transactions is doing the correct thing, with true serializability you know any parallel ordering of them does the correct thing too, and no timing can change that.

There are two ways to achieve true serializability. One is placing lots of locks on the database not only when writing to it, but also when reading from it. Unfortunately, that scales poorly. I'm not sure if any database actually implements this form of true serializability, but at least the SQL standard transaction isolation levels were created based on that flawed idea that locking would be the chosen strategy.

Another way is to check for potential serializability violations on the fly and fail transactions should you encounter a real or potential violation. This is the optimistic strategy (to distinguish from the pessimistic locking strategy), also called serializable snapshot isolation. This strategy means you have to be prepared for transactions failing. However, in any complex application you already are prepared for that, right? Even without serializable snapshot isolation, you can encounter deadlocks anyway. Although some deadlocks are easy to prevent (example: when doing account transfer, always modify the lower-numbered account first), in a complex application there are so many tables and transactions it's not realistic to create a good strategy for preventing all deadlocks. Thus, you have to be prepared to retry a deadlocked transaction. Serializable snapshot isolation only adds a second reason for transaction failing, but since you are prepared for transaction failing, that doesn't change anything. (Besides, if you use row versioning and check for row version in an update, in that case you already are prepared for update failing, right?)

As far as I know, the only database that today supports serializable snapshot isolation is PostgreSQL. So in business critical applications, you should select that database.

The difference is that snapshot isolation requires all conflicting users to have sustained online session with the transaction coordinator (the database). Row versioning allows transactions to resolve correctly after an offline period without a sustained session - which is often a criteria of remote caching, for example. Moreover, row versioning can have a secondary use as part of an audit trail - the transaction coordinator only keeps enough information to determine the correct resolution of active transactions, not to allow users to review the conflicting changes that caused a failure. — Steve, Commented Jul 9, 2022 at 11:32
In my experience, Row Versioning values are nothing whatever to do with /Database/ Change Concurrency (scale of microseconds). Instead, they handle the [completely different] problem of /Business Process/ Change Concurrency (scale of minutes or even hours). This is particularly relevant given the "disconnected" world in which our applications now exist - we no longer have persistent database connections through which any kind of database-provided Pessimistic Locking can be employed. — Phill W., Commented Jan 13, 2023 at 13:04

Stack Exchange Network

What are the use cases for SQL row version columns?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
sql
concurrency
or ask your own question.

Hot Network Questions

What are the use cases for SQL row version columns?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged sqlconcurrency or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sql
concurrency
or ask your own question.