2

The scenario involves maintaining versions of a 'Product' object (e.g., Product table) and creating new records when the product changes to the latest version (e.g. v2 from v1). It is crucial not to update existing product records, as other objects might reference them. The goal is to update referring objects (those needing v2) while preserving the old versions (v1) for objects still using them. And at the same time, we need to keep track of how the product has evolved i.e. v1->v2->v3

I've considered two potential approaches:

  1. Versioning: Assigning version labels (e.g., V1, V2) to the object to track and reference old versions if necessary.
  2. LinkedList (or doubly LinkedList): Each new object includes an 'old_object_ID' pointing to the previous object's ID, enabling tracking of all past objects.

Of course, there are pros and cons of both approaches like data redundancy, complex queries, and easy navigation.

At the same time, we need to consider Query Performance, Concurrency, Referential Integrity, Auditing, and History

Are there alternative or more effective strategies/patterns for achieving this versioning and referencing mechanism in relational databases?

1
  • 1
    Hi, welcome to dba.se! Question is very broad and needs an essay answer to fully cover a wide topic - I suggest that you go the wiki for a general overview and then to the excellent oracle-base.com site (and links within) site to see how the concepts are applied to Oracle - Oracle has two mechanisms - temporal tables as well as a Flashback facility. Come back with specific questions if you have specific problems!
    – Vérace
    Commented Mar 4 at 12:10

3 Answers 3

3

If you're designing bespoke software, then I would first warn that the requirements for any kind of record versioning are often highly specific to a context, and (in contrast to a non-versioning application) impose significant overheads on both the developers and the users to keep things correct.

we need to keep track of how the product has evolved i.e. v1->v2->v3

Most products don't in fact evolve like this. Just talking about software as a product, anyone who has ever tried to use more sophisticated source control packages will see that the evolution of the software when diagrammed often has a more web-like form. The commit events may all be linear in time, but the lineage of the underlying design concepts (and the timeline of activities of developers who work on the software design, including its modules and subsystems, and the prototypes they test) are not linear like that, and it's usually the latter kind of web-like lineage which is actually important or most useful.

With products that consist of physical manufacturing, in many businesses the design and manufacturing process is not always sufficiently orderly that all physical variations are even recorded.

The knowledge about the relationships of different design versions, may exist only for the time being in the heads of the staff in the design office - they may know that v9 was in fact a fork for a special project, and that v10 is the real successor of v8.

Moreover, there's an additional distinction between the versioning of the product design and the versioning of the records about the product. To some applications in some businesses, both may be pertinent and have to be distinguished. In other applications, neither may matter.

I've considered two potential approaches:

  1. Versioning: ...
  2. LinkedList: ...

Those are certainly two approaches, and I would say (1) is the best approach when storing purely sequential versions. (2) is likely to perform worse and be more difficult to code.

There's also a third common approach, which is to copy the relevant data from the master list at the time of a transaction, so that the data about the transaction stores a snapshot of how things were set at the time of the transaction.

To academics who teach theory, or at least to their students who receive only a simplified understanding, this is often a prime example of "denormalisation", the storage of redundant data, and a cause of "update anomalies".

In fact, it is not only the most straightforward and reliable way of keeping an audit trail, but it is not necessarily redundant at all since it attests to the state of the system at a point in time. And because it is supposed to be a snapshot, then it is not supposed to be updated when the master lists change.

To give you a common example of where this may apply, when a customer places an order, with a product description and the price noted, you want to record: (a) the actual agreement with the customer, and (b) the same details as any order acknowledgement issued to the customer. You certainly don't want the price on the order to change, just because the master price list has changed since the order was placed. You probably don't want the description to change either - especially if it was hand-altered to reflect a special request.

There are also legal regulations that concern some records. In the case of invoices, for example, it is a criminal offence to alter your record of the invoice, after you have issued the invoice to the customer. So you would certainly never write a computer application that automatically updates those records, or links the crucial parts to any other data that may change.

This approach also means it's possible for things to be recorded on old orders which no longer reflect things available on master lists. This means the business can throw away certain kinds of old data without fearing how it might affect linked records, and it doesn't need to keep current master lists cluttered with long-obsolete entries that have no further general use and exist only to support linkages from old data.

Conclusion

My advice would be not to design a system of versioning in the abstract, but to start with a deep familiarity of the business which apparently needs a system of versioning in its records, and design any software application to match that need.

If you come to ask somebody how to implement a system of versioning in record-keeping software, then come armed with a considerably detailed analysis of the business which requires those records to be kept.

Whilst obviously it is possible to discuss various data structures in the abstract, it's not possible to say which one is "best" without understanding the context to which it is being applied.

2

I assume by "object" you mean "row".

This is a classic "slowly changing dimension" scenario. There are several solutions commonly used, most of them variations on a method known as "type-2" which uses dates to historically version rows. Variations exist on whether to use surrogate keys that change with each version or that remain the same across versions, whether the start/end dates are inclusive vs. exclusive, whether a "current indicator" flag is added, etc.. But the basic idea is:

PRODUCT
-------
SURROGATE_PRODUCT_ID // primary key
BUSINESS_PRODUCT_ID // alternate key when combined with effective_end_date
PRODUCT_DETAIL_A
PRODUCT_DETAIL_B
PRODUCT_DETAIL_C, etc..
EFFECTIVE_START_DATE
EFFECTIVE_END_DATE
CURRENT_INDICATOR

Your ETL code joins on the alternate key where the ETL refresh date is between the effective start and end dates, or, alternatively, simply where current_indicator = 'Y', and looks for changes in the non-key attributes (product_detail_*...). If found, the old row gets its effective end date set to the current refresh date and the current indicator flag is set to N, then a new row is inserted with a start date of the current refresh date and set to current. You can use either NULL for effective_end_date or a bogus "end of time" value for the current record.

Your child tables would refer to surrogate_product_id, the PK, and so pinpoint exactly one historical version of the product. And none of your child tables ever need to get updated.

You can then see the whole history of changes simply by querying against the business key without filtering on the dates. You can see the reality as of any date you wish by filtering on the start/end dates using inequality (>=, < or between) predicates. You can find the current one the same way if using a dummy end-of-time date, or by effective_end_date IS NULl (if using the NULL method) or current_indicator = 'Y' (I prefer to use only dates, as I don't like having two ways of indicating current as then there could be ambiguity if they get out of wack due to buggy ETL code). Anyway, lots of material out there on the internet about this you can read up on.

A note on your suggestions:

  1. the problem with sequence versioning (v1, v2, etc.) is that every lookup of "current" requires a nested subquery to find the maximum value. It's critical that you know how to find the current row without having to visit all the other rows.

  2. A linked list is very inefficient. What if you had a product that changed daily for two years? To find the current row would require 730 row lookups using recursion/hierarchical query. Very unpleasant. The date-range historical table, or type-2 slowly changing dimension, is pretty industry standard and does the job well.

2
  • How about using Type-4 (en.wikipedia.org/wiki/…) and creating a history table for changed products and backtracking them with dates? Do you see any problem here ?
    – Enigma
    Commented Mar 6 at 6:33
  • 1
    @Enigma, type-4 is simply a type-2 history table shadowing a type-1 current table. I find it unnecessary to have this complexity, but it would work.
    – Paul W
    Commented Mar 6 at 11:09
1

Re "the problem with sequence versioning (v1, v2, etc.) is that every lookup of "current" requires a nested subquery to find the maximum value. It's critical that you know how to find the current row without having to visit all the other rows."

In past implementations for providing row-versioning, we use "current version = NULL". I.e. insert row for first time: set current_version to NULL. Amend same row: update current row version (the row with version = NULL) and increment row version to 1, then insert new version of row, setting its version to NULL.

Not the answer you're looking for? Browse other questions tagged or ask your own question.