11

I've inherited a system with an Oracle relational database with a couple of tables modeled like I've sketched below, where there's an entire child table that only stores a single status code in a one-to-one relationship with its parent.

Here's a generalized sketch of the schema I'm talking about: DB Schema Diagram

The ITEM table is quite large, with hundreds of millions of rows. COLLECTION is a few hundred rows, and its COLLECTION_STATUS does not actually appear to be used anywhere in the application (it is always initialized with the same value, which is never changed).

What would be the benefits or motivations for designing a schema like this? If I were designing this schema, I would have just made COLLECTION_STATUS_CODE a column on COLLECTION, and would have not bothered with the COLLECTION_STATUS table. However, there's a pattern here, and it seems like it dates to the earliest history of this application. Unfortunately, the system is more than a decade old, and all the original developers have long since left the company (in a site closure).

Could it be something to do with performance? The system regularly creates new COLLECTIONs that are mostly full copies of previous COLLECTIONs, and I believe that process is IO bound in the DB (on the SELECT not the INSERT part), most of which happens in one DB transaction. My naive gut feel is the creating the need for a join in this way wouldn't be worth it, through.

We need to add a new kind of "status code" to COLLECTION, so I'm faced with the choice of whether to add a new column to COLLECTION for it, or follow the unused COLLECTION_STATUS design and add it there.


Edit 1: Just to clarify something: as best I can tell, the COLLECTION, COLLECTION_STATUS, ITEM and ITEM_STATUS tables were all created around at the same time very early in the system's history (before it was ever in production) as part of some kind of refactor. This particular case is not a situation of trying to graft new functionality on to an old crusty system, since it was like this almost from the start.

Also, no applications, besides two that my team controls (that were originally one app), connect directly to this database.

Edit 2: Also none of these tables are very wide. COLLECTION and ITEM have only about 5 non-key columns. COLLECTION_STATUS and ITEM_STATUS only have one or two non-key columns.

13
  • I've seen some weird database schemas that make no sense. For example, I found one recently in a vendor system that was anti-normalized, for lack a better term. That is, it used a separate reference table with a new row for every row on the child table where many thousands were the same logical value. Good question but the title shows your frustration a little.
    – JimmyJames
    Commented Oct 17, 2022 at 20:33
  • This feels like there is (or was) an external reporting tool or direct-to-database integration using this table and the DBA made things easier for himself by breaking out only the necessary data away, so they could simplify the queries/permissions to just that splinter table. Not sure if the best design possible, but certainly not an uncommon one. That, or someone told them to add a new column but the DB had too much data/too bad of a hardware on it that it was impossible to do so on the main table. This is more of a freak situation but also possible.
    – T. Sar
    Commented Oct 18, 2022 at 11:15
  • 1
    I have a table with almost 140 columns. When I saw that, I chopped the table into 3 tables 1:1. Basically I moved columns to different tables based on the life cycle of the data. Not all the data change at the same time and due to the same reasons. Now I can manage changes in this data and limit the impact of changes. I can decouple developments and bug fixing. I can expand the data within a reasonable number of columns. That makes life a lot easier for my coworkers too. The data model and mappings are simple and the date is easier to reason about.
    – Laiv
    Commented Oct 18, 2022 at 14:15
  • 2
    It is obviously not the case here, but if you have some very wide columns that aren't often queried, it makes sense to move those big columns into separate table, so that the DB engine doesn't need to go through them every time it looks up for something unrelated in the table.
    – J. Doe
    Commented Oct 18, 2022 at 15:33
  • 3
    " My naive git feel" - that must be the programmers' equivalent of other people's gut feeling, right? (Please don't edit your post, this is brilliant!) Commented Oct 18, 2022 at 18:38

10 Answers 10

15

There are several reasons why one want to keep two tables:

  • separation of concerns (e.g. if the list of items is related to inventory an status management is related to sales catalogue)
  • authorisations (e.g. if access grants needs to be different for the two tables)
  • practicalities (e.g. former limits on number of columns in a table, optimisation of db triggers).

The main question here: if it’s not used anywhere, can it be removed?

8

In-keeping with other answers, there's no common explanation for why such a table may be designed like this with a one-to-one relationship.

What you see now may be only a vestige of some different design in the distant past, with the key parts already ripped out.

Or it may have been designed to allow extension somehow (such as a history of statuses), which was never used. Or it might have something to do with how audit trails are kept for each table, and dividing the tables allows alterations to rows in the main table to be audited separately from status changes.

What you see may also have started originally as a 1-to-(0..1) relation, where items could have statuses but did not have to have them. Perhaps to allow a retrofit and transition, without disrupting the existing design.

Either way, the explanatory information will probably have left with the old developers - it's sometimes easy to spend a lot of time analysing design choices that the original developers may have regarded as arbitrary rather than crucial.

It may also be purely a logical grouping of related columns. Experienced database developers would not split a table like this for just a few columns, and if needs be would use a naming prefix to relate groups of columns together logically.

But those more familiar with other languages may tend to draw an analogy between database tables on the one hand, and classes or structures on the other, and then try and assemble a kind of hierarchy out of tables and columns, which is the wrong way to go about it.

Also some UI front ends or tools might have made it easier (or seem easier) to split the table.

As others have also mentioned, permissions can be granted differently on different tables, and with extreme numbers of rows there may be performance gains in some workloads (at the expense of others).

Sometimes, people carry on with certain old performance optimisations for a long time after they become irrelevant to modern technologies, and never revisit the justification.

As for whether to add a new column to the main table or the status table...

This is entirely a matter of judgment for you.

If you are absolutely sure the COLLECTION_STATUS table is unused throughout the application, and neither you nor any colleagues can understand any rationale, and if there are not so many other status tables that you'd be disrupting a very fundamental design pattern, then now may be a good time to rip the COLLECTION_STATUS table out altogether.

2
  • 5
    There are common explanations for this pattern, you have even identified a few. The key is often the application that the database is or was managed by. Finding that information will unlock many mysteries of existing database schemas. Commented Oct 18, 2022 at 16:55
  • 1
    Good point on permissions. I hadn't considered that.
    – JimmyJames
    Commented Oct 18, 2022 at 17:33
5

Not all database structure decisions can be made or even understood without a System Level point of View at the moment the decision was made. If it's no longer needed refactor it and move on.

Recently I implemented something very similar in a project I'm working on. The existing system is very old and VERY fragile, I don't have the source code for all of it and the source I do have is a plinko board of spaghetti code. There is no way to know if part of the system is tightly coupled to the table structure. Accessing columns by ordinal position was actually a thing at one time.

dr = Select * from table;
value = dr[1];   

Since I can't see the existing code and don't dare change anything that code relies on, the safest thing to do is add a one to one table instead of adding a new column to an existing table. Of course, in a couple years we hope to turn off the old system and I'm sure whomever inherits this system will ask WTF was that idiot thinking to create a structure like this?

2
  • That's good point, but that's not the case in my situation, at least. The best I can tell, the COLLECTION, COLLECTION_STATUS, ITEM, and ITEM_STATUS tables were all added a the same time when the application was being stood up. The history is hard to follow, since the original developers liked move scripts around as they worked on them (between "development" and "released" folders), then later consolidate them.
    – Kaypro II
    Commented Oct 19, 2022 at 15:17
  • Understood, But the longer I work this industry the more I give the last guy the benefit of the doubt. No coder is a screw up in their own eyes. Whomever did it thought they had a perfectly logically reason for it at the time. If you really need to know why then You'll have to try and crawl back into their head at the moment the decision was made.
    – David
    Commented Oct 19, 2022 at 15:22
4

So this might be a better question on the DBA stack and there might be some aspect to this that depends on the specifics of the DB engine etc. At the risk of getting a nerd smackdown, I'm going to say this doesn't make any sense from a relational design perspective. I think a lot of times when you see things like this, there was probably a plan that didn't work out as expected or the designer wasn't as experienced or capable as they thought they were.

The only thing I can think of is that (in the ancient times of 10 years ago or more) when 100 million rows was considered to be a 'large' table, perhaps there was some benefit to splitting things off like this. Maybe for disaster recovery or something.

I'm stretching here. My guess is that you are looking at an error. Most of my career has been spent working around people's mistakes. A significant portion of those are my own.

2
  • 4
    Most of my career has been spent working around people's mistakes. A significant portion of those are my own. This summarizes so well my experience that depresses me just thinking about it
    – Laiv
    Commented Oct 18, 2022 at 14:23
  • 1
    @Laiv That's life, right? I live in a country that designed many/most of its cities around automobiles. Not much we can do about that decision, we just have to figure something out.
    – JimmyJames
    Commented Oct 18, 2022 at 21:15
4

Change Data Capture or Logical Delete implementation would look like this, especially if you do not want the logical Delete Flag or Record Status to be part of the domain model. Using a separate table for the logical status of the physical records allows for status definitions such as Active, ReadOnly, Archived, Exported, Restricted, Deleted or whatever your business requirements might be, without having to pollute your domain model with this information. It is possible to use a similar strategy to this to implement row-level security or multi-tenancy.

  • Some implementations of this scenario use the {TABLE}_STATUS table to include other audit metadata such as created and modified timestamps and user or source information.

  • The idea of separating this information into it's own table can be an implementation aspect of Compositional design patterns. Especially if the feature is designed to be bolted on and potentially removed from any tables in the model.

  • Table-Per-Type (TPT) database strategies implemented in ORMs like Entity Framework can also result in this type of structure if the logical status of the records is encapsulated in a base class, though modern implementations of this would not use an Identity Key at all for the table and just use the PK from the principal table (or the Status table would be the principal... either way they would share the same column and only in the principal table the Primary Key would be an Identity column).

I have seen databases that were designed to support Record Status in this way but the application not ever using this logical delete behaviour due to it being a feature of the framework that is only enabled when Change Data Capture is enabled in the business layer and/or the record status has been changed to indicate the record needs to be synchronized with an external source. In this scenario we don't need logical deletes if the data is not being synchronized, data is deleted physically. Only when that record is being synchronized with another database, then we need to use a logical delete to know that the record in the other database also needs to be deleted.

  • You might be looking at a live integration scenario where the records are logically deleted, then the synchronization or ETL processes the delete in the other system and then physically deletes the source record. You might be sampling the data at a time where there are no live deletes in progress.

This could also have been a feature that the developers deliberately chose not to use, this is common in code bases that are reused or repurposed from a larger or generic suite or product.

Change Data Capture like this can be bolted into an existing schema without the application even being aware of it! Through the use of triggers to intercept INSERT,UPDATE,DELETE statements from the application interactions we can create the status records and set the modified or deleted states. Perhaps your logic requires the status be changed to a specific value that enables the soft delete functionality. This form of Change Data Capture at the database level was common in older application schemas before native solutions for the same concept were implemented in modern RDBMS.

  • Look for any triggers on the COLLECTION or ITEM tables to determine if this is a custom CDC implementation.
  • If there are no triggers this just means it is intended to be managed from the application logic.

Either of these scenarios would match your observation where every record has the same initial value in this status column that is not changed for the lifetime of the associated record.

What would be the benefits or motivations for designing a schema like this?

While it can be a counter-intuitive database design pattern, there these and many other application design patterns implemented for performance or security reasons that can lead to database schemas like this.

  • I am not suggesting that 1:1 directly increases performance or security, but it can be an artifact of application design patterns that can be used to achieve greater overall application performance or specific security or behavioural protocols.

To accurately assess why the schema was designed in this way you should inspect the application code, for instance what is the meaning to the business domain of the COLLECTION_STATUS and what are the possible COLLECTION_STATUS_CODE values that can be used?

We need to add a new kind of "status code" to COLLECTION, so I'm faced with the choice of whether to add a new column to COLLECTION for it, or follow the unused COLLECTION_STATUS design and add it there.

If the current Status implementation is the logical status or state of the row, and your new status is not that kind of status then I would definitely NOT use or try to re-purpose this table for your new status.

In existing databases with existing applications/systems it can be technically challenging or cost prohibitive to try and engineer a new field into the database first and then try to manipulate the system to recognise that field. We usually need to consult the operating codebase to determine an appropriate structural location and naming convention for the new field. The conventions from the existing code should be able guide you on the appropriate way to include new members or values into the schema. If you just put a new field in any arbitrary table, existing applications will generally need to be modified in some way to interact with these fields anyway, it can save a lot of time and frustration if you at least consult the codebase first.

1

I agree with the other answers here, on why you may or may not want a 1:1 relationship. But I don't think that's what drove this.

My take on this is that the original DB designer was confused. A status table is a standard pattern, but it's almost always going to be a 1:many relationship and the FK should be the other way (you should have a FK COLLECTION_STATUS_ID on the COLLECTION table). That way every collection can have a specific status given by the COLLECTION_STATUS table.

Having a separate table for the status allows for translations of the displayed status, easier validations, and less data in the primary table.

I would keep the status table, but refactor the relationship so the status table is 1:many - add a FK COLLECTION_STATUS_ID on the COLLECTION table.

1

I have experience with postgres but I guess its common for both databases.

It makes sense and in some cases gives huge performance burst. In most cases such structure is used for optimizing IO load for MVCC(multiversion concurrency control) db model. Updating of 1 column in db means:

  • copying row and saving it with new value
  • deleting old row(by the vacuum process)

So by writing status in separate table you save less data on disk. You benefit depends on row size.

The problem is also known as 'table/index bloating' and it's the way to improve situation.

'table/index bloating' affects overall performance and efficiency of autovacuum

1
  • 1
    This is what one of my first thoughts was, MVCC and Lock handling. If the separate table needs to be updated very often, but it doesn't have to be immediate, while the main table updates rarely but must finish at once, then this breakout totally makes sense.
    – Joe
    Commented Oct 22, 2022 at 10:14
0

Not sure about Oracle, but in some other database systems this could be related to performance, especially if the status changes often and the rest of the columns are quite large.

When you update even a single fixed-width column, most systems will not just update in place: due to concurrency and transaction isolation, you need to be able to serve to each client the version that is relevant based on transaction isolation level, respective start of the transactions, etc.

So they will instead create a new row, with all the data from the original row + the change requested (and store somewhere who can see which version). After a while, some maintenance task will clean up the old versions which are no longer visible by anyone (garbage collection, vacuum, optimize...).

This means that if a row is relatively large and contains a lot of data that changes very little and a few items that change very often, you're better off splitting those frequently changing items to a separate table. It will take less space (because the copied rows are smaller), it will be faster to vacuum/optimize, etc.

Alternatively, the opposite may be true (not in your case apparently): you have a column with a large amount of data, but you don't need it very often. It may make sense to split that column to a separate table so the main table remains small. Small means easier to cache, faster to iterate through, and so on. For tables that need to be searched with complex conditions which cannot use indices, this can dramatically improve performance.

Whether this is really useful/appropriate in you case is difficult to know. Of course if the table is never updated it may be useless, but maybe at design time it was thought this could be useful. Or the designers apply the same pattern everywhere: large amounts of data that change little in one table, small amounts of data which (they expect to) change often in another.

What is slightly more concerning is the presence of the collection_status_id and item_status_id columns. They could (or should) have used collection_id and item_id as primary keys (as well as foreign keys). The additional id is useless.

-1

My answer goes from the assumption that although there is a one-to-one relation in the database design, the data is not really one-to-one, i.e. there are collections and items without status.

There is one simple explanation for splitting the information this way: it avoids NULL values. This, in turn, ensures that the database is normalized. The arguably controversial stance tha NULL values in database prevent the database from being fully normalized seems to be largely forgotten nowadays. However, e.g. database expert Christopher J. Date argues in his book "Date on Database" (ISBN 978-1-4842-2029-0) that a table with NULL values violates the first normal form. See https://en.wikipedia.org/wiki/First_normal_form#1NF_tables_as_representations_of_relations

Some benefits of avoiding NULLs:

  • less columns in one table
  • no need for separate handling of NULL values in WHERE clauses (IS NULL)
  • possibly better performance when reading table without NULL values
  • less visual clutter in the parent table
  • making database design more granular allows more freedom in storing data, e.g. separate tablespaces for some of the tables
  • no 1NF violation

Some disbenefits of splitting the data to several tables:

  • more complex queries requiring JOINs
  • possibly worse performance because of JOINs
  • general unfamiliriaty of design style, leading to possibly having to justify the controversial design decision
7
  • 1
    If the relationship is 1:1 then having two tables does not eliminate any null values. If the relationship var 1 to zero-or-one, then it would make sense, but this does not seem to be the case.
    – JacquesB
    Commented Oct 18, 2022 at 9:38
  • I'm afraid I don't agree with this, or with the linked "reference" which is nothing more than a controversial post by an individual. There is far too little information given to justify the claims. No sensible person would claim that reducing a five-column table by one column (at the expense of creating another table with two more columns caused only by its own existence) was self-evidently any kind of "practical" benefit. In fact the upper atmosphere of normalisation is often the exemplar of impracticality.
    – Steve
    Commented Oct 18, 2022 at 10:03
  • @Steve Reference changed for more authority. I have personally designed databases in the same way, causing some discussion with my colleagues. But then again, maybe I am not a sensible person. On the other hand, "no sensible person" is just the opinion of you, a random invidual. ;)
    – simon
    Commented Oct 18, 2022 at 10:48
  • @simon, well you are not being sensible by assuming the "practical benefit" is self-evident - which is what I said. You have altered the references (to a book which most will not have to hand, and to a Wiki which notes Date's view to be controversial and inconsistent with others like Codd), but you have still not added any explanation or rationale for why "less columns in one table" is a "practical" benefit in the OPs case, or for any of the other bullet points.
    – Steve
    Commented Oct 18, 2022 at 10:59
  • For me the practical benefits for "less columns in one table" is pretty self-evident and I did not expect to have to justify or explain this. The advantages are: faster table scans, less memory consumption, less disk space consumption, less visual clutter when reading the tables. But you are right that the last point (about 1NF) is maybe not directly a practical benefit, so I edited the description of the list to remove the word "practical".
    – simon
    Commented Oct 18, 2022 at 11:06
-3

If the tables are truly 1:1 then there is no logical reason to have them separate, and I see a few downsides:

  1. The schema allows an item without a corresponding item_status. This introduces an unnecessary risk of invalid data, which would not be possible if it was a single table (with non-nullable columns). The risk can be alleviated by defining the PK on item to also be a FK to item_stus. But still, this is an ugly workaround for a problem which should not exist in the fist place.

  2. Two tables take up more space than one and a join is usually slower than just reading from a single table. This might not be a problem given the performance profile of the database, but still there is no reason for this complication.

  3. Inserting a new item requires a transaction because item and item_status should be inserted at the same time. If it was a single table, an item could safely be inserted with a single insert.

It is hard to say why the database have been designed like this, but there may have been valid historical reasons, e.g. at one point the relationship might have though to be one-to-many or one-to-zero-or-one (although this is hard to imagine with a status column).

It might also have been in anticipation of an extension. For example, they might have planned to track all state changes over time by adding a time stamp.

Another reason might have been at the application level: If status was managed by a separate application or subsystem, it might have been easier to use a separate table. Maybe some microservice-nut thought up the design originally?

There might also have been performance considerations. Eg if item_status is often queried without using any of the item columns (or vice versa) it might be faster in some database configurations to extract the column to a separate table. Such optimization might become obsolete with improved database engines, so it is possible such an optimization made sense a decade ago but not any more.

In any case, the original reason for the choice is irrelevant. What matters is if the design makes sense for you now.

1
  • In any case, the original reason for the choice is irrelevant this is the takeaway for me, is this a standalone database, or is there an associated application or system that is still using this database? Commented Oct 18, 2022 at 17:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.