5

Problem: We have a very big table, and growing. Most of its entries (say 80%) are historical data (with "DATE" field past current date) that are seldom queried, while small part of it (say 20%) are current data ("DATE" field after current date), most queries search these current entries.

Consider two possible scenarios, which one would be better (considering the overall implementation difficulty and performance,...)

  1. Breaking the big table into two table: Historical and Current data. And on daily basis I move the records with expired date from Current table to Historical table.

  2. Keeping record in one table (the DATA field is defined as INDEXED).

The scenario A would indicate more hustle in implementation and maintenance, and overload on daily bases for moving date between tables, while scenario B would indicate searching a big database (though indexed). Does it impose memory problems? Which scenario is recommended? IS there any other recommendations?

2
  • 1
    Completely depends on the data, hardware and indexing. Overly partitioned tables can slow down performance in certain scenarios (like to little open files allowed, to small number of open tables allowed), instead of increasing performance. In general a well normalized table with >100GB of data in it, should not be a problem.
    – Norbert
    Commented Oct 10, 2015 at 16:35
  • 1
    Depending on what you need to do with the archived data afterwards, you might want to look into the ARCHIVE storage engine as well. dev.mysql.com/doc/refman/5.6/en/archive-storage-engine.html
    – CBroe
    Commented Oct 10, 2015 at 16:40

3 Answers 3

4

You usually don't want to break a big table into multiple tables, although having a current and historical table is totally reasonable. Your process makes sense. You can then optimize the current table for your query needs. I would probably go for two tables (given the limited information you provide), because it allows such optimization.

However, don't split the historical data. Instead, use partitioning. See the documentation. One caveat: queries need to specify the partitioning key in the where clause to take advantage of the partitions. With a large table, this is typical anyway.

3
  • Thank Gordon for reply. So you recommend approach A. In this case, I should have a daily maintenance operation nd DB such as: "checking CURRENT table, finding records with DATA field past current date, and moving those fields to the HISTORICAL table". Would not this maintenance operation be a serious load on server? Commented Oct 11, 2015 at 20:43
  • Can you explain what you mean by "Don't split the historical data". Why would I? My question concerns splitting all my data into CURRENT and HISTORICAL. Is this what you saying: adopting approach B, and partitioning based on DATE, and then re-partitioning.? Commented Oct 11, 2015 at 20:47
  • 1
    @cybergeek654 . . . If you wanted to think about splitting the historical data, don't bother. Use partitioning. When I first read the question, I thought you might be tempting to use even more tables. Commented Oct 12, 2015 at 2:00
2

Question: is the historical data necessary for system functionality or are these records stored for other purposes (e.g. audits)? It may be time to clean house by moving the historical data to an archive.

2
  • No, historical data are not used for system level functionality. t Commented Oct 11, 2015 at 20:31
  • If I go with approach A, then historical records are records which has DATE field past current date, they make big portion of the overall database, and very few queries are about them. Commented Oct 11, 2015 at 20:37
2

In my experience, most systems with big data have historical tables. In most cases that I have been, both the current data and historical data have different user-groups. The current data are used by the front end users to deal with customers with their current or recent transactions. The historical data are usually used by the user groups who do not have to talk with customers/clients directly.

Do not worry much about the issue of implementation and maintenance as I think your main consideration is all about performance. Implementation is only a one-time deal that will run on a specified frequency (like weekly, monthly or yearly archival) after you moved the program/s in production. Maintenance is very small and you can just forget about it once it is already implemented. You just have to make sure that you test the programs thoroughly.

For a normalized historical tables, tables have the same structure and field names which makes the data copy much easier. This way, one can just to a table join between the tables.

If you choose to not split the data, you will continue to add index after index. But somewhere down the road, you will still encounter the same issue again.

2
  • Thanks Eddie for the answer. In my case, both current and historical data can be accessed by same user groups. Though historical data queries make a much smaller percentage of queries. Commented Oct 11, 2015 at 20:51
  • 1
    one of the reasons why most companies who have big data sets (as in size is no longer counted by records but by gigabytes) split or archive old data. most of the old data are not relevant with the current day-to-day business. you and your team will have to identify the frequency or the relevance of those historical data on a day-to-day basis. i think you and your team will have to do a very thorough research to convince the user groups and management to have or not to have an archive tables.
    – Eddie Banz
    Commented Oct 13, 2015 at 1:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.