Designing Historical Tables

Introduction

Historical tables are never taught in any Computer Science Curriculum but they are incredibly useful in the industry

However, they are usually implemented in an ad-hoc basis, which usually brings data anomalies and very low-performance

This paper presents a generalized pattern to implement “Historical Tables” as part of a relational data model properly normalized and designed. Historical Tables are implemented using Sequence Generators, which also are usually ignored in the academia

This pattern is an original development by David Chaves1

Sequence Generators

A sequence generator… generates unique numbers

Sequence numbers are quite simple, useful and usually ignored

• Confirmation numbers
• Transaction IDs
• Reference numbers
• Authorization numbers
• Order number

Have you seen one recently? I bet you have… just look around

SQL Syntax

Sequences already exist in Oracle, PostgreSQL, IBM DB/2, Informix

CREATE SEQUENCE TheSeq START 100 ;

For example, to generate the next value in PostgreSQL,

SELECT nextval('TheSeq') INTO … ;

And in Oracle,

SELECT TheSeq.NEXTVAL INTO … ;

Sequence Characteristics

• Never NEVER repeat numbers
• Don't need to roll-back (push back) unused values
• Strictly incremental numbers sometimes
• Mostly incremental numbers
• Holes are just fine
• Work like a global virtual clock
• Very fast – very low overhead
• Easy to implement by storing the next value in a tiny file, if your database does not provide it for free

Sequences in MySQL

Even when MySQL does not implement sequences natively, they are very easy to use. For example, to get the next value,

update TheSeq set N = LAST_INSERT_ID( N + 1 ) ;
select LAST_INSERT_ID() into … ;
commit ;

Assuming that TheSeq was created as follows

create table TheSeq ( N integer ) ;
insert into TheSeq ( N ) values ( 100 ) ;

Tip

• Use dedicated connection in order to not interfere with your transactions

Questions…?

• Sequences are important: they make your design and code simpler if your use them
• Sequences are usually not shown explicitly in any database design drawing: they are external sources of data from the design point of view
• Sequence values are not hidden to end-users: they also have a meaning for tracking transactions and handling exceptions
• You will usually need only one sequence generator application-wide, or even database-wide

Relational Modeling

• Based on tables representing relations, and rows (“tuples”) representing facts
• Do not try to think in terms of objects and containers: relational databases is about data, objects is about actions. Do not introduce RecordId's to simulate pointers between rows. Shall you read The Vietnam of Computer Science
• Show me your database design and throw away the executable: start there if multiple teams are developing in parallel different pieces, to minimize dependencies and miscommunication
• Doug's Oracle: Data, not programs, is the only thing that matters — applications are transient and have no value except to acquire, manipulate, and display data. Data is the only thing with value
• Eric Raymond: Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious
• Rajamaran’s First Law: Adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set
• The design can be automated completely if substantial raw data is available — for example,
• my friend Ronald Argüello got a PhD on this matter
• FDEP is a well-documented GNU C program for inducing functional dependencies from relations — however, the fdep complexity can be exponential in the size of the input relation instance
• Tane is other C program for discovering functional dependencies from large databases with additional instructions
• Anthony Aaby's normal.pl is a database design tool in SWI-Prolog to construct 3NF and BCNF schemas from functional dependencies

Useful Tips

• Use master-tables and detail-tables to represent relations $1\to N$, where detail-table's primary key include all the primary-key columns from the master-table plus a discriminating value, like a detail or line number (or a sequence number)
• Use relation-tables to store relations $N\leftrightarrow M$, which primary keys contain the complete primary keys of the (other) related tables
• A primary key is exactly the same as an unique index where all columns have the not null constraint

There are always update/delete anomalies when the model is not normalized

Database Design Anomalies

• Anomaly = relation breakdown and corruption during transactions
• Some anomalies only show up when the concurrency-level and database-server's workload are both high

Design goals

• The goal is to avoid anomalies and redundancy
• Update anomaly: one occurrence of a fact is changed, but not all occurrences
• Deletion anomaly: valid fact is lost when a tuple is deleted
• Need to normalize when updating or deleting in a relational database, and you don't want anomalies
• When you don't need to normalize? If you are not updating nor deleting, or if you are dealing with a non-relational database, or if you don't mind about anomalies

Consider a relation Drinkers(Name, Address, BeersLiked, Manufacturer, FavoriteBeer ) with the following data,

Jane Toronto Budweiser Anheuser-Busch Wicked Ale
Jane ??? Imperial Cervecería Costa Rica ???
David Waterloo Budweiser ??? Root Beer

This is redundant: the ???s can also be figured out by using the following functional dependencies

• $\underline{Name}\to Address+FavoriteBeer$
• $\underline{BeersLiked}\to Manufacturer$

What is a functional dependency?
Why Name and BeersLiked are underlined above?

Anomalies in last example

• Update anomaly: if Jane is transferred to a new address, Vancouver… Will we remember to change each of her tuples?
• Deletion anomaly: If nobody likes Bud, we lose track of the fact that Anheuser-Busch manufactures Budweiser

• We hire Architects and Civil Engineers who use mathematical principles to design buildings and bridges… Why we sometimes do not apply math principles to our database designs, even when they are also available?
• A normalized database is mathematically beautiful; a non-normalized database is a mess and unpredictable

Need help?

Historical Data

• Financials and Governments have strict auditing requirements, for example
• Historical tables are useful to track all factual changes applied to relations, forward and backward from any point of time
• Same pattern can be applied over and over again

Characteristics

• Only INSERT and RETRIEVE operations
• DELETE very, very old data only: for PURGING, real DBAs usually put those tables off-line first, then back them up to tape, and finally delete them
• Never UPDATE, never update historical data — this might have legal implications!
• Large storage, based on auditing requirements
• Help to keep non-historical tables very manageable
• Don't need to be normalized because no anomalies are possible — there are no UPDATE operations!
• Can use star/snowflake data model if you want (More)
• Can also use materialized views to store result of joined tables instead of individual (historical) tables themselves
• Don't use constraints: not-null's, foreign keys, check expressions
• Don't use default values: use plain NULL values
• Don't use triggers nor validations: data was already validated before since it came from “current” tables
• Represent the fixed state of something at a given point of time in the past
• All “possible” foreign keys (references) to non-historical tables are not true ones, since they are not really referencing the most current values in the “current” tables
• Usually have lots LOTS of non-unique, secondary indexes for searching on multiple criteria

The Historical Tables Pattern

1. Start with a table for “current” data (facts): Tabk(Col1, Col2, Col3, …)
2. Generate a unique confirmation number and timestamp every time you are going to insert, update or delete anything in Tabk
3. Have a historical table for Tabk: Tab.Histk(Col1, Col2, Col3, …, Hist#, HistTime)
4. Might extend Tabk (and so Tab.Histk) to include the latest confirmation number and update timestamp as well: Tabk(Col1, Col2, Col3, …, Confirmation#, LastUpdate)
5. Notation: Col1 and Col2 (and Hist#) are the primary keys since they are underlined

Insert/Update/Delete Algorithm

Generate new {Seqno} using a sequence generator ;

Load current {Timestamp} from the system clock ;

START TRANSACTION ;

insert into Tab.Histk(Col1, Col2, Col3, …, Hist#, HistTime)
select Col1, Col2, Col3, …, {Seqno}, {Timestamp}
from Tabk where Col1 = … and Col2 = … ;

insert/update/delete from Tabk
where Col1 = … and Col2 = … ;

update from Tabk
set Confirmation# = {Seqno}, LastUpdate = {Timestamp}
where Col1 = … and Col2 = … ;

COMMIT ;

Extending the Pattern

• It can also be used when multiple tables or multiple rows are being inserted, updated or deleted at the same time
• Use the same Confirmation# and Timestamp values when you are touching multiple tables or multiple rows in the same transaction — it is a single action from the end-user point of view
• The same Confirmation# value in multiple tables indicates that all these changes were applied in the same transaction submitted by the end-user
• Populate the historical tables before updating any “current” table
• The confirmation number generation is outside the transaction itself, to empathize the fact that you need it before you start changing your tables

• Not all tables are equal — there are important differences between them!
• It is important to know their usage pattern
• Historical tables are really simple to use
• Historical tables are really powerful, as you can see in the following sections

Sample uses of Historical Tables

• Easy storage & retrieval of current or historical content, to process it based on any cut date and time
• Easy roll-back of imported data, if you have an import#
• Easy tracking of calendar changes

Easy-to-rollback Imports

• This is the Import algorithm:

Generate new {Seqno} using a sequence generator ;

START TRANSACTION ;

import data and update historical tables as usual ;

COMMIT ;

Return Seqno as the Import# to the end-user

• Exercise: write the Undo Import algorithm
• What to do if something has been changed since the Import#?

Rolling back Huge or Complex Imports

• This is the Rollback Import algorithm to undo Import#:

Generate new {Seqno} using a sequence generator ;

Load current {Timestamp} from the system clock ;

START TRANSACTION ;

insert into HistoricalTables
using {Seqno as the new Confirmation#}
select CurrentTables
where Confirmation# = {Import#} ;

replace into CurrentTables
select HistoricalTables
where Hist# = {Import#} ;

delete from CurrentTables
where Confirmation# = {Import#} ;

COMMIT ;

Return Seqno as the Undo# to the end-user

Tracking Calendar Changes

• This is really an exercise…
• Assume that calendar tables have associated historical tables as well
• Every time the calendar changes due to transaction#, we update the calendar using the transaction# as the confirmation# and update the respective historical tables
• The calendar tables can also be used by the end-user to track their personal activities as well
• Notice that we can always provide a valid explanation for any calendar change, if the end-user complains about having lost his/her personal activities!