Data Engineering

fritz_astro

•

Airflow Summit 2023 - Recordings Now Available

https://www.youtube.com/playlist?list=PLGudixcDaxY29qXIXhd90htHp_BFk-Bqf

•

The 6 columns essential to a $6B/year database table

When you have a database with 6 Billion dollars (US) flowing through it every year, you need to be able to account and prove exactly every single penny and every single action that occurred. So you better have _A tables for all of the main tables and have these columns to boot.

Create_user_id – Who/what created the record
Create_Dt – When exactly was this record created
Update_user_id – If updated, who updated this record (default null)
Update_Dt – when was it last updated (default null)
Archive_Dt – When can we legally destroy these records
Unique_Trans_id – So that tracing down everything that occurred becomes even easier.

It isn't sexy but it'll be handy if you ever need to trace down things in your database too.

2 comments

•

ARCHIVE_DT or when you can finally delete some shit

Knowing when a record can be disposed of in the future is deeply valuable to keeping database tables clean and containing only useful data.

31-DEC-2999 is pretty common for records that don't have an easily known value.

If you are in an organization that is trying to monetize your data, treat this value as the date when the user will no longer be able to see this record.

If you are in a more sane industry, treat it as the records retention date minus X number of days that match your _A table storage duration or if the record is already in your _A table, the date it will be deleted and no longer recoverable.

You may wish to put some thought to your database backup and rotation schedule so that those records cleared by that date as well but I leave that as an exercise to the reader.

•

Reference table design

There are 2 ways of doing reference tables:

unique hand written tables that perfectly match your desired data

The RT_ tables pattern mixed with cached views which will give a useful versioned reference table with an effective begin date, meaningful descriptions, version number, the effective end date (If it is set). With the ability to get previous version values if needed, who created the values, when the values were created, who updated the values and when they were updated (And if you follow _A table best practices, all of the previous updates too); not that you would likely need to update the values without doing a version update as well.

Insert in the following order to avoid constraint violations:

RT_TABLE

RT_FIELD_DOMAIN (only need to add entries when creating new reference table views or adding columns to reference tables)

RT_TABLE_FIELD (duplicate old RT_FIELD_DOMAIN values with new table to keep old column names)

RT_FIELD_VALUES (Easiest to do 1 row or column at a time)

Or just insert them all in a single transaction

RT_TABLE design

This is the master reference table for finding what reference tables exist and the versions that exist for them.

| Name            | Null     | Type           |
|-----------------+----------+----------------|
| REF_TABLE_ID    | NOT NULL | NUMBER         |
| TABLE_ID        |          | NUMBER         |
| VERSION         |          | NUMBER         |
| NAME            |          | VARCHAR2(30)   |
| DESCRIPTION     |          | VARCHAR2(255)  |
| COMMENTS        |          | VARCHAR2(255)  |
| STATUS          |          | CHAR(1)        |
| CREATE_USER_ID  | NOT NULL | VARCHAR2(20)   |
| UPDATE_USER_ID  |          | VARCHAR2(20)   |
| CREATE_DT       | NOT NULL | DATE           |
| UPDATE_DT       |          | DATE           |
| UNIQUE_TRANS_ID | NOT NULL | NUMBER         |
| EFF_BEGIN_DT    | NOT NULL | DATE           |
| EFF_END_DT      |          | DATE           |
| ARCHIVE_DT      | NOT NULL | DATE           |

REF_TABLE_ID is the primary key

RT_FIELD_VALUES design

The actual reference table values

| Name               | Null     | Type          |
|--------------------+----------+---------------|
| REF_TABLE_FIELD_ID | NOT NULL | NUMBER        |
| FIELD_ROW_ID       | NOT NULL | NUMBER        |
| FIELD_VALUE        |          | VARCHAR2(255) |
| CREATE_USER_ID     | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID     |          | VARCHAR2(20)  |
| CREATE_DT          | NOT NULL | DATE          |
| UPDATE_DT          |          | DATE          |
| UNIQUE_TRANS_ID    | NOT NULL | NUMBER        |
| ARCHIVE_DT         | NOT NULL | DATE          |

REF_TABLE_FIELD_ID has a foreign key with RT_TABLE_FIELD.REF_TABLE_FIELD_ID FIELD_ROW_ID a sequence value used for all entries on a row

RT_TABLE_FIELD design

This is the glue table for all of the reference tables

| Name               | Null     | Type          |
|--------------------+----------+---------------|
| REF_TABLE_FIELD_ID | NOT NULL | NUMBER        |
| REF_TABLE_ID       | NOT NULL | NUMBER        |
| FIELD_ID           | NOT NULL | NUMBER        |
| CREATE_USER_ID     | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID     |          | VARCHAR2(20)  |
| CREATE_DT          | NOT NULL | DATE          |
| UPDATE_DT          |          | DATE          |
| UNIQUE_TRANS_ID    | NOT NULL | NUMBER        |
| ARCHIVE_DT         | NOT NULL | DATE          |

REF_TABLE_FIELD_ID is the primary key (sequence or uuid) REF_TABLE_ID is a foreign key to RT_TABLE.REF_TABLE_ID FIELD_ID is a foreign key to RT_FIELD_DOMAIN.FIELD_ID

RT_FIELD_DOMAIN design

The actual column names for the reference tables

| Name            | Null     | Type          |
|-----------------+----------+---------------|
| FIELD_ID        | NOT NULL | NUMBER        |
| NAME            |          | VARCHAR2(50)  |
| DATA_TYPE       |          | CHAR(1)       |
| MAX_LENGTH      |          | NUMBER(5)     |
| NULLS_ALLOWED   |          | CHAR(1)       |
| CREATE_USER_ID  | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID  |          | VARCHAR2(20)  |
| CREATE_DT       | NOT NULL | DATE          |
| UPDATE_DT       |          | DATE          |
| UNIQUE_TRANS_ID | NOT NULL | NUMBER        |
| ARCHIVE_DT      | NOT NULL | DATE          |

FIELD_ID is the primary key (sequence or uuid)

RT_ALL_MV design

The master query behind all of the reference tables (keep it cached)

CREATE VIEW IF NOT EXISTS RT_ALL_MV AS
SELECT
   A.NAME AS TABLENAME
  ,A.VERSION AS VERSION
  ,D.FIELD_ID AS FIELDID
  ,A.EFF_BEGIN_DT AS EFFBEGDATE
  ,A.EFF_END_DT AS EFFENDDATE
  ,B.FIELD_ROW_ID AS ROW_ID
  ,D.NAME AS COLUMNNAME
  ,B.FIELD_VALUE AS COLUMNVALUE
FROM
   RT_TABLE A
  ,RT_FIELD_VALUES B
  ,RT_TABLE_FIELD C
  ,RT_FIELD_DOMAIN D
WHERE
  A.REF_TABLE_ID = C.REF_TABLE_ID                AND
  B.REF_TABLE_FIELD_ID = C.REF_TABLE_FIELD_ID    AND
  C.FIELD_ID = D.FIELD_ID;

Example RT_ view

Current values can be just: SELECT * FROM RT_example_MV; For figuring out previous values or making a view:

For sqls that support DECODE

SELECT
   MAX(DECODE(COLUMNNAME, 'CODE', COLUMNVALUE)) AS CODE
  ,MAX(DECODE(COLUMNNAME, 'DESCRIPTION', COLUMNVALUE)) AS DESCRIPTION
  ,MAX(VERSION) AS VERSION
  ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT
  ,MAX(EFFENDDATE) AS EFF_END_DT
FROM FROM RT_ALL_MV
WHERE
  TABLENAME LIKE '%STATUS_IND%' AND VERSION=3
GROUP BY ROW_ID
ORDER BY CODE;

For sqls without

SELECT
   MAX(CASE COLUMNNAME WHEN 'Code' THEN COLUMNVALUE END) AS 'Code'
  ,MAX(CASE COLUMNNAME WHEN 'S0_Rate' THEN COLUMNVALUE END) AS 'S0 Rate'
  ,MAX(CASE COLUMNNAME WHEN 'S1_Rate' THEN COLUMNVALUE END) AS 'S1 Rate'
  ,MAX(VERSION) AS VERSION
  ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT
  ,MAX(EFFENDDATE) AS EFF_END_DT
FROM RT_ALL_MV
WHERE
  TABLENAME LIKE '%example%' AND VERSION=1
GROUP BY ROW_ID
ORDER BY CODE;

•

Rate histories or cleanly storing history

The HIST_NAV_IND column:

When you want a history of values (such as ratings) in the main table for some business requirement, add this column and use the following values:

S => When you have only 1 record

F => The first record when you have more than 1 record

P => The current primary record when you have more than 1

M => The previous P records that have been surpassed.

The EFF_BEGIN_DT and EFF_END_DT columns:

In case you might need to do reprocessing of old records you will want an easy way to figure out which rate history that you would want to use; EFF_BEGIN_DT and EFF_END_DT make that simple.

EFF_BEGIN_DT is always set in every record (generally it should match the create date but there are business reasons why you want it separate)

EFF_END_DT should be NULL for the current primary record (unless you are organized enough to always know the future rate change date in advance [unlikely]) and should always be set for the M and F records to the day [or hour, minute or second] prior to the EFF_BEGIN_DT of the new P record. The EFF_END_DT of one record should never overlap with the EFF_BEGIN_DT of the next and you can use TRUNC("TimeStamp", DATE) to ensure that your select driver will always either get 1 [normally] or zero [They shouldn't have been included] records.

•

UNIQUE_TRANS_ID or letting you track what occurred together.

You will find 2 different implementations for this, the first (very wrong) is a unique sequence for every table and it serves the purpose of a HIST_SEQ column.

The second (correct) is a global sequence which will be the same for all records in all tables which are updated by a single transaction. The purpose is to make it trivial to find all records (inserted, updated [ and deleted if using _A tables]) in a single transaction. [You'll want to add an AUDIT_UNIQUE_TRANS_ID column to your _A tables for that linkage]

In simple environments this can be just a simple sequence and in more advanced environments this can be a UUID. The key is it must be unique on every transaction but its value should not be used to provide any information about the order of events in a table (that is the job of a HISTORY_SEQ column).

•

HISTORY_SEQ column or sanity checking basic mode

If you might need to store multiple duplicate records or want a sequence number for the order of created/updated records in your table.

This is what you need, the big annoying bit is you need to also update this column on EVERY SINGLE UPDATE to that table and you'll want _A tables if you want to figure out historical ordering of events. And you will be creating a unique sequence for every single table where this column exists. but just shove that functionality in a trigger.

This also would be quite handy if you want a unique key handle for picking which records are being manually deleted and you have the solution when one person updates a record at the same time someone else is trying to delete a record.

•

_A tables or how not to accidentally lose your shit

Sometimes called journal or audit tables. _A tables do the following magic trick: you can't screw up or delete your data in a way you can't recover.

In the most simple version possible you take your table foo, duplicate it's structure in a table named foo_A and add 2 columns: audit_dt and audit_user_id. Then you create triggers for update and deletes on the table foo to first write the old values as a new insert in the foo_A table.

Now even if you screw up your select and delete all of the contents of table foo. everything will still be in table foo_A. If you accidentally overwrite everything in foo with garbage data, the good data will still be in foo_A

The application nor any of the users need to know about the _A tables (unless you want to leverage stored procedures instead of triggers to create the _A table entries)