Data Engineering

!dataengineering

@lemm.ee
Create post
Help Branching Out / Upgrading Skillset

Help Branching Out / Upgrading Skillset

I've been a data (reporting) analyst for nearly a decade. I got my start writing complex SQL queries to develop reports and expanded into visualization (Tableau, Power BI). Lately I spent time as a data modeler designing DBs and helping create ETLs using ADF (though not putting hands on that process myself, I'm familiar enough with it to make it work).

I'm looking for my next opportunity and finding that my lack of knowledge in Python is creating a blocker. Also, other skills I don't have seem to crop up, like Spark, Hadoop, etc. It seems like the Data Engineer role has been folded into my Data Analyst role and I just can't compete any more.

Would anyone have any suggestions for paths I might take to remedy this? I can work with some online Python courses but I feel I'm not getting the full experience needed to really support the new needs being asked of those in my role. I'm hoping someone might have some suggestions for directions I might take to up-skill myself and be more prepared for the emerging needs of this changing industry.

Airflow Summit 2023 - Recordings Now Available

Airflow Summit 2023 - Recordings Now Available

Open link in next tab

https://www.youtube.com/playlist?list=PLGudixcDaxY29qXIXhd90htHp_BFk-Bqf

The 6 columns essential to a $6B/year database table

The 6 columns essential to a $6B/year database table

When you have a database with 6 Billion dollars (US) flowing through it every year, you need to be able to account and prove exactly every single penny and every single action that occurred. So you better have _A tables for all of the main tables and have these columns to boot.

  1. Create_user_id – Who/what created the record
  2. Create_Dt – When exactly was this record created
  3. Update_user_id – If updated, who updated this record (default null)
  4. Update_Dt – when was it last updated (default null)
  5. Archive_Dt – When can we legally destroy these records
  6. Unique_Trans_id – So that tracing down everything that occurred becomes even easier.

It isn't sexy but it'll be handy if you ever need to trace down things in your database too.

ARCHIVE_DT or when you can finally delete some shit

ARCHIVE_DT or when you can finally delete some shit

Knowing when a record can be disposed of in the future is deeply valuable to keeping database tables clean and containing only useful data.

31-DEC-2999 is pretty common for records that don't have an easily known value.

If you are in an organization that is trying to monetize your data, treat this value as the date when the user will no longer be able to see this record.

If you are in a more sane industry, treat it as the records retention date minus X number of days that match your _A table storage duration or if the record is already in your _A table, the date it will be deleted and no longer recoverable.

You may wish to put some thought to your database backup and rotation schedule so that those records cleared by that date as well but I leave that as an exercise to the reader.

Reference table design

Reference table design

There are 2 ways of doing reference tables:

unique hand written tables that perfectly match your desired data

or

The RT_ tables pattern mixed with cached views which will give a useful versioned reference table with an effective begin date, meaningful descriptions, version number, the effective end date (If it is set). With the ability to get previous version values if needed, who created the values, when the values were created, who updated the values and when they were updated (And if you follow _A table best practices, all of the previous updates too); not that you would likely need to update the values without doing a version update as well.

Insert in the following order to avoid constraint violations:

RT_TABLE

RT_FIELD_DOMAIN (only need to add entries when creating new reference table views or adding columns to reference tables)

RT_TABLE_FIELD (duplicate old RT_FIELD_DOMAIN values with new table to keep old column names)

RT_FIELD_VALUES (Easiest to do 1 row or column at a time)

Or just insert them all in a single transaction

RT_TABLE design

This is the master reference table for finding what reference tables exist and the versions that exist for them.

| Name            | Null     | Type           |
|-----------------+----------+----------------|
| REF_TABLE_ID    | NOT NULL | NUMBER         |
| TABLE_ID        |          | NUMBER         |
| VERSION         |          | NUMBER         |
| NAME            |          | VARCHAR2(30)   |
| DESCRIPTION     |          | VARCHAR2(255)  |
| COMMENTS        |          | VARCHAR2(255)  |
| STATUS          |          | CHAR(1)        |
| CREATE_USER_ID  | NOT NULL | VARCHAR2(20)   |
| UPDATE_USER_ID  |          | VARCHAR2(20)   |
| CREATE_DT       | NOT NULL | DATE           |
| UPDATE_DT       |          | DATE           |
| UNIQUE_TRANS_ID | NOT NULL | NUMBER         |
| EFF_BEGIN_DT    | NOT NULL | DATE           |
| EFF_END_DT      |          | DATE           |
| ARCHIVE_DT      | NOT NULL | DATE           | 

REF_TABLE_ID is the primary key

RT_FIELD_VALUES design

The actual reference table values

| Name               | Null     | Type          |
|--------------------+----------+---------------|
| REF_TABLE_FIELD_ID | NOT NULL | NUMBER        |
| FIELD_ROW_ID       | NOT NULL | NUMBER        |
| FIELD_VALUE        |          | VARCHAR2(255) |
| CREATE_USER_ID     | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID     |          | VARCHAR2(20)  |
| CREATE_DT          | NOT NULL | DATE          |
| UPDATE_DT          |          | DATE          |
| UNIQUE_TRANS_ID    | NOT NULL | NUMBER        |
| ARCHIVE_DT         | NOT NULL | DATE          |

REF_TABLE_FIELD_ID has a foreign key with RT_TABLE_FIELD.REF_TABLE_FIELD_ID FIELD_ROW_ID a sequence value used for all entries on a row

RT_TABLE_FIELD design

This is the glue table for all of the reference tables

| Name               | Null     | Type          |
|--------------------+----------+---------------|
| REF_TABLE_FIELD_ID | NOT NULL | NUMBER        |
| REF_TABLE_ID       | NOT NULL | NUMBER        |
| FIELD_ID           | NOT NULL | NUMBER        |
| CREATE_USER_ID     | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID     |          | VARCHAR2(20)  |
| CREATE_DT          | NOT NULL | DATE          |
| UPDATE_DT          |          | DATE          |
| UNIQUE_TRANS_ID    | NOT NULL | NUMBER        |
| ARCHIVE_DT         | NOT NULL | DATE          |

REF_TABLE_FIELD_ID is the primary key (sequence or uuid) REF_TABLE_ID is a foreign key to RT_TABLE.REF_TABLE_ID FIELD_ID is a foreign key to RT_FIELD_DOMAIN.FIELD_ID

RT_FIELD_DOMAIN design

The actual column names for the reference tables

| Name            | Null     | Type          |
|-----------------+----------+---------------|
| FIELD_ID        | NOT NULL | NUMBER        |
| NAME            |          | VARCHAR2(50)  |
| DATA_TYPE       |          | CHAR(1)       |
| MAX_LENGTH      |          | NUMBER(5)     |
| NULLS_ALLOWED   |          | CHAR(1)       |
| CREATE_USER_ID  | NOT NULL | VARCHAR2(20)  |
| UPDATE_USER_ID  |          | VARCHAR2(20)  |
| CREATE_DT       | NOT NULL | DATE          |
| UPDATE_DT       |          | DATE          |
| UNIQUE_TRANS_ID | NOT NULL | NUMBER        |
| ARCHIVE_DT      | NOT NULL | DATE          |

FIELD_ID is the primary key (sequence or uuid)

RT_ALL_MV design

The master query behind all of the reference tables (keep it cached)

CREATE VIEW IF NOT EXISTS RT_ALL_MV AS
SELECT
   A.NAME AS TABLENAME
  ,A.VERSION AS VERSION
  ,D.FIELD_ID AS FIELDID
  ,A.EFF_BEGIN_DT AS EFFBEGDATE
  ,A.EFF_END_DT AS EFFENDDATE
  ,B.FIELD_ROW_ID AS ROW_ID
  ,D.NAME AS COLUMNNAME
  ,B.FIELD_VALUE AS COLUMNVALUE
FROM
   RT_TABLE A
  ,RT_FIELD_VALUES B
  ,RT_TABLE_FIELD C
  ,RT_FIELD_DOMAIN D
WHERE
  A.REF_TABLE_ID = C.REF_TABLE_ID                AND
  B.REF_TABLE_FIELD_ID = C.REF_TABLE_FIELD_ID    AND
  C.FIELD_ID = D.FIELD_ID;

Example RT_ view

Current values can be just: SELECT * FROM RT_example_MV; For figuring out previous values or making a view:

For sqls that support DECODE

SELECT
   MAX(DECODE(COLUMNNAME, 'CODE', COLUMNVALUE)) AS CODE
  ,MAX(DECODE(COLUMNNAME, 'DESCRIPTION', COLUMNVALUE)) AS DESCRIPTION
  ,MAX(VERSION) AS VERSION
  ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT
  ,MAX(EFFENDDATE) AS EFF_END_DT
FROM FROM RT_ALL_MV
WHERE
  TABLENAME LIKE '%STATUS_IND%' AND VERSION=3
GROUP BY ROW_ID
ORDER BY CODE;

For sqls without

SELECT
   MAX(CASE COLUMNNAME WHEN 'Code' THEN COLUMNVALUE END) AS 'Code'
  ,MAX(CASE COLUMNNAME WHEN 'S0_Rate' THEN COLUMNVALUE END) AS 'S0 Rate'
  ,MAX(CASE COLUMNNAME WHEN 'S1_Rate' THEN COLUMNVALUE END) AS 'S1 Rate'
  ,MAX(VERSION) AS VERSION
  ,MAX(EFFBEGDATE) AS EFF_BEGIN_DT
  ,MAX(EFFENDDATE) AS EFF_END_DT
FROM RT_ALL_MV
WHERE
  TABLENAME LIKE '%example%' AND VERSION=1
GROUP BY ROW_ID
ORDER BY CODE;
Rate histories or cleanly storing history

Rate histories or cleanly storing history

The HIST_NAV_IND column:

When you want a history of values (such as ratings) in the main table for some business requirement, add this column and use the following values:

S => When you have only 1 record

F => The first record when you have more than 1 record

P => The current primary record when you have more than 1

M => The previous P records that have been surpassed.

The EFF_BEGIN_DT and EFF_END_DT columns:

In case you might need to do reprocessing of old records you will want an easy way to figure out which rate history that you would want to use; EFF_BEGIN_DT and EFF_END_DT make that simple.

EFF_BEGIN_DT is always set in every record (generally it should match the create date but there are business reasons why you want it separate)

EFF_END_DT should be NULL for the current primary record (unless you are organized enough to always know the future rate change date in advance [unlikely]) and should always be set for the M and F records to the day [or hour, minute or second] prior to the EFF_BEGIN_DT of the new P record. The EFF_END_DT of one record should never overlap with the EFF_BEGIN_DT of the next and you can use TRUNC("TimeStamp", DATE) to ensure that your select driver will always either get 1 [normally] or zero [They shouldn't have been included] records.

UNIQUE_TRANS_ID or letting you track what occurred together.

UNIQUE_TRANS_ID or letting you track what occurred together.

You will find 2 different implementations for this, the first (very wrong) is a unique sequence for every table and it serves the purpose of a HIST_SEQ column.

The second (correct) is a global sequence which will be the same for all records in all tables which are updated by a single transaction. The purpose is to make it trivial to find all records (inserted, updated [ and deleted if using _A tables]) in a single transaction. [You'll want to add an AUDIT_UNIQUE_TRANS_ID column to your _A tables for that linkage]

In simple environments this can be just a simple sequence and in more advanced environments this can be a UUID. The key is it must be unique on every transaction but its value should not be used to provide any information about the order of events in a table (that is the job of a HISTORY_SEQ column).

HISTORY_SEQ column or sanity checking basic mode

HISTORY_SEQ column or sanity checking basic mode

If you might need to store multiple duplicate records or want a sequence number for the order of created/updated records in your table.

This is what you need, the big annoying bit is you need to also update this column on EVERY SINGLE UPDATE to that table and you'll want _A tables if you want to figure out historical ordering of events. And you will be creating a unique sequence for every single table where this column exists. but just shove that functionality in a trigger.

This also would be quite handy if you want a unique key handle for picking which records are being manually deleted and you have the solution when one person updates a record at the same time someone else is trying to delete a record.

_A tables or how not to accidentally lose your shit

_A tables or how not to accidentally lose your shit

Sometimes called journal or audit tables. _A tables do the following magic trick: you can't screw up or delete your data in a way you can't recover.

In the most simple version possible you take your table foo, duplicate it's structure in a table named foo_A and add 2 columns: audit_dt and audit_user_id. Then you create triggers for update and deletes on the table foo to first write the old values as a new insert in the foo_A table.

Now even if you screw up your select and delete all of the contents of table foo. everything will still be in table foo_A. If you accidentally overwrite everything in foo with garbage data, the good data will still be in foo_A

The application nor any of the users need to know about the _A tables (unless you want to leverage stored procedures instead of triggers to create the _A table entries)

How do I convince my data engineer to not modify data before including it in our db?

How do I convince my data engineer to not modify data before including it in our db?

Our data engineer insists in lowercasing everything and removing some other formatting like new lines on free text fields.

They say it's "better for elastic search".

To me that makes no sense and loses information that can't be added back. But I couldn't really convince them otherwise. So far no real problem has come out of it but it makes for a worse experience for the user. Like company names that are acronyms show up as all lowercase. (ibm, llc, etc.) or free text fields that we miss when the user wrote in caps or added paragraphs.

What are your thoughts on this?

Disclaimer, I'm not a data engineer. Just a PM from a data related product.