When developing databases, it is often necessary to provide support for versioning and storing object history. For example, an employee may change position, the position in turn may change salary - in multidimensional modeling this is called Slowly changing dimensions (hereinafter SCD) - rarely changing dimensions, that is, dimensions whose non-key attributes tend to change over time. In total there are 6 main types (methods) of
SCD , which determine how the history of changes can be reflected in the model.
Type 0
It lies in the fact that the data after the first hit in the table further never change. This method is practically not used by anyone, because It does not support versioning. It is needed only as a zero point of reference for the SCD methodology.
')
Type 1
Type 1 is the usual rewriting of old data with new ones. In its pure form, this method also does not contain versioning and is used only where history is not actually needed. However, in some DBMS for this type, it is possible to add limited support for versioning by the means of the DBMS itself (for example, Flashback query in Oracle) or by tracking changes through triggers.
Advantages:
- No redundancy added
- Very simple structure
Disadvantages:
Type 2
This method consists in creating for each version a separate entry in the table with the addition of a key attribute field of this version, for example: version number, date of change, or date of beginning and end of the period of version existence.
Example:
ID | NAME | POSITION_ID | DEPT | DATE_START | DATE_END |
---|
one | Kolya | 21 | 2 | 08/11/2010 10:42:25 | 01/01/9999 |
2 | Denis | 23 | 3 | 08/11/2010 10:42:25 | 01/01/9999 |
3 | Boris | 26 | 2 | 08/11/2010 10:42:25 | 01/01/9999 |
four | Sheldon | 22 | 3 | 08/11/2010 10:42:25 | 01/01/9999 |
five | a penny | 25 | 2 | 08/11/2010 10:42:25 | 01/01/9999 |
In this example, the default end date of the version is '01 .01.9999 ', instead of which one could specify, say, null, but then there would be a problem with creating the primary key from ID, DATE_START and DATE_END, and, moreover, The sampling condition for a specific date is simplified ("
where snapshot_date between DATE_START and DATE_END
" instead of "
where snapshot_date>DATE_START and (snapshot_date < DATE_END or DATE_END is null)
".
With such an implementation, when an employee leaves the company, it will be possible to simply change the end date of the current version to the date of dismissal instead of deleting employee records.
Advantages:
- Stores full and unlimited version history
- Convenient and easy access to the data of the required period
Disadvantages:
- Provokes redundancy or the establishment of additional tables to store variable dimension attributes
- Complicates the structure or adds redundancy in cases where the analyst will need to reconcile the data in the fact table with specific measurement versions and the fact may not be consistent with the current version of the measurement for this fact. (For example, the client has changed audits or address, but operation / delivery by old values)
Type 3
The record itself contains additional fields for the previous attribute values. When new data is received, old data is overwritten by current values.
| ID | UPDATE_TIME | LAST_STATE | CURRENT_STATE |
---|
one | one | 08/11/2010 12:58:48 | 0 | one |
2 | 2 | 08/11/2010 12:29:16 | one | one |
Advantages:
- Small amount of data
- Easy and quick access to history
Disadvantages:
Type 4
The change history is contained in a separate table: the main table is always overwritten with current data with old data transferred to another table. Usually this type is used to audit changes or create archive tables (as I said, in Oracle, this same 4th type can be obtained from the 1st using the flashback archive). A subtype or a hybrid of this option (with the second type), it seems to me, should be considered sectioned on the basis of the current version with the allowed movement of lines, but this is beyond the limits of modeling and rather refers to administration.
Example:
select * from emp
ID | NAME | POSITION_ID | DEPT |
---|
one | Kolya | 21 | 2 |
2 | Denis | 23 | 3 |
3 | Boris | 26 | 2 |
four | Sheldon | 22 | 3 |
five | a penny | 25 | 2 |
select * from emp_history
ID | NAME | POSITION_ID | DEPT | DATE |
---|
one | Kolya | 21 | one | 08/11/2010 2:12:13 PM |
2 | Denis | 23 | 2 | 08/11/2010 2:12:13 PM |
3 | Boris | 26 | one | 08/11/2010 2:12:13 PM |
four | Sheldon | 22 | 2 | 08/11/2010 2:12:13 PM |
Advantages:
- Quick work with current versions
Disadvantages:
- Splitting a single entity into different tables
Hybrid Type / Type 6 (1 + 2 + 3)
Type 6 was coined by
Ralph Kimball as a combination of the above methods and is designed for situations that they do not take into account or for greater convenience of working with data. It consists in introducing additional redundancy: a type 2 is taken as a basis, a surrogate attribute is added for an alternative version review (type 3), and one or all previous versions (type 1) are overwritten.
Example:
VERSION | ID | NAME | POSITION_ID | DEPT | DATE_START | DATE_END | CURRENT |
---|
one | one | Kolya | 21 | 2 | 08/11/2010 10:42:25 | 01/01/9999 | one |
one | 2 | Denis | 23 | 3 | 08/11/2010 10:42:25 | 01/01/9999 | one |
one | 3 | Boris | 26 | 2 | 08/11/2010 10:42:25 | 08/11/2010 11:42:25 | 0 |
2 | 3 | Boris | 26 | 2 | 08/11/2010 11:42:26 | 01/01/9999 | one |
In this example, for example, adding a surrogate key adds the ability to refer from the fact tables to a specific version of the dimension, which may not belong to the lifetime of the fact itself, and the indicator of the current version may help to partition by the current versions (although it would be better to call the latest version sectioning since the version may become outdated without changing the record itself). However, the indicator of the current version can be created as a virtual computable field, without worsening normalization, if it is necessary in the table (if the DBMS supports such fields, they appeared in Oracle in the 11th version), and as a field in the view from this table.
In general, any combination of basic types of SCD refers to a hybrid type, so both their disadvantages and advantages depend on your specific implementation, but one thing is sure - the choice of a hybrid type can only be due to the complexity of your model and almost always (in any case, I don’t know cases when it can be otherwise) you can get by with the main 4 types.
Let me add a few tips on how to implement SCD:
- Try to implement a mechanism for changing records in stored procedures - it is absolutely undesirable that the change code be scattered in different places, even if the change code is stored in clearly defined places in your external application;
- If you want to make a smooth transition from the 1st model to the second, you can do this:
1) change the table by type 2 SCD with renaming, for example, in table_name_scd2
2) create an updated view with the name of the old table, which will produce data in the same structure as the old table;
3) if you do not conduct all the changes in stored procedures (I hope this is temporary :)), which have already been changed, then create triggers that will fill in new fields in cases where they are not set by the query (when: new.start_date is null. ..) and log it in to make sure you changed everything - In cases where the start and end version fields are used, in addition to using the primary key, which includes the object identifier and the version start and end dates, you will need to control the integrity to create a limit on non-intersection of version dates. It is very good if your DBMS supports check constraints based on non-deterministic functions that allow you to do this (by the way, I would like to know which DBMS supports this), but if this is not the case, then you can check the condition in the trigger before creating or modifying and throw an exception in case of violation. Example for Oracle:
create or replace trigger T_EMP_CHECK before insert or update on emp for each row declare f_ok number; begin select count(*) into f_ok from dual where exists( select null from emp e where e.id = :new.id and e.date_start <= :new.date_end and e.date_end >= :new.date_start ); if f_ok>0 then raise DUP_VAL_ON_INDEX; end if; end T_EMP_CHECK;
- When moving from type 1 to type 4, you just need to create a before update trigger, in which you will store the records in a new table for archive records
Ps. Habralyudi, share what interesting hybrid implementations you met?