Developers Club geek daily blog

2 years, 10 months ago
Projects of data storages are part of IT infrastructure of the majority of large enterprises for a long time. Processes of ETL are part of these projects, however developers sometimes make the same mistakes at design and maintenance of these processes. Some of these errors are described in this post.

I would like to narrow a framework of discussion at once and to agree about terminology:
  • The data storage (Datawarehouse, DWH) is meant traditional SQL DWH (Oracle Database, MS SQL Server, etc.);
  • When modeling DWH concepts of single version of truth (the uniform version of the truth) and historical truth (the historical truth) usually are meant;
  • ETL process (Extraction-Transformation-Loading) is meant as loading process of data from one or several source systems (initial systems) in DWH.
  • DWH is created not yesterday and at present over it several design teams independently work with the projects.

The term ETL is interpreted often in different ways, thanks to "simple" interpretation of the abbreviation. Actually problems of ETL is only a subset of problems of Data Movement. At Kimball in its book "The Data Warehouse ETL Toolkit" 3 operations which ETL process shall make are selected:
  1. To load data in the maximum convenient type for analytics applications;
  2. In loading process of data to enrich them with additional information;
  3. To fix and document lineage (origin) of data.

The first point is rather obvious therefore I will pass it. The second point says that data have to not just be overloaded from one place in another, but also be enriched in process, for example, with the new calculated attributes, technical attributes (loading session id, date of loading, initial system, etc.). The third says that for any record there has to be an opportunity to trace from where and when this record appeared in DWH when and what process it changed.

In general, the essence of the majority of errors of the ETL developer can be explained with ignoring of the vital rule from this picture.

7 errors of the ETL developer

Further examples for DWH based on Oracle 11g will be used. So we will start.

1. Use of system date (or similar function) in business logic.


One of the simplest and frequent errors, especially at inexperienced developers. Let's allow there is a business rule: during "a night window for loading" to unload orders which were closed for this day (across the field of close_date). Such sql statement happens result sometimes approximately:

insert <….> into target_table
select <….> from orders
where close_date> = sysdate () — 1


Even if to forget about what sysdate () may contain not only date, but also time, then we with this script have problems while regular work of ETL of process is broken for quite banal reasons (the initial system apgreyditsya on the new version, communication with initial system was gone, because of new process of ETL the place in temporary tablespace, etc. ended). I.e. at that moment when our ETL process needs to be restarted or suspended for some reason for a while and then to start again. Also there can be something interesting if for any reason this process is started twice in a day ;)
Solution at this error usually simple: to parametrize a challenge of this process and if it is necessary, then to use sysdate () as default value with a possibility of redefinition. Though use of a datetime field for processing of the delta from the point of view of maintenance of HD is not really optimum, and instead of it it is better to apply the delta on a certain discrete field (for example integer id of session of loading or something similar)

2. Profiling of data was not made before development


Even the most documentary and the initial system developed in due form and to techniques usually comprises malformed or not jellied data, despite numerous assurances of her developers or command of support. And to rely on assurances of correctness from that party of barricades, it is usually fraught with problems at the end of development. Any data source (the table, the file, xml, json, etc.) has to be checked for compliance of a logical model of DWH. There are different tools for profiling of data as built in the ETL tools and independent of them. I will list the most demanded checks:

Check #1: Uniqueness of identifiers and natural keys of basic data.


Distinction between the identifier and a natural key consists that the identifier — it is normal a certain substitute value which technically identifies a line, and the natural key is a value or a combination of values which have business sense.

Table order_details
order_details_id
document_position
order_id
35346346
10
1224114
35346365
20
1224114
….
….
….
35345464
10
1224438
In this example of order_details_id is an identifier, and the combination of document_position+order_id is a natural key

Example: I participated in the project on data loading in DWH from distributed system (instance-based) in which account of objects of network infrastructure was kept. Developers of this system on a blue eye assured that id of this object is unique and even showed in initial system unique index on the table in confirmation of the words. The dirty trick came to light not at once: there is a uniqueness of these id existed only within one instans of system and when tried to load all data from all instans, the problem with uniqueness turned out. As a result it was necessary to change a data model and to expand a natural key of entity "network object" with an additional field "инстанс" to provide uniqueness.

Check #2: Data types.


If the field is called Order_nr, then it optional contains only numerical values — there can quite be alphanumeric sequences. Also it is always worth checking length of fields. This problem is usually characteristic of file data sources — tables DB are usually well typified.

Check #3: Referential integrity (check of FK)


The fact that the developer shows ER charts of the initial system shows at itself on a DEV environment the existing FK between tables, and in general mother swears that it has all under control, is not an occasion not to check existence of the "hung" records. Since he can be aware not that on a productive environment of DBA already turned off this check for improvement of performance (of course, having approved it with the developer's manager, i.e. nobody is guilty). Also problems with referential integrity very often meets for file data sources. Also you should not forget about application of the scenario late-arriving-data (for example if data come it is approved today, not the fact that so will be also in half a year).

Check #4: NULL values


The main problem of NULL values consists that <>NULLNULL therefore any requests with Jains will return unpredictable results across the field which may contain NULL. Therefore all important fields should be enveloped construction of nvl (). Exists separate holivar in an occasion to load NULL in non-key fields or to replace with certain values by default. The idea about general replacement of NULLOV for more standardized approach to use of DWH is closer to me, but I do not undertake to insist what needs so to be done always.

Check #5: Dates


Checks of fields with dates usually are the most complicated since in addition to standard checks it is necessary to consider that not all dates which are admissible from the point of view of a DB are those from the point of view of DWH: date "21-07-1007" hardly is admissible for date of the conclusion of the agreement on rendering services of cellular communication. When modeling DWH usually there are so-called dates of "the beginning of times" and "the end of times" (other names are possible), and any date which is not getting to this range of time has to be replaced with a certain value by default.
Separate mentioning is deserved by cases of use of data types like varchar(8) for storage of dates (in a format for example ‘20151201’) since the number of checks has to be even more here.

3. Removal of duplicates through GROUP BY or DISTINCT


In spite of the fact that all data which come from a source are usually loaded in DWH, there are scenarios when obviously duplicated data come to an input. But uniqueness of a natural key demands only one record from duplicates. There are two wrong methods of removal of duplicates:

Wrong method #1: GROUP BY


Let's say we load addresses of clients and we know that theoretically for one client several records can come with the address information (usually they are complete duplicates because of problems, for example, with synchronization). Having given in to desire to solve a problem "in a forehead", the developer can write such request:

insert into customer_address
select customer_id, max(street_name), max(house_nr)
from source_table
group by customer_id


Problems will begin if two really different records for one client come to an input (for example, there was an error of input of the operator which it corrected, but both options of record got to data source):
customer_id
street_name
house_nr
1321
Moskovskaya str
127
1321
Pushkinskaya str
34
The request can return such result (depending on a locale):
customer_id
street_name
house_nr
1321
Pushkinskaya str
127
Such record in basic data was not, and users of DWH can have a reasonable question what in general it it? Actually the 3rd requirement to ETL to process is violated here: record which cannot be monitored to initial system, in other words, which is not there was loaded into DWH. And it is the unambiguous error ETL of the developer.

Wrong method #2: DISTINCT


The second option of "a solution in a forehead" in the scenario described above is to use for removal of the duplicated records DISTINCT

insert into customer_address
select distinct customer_id, street_name, house_nr
from source_table


In this case couple of duplicated records with different attributes will be identified earlier as instead of one two records will turn out, and uniqueness of a natural key will be broken and ETL process will fall with an error.

One of the correct methods


How it is worth solving a problem of existence of two records with an identical natural key, but different attributes? It is obvious that if not to make these changes to a data model, then from all records the unique correct has to be selected. It is necessary to select it according to the predeterminated criterion: if information is quite critical, it is possible to implement the different scenarios Data Quality if is not present, then as correct record to take the last loaded.

insert into customer_address
select customer_id, street_name, house_nr from (
select customer_id, street_name, house_nr,
row_number () over (partition by customer_id order by change_datetime desc) row_num
from source_table)
where row_num = 1


Generally it is necessary not to forget that any record in DWH has to have an opportunity to be traced to a source(s) of data depending on the business rule and not to create "inexplicable" records

4. Use of "static" scripts from initial systems


Very often the business logic for entities of DWH comes from developers or analysts of initial systems in the form of SQL of scripts. And this big help for the ETL developer, but, as they say, "be afraid of danayets, the gifts bringing": as a rule these scripts fix certain conditionally "static" status initial systems in some timepoint, and the ETL developer usually is engaged in tracking of dynamics in data and loading only of changes ("delta"). What has to guard in these "static" SQL scripts? Here some from:
  • aggregate functions (SUM, AVG, COUNT, etc.)
  • operators IN and EXISTS
  • window functions (OVER(PARTITION BY …))

Example of such script:

insert order_id into orders_from_calls
select order_id from orders
where order_id IN (select order_id from calls where order_id <>-1)
and changed_date> $last_loaded_date

It seems, everything is logical: to load in our table order_from_calls all orders on which there is a link in the table of calls and for which the last modification date is more than a date of the last loading. And now we will provide that updating of the table calls in DWH did not happen (for example, it is loaded from other initial system and communication with it for some reason is broken), and this request did not load some id of orders. After that the table calls was charged up correctly, and there these passed id of orders appeared, but we will not load them any more into the table order_from_calls since in the table orders nothing exchanged and new starts of this request will give nothing. Therefore in this case it is necessary to trace the delta not only according to the table orders, but also according to the table calls.

5. Development on small data volume for development


As a rule, to the ETL developer for development on a DEV environment the small part of data from productive system on which it is offered to conduct development and debugging of work of ETL processes is unloaded. Unfortunately, the solutions developed on such small data volume usually lead to different problems on productive system, it as the insufficient performance, shortage of the place for intermediate tables (for example, the developer decided to carry beautifully business logic steps on a set of intermediate tables, consistently overloading from one in another — and here in productive system of data it appeared too much, and tablespace for temporary tables suddenly ended).
Unfortunately, the ETL developer can not always independently solve this error, because of different regulations and instructions, lack of the budget on a full-fledged DEV environment with the same data volume as on a produktiva, etc. Thus, it should be considered as project risk.
One of outputs is crushing of stages of the project on smaller and to do releases more often to identify such problems not at the end of the project, and at least in the middle.

6. Misuse technical and business of dates:


In DWH there are 2 types of dates: business dates and technical dates. A difference at them in an origin: business date is that date which came from data source or was created by business rules; technical date is a date which was generated by ETL process or DWH. And very often they are used incorrectly:

#1 Business dates are used as technical dates:


If the entity istorizirutsya as SCD2 (Slowly Changing Dimension type 2) and in data source there are fields "_from" and "_to" which the ETL developer is offered to use as ranges of a validity of data, then it has to have just steel concrete guarantees of that, all ranges of a validity for each natural key will be: 1) not crossed, 2) between ranges will not be gaps, 3) consolidation of these ranges of dates will match date range "from the beginning of times" till "the end of times" determined for your DWH (it can be for example couples of dates "01.01.1000" and "31.12.9999", or "11.11.1111" and "09.09.9999"). As a rule, developers of initial systems bother a little and if the rule of "not crossed ranges of dates" usually is followed, then with the 2nd and 3rd point usually there are problems. Anyway, the general recommendation is not to use business date for SCD2, and to generate the technical dates.

#2 Technical dates are used as business dates:


Very often data sources do not deliver a field for tracking of any control dates: for example, the document has only the status of closing, but not a time stamp when this event took place, and as a solution it is offered to use technical dates of "_from" and "_to" which were generated by ETL process. However this solution works before the first failure of ETL of process (for example, stops of ETL of processes for couple of days): failure happened on Monday, recovery came on Wednesday and since the initial system quite to itself worked all this time all the created documents will be loaded as created on Wednesday. Generally, we do not implement the scenario "historical truth" if data source does not deliver all dates necessary to users and can be only semulirovan (by means of technical dates), but in that case this scenario has to be spoken and described in documentation that in a year users were not surprised to the zero number of the closed documents on Monday and Tuesday, and also to threefold quantity them on Wednesday.

7. "Mechanical" implementation


It is one of the errors, most difficult for identification, and, to tell the truth, is not the ETL developer's error, and it is rather an architect of DWH. But the command works on the project, and colleagues should be helped out too.
Sometimes so it happens that the target entity in DWH was incorrectly simulated proceeding because of discrepancies in terminology for the developer of initial system and the architect. The developer of initial system thinks of categories of the initial system, the architect of DWH needs to think over different integration schemes how to connect in uniform DWH a set of objects from diverse initial systems.
I will describe on the example of entity "client" as one of typical for such problems: in data source there is a table "customer", having unique natural key, referential integrity as it should be. On the basis of this table in DWH the entity of "customer" was created. Proceeding from the name, it is logical to assume that one entry in this table has to correspond to one client, but actually it became clear that actually same real client could have several records with the same attributes, but different natural keys. And it would lead to an unpleasant collision for users of DWH who used this entity, for example, for calculation of total quantity of clients of the company. The decision to share this entity into two was as a result made: "customer_record" and "customer" connected through FK by M:1 relation.
And if the ETL developer "mechanically" implemented everything according to the specification, then he of course would not be guilty, but he had an opportunity to notice it since anyway it in comparison with the architect works conditionally speaking "on the earth", unlike the "going woolgathering" architect ;)
In general it is possible to mention some symptoms of "mechanical" implementation:
  • "To inherit" names of tables of initial system
  • To copy business logic from the existing data streams in new
  • To use dzhoba of the "pilot" project generally

What should be done to minimize risks of "mechanical" implementation:
  • It is attentive to analyze business rules with any Jains who can potentially "cut off" part of data (left outer join is normal more preferably than inner join)
  • From other party to check "doubtful" dzhoyna which can "multiply" data because of an incorrect or incomplete condition

Summarizing this point: always it is necessary to understand that you load into DWH and whether there corresponds the name to contents and also to load neither more and nor less data, than it is required.

Conclusion


Certainly, this list is not complete, but I hope that this article can bring certain order to the heads which are so confused with deadlines, maylstouna, releases and bagfiksa ;)

This article is a translation of the original post at habrahabr.ru/post/273243/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus