Authors: Hjusein Tjurkmen, Mariyana Hristova, Musala Soft
Nowadays we are surrounded by tons of information and our life is flooded with data that we consciously or not analyze day and night. More and more public and private organizations appreciate and measure the quality of data. Within a company, data quality is crucial to all operational processes and to the reliability of business analytics (BA) and business intelligence (BI) reporting. When it comes to important business decisions, the following question always raises: Can you really trust your data?
Before giving an answer to this particular question, the essence of data quality term should be clarified. There exist a lot of definitions for it and here are some of them:
- Data are of high quality “if they are fit for their intended uses in operations, decision making and planning.” (J. M. Juran)
- “The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.”
- “The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.”
- “The degree to which information consistently meets the requirements and expectations of all knowledge workers who require it to perform their processes. “ (Larry English)
Common to all definitions is that data has quality if it satisfies the requirements of its intended use. To satisfy the intended use, the data must be accurate, relevant, complete, understood, and trusted.
Inevitably, the next question that arises is “Why should we care about data quality?” Below are listed four main problems that the lack of data quality may lead to:
• Poor data cost money and reduce productivity – the time spent fixing a big amount of data is expensive and reduce productivity
• Wrong business decisions – the strategic decisions in the companies are taken based on a big amount of data stored in data warehouse or databases. What if the data is wrong? This may lead to serious financial losses.
• Poor client satisfaction level – providing clients and third party organizations with incorrect or inconsistent data results in decreasing partner trust and market share losses.
• Expensive and slowly implementation process – solutions build using poor data are unreliable, unstable and subject of constant maintaining. This affect the project price and delivery time.
The history is full of stories related to high cost of poor data quality and here are some interesting facts from TDWI (The Data Warehouse Institute) that are related to the above mentioned problems:
1. Poor quality customer data costs U.S. businesses $611 billion a year.
2. The cost of poor data may be 10-25 percent of total revenues.
3. 40 % of the companies have suffered losses due to poor data quality.
4. One company is generating $130,000 annually in cost savings from a data quality project.
Data quality dimensions
The analysis of the predefined dimensions is a very important part of the data quality process. A data quality dimension is an aspect or feature of information and a way to classify information. Dimensions are used to define measures and manage the quality of the data. Different dimensions are presented in different books and articles for data quality. Here are some of them:
• Data Specifications – a measure of the existence, completeness, quality, and documentation of data standards, data models, business rules, metadata, and reference data
• Data Integrity – a measure of the existence, validity, structure, content, and other basic fundamentals characteristics of the data
• Duplication – a measure of unwanted duplication existing within or across systems for a particular field, record, or data set
• Accuracy – a measure of the correctness of the content of the data (which requires an authoritative source of reference to be identified and accessible)
• Consistency and Synchronization – a measure of the equivalence of information stored or used in various data stores, applications, and systems, and the processes for making data equivalent
• Timeliness and Availability – a measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected
Other dimensions also exist such as: Ease of use and maintainability, Data Coverage, Presentation Quality, Perception, Relevance, and Trust, Data Decay and Transactability but the six of them described in more details above are most often applied in practice.
Why is data warehouse testing important for ensuring data quality
Most of the organizations use data warehouse (DWH) as their main repository of data and run their businesses relying on this data for strategic decision-making. One of the greatest risks to any company’s success implementing a business intelligence system is to rush a data warehouse into service without testing it effectively.
DWH testing should always be aligned with the data model underlying a data warehouse. It will never be appreciated if bugs are detected at a later stage of testing cycles because it could easily lead to very high financial losses to the project. DWH testing focuses on data quality. Test execution itself has multiple phases and is carried out throughout the whole life cycle of DWH implementation.
ETL tools as part of improving data quality in a data warehouse
ETL stands for extract, transform, and load and is the building block for a data warehouse. During ETL phase, data quality testing is of utmost importance because any defect slippage in this phase will be very costly to rectify later. ETL tools are widely used to improve data quality. As can be seen in the Figure 1 below, the ETL tools first combine data from different external data sources to a single repository (extract phase), transform the data based on predefined business rules to the desired format in order to optimize reporting and analysis (transform phase), and then load the data (load phase) into a data warehouse or other system, where end users can then rely on it for strategic decision-making. As to what tools are used, there are numerous tools that help with ETL process and it depends on the individual and the environment.
Data quality integrated with data warehouse
The data quality is an integral part of building successful data warehouse solutions. One possible data quality solution is presented and described below (Figure 2).
First step is to define a set of data quality (DQ) and business rules loaded into the Rules engine. Each rule will have an associated validity threshold based on the users’ expectations of quality. The input data first is loaded into the Rules engine and is tested against the data quality rules.
- If all rules are met, the data is valid and can be successfully loaded in the data warehouse.
- If any rule is not met, the data is sent to the reconciliation system with attached information which rule is not met. The violated records can also be passed through the data warehouse but they will be marked as “not conformed to users’ expectations”.
The count of failures and successes is maintained for each rule. When the process of data importing is finished, each rule validity value is computed as the ratio of valid records to the total records and a report is generated. If all validity percentages exceed the associated thresholds, the warehouse is certified to conform to the users’ data quality requirements. Otherwise the data warehouse is not certified until the percentages apply to the conformance level. The records loaded in the reconciliation system must be analyzed for their main cause of the failure. After reconciliation the data is reloaded in the Rules engine and the flow is repeated.
Data quality testing: approaches, automated solutions and real work experience
Business case 1:
The data is stored in different systems/layers and needs to be verified and validated against predefined data quality dimensions and business rules. Customers rely on it for strategic decision-making which makes its testing very important.
In this example for a data quality solution, there is a data warehouse with 3 layers where the input data comes from the different systems named Source systems (Figure 3):
• Staging layer – where data from the source systems is loaded without any transformations
• Logical layer – where the data taken from the staging layer is transformed and aggregated
• Data presentation layer – where data is loaded and can be seen by the end users
The data quality team assures testing of small activities that can be separated in three categories:
1. Data loading from different source systems in the data warehouse staging layer
2. Data loading, transformations and aggregations between the different data warehouse layers
3. Providing data from the different data warehouse layers to the business departments or external customers
Quality testing approach: The general concept for the applied data quality testing is comparison between data in two or more systems/layers where one or more systems/layers are source and one or more system/layers are destination. For the purpose of testing there are defined and agreed rules measuring the data quality dimensions. Additionally, a set of different documents is provided – data flow documentation, technical documentation, business requirements and etc.
When there is a need to assure data quality between the source system and DWH staging layer, ETL tools are used for parallel extract of the data and load it in the environment for performing data quality. After that, based on the provided set of documentation and the agreed rules, the data loaded from the source system is compared with the data loaded from DWH staging area. However, it is not necessary to use ETL tools when validating data between the deferent layers in DWH because both the source and the destination are located in the data warehouse. For this reason, a comparison of the data based only on the business requirements and analysis of the data quality dimensions is applied.
Development and maintenance of internal data quality framework: An internal data quality framework solution is developed to support the process of data quality testing and to supply business departments with daily deviation reports. This framework provides the following functionality:
a) creation of automated measures between two or more systems/layers
b) configuration of desired regularity of the created measures (daily, weekly, monthly etc.)
c) configuration of desired date range for the measure calculation
d) automated generation of reports containing the results from the automated measurements
e) automated sending of the reports per e-mail
Some examples of useful automated measures created by data quality framework are presented below:
- Daily measures of payments counts (for the previous day) between the source system and the staging layer
- Daily measures of invoices counts (for the previous day) between the source system and the staging layer
After the calculation of the above mentioned measures, a report with the results is generated and sent to the predefined e-mail list. (Figure 4)
In the given example, the technologies used to build the data quality framework are: Microsoft SQL Server Integration Services, Microsoft SQL Server Reporting Services, and Microsoft SQL Server and C#. Integration services technology is used to extract, transform and load data from the source and the destination systems. Reporting services combined with C# are used for the report generation and the automated sending per e-mail. SQL Server stores the results from the automated measures calculation and the metadata for the measure configuration.
Business case 2:
Assuring data correctness and quality when a given application is dealing with voluminous data coming from different external data sources is even more challenging task. This is exactly the next business real life case that will be described in this article.
The business case is a dashboard displaying reports containing important historical data, statistics and forecast metrics. The data displayed in the reports comes from external sources, then is extracted and further transformed according to predefined business rules and is finally loaded to the appropriate destination by several ETL jobs. The challenge in assuring data quality and correctness of the data comes from the fact that those ETLs are dealing with large amount of data that should be verified against the business requirements.
Quality testing approach: Different tools were used in the implementation of an automated and quick solution of the problem: The data model and business rules are defined in the IBM InfoSphere Data Architect (IDA). With the help of IDA, different SQL scripts and expected test data were generated based on the specified predefined rules and transformations. Furthermore, automated jobs using IBM InfoSphere DataStage and QualityStage Designer and IBM InfoSphere Design Studio were developed that quickly compared the data produced from ETLs with the generated expected test data. The main purpose of those automated jobs is to validate the correct processing of the developed ETLs jobs and to check whether the produced data is same as the expected data. The creation of automated jobs helped a lot in finding data defects earlier in the project lifecycle and in checking the correctness and quality of the data in an easier way. Also errors that are as a result of wrong implementation in the business mapping or data model were also caught. This saved time and efforts while validating large amounts of data and decreased significantly the effort necessary for retesting due to changes in the business transformations.
Data quality is a fast growing area in the IT sphere. Data quality assurance (DQA) is the process of verifying the reliability, quality and effectiveness of data. Ensuring good data quality is not a one shot effort; it takes continuous process management in order to be successful. The companies which integrated the data quality process in their business already feel the difference. Be a part of this and enjoy your trusted data!
1. David Loshin, Enterprise knowledge management, Elsevier Science & Technology,31st January 2001