Much has been said about what data mining might deliver to users: But what about using this technology to improve the quality of data itself?
by Kamran Parsaye and Mark Chignell
Almost all decisions rely on data in some way. The quality of the data affects the quality of the decisions - at times dramatically so. As the size and complexity of databases grow, two things begin to happen:
In our experience, most large databases are riddled with errors and inconsistencies. However, in some cases, there is often a cavalier disregard for data quality, with people assuming that if large amounts of data have been collected, it must be good. All too often this is wishful thinking, because poor-quality data has a way of staying undiscovered especially when insufficient efforts are made to identify and root it out.
Data is the foundation of knowledge, and it must be mined and distilled very carefully. Thus data quality is a key element of intelligent databases. An effective data quality program is a must in any large database application. A data quality program requires:
Almost any large database will have some data problems. This issue becomes especially critical when data from multiple sources that relate to each other are maintained by different departments or divisions. An intelligent database quality management system solves most data quality problems. An intelligent database system for data quality management is made up of three different components: basic quality enforcement, rule-based constraints, and automatic error detection.
The basic quality system checks for type consistency, range validity, and so on. The rule-based constraint system applies constraints known and supplied by the user to the data. The automatic anomaly detection system discovers more rules than the user could be aware of and finds unusual patterns by using these discoveries.
The basic quality enforcement component handles the bread-and-butter quality problems that must be faced whenever an organization has a few departments with interrelated databases. Some of the simpler problems that should be monitored include:
This is just the tip of the iceberg, and many more types of errors are possible. A good quality program must address all of these issues.
In database applications of the 1970s and 1980s, integrity was encoded in terms of procedural programs when the application was developed. Often, integrity constraints were "spread over" a program: Each constraint was coded as a specific set of instructions. As the constraints (inevitably) change, the program must be "reopened" for further surgery, which can cause other serious problems.
Although most constraints take the form of "should not be allowed" rules, their procedural implementation in this old-style approach is often not obvious since the rules have to be "coded" into procedures. Sometimes the constraints are not those that you want since their meaning is lost in the translation to procedures.
The rule-based constraints approach allows comprehensive integrity constraints to be expressed much more easily in terms of IF/THEN statements. This approach allows the user to change the integrity constraints without changing the application program. Moreover, the systems can automatically check for anomalies and deviations in values, based on a set of criteria provided by the user.
Thus in the modern approach, constraints are expressed in terms of a set of rules that are "separate" from the program. Then, as the constraints change, the rules are changed without having to reopen the program. This procedure has two advantages:
With rules, you may also enter a constraint such as the one that follows:
Confidence = 99%
IF
DATE_OF_SALE > 1991
and
TRUCKER = Transway
THEN
Plant = Cincinnati
This constraint would then be checked against the data. Any instances in which the date of sale was after 1991, the trucker was Transway, but the plant was not Cincinnati would be flagged as anomalies, although this situation may be tolerated one percent of the time.
Using constraints, anomaly detection may be based on rules that are input by the user. If the rule in a constraint is completely specified, then for each record that satisfies the IF condition, anomaly detection can check whether the THEN part is also satisfied. Anomaly detection will then report all the records that satisfy the IF condition but fail the THEN part. For example, the following constraint:
IF
Department = "SALES"
THEN
Salary > 30,000;
states that the minimum salary in the sales department is $30,000. If the database contains a record showing that a person works for the sales department but only has a salary of $25,000, this record will be caught and reported by anomaly detection. This type of checking is useful when you know the relationships among database fields, and these relationships can be represented as rules.
The basic way in which rulebased constraints are applied is shown in Figure 1. Either a human or an automatic anomaly detection system provides rules that should be enforced in the database. The rules are then applied to the data, and the errors are signaled.
Errors often manifest themselves as anomalies; that is, exceptions to expected patterns. Anomalies occur for many reasons. However, in most large databases, a fair proportion of the anomalies will be due to errors in the data.
Automatic anomaly detection also provides a set of statistical features for measuring data quality. For instance, deviations from the standard and from typical correlations are detected. The system thus relies on both statistical and deductive methods for testing hypotheses provided by the user and for enforcing integrity constraints. However, the key contribution here often comes from automatic discovery, which automatically generates hypotheses. This process is shown in Figure 2.
Anomalies are the exceptions to rules. Say that if a rule is correct 99 times out of 100, then a case that doesn't obey the rule is surprising and possibly anomalous. Using this principle, anomaly detection is an intelligent database tool that automatically analyzes databases and finds anomalous data items and errors.
Anomalies are the flip side of rules. Anomalous data contradicts expectations. In general, we can distinguish between different types of anomalies:
For one reason or other, anomalies are almost always of interest. Thus, anomaly detection is a useful form of information discovery that supplements rule discovery. In practice, anomaly detection follows from rule discovery. Once a set of rules has been extracted from a large database, the rules can then be applied to the same data (or to new data) to see which data is anomalous.
Consider the problems at a large corporation such as Global Motors Corp. Global Motors makes and sells about 1,000,000 automobiles yearly through a network of several thousand dealers. Parts for these automobiles are manufactured at company plants and by outside suppliers, assembled at various locations, and the finished products are shipped to thousands of dealers by several trucking firms. All of this data is kept in various databases. The company maintains many databases, consisting of customer, dealer, shipper, plant, and product information. There are so many of these databases and they are so large that no single person at Global Motors could possibly know what they contain. At times these databases contain errors and anomalies that no one can detect. These errors have a high cost. Still, some of these problems may be uncovered.
Not surprisingly, Global's large databases contain errors, typos, and incorrect entries, which cause problems with shipping, receiving, and data analysis. By using anomaly detection, however, the company can discover errors hidden in these databases. For example, anomaly detection may point out the following as anomalies, or unusual cases:
Anomaly detection may also find the following errors in the database:
Using an anomaly detection tool, you may set thresholds for the level of error tolerated by anomaly detection or use a built-in tolerance. For instance, you may specify the tolerance level of 90, which means the system will find less anomalies, or a tolerance level of 10, which means that more anomalies will be found.
Just as discovery parameters can be set in rule generation, anomalv detection allows you to specify the tolerance level of errors to be discovered. The higher the tolerance level is, the fewer anomalies will be found.
Tolerance level can be represented as values in the range of 0 to 100. Tolerance level 100 means all data values are tolerable and, therefore, no anomaly will be reported by anomaly detection. Tolerance level 0 means you do not tolerate any discrepancies in the data values. In most cases, if you specify tolerance level 0, you will get all the data values reported. A reasonable default tolerance level is 80.
For scalar analysis, anomaly detection calculates the average value for scalar fields, separating the data values into different value ranges according to the statistical analysis and screening out those values that fall beyond the tolerance level.
For nonscalar analysis (that is, using category variables such as type of occupation or job), anomaly detection uses the occurrence frequencies of values to determine the anomalies. If a value occurs unusually often or unusually seldom, then it is possibly an error. Unlike scalar analysis, where anomaly detection uses values of scalar fields in order to perform statistical analysis, anomaly detection performs statistical analysis on the occurrence frequencies of the values for each field. The field can be either scalar or nonscalar. For each field, anomaly detection calculates the average frequency, separates data values according to different frequency ranges according to the statistical analysis, and screens out those values whose occurrence frequencies fall beyond the tolerance level.
From a methodological point of view, when an organization does have data quality problems (which is more often that not), it is tempting to guess which method (that is, basic quality enforcement, constraints, or automatic error detection) is most important or practical at first.
Our experience indicates that often the best course of action to begin with is using a moderate dose of each component, followed by further analysis to determine the key areas of concern. By neglecting any component initially, you may be overlooking the most important problem. The third component, say, automatic error detection, is particularly important since errors are usually where we don't expect them to be.
As databases grow, organizations typically institute a data quality program using one of two paths:
Those who choose the second option generally win.
Intelligent database applications for data quality control are essential in all database applications, ensuring the integrity of the data on which critical decisions are based.
(This article originally appeared in the May 1993 issue of AI Expert, a Miller Freeman publication.)
Kamran Parsaye is chairman of lntelligenceWare Inc., a database tools company in Los Angeles. Mark Chignell is an assistant professor at the University of Toronto in Ontario, Canada.
References
1. Parsaye, K., and M. Chignell. Intelligent Database Tools and Applications. John Wiley and Sons Inc., 1993.
2. Parsaye, K., M. Chignell, et al. Intelligent Databases: Object Oriented, Deductive Hypermedia Technologies. John Wiley and Sons Inc., 1989.
3. Parsaye, K., and M. Chignell. "Information Made Visual Using HyperData." AI Expert, September 1992.