Much has been said about what data mining might deliver to users: But what about using this technology to improve the quality of data itself?

Quality Unbound

by Kamran Parsaye and Mark Chignell

Almost all decisions rely on data in some way. The quality of the data affects the quality of the decisions - at times dramatically so. As the size and complexity of databases grow, two things begin to happen:

  • We rely more and more on the database, at times becoming fully dependent on it.
  • The number of errors in the data increases, and the errors become harder to trace.

    In our experience, most large databases are riddled with errors and inconsistencies. However, in some cases, there is often a cavalier disregard for data quality, with people assuming that if large amounts of data have been collected, it must be good. All too often this is wishful thinking, because poor-quality data has a way of staying undiscovered especially when insufficient efforts are made to identify and root it out.

    Data is the foundation of knowledge, and it must be mined and distilled very carefully. Thus data quality is a key element of intelligent databases. An effective data quality program is a must in any large database application. A data quality program requires:

  • Continuous data quality audits with appropriate tools
  • Cleanup and restructuring of data schemas
  • Standards and software for maintaining data quality in a continuous manner.

    Almost any large database will have some data problems. This issue becomes especially critical when data from multiple sources that relate to each other are maintained by different departments or divisions. An intelligent database quality management system solves most data quality problems. An intelligent database system for data quality management is made up of three different components: basic quality enforcement, rule-based constraints, and automatic error detection.

    The basic quality system checks for type consistency, range validity, and so on. The rule-based constraint system applies constraints known and supplied by the user to the data. The automatic anomaly detection system discovers more rules than the user could be aware of and finds unusual patterns by using these discoveries.

    Basic Quality Enforcement

    The basic quality enforcement component handles the bread-and-butter quality problems that must be faced whenever an organization has a few departments with interrelated databases. Some of the simpler problems that should be monitored include:

  • Inconsistent file naming. The file names are sometimes used to indicate the file content, but may sometimes be named arbitrarily, causing confusion. Even worse, programs may rely on the file names to access appropriate data.
  • Inconsistent field lengths. Field lengths in different files may be different.
  • Inconsistent field orders. Different files may have different orderings for the fields, making the system nonrelational.
  • Inconsistent descriptions. The documentation that describes fields and files may be inconsistent or misleading.
  • Incomplete entries. The database may contain many missing entries; even worse, unknown values may not be distinguishable from known but zero values.
  • Inconsistent identities. Usage of levels assigned to a category attribute may be inconsistent (for example, the same product may sometimes be referred to as a cordless phone but at other times may be referred to as a portable phone).
  • Inconsistent value assignment. Assignment of values to numerical attributes may be inconsistent (for example, a ZIP code may be represented as either five or nine digits; phone numbers may or may not include the area code).

    This is just the tip of the iceberg, and many more types of errors are possible. A good quality program must address all of these issues.

    Integrity Constraint

    In database applications of the 1970s and 1980s, integrity was encoded in terms of procedural programs when the application was developed. Often, integrity constraints were "spread over" a program: Each constraint was coded as a specific set of instructions. As the constraints (inevitably) change, the program must be "reopened" for further surgery, which can cause other serious problems.

    Although most constraints take the form of "should not be allowed" rules, their procedural implementation in this old-style approach is often not obvious since the rules have to be "coded" into procedures. Sometimes the constraints are not those that you want since their meaning is lost in the translation to procedures.

    The rule-based constraints approach allows comprehensive integrity constraints to be expressed much more easily in terms of IF/THEN statements. This approach allows the user to change the integrity constraints without changing the application program. Moreover, the systems can automatically check for anomalies and deviations in values, based on a set of criteria provided by the user.

    Thus in the modern approach, constraints are expressed in terms of a set of rules that are "separate" from the program. Then, as the constraints change, the rules are changed without having to reopen the program. This procedure has two advantages:

  • The chance that the constraints are accurate increases.
  • The ability to change the constraints improves.

    With rules, you may also enter a constraint such as the one that follows:

    Confidence = 99%
    IF
        DATE_OF_SALE > 1991
    and
        TRUCKER = Transway
    THEN
        Plant = Cincinnati
    

    This constraint would then be checked against the data. Any instances in which the date of sale was after 1991, the trucker was Transway, but the plant was not Cincinnati would be flagged as anomalies, although this situation may be tolerated one percent of the time.

    Using constraints, anomaly detection may be based on rules that are input by the user. If the rule in a constraint is completely specified, then for each record that satisfies the IF condition, anomaly detection can check whether the THEN part is also satisfied. Anomaly detection will then report all the records that satisfy the IF condition but fail the THEN part. For example, the following constraint:

    IF
        Department = "SALES"
    THEN
        Salary > 30,000;
    

    states that the minimum salary in the sales department is $30,000. If the database contains a record showing that a person works for the sales department but only has a salary of $25,000, this record will be caught and reported by anomaly detection. This type of checking is useful when you know the relationships among database fields, and these relationships can be represented as rules.

    The basic way in which rulebased constraints are applied is shown in Figure 1. Either a human or an automatic anomaly detection system provides rules that should be enforced in the database. The rules are then applied to the data, and the errors are signaled.

    Automatic Error Detection

    Errors often manifest themselves as anomalies; that is, exceptions to expected patterns. Anomalies occur for many reasons. However, in most large databases, a fair proportion of the anomalies will be due to errors in the data.

    Automatic anomaly detection also provides a set of statistical features for measuring data quality. For instance, deviations from the standard and from typical correlations are detected. The system thus relies on both statistical and deductive methods for testing hypotheses provided by the user and for enforcing integrity constraints. However, the key contribution here often comes from automatic discovery, which automatically generates hypotheses. This process is shown in Figure 2.

    Anomalies are the exceptions to rules. Say that if a rule is correct 99 times out of 100, then a case that doesn't obey the rule is surprising and possibly anomalous. Using this principle, anomaly detection is an intelligent database tool that automatically analyzes databases and finds anomalous data items and errors.

    Anomalies are the flip side of rules. Anomalous data contradicts expectations. In general, we can distinguish between different types of anomalies:

  • An anomaly that indicates an error in the data
  • An anomaly that indicates a rule is faulty
  • An anomaly that indicates an interesting data point.

    For one reason or other, anomalies are almost always of interest. Thus, anomaly detection is a useful form of information discovery that supplements rule discovery. In practice, anomaly detection follows from rule discovery. Once a set of rules has been extracted from a large database, the rules can then be applied to the same data (or to new data) to see which data is anomalous.

    Consider the problems at a large corporation such as Global Motors Corp. Global Motors makes and sells about 1,000,000 automobiles yearly through a network of several thousand dealers. Parts for these automobiles are manufactured at company plants and by outside suppliers, assembled at various locations, and the finished products are shipped to thousands of dealers by several trucking firms. All of this data is kept in various databases. The company maintains many databases, consisting of customer, dealer, shipper, plant, and product information. There are so many of these databases and they are so large that no single person at Global Motors could possibly know what they contain. At times these databases contain errors and anomalies that no one can detect. These errors have a high cost. Still, some of these problems may be uncovered.

    Not surprisingly, Global's large databases contain errors, typos, and incorrect entries, which cause problems with shipping, receiving, and data analysis. By using anomaly detection, however, the company can discover errors hidden in these databases. For example, anomaly detection may point out the following as anomalies, or unusual cases:

  • One dealer has a very high percentage of cars returned with battery problems.
  • One dealer has never received a single complaint (perhaps because the dealer is a fictitious database entry).

    Anomaly detection may also find the following errors in the database:

  • A trucking company is transporting luxury cars, which it is not authorized to do.
  • The trucker picking up some cars is not the same as the trucker delivering them.

    Using an anomaly detection tool, you may set thresholds for the level of error tolerated by anomaly detection or use a built-in tolerance. For instance, you may specify the tolerance level of 90, which means the system will find less anomalies, or a tolerance level of 10, which means that more anomalies will be found.

    Just as discovery parameters can be set in rule generation, anomalv detection allows you to specify the tolerance level of errors to be discovered. The higher the tolerance level is, the fewer anomalies will be found.

    Tolerance level can be represented as values in the range of 0 to 100. Tolerance level 100 means all data values are tolerable and, therefore, no anomaly will be reported by anomaly detection. Tolerance level 0 means you do not tolerate any discrepancies in the data values. In most cases, if you specify tolerance level 0, you will get all the data values reported. A reasonable default tolerance level is 80.

    For scalar analysis, anomaly detection calculates the average value for scalar fields, separating the data values into different value ranges according to the statistical analysis and screening out those values that fall beyond the tolerance level.

    For nonscalar analysis (that is, using category variables such as type of occupation or job), anomaly detection uses the occurrence frequencies of values to determine the anomalies. If a value occurs unusually often or unusually seldom, then it is possibly an error. Unlike scalar analysis, where anomaly detection uses values of scalar fields in order to perform statistical analysis, anomaly detection performs statistical analysis on the occurrence frequencies of the values for each field. The field can be either scalar or nonscalar. For each field, anomaly detection calculates the average frequency, separates data values according to different frequency ranges according to the statistical analysis, and screens out those values whose occurrence frequencies fall beyond the tolerance level.

    Which Tool is Best?

    From a methodological point of view, when an organization does have data quality problems (which is more often that not), it is tempting to guess which method (that is, basic quality enforcement, constraints, or automatic error detection) is most important or practical at first.

    Our experience indicates that often the best course of action to begin with is using a moderate dose of each component, followed by further analysis to determine the key areas of concern. By neglecting any component initially, you may be overlooking the most important problem. The third component, say, automatic error detection, is particularly important since errors are usually where we don't expect them to be.

    Two Paths

    As databases grow, organizations typically institute a data quality program using one of two paths:

  • Do nothing and wait until disaster strikes, then discover the importance of data quality.
  • Use data quality as a key element of a database implementation effort.

    Those who choose the second option generally win.

    Intelligent database applications for data quality control are essential in all database applications, ensuring the integrity of the data on which critical decisions are based.

    (This article originally appeared in the May 1993 issue of AI Expert, a Miller Freeman publication.)

    Kamran Parsaye is chairman of lntelligenceWare Inc., a database tools company in Los Angeles. Mark Chignell is an assistant professor at the University of Toronto in Ontario, Canada.


    References

    1. Parsaye, K., and M. Chignell. Intelligent Database Tools and Applications. John Wiley and Sons Inc., 1993.

    2. Parsaye, K., M. Chignell, et al. Intelligent Databases: Object Oriented, Deductive Hypermedia Technologies. John Wiley and Sons Inc., 1989.

    3. Parsaye, K., and M. Chignell. "Information Made Visual Using HyperData." AI Expert, September 1992.