by Lisa Lewison, PC AI, November/December 1993
Different data mining tools can help business in different ways. To find out exactly how each one applies, I spoke to a number of tool vendors. In order to inject concreteness into our conversations, I asked each one how his or her organization's product could help a particular business. To keep things comprehensible, I asked them to consider a simple business operation - a fictional neighborhood lemonade stand.
Imagine a group of neighborhood children that call themselves the Sidewalk Drink Company (SDC). SDC ran two lemonade stands this summer, and they ran these two stands last summer as well. Like many modern businesses (but unlike most neighborhood lemonade stands) they kept a database, amassing records on 4,000 transactions over the 182 days they operated.
These young entrepreneurs are as observant as they are industrious. They've tracked weather conditions, customer transportation, customer age range, and a number of other variables. They've left coupons in mailboxes according to their perception of the resident's wealth. Figure 1 shows all the variables in their database.
The kids of SDC are ambitious. They want to run their lemonade stands next summer, and they want to use the data in their database to help them plan ahead. How can data mining tools - intelligent software packages that find hidden patterns in data - help SDC use their data to predict total daily sales? Vendors of these tools looked at SDC's variables and graciously consented to tell them.
Neural networks are designed for pattem recognition. Hence, they're a popular class of data mining tools. Here are the network makers who offered suggestions for SDC.
Talon
Talon Development Corporation produces @Brain. Talon President Mike Staub suggests that SDC analysts build two backpropagation neural nets, one for each lemonade stand. "With 182 examples (from the two summers of sales), they are probably undersampled, so they should run a number of neural nets to get good results," he says. He recommends that they winnow down the number of inputs to ten, using Lotus 1-2-3 or Excel to prepare their data.
For training, he suggests that SDC partition their data, using 10% as a test set while training and another 10% as an "out of sample validation set." He recommends that they use @Brain's Best Test command, which stops the training when the test data has minimum error with no human intervention. He points out that test data becomes biased when you use Best Test to terminate training, which is why he recommends a third data set.
"Building a neural net is a development process," Staub says. "SDC may have to make changes to their data, take ratios of numbers, or cluster attributes, all of which are done easily in a spreadsheet." During training, @Brain automatically sets the number of hidden layers of neurons as it analyzes the input. After they've run their first training set (with the ten winnowed-down inputs), Staub recommends that they winnow down the number of inputs again. "@Brain gives the relative influence of the inputs," he says. "The SDC analysts should remove the inputs that had an order of magnitude less influence over the results, then retrain the neural network."
Staub points out that with @Brain, every iteration through the training data is better than the iteration before. Thus, each training session reaches an optimum point. Once the kids have decided upon the optimal neural network for each lemonade stand, they can then run the network in realtime with what-if questions about total daily sales. They do this by changing elements in the spreadsheet via the @Function. "They can plug in temperature, weather condition, salesperson, or any combination of facts in the input variables and see what the outcome will be," says Staub. "Given a predicted weather forecast, they can decide if they should sleep in late the next day."
California Scientific Software
Jeannette Lawrence is the General Manager of Califomia Scientific Software, the producers of BrainMaker. Reviewing SDC's variable list, she advises:" Asking for total-daily sales is viable, but a number of other analyzes are possible. We could look at how much a customer will buy. Right now, this is limited by the fact that we have no information on who didn't buy.
"If the SDC analysts access new demographic categories - like income and education - through the zip code, we have the start of a good marketing design. Another good prediction is to look at how promotions affect sales. Total daily sales amount would be the output of the neural net, and each promotion type would be a separate input. If there were five promotions, the SDC analysts would run a BrainMaker sensitivity analysis to determine the effects of inputs on outputs. They could also ask when to open the stand or which 'refreshment consultant' will attract the most sales."
Lawrence explains that to predict total daily sales, SDC would have to collate the records into daily totals before giving the data to BrainMaker. BrainMaker would do the rest of the data transformation, like turning day-of-the-week into a symbol and adding past sales data as input.
BrainMaker asks the user how much to take out as a test set. "The default is every 10th fact," says Lawrence, "and we suggest shuffling the data before this is done. With BrainMaker, they can tune the Teaming rate and work with the weights of each connection node. They can ask for a specific degree of accuracy. I would suggest starting with about 85% accuracy, which means that 16 out of 18 facts will be correct. One hundred eighty-two days of data should be sufficient, but if the training does fail, we would suggest reducing the number of hidden neurons and loosening the tolerance [which means the net will not train as deeply]."
Once the network is trained, the kids can look at the results through a four-color contour plot or a sensitivity file which shows the effect of each input or output. They can analyze all the inputs and all the facts via a report which identifies the attributes which had the greatest overall effect on total daily sales, and the ones with the least overall effect. The report also shows the average effect, the standard deviation, and a number of other statistics.
"If the SDC kids are considering changing the price of the lemonade based on daily temperature," says Lawrence, "the contour plot can show them which value produces the largest gross income. The sensitivity file can tell them the most important factor that affects the sale amount as well as the least important factor."
Lawrence advises SDC to run a variety of neural networks and rate them against each other. Running approximately ten neural nets, she predicts, would give SDC the ability to accurately forecast the next day's sales and would provide a guideline for the best marketing campaigns.
HNC
HNC created the DataBase Mining Workstation. Applications Development Director Bruce Harris points out that the SDC data covers two problems: target marketing and retail sales forecasting. "One of the most beneficial things they can do," he says, "is to expand and link the data into other data sources and enrich the data as much as they can."
To upgrade the quality of their data, Harris recommends that the SDC kids use a specific customer age, even if it's a judgement. He also suggests adding in whether the date is a weekday or weekend, an important factor in retail sales. He thinks the kids should use the four-digit zip code extension, which allows them to get into CensusTrack and other services.
"Add more personal data on the customer," Harris advises. "Include dress - suits, playclothes, sweats. With this kind of database, you should treat people as individuals."
When the data is ready, Harris explains that a Data Set Module removes test records, maps data from the original database, and assigns types. Once the data is loaded into the DataBase Mining Workstation (DMW), Harris recommends using the Relationship Discovery tool, which measures all non-random relationships between all elements in the database. For example, he says, the SDC analysts can determine which variables are strongly related to rollerblades. They can look at these two-way relationships through scatterplots, bar charts, or 3D surface plots.
The SDC analysts would build their neural nets using the Modeler tool. The Modeler automatically builds a half-dozen neural nets and identifies the one with the best characteristics. They would then look at this net's distribution of errors. For example, the net may yield accurate information on high-volume sales or low-volume sales but may not give good information on the middle volumes. According to Harris, this might mean that high-volume sales occur on weekends and are more predictive. Often, Harris notes, inaccuracies in one part of a range of data reveal inadequate information.
The kids would use the DMW's Sensitivity Analysis tool to determine each fact's sensitivity. "Sensitivity Analysis detects each fact's power to predict," Harris explains, "and even finds unusual interactions such as weekend vs. weekday bicycle customers, who may have totally different buying pattems."
Because sensitivity analysis goes through every variable of every record, it is computationally intensive. To speed up the process, HNC offers its own Balboa 860 board.
The Automatic Variable Selection Module analyzes two-way relationships and finds highly related elements that may cluster together. "For instance," says Harris, "the kids may find that bikes, rollerblades, and foot traffic are closely related. If the same kid is always at the same stand, they may find that stand and salesperson are correlated, which means that you don't need both attributes. Too much information with the same content clouds the picture. Using the DMW's tools, you can reduce the number of variables by more than half. Then you rebuild the model and discover it became better because you got rid of noisy data.
"In order to predict customers accurately, SDC has to know who the non-responders are. I would advise them to change their method of couponing. They should get data on everyone who receives a coupon and make a new target variable of everyone who retums a coupon. The Sensitivity Analysis will be able to show them the kind of person who responds to a coupon, and they'll have to figure out a different campaign to target non-responders. They can also build a retail forecasting model to predict how many lemons to buy."
Scientific Consultant Services
Another marketing approach comes from Jeffrey Owen Katz of Scientific Consultant Services, makers of N-Train. "The first thing to do is convert symbolic data into numeric values" he advises. "For example, 'day of week' may be one through seven. 'Stand type' may be 0 or 1. Each salesperson is given a different input which is assigned a 0 or 1 (one-event coding). More complex massaging of the data would involve overlaying zip codes with actual demographics or constructing ratios among some of the data elements. Then the kids would use a spreadsheet to compute simple statistics, like means, standard deviations, and correlations between input and output variables. This would enable them to weed out variables. A rarely used input variable should be discarded."
Katz recommends using N-Train to first build a simple logistic regression model, with 22 inputs and one output. This type of model, he points out, is often used in marketing to create a baseline. After training, the kids would divide the output into ten equally probable intervals (deciles), another common marketing technique. "Look at the content of each range of outputs," he advises, "and exclude the weakest sectors."
After building the logistic regression model, the SDC executives should build a non-linear model, says Katz. "For this neural net, I would introduce a sigmoid middle layer. Since there is a small amount of data, I would start small, with 22 inputs and ten middle layer neurons. Another way to overcome the small size of the dataset is to use OptiTrain, which allows the user to save the best networks generated in a training series." Katz advises SDC to use N-Train's default settings when they first build the net. When they're ready, the software offers finer control of teaming rates and transfer functions on a per-layer basis. The software also offers selection of several forms of error functions.
Once the second model is built, Katz says, if it does not perform well because of the small dataset, they might use factor analysis to build a data reduction network. This reduces the input data to a smaller network. Another way to reduce the number of inputs is to use the Relative Contribution module, which looks at each input variable and replaces it with a synthetic variable, then tests to see how much performance declines with the synthetic variable. This module reveals each variable's contribution to the model. Katz recommends that SDC look at all the weights leading from various inputs to second-layer neurons. Small weights from an input suggest that the associated variable is not important.
"By going through this process," Katz concludes, "the SDC execs will have a model that shows the best way of increasing sales."
An abductive network, like a neural network, is a set of interconnected nodes. In most types of neural nets, each node performs the same kind of simple computation. In an abductive network, the nodes' computations can differ from one another and become mathematically complex. Abductive modeling is the search for the types of nodes and the architecture of their interconnections that minimize the predictive error in a set of training data. The goal is to build a model that generalizes well to a set of test data.
AbTech Corporation produces AIM (Abductory Inference Mechanism), a tool for building abductive networks. Dr. Keith C. Drake, AbTech's Director of Research and Development, says that many of his customers want to solve huge problems all at once. Instead, he focuses them on smaller sets of problems. He explained that defining their problem correctly is the most challenging task that SDC faces. While predicting total daily sales is certainly an option, a better approach would be to ask, what do we really want to do? He says that the SDC data holds a wealth of information on maximizing sales.
Preprocessing
To use AIM, Drake explained, the SDC kids would prepare their data - making dummy variables out of type columns and amplifying the zip code data - in a spreadsheet of their choice. "After the data is imported into AIM, it is automatically and randomly split into separate training and testing subsets. AIM allows the user to set modeling criteria to fit data more tightly or to generalize more during model-building."
Modeling
Drake recommends that the kids use AIM to build a number of what-if models. "What is the best stand-color? They may find that on cold days, the red stand works better - or there may be no correlation. What promotion type works best within each zip code? The answers to this will reduce their costs and maximize returns on direct mail. Who is the most likely customer on a sunny day? Who is more likely to buy more than one cup of lemonade? Which sales person attracts which kind of customer? If they've been moving the stand around, what is the optimal location? What are the predicted daily sales, or hourly sales? In each case, the overall objective is to run their operation intelligently rather than guess. SDC can use the same database to instantly set up models that answer these questions and others."
During the course of a Teaming session, AIM starts out with simple models. AIM automatically programs network solutions from examples of input and output variables and learns what minimizes the error between the expected output and the derived output. The tool goes through an iterative process, automatically using as building blocks previous models that worked well. The final model might contain only a subset of the original inputs - the important input variables.
Examining the Model
When the model is built, AIM generates a set of graphical representations. Each graph provides insights as to which variables are important and displays the relationships among variables. AIM can analyze its own performance on an independent database: For each set of input values in the independent database, AIM compares the database outputs to the corresponding outputs computed by the AIM model and calculates a set of evaluation statistics.
"AIM is known for its simplicity and directness," Drake says. "Once the models are built, you can use them to fine-tune just about every aspect of business operations."
Suppose you're in the initial stages of buying a home. You gather detailed information on a number of houses. Some have characteristics which make them seem desirable. Some have characteristics that make you rule them out right away. Others are borderline: You look at the features they offer and you just can't decide whether you'd like any of them or not. Wouldn't it help if you had rules of thumb to decide, on the basis of their characteristics, which house falls into which category?
Rough sets analysis does just that. Such an analysis examines a database, finds pattems and dependencies, and develops if-then rules that explain the pattems and dependencies. The if-then rules can become the foundation of a knowledge base.
The name comes from the theory of rough sets. This theory extends the ideas of classical set theory. In a classical set, the boundary is a line that separates elements inside the set (members) from elements outside the set (non-members). In a rough set, the boundary is broader than a line - it's a region that can contain elements. The larger the boundary region, the "rougher" the set.
DataLogic/R, a creation of Reduct Systems, Inc. is based on the theory of rough sets. Reduct President Adam Szladow offers a three-phase approach to the Sidewalk Drink Company.
Phase 1: Knowledge Discovery, Data Analysis, Knowledge Validation
DataLogic/R searches for important pattems (rules) that describe how outcomes such as total daily sales or sales per customer depend on conditions such as location, time of day, and promotion type.
This analysis
Figure 2 shows a sample of the rules that SDC might expect to see. Szladow explains that each attribute in the rules is ranked for strength. The SDC kids should notice when more than one rule describes the same outcome (for example, different market segments within the group of customers who spend more than 40 cents) because such rules may help them come up with new, different promotions for each customer group. This is in contrast to using one promotion for all customers who spend more than 40 cents. DataLogic/R's rule report identifies which transactions were used to derive the rules and how many customers fell within each market segment.
A validation report shows the percent of cases predicted correctly, the percent misclassified, and the percent of cases where no decision was possible. Szladow points out that "no decision" is in itself an important answer - it may indicate poor quality data or lack of pattems.
Phase 2: Strategy Development
Szladow advises that SDC study the rules and try to identify the conditions which resulted in high sales. They should also look at the conditions which resulted in low sales. "Just by changing one or two conditions," he advises, "low-volume customers may buy more."
Phase 3: Strategy Testing
DataLogic/R offers expert system facilities which will allow SDC to use the rules to forecast outcomes for different combinations of input variables. The tool's case based-like reasoning retrieves and summarizes the common features of similar cases in the database.
SDC can decrease or increase rule "roughness," says Szladow, which will give them different levels of knowledge. They can control the level of risk, for instance, deriving marketing rules that have higher risk and bigger payoffs (lower probability of success but larger market segments) or rules that have lower risk and smaller payoffs (higher probability of success but smaller market segments).
His final words for SDC: "Acquire the knowledge, evaluate it, and explore its use."
Universal Process Modeling (UPM) is a propietary algorithm of TERANET. UPM is built to model systems, and you can apply it to anything that takes a set of inputs and produces a set of outputs. The idea is to give UPM a "reference library" of examples of a system's behavior. Each example consists of values of a number of variables. UPM learns the relationships among the variables. Given a new set of system observations, UPM provides a prediction for all the variables in each new observation.
TERANET produce ModelWare, a package based on the UPM algorithm. According to TERANET Vice President Paul O'Sullivan, the SDC execs should use a spreadsheet of their choice to change their symbolic data into numeric data. "Also, it' not necessary to place the forecasted value in a column," he explains. "We use freeform data, and all the user has to do is load it into the software".
Preprocessing
One of ModelWare's unique features. says O'Sullivan, is that it predicts the results of all variables. "By using the software, you can see immediately if the system is one that can be modeled accurately", he notes. "In addition, if there were any missing data, our Patch tool can fill in missing values intelligently, using the same UPM algorithm that drives the model-building process. Other preprocessing tools remove redundant records and smooth the data."
ModelWare creates test and training (reference) files, allowing the user to split data randomly or sequentially. In Leave One Out modeling mode, O'Sullivan says, you can use a reference file to model itself. Each row provides output, and every value is estimated. "You'd do this," O'Sullivan advises, "to get a feel for the data. That sounds unscientific, and I'm an engineer, but you have to have a fundamental understanding of the problem. With problems such as lemonade stand sales, the underlying mathematical model is usually unknown. Perhaps there is no sensible model to be obtained. By using the reference file to model itself, I may find that some variables don't model very well at all, and that others model very well.
"The SDC execs should also apply our Drivers Tool. Select the column of interest (in this case, total daily sales), and the Drivers Tool reports on the best predictive variables. The results are quantified. The tool may find that if all the variables were used, there would be eight percent error. Using a subset of the variables may drop the error down to two percent. The SDC analysts can also use the Influence Tool, which gives correlations between all pairs of variables. These tools provide a specific view and a general view of how the data is correlated."
Modeling the Test File
The UPM ranks all the records in the database of examples by their similarity to the input. The algorithm "generates a similarity coefficient between 0 and 1", says O'Sullivan, "combining the records together to give a predicted result.
"The software runs quickly and is self-optimizing. You can fine-tune the model, but the adjustments won't be drastic. If the model doesn't work with the defaults, it's unlikely to work with the adjustments. You can turn features on which explain the examples that are pulled from the reference library and detail how the examples are combined to produce a result."
Examining the Model
You can print out a model, export it to a spreadsheet, or view it on-line. Color graphics plot trends and deviations.
"If previous models have been run, the user can evaluate the current model by overplotting it with two previous models and comparing the results," O'Sullivan advises. "SDC can look at single data files as well as complete models. They can visually tag a particular record they find useful, and a marker will appear on the screen. Tagged records can be written out to a file. In fact, after the model is built and the SDC analysts feel comfortable with it, they can create a DOS batch file to run the program automatically at the end of each week. They've isolated the lemonade stand phenomena that will help them the most and they can easily access the information throughout the summer."
Data mining tools have become relatively easy to use, if you take the time to learn and experiment with them. They will show you the value of enriching your data, and they can be integral parts of a business operation of any size.
So think ahead to next summer and imagine this scenario: You roll up to a local lemonade stand on your bicycle and redeem an addressed and nine-placed zip-coded coupon flyer for ten cents off a glass of lemonade. Freckle-faced Suzie sits with a laptop, clicking away while her cohort fetches the brew. You catch a peek at the array of variables. Instantly, you know you've been tagged. She's entered your clothing type, sneaker type and brand, transportation type and brand, and made an entry in a special column for extra yuppie gear. For all the moms in the neighborhood, she's got stroller type and brand, designer kids clothes flag and brand, and more.
Suzie tells you that they're considering a market niche of bicycle accessories. And that they'll soon be adding peppermint iced tea to their product mix (it appeals to people your age). They also plan to run their stand in winter to sell hot chocolate. Their database grows each day.
As you ride off into the balmy midday, you wonder - is their cache of data, no matter how rich, going to give them enough product sales to get these techno-kids to college? It's hard to know. But from now on, at the end of every season, they're going to sell their customer list.
Lisa Lewinson is president of Northstar Consulting Services, Inc. (708-786-3922), a Chicago-based consulting firm. Northstar develops user- and developer-friendly intelligent applications and offers seminars