DANIEL H. WAGNER ASSOCIATES, INC

A Leader in Applying Mathematics and Computer Science to Industry

 

Operations Research - Mathematics - Software Development

Home

About Us

Technology

Projects

Products

Careers

Contact Us

Search

You are at: Wagner Home > Technologies > Data Mining

Data Mining

Our Data Mining algorithms/tools provide extensive and highly automated data mining Knowledge Discovery in Databases (KDD) capabilities.

Their key feature is the ability to automatically determine the sensitivity of results/events to many different factors.  Even perfect knowledge of the environment or other important factors is only useful if we know how sensitive mission success/operational results are to these factors.

Data Mining Knowledge Discovery in Databases (KDD) Tools

Our Data Mining KDD tools provide automated statistical analysis employable in a wide variety of applications, including detailed analysis of initial data mining results.  The software produces answers to several key questions such as:

·        Which factors are most influential and how do they affect the probability of a given event or the number of occurrences of a given event?  Example:  How do age range, gender, education level, and work experience affect the probability that a person’s income will fall into a particular salary range?  What is the most determining factor for women in particular?

 

·        How do different factors affect the average number of occurrences of a specific event?  Example:  How many traffic accidents (per 5 year period) should we expect from a specific category of insured drivers?  Which factor most heavily affects the expected number of accidents?

In mine warfare we are using this tool to automatically determine which environmental/operational factors each MCM system is most sensitive to and how these factors affect system performance.  We are also using it to automatically determine the best model to predict the probability of detection for each combination of these factors.  However, there are countless situations when businesses and researches need answers to these or similar questions.  In general, they have available only limited sample data to estimate the desired probabilities or average number of events.  Unless generated using a large number of test cases, these sample statistics tend to reflect anomalies in the data rather than true probability trends.  It is for this reason that we use advanced statistical methods to smooth out the data, including models that extend to cases when the factors of interest fall into discrete categories.

These methods are similar to linear regression, but are more advanced, using more sophisticated models that must be solved and analyzed using more computationally intensive approaches.

The statistical and numerical tools required for this analysis are not at all straightforward.  Even advanced statistical software, while taking away some of the burden of computation, still requires the user to form and interpret individual models.  In addition, the number of models to choose from grows exponentially in the number of factors that are considered.

Our Data Mining KDD tool completely automates this statistical process.  Once the user has entered the sample data, the software does all of the work, forming and solving tens, hundreds, or thousands of possible statistical models and determining the quality of each model.  The user can view a ranking of which factors are statistically most important and can view a variety of graphs showing how different factors affect the overall outcome.  The user can also view the results in a variety of formats.

Mine Warfare KDD Example

A company interested in underwater exploration needs to determine the search effectiveness of their sonar in a variety of scenarios.  They suspect that the probability of detecting a certain underwater object depends on one or more of the following factors:

  • The Closest Point of Approach (CPA) of the sonar to the object (binned as 0-10, 10-20, 20-50, 50-100, 100-150, and 150-200 yards)

  • The Bottom Type in the area (scaled as A, B, C, or D)

  • The Clutter Density in the area (scaled as 1, 2, or 3)

  • The Sound Speed Profile (SSP) of the water (positive indicates that sound speed is increasing with depth and negative indicates that sound speed is decreasing with depth)

The company performs a number of test runs, recording when their sonar finds or fails to find a known object.  The number of tests may vary for each scenario (each scenario represents a different combination of the factors) and there may be scenarios for which no tests were run.  They would like to use their limited test data to answer the following questions:

Question 1:    Which of the factors most heavily affects the probability of detection?

Question 2:    For any combination of Bottom Type, Clutter, and SSP, what is the probability of detection as a function of CPA?

To illustrate this example, we generated simulated test results.  We first randomly determined (uniformly distributed from 0 to 9) the number of test runs for each of the 192 scenarios -

  • 6 CPA Bins

  •  4 Bottom Types

  •  4 Clutter Densities

  •  2 SSP Profiles 

Then, for each test run, we randomly determined whether the run was successful based on the actual probability of detection as a function of the CPA curve that we built for each scenario.  The measured data was then entered into the fully automated data mining KDD tool in a simple ASCII format and the module determined the best model.

To illustrate the interaction between the user and our data mining KDD tool, we assume that the user requested that the probabilities be graphed as functions of CPA and proceeded to view the results:

User:            Using the graphical user interface (GUI), the user requests to see the three non-CPA factors ranked based on which one has the greatest affect on the probability of detection.

Software:     Ranks the three factors.  Calculates and displays the relative importance of each factor as shown in Figure 1.

 

Figure 1.  Ranking of Most Important Factors in Modeled Data

User:            Notes that Bottom Type seems slightly more important than Clutter.

               Requests the specific probability curve for Bottom Type = C, Clutter Density = 3, and Negative SSP (6 scenarios).

Software:     Provides the modeled curve (the red curve) in Figure 2.  For illustrative purposes, we have also graphed the probability curve formed from the measured data, with no modeling (black curve), along with the actual probability of detection curve, which we used to generate the measured data results (blue curve).

Figure 2.  Probability Curves for a Specific Scenario (Bottom Type = C, Clutter Density = 3, Negative SSP) (Y-Axis Indicates Probability of Detection)

As can be seen in Figure 2, the sparse measured data gives a very poor indication of the true shape of the probability of detection curve, but the model built from the measured data using our data mining KDD tool is very similar to the actual probability of detection curve.

Biotechnology KDD Example

A pharmaceutical company is studying the distribution of a certain drug.  The company has enlisted the help of a sample group of physicians to determine the average number of prescriptions written based on the following list of factors:  patient gender, patient age bracket, physician specialization, and geographic region within the country.  The following table shows how the different factors were broken down:

Table 1.  Factors Utilized in Data Mining KDD Tool Example

Geographic Region

Northeast

Mid-Atlantic

South

Southwest

Midwest

Northwest

Pacific

Patient Gender

Male

Female

Specialization

Pediatrics

General

Internal Medicine

Geriatrics

Patient Age Bracket

0-11

12-18

19-35

36-65

65+

In a real-world test each physician would report the number of prescriptions written to patients in each category during the test period and the results would be combined into a sample table. We have simulated this test data by taking a random sample from a predetermined distribution.  Thus, we have the “true average” number of responses for each combination of factors.  We use this randomly computed simulated data set as the measured data.  We can compare both the sample and modeled data sets to this “true average,” giving us an idea of the benefits of using the modeled data as opposed to the sample data.   

Our Data Mining KDD tool used this sample table to recommend a best statistical model, which gives information about which factors are most important in determining how many prescriptions will be written, which groups of factors seem to operate together and which seem to operate independently, and what the true average number of prescriptions for each category seems to be.

In this particular example, the Data Mining KDD Tool must choose from among 166 possible nominal and 66,558 possible ordinal models.  Our Data Mining KDD tool can automatically generate and analyze all of these models, or we can ask for a quicker, heuristic search, in which case the tool efficiently searches the space of all possible models to arrive at a “Best” model.  It then displays the groups of factors that seem to interact together (see Figure 3):

Figure 3.  Groups of Interdependent Factors

From this we see that we can consider region independently, but that gender and patient age should be viewed together, as should physician specialization and patient age.  In other words, we could well make a statement such as “More prescriptions tend to be written in the Southwest than in the Midwest.”  Such a statement could be made without specifying a gender, age bracket or physician type.  However, we could not say, “More prescriptions tend to be written for males than for females” unless we also include more detailed information about age, as in, “For patients over the age of 65, more prescriptions tend to be written for males than for females.”

We can get an idea of which of the four factors seem to be most influential by asking the tool to rank the factor effects.  The result is the display shown in Figure 4.

 

 Figure 4.  Ranking of Most Influential Factors

We proceed to analyze the data in more detail.  The following graphs show the number of prescriptions written for one specific group – males between the ages of 19 and 35, living in the Southwest.

(Y-axis indicates total number of prescriptions written)

Figure 5.  Number of Prescriptions to Males Aged 19-35 in the Southwest

The jagged measured data curve bears the marks of the variance of the distribution, but the modeled curve matches the true mean number of prescriptions almost exactly.  It is at this level of detail that we most clearly see the advantages of using the generated “Best” statistical model to analyze sample data, rather than simply analyzing the sample data directly.


 

Home | Contact Us | Site Index | Career Opportunities

Technology | Projects | Products | Locations | Legal Notices | Search

© 2005 Daniel H. Wagner Associates, Inc.  - All rights reserved.