|
You are at: Wagner
Home > Technologies > Data Mining
Data Mining
Our Data Mining algorithms/tools
provide extensive and highly automated data mining Knowledge Discovery in
Databases (KDD) capabilities.
Their key feature is the ability
to automatically determine the sensitivity of
results/events to many different factors. Even perfect knowledge of the
environment or other important factors is only useful if we know how
sensitive mission success/operational results are to these factors.
Data
Mining Knowledge Discovery in Databases (KDD) Tools
Our Data Mining KDD tools provide
automated statistical analysis employable in a wide variety of applications,
including detailed analysis of initial data mining results. The software
produces answers to several key questions such as:
·
Which factors are
most influential and how do they affect the probability of a given event
or the number of occurrences of a given event? Example: How do age
range, gender, education level, and work experience affect the probability
that a person’s income will fall into a particular salary range? What is
the most determining factor for women in particular?
·
How do different
factors affect the average number of occurrences of a specific event?
Example: How many traffic accidents (per 5 year period) should we expect
from a specific category of insured drivers? Which factor most heavily
affects the expected number of accidents?
In mine warfare we are using this
tool to automatically determine which environmental/operational factors each
MCM system is most sensitive to and how these factors affect system
performance. We are also using it to automatically determine the best model
to predict the probability of detection for each combination of these
factors. However, there are countless situations when businesses and
researches need answers to these or similar questions. In general, they
have available only limited sample data to estimate the desired
probabilities or average number of events. Unless generated using a large
number of test cases, these sample statistics tend to reflect anomalies in
the data rather than true probability trends. It is for this reason that we
use advanced statistical methods to smooth out the data, including models
that extend to cases when the factors of interest fall into discrete
categories.
These methods are similar to
linear regression, but are more advanced, using more sophisticated models
that must be solved and analyzed using more computationally intensive
approaches.
The statistical and numerical
tools required for this analysis are not at all straightforward. Even
advanced statistical software, while taking away some of the burden of
computation, still requires the user to form and interpret individual
models. In addition, the number of models to choose from grows
exponentially in the number of factors that are considered.
Our Data Mining KDD tool
completely automates this statistical process. Once the user has entered
the sample data, the software does all of the work, forming and solving
tens, hundreds, or thousands of possible statistical models and determining
the quality of each model. The user can view a ranking of which factors are
statistically most important and can view a variety of graphs showing how
different factors affect the overall outcome. The user can also view the
results in a variety of formats.
Mine
Warfare KDD Example
A company interested in
underwater exploration needs to determine the search effectiveness of their
sonar in a variety of scenarios. They suspect that the probability of
detecting a certain underwater object depends on one or more of the
following factors:
-
The Closest Point of Approach (CPA) of the sonar to the object (binned as
0-10, 10-20, 20-50, 50-100, 100-150, and 150-200 yards)
-
The Bottom Type in the area (scaled as A, B, C, or D)
-
The Clutter Density in the area (scaled as 1, 2, or 3)
-
The Sound Speed Profile (SSP) of the water (positive indicates that sound
speed is increasing with depth and negative indicates that sound speed is
decreasing with depth)
The company performs a number of
test runs, recording when their sonar finds or fails to find a known
object. The number of tests may vary for each scenario (each scenario
represents a different combination of the factors) and there may be
scenarios for which no tests were run. They would like to use their limited
test data to answer the following questions:
Question 1:
Which of the factors most heavily affects the probability of detection?
Question 2:
For any combination of Bottom Type, Clutter, and SSP, what is the
probability of detection as a function of CPA?
To illustrate this example, we
generated simulated test results. We first randomly determined (uniformly
distributed from 0 to 9) the number of test runs for each of the 192
scenarios -
-
6 CPA Bins
-
4
Bottom Types
-
4
Clutter Densities
-
2
SSP Profiles
Then, for each test run, we
randomly determined whether the run was successful based on the actual
probability of detection as a function of the CPA curve that we built for
each scenario. The measured data was then entered into the fully automated
data mining KDD tool in a simple ASCII format and the module determined the
best model.
To illustrate the interaction
between the user and our data mining KDD tool, we assume that the user
requested that the probabilities be graphed as functions of CPA and
proceeded to view the results:
User:
Using the graphical user interface (GUI), the user requests to see
the three non-CPA factors ranked based on which one has the greatest affect
on the probability of detection.
Software:
Ranks the three factors. Calculates and displays the relative importance
of each factor as shown in Figure 1.

Figure 1. Ranking of Most Important Factors in Modeled Data
User:
Notes that Bottom Type seems slightly more important than Clutter.
Requests the specific probability curve for Bottom Type = C,
Clutter Density = 3, and Negative SSP (6 scenarios).
Software: Provides the modeled curve (the red curve) in Figure 2.
For illustrative purposes, we have also graphed the probability curve formed
from the measured data, with no modeling (black curve), along with the
actual probability of detection curve, which we used to generate the
measured data results (blue curve).
Figure 2. Probability Curves for a Specific Scenario (Bottom Type = C,
Clutter Density = 3, Negative SSP) (Y-Axis Indicates Probability of
Detection)
As can be seen in Figure 2, the sparse measured data gives a very poor
indication of the true shape of the probability of detection curve, but the
model built from the measured data using our data mining KDD tool is very
similar to the actual probability of detection curve.
Biotechnology KDD Example
A pharmaceutical company is
studying the distribution of a certain drug. The company has enlisted the
help of a sample group of physicians to determine the average number of
prescriptions written based on the following list of factors: patient
gender, patient age bracket, physician specialization, and geographic region
within the country. The following table shows how the different factors
were broken down:
Table 1. Factors Utilized in
Data Mining KDD Tool Example
|
Geographic Region
Northeast
Mid-Atlantic
South
Southwest
Midwest
Northwest
Pacific |
Patient Gender
Male
Female |
Specialization
Pediatrics
General
Internal Medicine
Geriatrics |
Patient Age Bracket
0-11
12-18
19-35
36-65
65+ |
In a real-world test each
physician would report the number of prescriptions written to patients in
each category during the test period and the results would be combined into
a sample table. We have simulated this test data by taking a random sample
from a predetermined distribution. Thus, we have the “true average” number
of responses for each combination of factors. We use this randomly computed
simulated data set as the measured data. We can compare both the sample and
modeled data sets to this “true average,” giving us an idea of the benefits
of using the modeled data as opposed to the sample data.
Our Data Mining KDD tool used
this sample table to recommend a best statistical model, which gives
information about which factors are most important in determining how many
prescriptions will be written, which groups of factors seem to operate
together and which seem to operate independently, and what the true average
number of prescriptions for each category seems to be.
In
this particular example, the Data Mining KDD Tool must choose from among 166
possible nominal and 66,558 possible ordinal models. Our Data Mining KDD
tool can automatically generate and analyze all of these models, or we can
ask for a quicker, heuristic search, in which case the tool efficiently
searches the space of all possible models to arrive at a “Best” model. It
then displays the groups of factors that seem to interact together (see
Figure 3):

Figure 3. Groups of Interdependent Factors
From this we see that we can
consider region independently, but that gender and patient age should be
viewed together, as should physician specialization and patient age. In
other words, we could well make a statement such as “More prescriptions tend
to be written in the Southwest than in the Midwest.” Such a statement could
be made without specifying a gender, age bracket or physician type.
However, we could not say, “More prescriptions tend to be written for males
than for females” unless we also include more detailed information about
age, as in, “For patients over the age of 65, more prescriptions tend to be
written for males than for females.”
We can get an idea of which of
the four factors seem to be most influential by asking the tool to rank the
factor effects. The result is the display shown in Figure 4.

Figure 4. Ranking of Most Influential Factors
We proceed to analyze the data in
more detail. The following graphs show the number of prescriptions written
for one specific group – males between the ages of 19 and 35, living in the
Southwest.

(Y-axis indicates total number of
prescriptions written)
Figure 5. Number of Prescriptions to Males Aged 19-35 in the Southwest
The jagged measured data curve bears the marks of the variance of the
distribution, but the modeled curve matches the true mean number of
prescriptions almost exactly. It is at this level of detail that we most
clearly see the advantages of using the generated “Best” statistical model
to analyze sample data, rather than simply analyzing the sample data
directly.
|