What Is Data Mining?
by Josh Patterson ~ November 17th, 2010A man who carries a cat by the tail learns something he can learn in no other way.
– Mark Twain
Data Mining is defined as:
“the process of extracting patterns from data. “
Any of these groups of data has a set of examples or instances that can be grouped together in a binary file, a text file, or a database and examined electronically by a data miner and their tools. Some people seek to gain knowledge about the data from data mining, yet others take the instances and make sets of rules about how the data works, or even more complicated mathematical models that describe very hard to see trends. But how do we actually “mine” this data? How can we break down this process of pattern extraction into more discrete steps?
…
The Process of Data Mining
The process of data mining has 4 major components used for knowledge construction:
- Concepts
- Examples
- Instances
- Attributes
where each of these components work together to form a process where we condense patterns from the vapor of nuances in our data. These patterns can later be used as a valuable resource.
…
Concepts
A concept is the thing to be learned, what we are after, the model of the structural pattern within the data. Some examples of concepts are
- abnormal vs normal system operation
- the canonical “Bell, Funnel, Cylinder” dataset - understanding the difference between these shapes in noisy timeseries data
- recognizing how similar or different two peices of information are
- the type of books I would like to be shown based on my past purchases on amazon (recommendation engines)
The reader can think of concepts in the same way they learn a “concept” themselves — a topic that we study and look at descriptions, examples, and information about in order to get a better understanding of.
…
Examples
Examples are defined as:
- inputs to a learning scheme ( clustering, classification, association, etc )
- a set of instances ( described below )
- a dataset ( example: The waveform dataset at the UCI Dataset repository )
Examples are a rather restricted form of input that we use to educate our system with. We use examples as input data that we know correspond to concepts we want to our system to learn.
…
Instances
Instances (which are specific types of examples) are records in database or text file. Examples of instances include:
- recorded symptoms of patients as they enter an emergency room as well as what type of ailment they actually had
- a window of timeseries data from a temperature sensor
- a line in a log file
Preparing instances for the data mining process usually takes the bulk of our time.
…
Attributes
Attributes are the fields in the record/instance that contain the actual data. Each attribute can have different types of data in it. Types include
- Numeric - Integers, Longs, any type of numeric data
- Nominal - distinct symbols generally serving as labels or names. (ex: “sunny”, “overcast”, “rainy” )
- Ordinal - impose order on values yet not distance between the values is defined ( example: “hot” > “mild” > “cold” )
- Interval - are not only ordered but measured in fixed and equal units (example: temperature, or year).
- Ratio - quantities are ones for which the measurement defines a zero point (example: distance) and are treated as real numbers.
No relation is implied among nominal values, but equality tests can be performed. Also, the distinction between nominal and ordinal values are not always clear.
The question beyond “what are the components of data mining?” is “how do we use these concepts?”. In the following section, we take a very general look at how we can use the above components to build a process in which we extract patterns from collections of data.
…
Putting the Components Together
In order to learn concepts, we need examples; These examples are made up of instances which were selected to show a specific piece of data corresponding to a class or type. If a system that is looking for patterns sees similar instances resulting in the same output again and again, the system may conclude that this general class of instance corresponds to this outcome. It could be said that the system has now “learned” this “concept”.
An interesting example of this is illustrated with a classifier; A classifier is an system that takes input data and predicts the input data’s class or type. In this example scenario we show a classifier 2 sets of timeseries data:
- A set of timeseries data that shows operation of a wind turbine under normal operation.
- A seperate set of timeseries data that shows the wind turbine operating under duress or abnormally
With proper training, our classifier can learn the difference and classify future wind turbine timeseries data it hasnt seen before. Of course these classifiers aren’t perfect, but they can be trained to get far more answers right rather than wrong. To see an example of a similar classifier in real life, check out the timeseries classifier open sourced in the openPDC project. In future articles we’ll talk more about how the openPDC timeseries classifier is built, trained, and classifies terabytes of timeseries data. For now, let’s move on to some general ideas about how we build which datasets which power our data mining processes.
…
Dataset Construction
Beyond the above components, to construct a dataset, we have to do 4 things:
- Gather the data - (ex: by hand, mechanical turk, ETL)
- Review the content and context of the data - (ex: open data in excel, graph some values)
- Verify, Validate, and Plug the data - (does the data make sense? are values in bounds?)
- Get to know the data - (get comfortable with this type of data)
When we gather our data, we have to ask questions like
- Where does the data come from?
- Did it come from a single source or from multiple sources?
- Is it clean data? Are their missing or bad values in the instances?
The content of the data gathered can vary greatly, and has to be checked thoroughly to protect against invalid data. If our data is bad or noisy, our results can be rendered useless or even worse, dangerous. Noisy data can have decimal places off, can have missing data, missing attributes, or be recorded in different formats or types in the same attribute field.
The data cleansing process can easily take up the bulk of a data miner’s time since there can be millions of records to review. Getting to know the data takes time and much energy, but its results cannot be replaced. Many hours are generally expected to be invested in this process. Graphing a single attribute to look for any statistical outliers, taking the mean, and taking the standard deviation are some of the ways we can get to know the data.
Another important thing to consider is as we do get to know the data, we have access to many sensitive and private pieces of information. What is our responsibility in handling this data? In the next post on machine learning I’m going to take examine some of the implications of ethics in data mining.
…
Conclusion
Datamining is the process by which we extract patterns from data and store these patterns for later use. We’ve taken a look at the components of data mining and given a very broad description on the beginnings of how to mine this data. Although we know a little bit about the process of data mining, we need to know how to represent and store the patterns extracted during the process. In the next installment of this series, we’ll take a more in-depth look at how we model and represent the patterns, or knowledge, we have mined from our data.
…
References
To learn more about these topics in depth, please take a look at Ian Witten’s Data Mining book:
http://www.cs.waikato.ac.nz/~ml/weka/book.html




