Data Mining Primitives - There has been a huge misjudgment is that Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.
(Read also -> What is Data mining?)
If there was no user intervention then the system would uncover a large set of patterns and insights that may even surpass the size of the database. Hence, user interference is required.
This user communication with the system is provided by using a set of data mining primitives.
Data Mining Primitives
1. Task-relevant data: What is the data set I want to mine?2. Type of knowledge to be mined: What kind of knowledge do I want to mine?
3. Background knowledge: What background knowledge could be useful here?
4. Pattern interestingness measurements: What measures can be useful to estimate pattern interestingness?
5. Visualization of discovered patterns: How do I want the discovered patterns to be presented?
Task-Relevant Data
The first primitive is the specification of the data on which mining is to be performed.Typically, a user is interested in only a subset of the database. It is impractical to mine the entire database, particularly since the number of patterns generated could be exponential w.r.t the database size.
Furthermore, many of the patterns found would be irrelevant to the interests of the user.
In a relational database, the set of task-relevant data can be collected via a relational query involving operations like selection, projection, join and aggregation.
This retrieval of data can be thought of as a “subtask” of the data mining task. The data collection process results in a new data relational called the initial data relation.
The initial data relation can be ordered or grouped according to the conditions specified in the query.
The data may be cleaned or transformed (e.g. aggregated on certain attributes) before applying data mining analysis.
This initial relation may or may not correspond to physical relation in the database.
Since virtual relations are called Views in the field of databases, the set of task-relevant data for data mining is called a minable view.
- Database or data warehouse name
- Database tables or data warehouse cubes
- Condition for data selection
- Relevant attributes or dimensions
- Data grouping criteria
Example
If a data mining task is to study associations between items frequently purchased at All Electronics by customers in Canada, the task-relevant data can be specified by providing the following information:
Name of the database or data warehouse to be used (e.g., AllElectronics_db)
Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases, and items_sold)
Conditions for selecting the relevant data (e.g., retrieve data about purchases made in Canada for the current year)
The relevant attributes or dimensions (e.g., name and price from the item table and income and age from the customer table)
Knowledge To Be Mined
It is important to specify the kind of knowledge to be mined, as this determines the data mining functions to be performed.The kinds of knowledge include concept description (characterization and discrimination), association, classification, prediction, clustering, and evolution analysis.
(We will be discussing those in the upcoming articles)
In addition to specifying the kind of knowledge to be mined for a given data mining task, the user can be more specific and provide pattern templates that all discovered patterns must match.
These templates, or meta patterns (also called metarules or meta queries), can be used to guide the discovery process. The use of meta patterns is illustrated in the following example.
A user studying the buying habits of Allelectronics customers may choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)
Here X is a key of the customer relations
P & Q are predicate variables that can be instantiated to the relevant attributes or dimensions specified as part of the task-relevant data W, Y, and Z are object variables.
The search for association rules is confined to those matching the given metarule, such as
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”) [2.2%, 60%]
and occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”) [1.4%, 70%]
The former rule states that customers in their thirties, with an annual income of between 40K and 49K, are likely (with 60% confidence) to purchase a VCR, and such cases represent about 2.2.% of the total number of transactions.
The latter rule states that customers who are students and in their twenties are likely (with 70% confidence) to purchase a computer, and such cases represent about 1.4% of the total number of transactions.
Background Knowledge
It is the information about the domain to be minedConcept hierarchy: is a powerful form of background knowledge. It allows the discovery of knowledge at multiple levels of abstraction.
Concept hierarchy defines a sequence of mappings from a set of low – level concepts to higher – level, more general concepts. A concept hierarchy for the dimension location is shown in the figure, mapping low-level concepts (i.e. cities) to more general concepts (i.e. countries)
Concept hierarchy consists of four levels. In our example, level 1 represents the concept country, while levels 2 and 3 represent the concepts province_or_state and city respectively.
Rolling Up - Generalization of data
Allows to view data at more meaningful and explicit abstractions.
Makes it easier to understand
Compresses the data
Would require fewer input/output operations
Drilling Down - Specialization of data
Concept values replaced by lower-level concepts.
There may be more than one concept hierarchy for a given attribute or dimension based on different user viewpoints.
Interestingness Measure
Simplicity: A factor contributing to the interestingness of a pattern is the pattern’s overall simplicity for human comprehension.
Objective measures of pattern simplicity can be viewed as functions of the pattern structure, defined in terms of the pattern size in bits, or the number of attributes or operators appearing in the pattern.
For example, the more complex the structure of a rule is, the more difficult it is to interpret, and hence, the less interesting it is likely to be.
Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with it that assesses the validity or “trustworthiness” of the pattern.
A certainty measure for association rules of the form “A =>B” where A and B are sets of items, is confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples the confidence of “A => B” is defined as
A certainty measure for association rules of the form “A =>B” where A and B are sets of items, is confidence. Confidence is a certainty measure. Given a set of task-relevant data tuples the confidence of “A => B” is defined as
Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can be estimated by a utility function, such as support. The support of an association pattern refers to the percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
support (A=>B) = # tuples containing both A and B / total #of tuples
Utility (support): usefulness of a pattern
support (A=>B) = # tuples containing both A and B / total #of tuples
Novelty: Novel patterns are those that contribute new information or increased performance to the given pattern set.
For example -> A data exception.
Another strategy for detecting novelty is to remove redundant patterns.
Another strategy for detecting novelty is to remove redundant patterns.
Presentation And Visualization
For data mining to be effective, data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables, cross tabs (cross-tabulations), pie or bar charts, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered patterns.
(We will be discussing it separately)
Some representation forms may be better suited than others for particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for presenting characteristic descriptions, whereas decision trees are a common choice for classification.
Users must be able to specify the forms of presentation to be used for displaying the discovered patterns.
(We will be discussing it separately)
Some representation forms may be better suited than others for particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for presenting characteristic descriptions, whereas decision trees are a common choice for classification.
Summary
Data Mining Primitives -> Task-relevant Data, Type of knowledge to be mined, Background Knowledge, Pattern interestingness measurements, Visualization of discovered patterns. (Read also -> What is Data Mining?)
Subscribe us for more content on Data
0 Comments