Data Mining

From PPDM Wiki
Jump to: navigation, search

Contents

Introduction

Traditional data analysis is done by inserting data into standards or customized models. In either case, it is assumed that the relationships among various system variables are well known and can be expressed mathematically. However, in many cases, relationships may not be known. In such situations, modeling is not possible and a data mining approach may be attempted.

Data mining (DM) is a term used to describe knowledge discovery in databases. DM is a process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases. Formerly the term was used to describe the process through which undiscovered patterns in data were identified. However, over time, the original definition has been modified to include most types of (automated) data analysis. According to the Gartner Group, data mining is the process of engineering mathematical patterns from usually large sets of data. These patterns can be rules, affinities, correlations, trends, or prediction models.

DM is on the interface of computer science and statistics, utilizing advances in both disciplines to make progress in extracting information from large databases. It is an emerging field that has attracted much attention in a very short time. DM includes tasks known as knowledge extraction, data archaeology, data exploration, data pattern processing, data dredging, and information harvesting. All these activities are conducted automatically and allow quick discovery even by non-programmers.

The major characteristics and objectives of data mining

  • Data are often buried deep within very large databases, which sometimes contain data from several years. In many cases, the data are cleaned and consolidated in a data warehouse.
  • The data mining environment is usually client/server architecture or a Web-based architecture.
  • Sophisticated new tools, including advanced visualization tools, help to remove the information ore buried in corporate files or archival public records.
  • The miner is often an end-user, empowered by data drills and other power query tools to ask ad hoc questions and obtain answers quickly with little or no programming skill.
  • Striking in rich often involves finding an unexpected result and requires end users to think creatively.
  • Data mining tools are readily combined with spreadsheets and other software development tools. Thus, the mined data can be analyzed and processed quickly and easily.
  • Because of the large amounts of data and massive search efforts, it is sometimes necessary to use parallel processing for data mining.

Effectively leveraging data mining tools and technologies can lead to acquiring and maintaining a strategic competitive advantage. DM offers organizations an indispensable decision-enhancing environment to exploit new opportunities by transforming data into a strategic weapon.

Imagem.JPG


How Data Mining Works

Intelligent data mining discovers information within data warehouses that queries and reports cannot effectively reveal. DM tools find patterns in data and may even infer rules from them. Three types of methods are used to identify patterns in data:

  • Simple models (SQL-based query, OLAP, human judgment)
  • Intermediate models (regression, decision tress, clustering)
  • Complex models (neural networks, other rule induction)

These patters and rules can be used to guide decision-making and forecast the effect of decisions. DM can speed analysis by focusing attention on the most important variables. The dramatic drop in the cost or performance ratio of computer systems has enabled many organizations to start applying the complex algorithms of data mining techniques. Each data mining application class is supported by a set of algorithmic approaches to extract the relevant relationships in the data. These approaches differ in the classes of problems they are able to solve.

Classes

  • Classification: infers the defining characteristics of a certain group (e.g., customers who have been lost to competitors). These methods involve seeding a set of data with a known set of classes (perhaps found by clustering), and mapping all other items (customers) into these sets. Decision trees and neural networks are useful techniques.
  • Clustering: identifies groups of items that share a certain characteristic (clustering, differs from classification in that no predefining characteristic is given). Clustering approaches address segmentation problems. Clustering algorithms can be used to identify classes of customers with certain needs to be met.
  • Association: identifies relationships between events that occur at one time. Association approaches address a class of problems typified by market basket analysis. In retailing, there is an attempt to identify what products sell with what others ones, and to what degree. Statistical methods are typically used.
  • Sequencing: similar to association, except that the relationship occurs over a period of time (e.g., repeat visits to a supermarket, use of a financial planning product). Purchases can be tracked because the purchaser can be identified by an account number or some others means.
  • Regression: used to map data to a prediction value. Linear and nonlinear techniques are used. This is a form of estimation. It often involves identifying metrics and evaluating an item (customers) along the metrics by assigning scores. Sales predictions may be accomplished as well.
  • Forecasting: estimates future values based on patterns within large sets of data (e.g., demand forecasting). This is another form of estimation. There is an attempt to utilize statistical time-series methods to predict future sales.
  • Other techniques: these are typically based on advanced artificial intelligence methods. They include case-based reasoning, fuzzy logic, genetic algorithms, and fractal-based transforms.
Personal tools