文档库 最新最全的文档下载
当前位置:文档库 › 数据挖掘技术毕业论文中英文资料对照外文翻译文献综述

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述

数据挖掘技术简介

中英文资料对照外文翻译文献综述

英文原文

Introduction to Data Mining

Abstract:Microsoft? SQL Server? 2005 provides an integrated environment for creating and working with data mining models. This tutorial uses four scenarios, targeted mailing, forecasting, market basket, and sequence clustering, to demonstrate how to use the mining model algorithms, mining model viewers, and data mining tools that are included in this release of SQL Server.

Introduction

The data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.

The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see "Choosing Between SQL Server Management Studio and Business Intelligence Development Studio" in SQL Server Books Online.

All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based

on existing models.

After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see "Viewing a Data Mining Model" in SQL Server Books Online.

Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.

To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see "Data Mining Extensions (DMX) Reference" in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.

Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process.

Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see "DTS Data Mining Tasks and Transformations" in SQL Server Books Online.

In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.

Adventure Works

AdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.

Adventure Works sells products wholesale to specialty shops and to individuals through the

Internet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.

For more information on Adventure Works Cycles see "Sample Databases and Business Scenarios" in SQL Server Books Online.

Database Details

The Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:

North America (83%)

Europe (12%)

Australia (7%)

The database contains data for three fiscal years: 2002, 2003, and 2004.

The products in the database are broken down by subcategory, model, and product.

Business Intelligence Development Studio

Business Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.

Working in an IDE is beneficial for the following reasons:

The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.

If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.

SQL Server Management Studio

SQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work.

After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the data

mining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models.

Data Transformation Services

Data Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in SQL Server 2005. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.

DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.

You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.

Mining Model Algorithms

Data mining algorithms are the foundation from which mining models are created. The variety of algorithms included in SQL Server 2005 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see "Data Mining Algorithms" in SQL Server Books Online.

Microsoft Decision Trees

The Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes.

In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen to

cause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.

Microsoft Clustering

The Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.

Microsoft Na?ve Bayes

The Microsoft Na?ve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Na?ve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.

Microsoft Time Series

The Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.

Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years.

A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorithm can use cross-variable correlations in its predictions. For example, prior sales at one store may be useful in predicting current sales at another store.

Microsoft Neural Network

In Microsoft SQL Server 2005 Analysis Services, the Microsoft Neural Network algorithm creates classification and regression mining models by constructing a multilayer perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm provider, given each state of the predictable attribute, the algorithm calculates probabilities for each possible state of the input attribute. The algorithm provider processes the entire set of cases , iteratively comparing the predicted classification of the cases with the known actual classification of the cases. The errors from the initial classification of the first iteration of the entire set of cases is fed back into the network, and used to modify the network's performance for the next iteration, and so on. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes. One of the primary differences between this algorithm and the Microsoft Decision Trees algorithm, however, is that its learning process is to optimize network parameters toward minimizing the error while the Microsoft Decision Trees algorithm splits rules in order to maximize information gain. The algorithm supports the prediction of both discrete and continuous attributes.

Microsoft Linear Regression

The Microsoft Linear Regression algorithm is a particular configuration of the Microsoft Decision Trees algorithm, obtained by disabling splits (the whole regression formula is built in a single root node). The algorithm supports the prediction of continuous attributes.

Microsoft Logistic Regression

The Microsoft Logistic Regression algorithm is a particular configuration of the Microsoft Neural Network algorithm, obtained by eliminating the hidden layer. The algorithm supports the prediction of both discrete andcontinuous attributes.)

中文译文

数据挖掘技术简介

摘要:微软? SQL Server?2005中提供用于创建和使用数据挖掘模型的集成环境的工作。本教程使用的四种情况:有针对性的邮件预测;顺序分析和聚类;演示如何使用挖掘模型算法;挖掘模型查看器和数据挖掘工具。

介绍

数据挖掘教程旨在通过创建走在Microsoft SQL Server 2005的数据挖掘模型的过程。数据挖掘算法,并在SQL Server 2005工具可以很容易地建立一个项目,包括市场购物篮分析各种全面的解决方案,预测分析,有针对性的邮件分析。这些解决方案的情景更详细的解释在后面的教程。

SQL Server 2005最明显的部分是用来创建和处理数据挖掘模型的工作室。在线分析处理(OLAP )和数据挖掘工具被统一为两个工作环境:商业智能开发工作室和SQL Server 管理工作室。通过商业智能开发工作室,您可以在与服务器断开连接的情况下建立一个服务项目分析。当项目已经准备就绪,您可以发布到服务器上。您也可以直接面向服务器工作。SQL Server 管理工作室的主要职能是管理服务器。之后将有针对每一个环境的详细说明。欲了解更多关于从两个环境中选择的信息,请参看SQL Server联机丛书中的“在SQL Server 工作室和商业智能开发工作室中选择”。

数据挖掘工具都存在于数据挖掘的编辑。使用编辑器,您可以管理挖掘模型,创造新模式,查看模型,比较模型,并建立在现有模型的预测。

当你创建一个挖掘模型,你会想要去探索它,寻找有趣的模式和规则。在编辑器中的每个挖掘模型查看器是自定义进行探讨,以特定的算法建立的模型。如

需观众的信息,请参看SQL Server联机丛书中的“查看数据挖掘模型”。

您的项目往往会包含多个挖掘模型,所以才能使用的模式创建的预测,你要能够确定哪些模式是最准确的。出于这个原因,编辑包含一个模型比较工具挖掘精度的图表标签。使用此工具,您可以比较准确的预测模型和您确定最佳模式。

为了建立数据预期,你将使用一种DME语言,DMX扩展了传统的SQL语法,包含了一些创建修改和建立数据预期的命令,关于DMX的详细信息,请参考SQL BOL中的“Data Mining Extensions (DMX) Reference”章节。因为建立一个数据预期可能比较复杂,所以数据挖掘编辑器包含了一个工具叫做“Prediction Query Builder”,该工具可以让你在一个图形化的界面下编辑DMX查询语句,你也可以在该工具中可以查看自动生成的DMX语句。

了解了前面介绍的实现数据挖掘的工具之外,同等重要的是了解数据挖掘模型的结构本身,建立一个数据模型的关键是数据挖掘算法,该算法在你操作的数据中寻找我们需要的部分,并且转换这些数据成为一个可操作的数据模型。

一些很重要的建立数据挖掘解决方案的步骤是用来整理准备那些用于建立数据模型的数据,SQL2005包含一个DTS的工作环境以及一些DTS的工具用于清理验证准备数据,关于DTS的更多信息请查看SQL BOL中的‘DTS Data Mining Tasks and Transformations’章节。

Adventure 数据库

AdventureWorksDW 数据库是基于一个虚构的自行车制造公司而建立,公司的名称叫做“Adventure Works Cycles”(简称AW公司)。AW公司生产并向北美,欧洲和亚洲的商业市场销售金属和复合材料的自行车,主要的工作都在华盛顿Bothell完成,那里拥有500 员工,以及一些地区销售部门遍及各地。

AW公司通过INTERNET批发和零售他们的产品,本教程中的数据模型实例需要你使用这些网络销售数据作为数据模型。

关于AW公司数据库的更多信息请参考SQL Server联机丛书中的如下章节:‘Sample Databases and Business Scenarios’。

数据库详细信息

网络销售数据构架包含9242个客户的信息,这些客户分布在6个国家,并被合并为3个区域:

南美(83%)

欧洲(12%)

澳大利亚(7%)

该数据库包含三个财政年度的数据:2002年,2003年和2004年。数据库中的产品根据子类别,型号和产品来分类。

商业智能开发工作室

商业智能开发工作室是一套用于创建商务智能项目的工具。由于商业智能开发工作室是创建于IDE环境中的,在该环境中,你可以在脱机状态下创建一个完

整地解决方案。你可以想改多少数据挖掘对象就改多少,但是在你发布该项目前,这些改变将不会反映在服务器上。

一个SSAS数据库用于集成多种技术,这个数据库作为数据挖掘模型以及OLAP等技术的基础。你可以使用商业智能建立和修改一个SSAS项目并部署这个项目到一个或多个SSAS服务如果你在开发一个SSAS项目你也可以使用商业智能开发工作室直接连接数据库,这样你所作的改动可以立刻影响到数据库中。SQL Server 管理工作室

SQL Server管理工作室是一个行政和脚本工具与Microsoft SQL Server 组件工作的集合。此工作区的不同之处,你是在互联环境中工作的行动是在传播到服务器只要您保存您的工作从商务智能开发工作室中。

在数据被清理并为数据挖掘准备好后,大多数和创建苏局挖掘解决方案相关联的工作都在商业智能开发工作室中工作。通过使用商业智能开发工作室,你可以利用迭代过程确定的给定情况下的最佳模式来发布和测试数据挖掘解决方案。一旦开发商对解决方案满意,就可以将其发布到分析服务服务器。

从这点来看,重点从SQL Server管理工作室的开发转移到了维护和应用。在SQL Server管理工作室中,您可以管理您的数据库和执行一些在商业智能开发工作室中的相同的职能,比如在挖掘模式中查看、创建预测。

数据转换服务

在SQL Server 2005中数据转换服务(DTS )包括抽取,转换和加载(简称ETL )工具。这些工具可用于执行一些数据挖掘中最重要的任务,为数据模型的建立清理和准备数据。在数据挖掘,您通常可以执行重复数据转换清理数据,然后利用这些数据组成挖掘模型。利用DTS中的任务和转移,您可以把数据准备和模型建立结合为一个单一的DTS包。

DTS公司还提供了DTS设计器,以帮助您轻松地建立和运行的包含了所有的任务和转变的软件包。利用DTS设计器,您可以将包发布到服务器上并定期的运行他们。这是非常有用例如,你每周收集数据资料,并向要每次自动执行相同的清洁转换工作。

你可以通过向商业智能开发式的解决方案中分别增加项目来将数据转换项目和分析服务项目结合起来工作,作为商务智能解决方案的一部分。

挖掘模式算法

数据挖掘算法是挖掘模型的创建的基础。SQL Server 2005中各种各样的算法可以让你执行多种类型的执行。欲了解更多有关算法及其参数调整的信息,请参看SQL Server联机丛书中的“数据挖掘算法”。

决策树

决策树算法支持分类与回归并且对预测模型也行之有效。利用该算法,你可以预测离散和连续这两个属性。

在建立模型时,该算法检查每个数据集的输入属性是怎样的影响预测属性的结果,以及使用最强的关系的输入属性制造了一系列的分裂,称为节点。随着新节点添加到模型中,树状结构开始形成。顶端节点树描述了大多数预测属性的统计分析。每个节点建立把预测属性比作投入的属性的分布情况上。如果输入的属性被视为导致预测属性有利于促成比另一个更好的状态,于是一个新的节点添加到模型。该模型继续增长,直到没有剩余的属性制造分裂提供了一个更好的预测在现有节点。该模型力图找到一个结合的属性和引起在预测属性不成比例分配的状态,因此,您可以预测预测属性的结果。

簇算法采用迭代技术组从包含相似特性的数据及中进行分类。利用这些组合,您可以探讨的数据,更多地了解存在的关系,这在理论上可能不容易通过偶然的观察获得。此外,您也可以从算法创建的簇建立预测模型。例如,考虑那些住在同一社区,驱动器相同的车,吃同样的食物,买了类似的版本的产品的那一个群体的人。这是一组数据。另一组可能包括去相同的餐厅,也有类似的薪金,休假和每年两次以外的地区的人。观测这些集合是如何的分布,可以更好地了解预测属性的结果是如何相互影响的。

传统贝叶斯

在传统贝叶斯算法快速生成挖掘,可用于分类和预测的模型。它计算的每个输入属性的国家给予每个可预测属性,它可以用来预测以后的预测属性上已知的结果输入属性状态,概率。用于生成该模型的概率计算,并在立方体的处理中。该算法只支持离散或离散化的属性,它认为所有输入属性是独立的。在传统贝叶斯算法产生一个简单的挖掘模型可以被认为是在数据挖掘过程的起点。由于在建立模型中使用的计算大多是在加工过程中产生的立方体,迅速返回结果。这使得该模型的一个探索发现的数据和如何在不同的输入属性的预测属性的不同分布状态不错的选择。

时间系

Microsoft时序算法创建,可用于预测了来自OLAP和关系数据源的时间连续变量模型。例如,您可以使用Microsoft时序算法来预测销售和在一个立方体的历史数据为基础的利润。

利用该算法,你可以选择一个或多个变量进行预测,但必须是连续的。您只能有一个为每个模型病例。此案系列标识系列中的位置,如超过之日起在几个月或几年的长度寻找销售。

一个案件可能含有一组变量(例如,在不同的商店销售)。Microsoft时序算法可以用其预测交叉变量的相关性。例如,在一家商店前的销售可能会在其他商店的预测目前的销售非常有用。

神经网络

在Microsoft SQL Server 2005分析服务,Microsoft神经网络算法创建通过构

相关文档
相关文档 最新文档