当前位置：文档库 › Clustering Intrusion Detection Alarms to Support Root Cause Analysis

Clustering Intrusion Detection Alarms to Support Root Cause Analysis

KLAUS JULISCH

IBM Research,Zurich Research Laboratory

It is a well-known problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day.This article presents a new approach for handling intrusion detection alarms more e?ciently.Central to this approach is the notion that each alarm occurs for a reason,which is referred to as the alarm’s root causes.This article observes that a few dozens of rather persistent root causes generally account for over90%of the alarms that an intrusion detection system triggers.Therefore,we argue that alarms should be handled by identifying and removing the most predominant and persistent root causes.To make this paradigm practicable, we propose a novel alarm clustering method that supports the human analyst in identifying root causes.We present experiments with real-world intrusion detection alarms to show how alarm clustering helped us identify root causes.Moreover,we show that the alarm load decreases quite substantially if the identi?ed root causes are eliminated so that they can no longer trigger alarms in the future.

Categories and Subject Descriptors:C.2.0[Computer-Communication Networks]:General—Security and Protection(e.g.Firewalls);C.2.3[Computer-Communication Networks]:Net-work Operations—Network Monitoring;D.4.6[Operating Systems]:Security and Protection;

H.2.8[Database Management]:Database Applications—Data Mining;I.2.6[Arti?cial Intel-ligence]:Learning—Concept Learning

Additional Key Words and Phrases:Intrusion detection,root cause analysis,data mining,cluster analysis,false positives

1.INTRODUCTION

Over the past10years,the number as well as the severity of network-based com-puter attacks have signi?cantly increased[Allen et al.2000].As a consequence, classic information security technologies such as authentication and cryptography have gained in importance.Simultaneously,intrusion detection has emerged as a new and potent approach to protect information systems[Bace2000;Debar et al. 2000].In this approach,so-called Intrusion Detection Systems(IDSs)are used to monitor information systems for signs of security violations.Having detected such signs,IDSs trigger alarms to report them.These alarms are presented to a hu-An earlier version of this article appeared as“Mining Alarm Clusters to Improve Alarm Han-dling E?ciency”in Proceedings of the17th Annual Computer Security Applications Conference (ACSAC2001),New Orleans,LA.,December.IEEE Computer Society,2001.

Author’s address:K.Julisch,IBM Zurich Research Laboratory,S¨a umerstrasse4,8803R¨u schlikon, Switzerland;e-mail:kju@https://www.wendangku.net/doc/fb18628465.html,.

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for pro?t or commercial advantage,the ACM copyright/server notice,the title of the publication,and its date appear,and notice is given that copying is by permission of the ACM,Inc.To copy otherwise,to republish, to post on servers,or to redistribute to lists requires prior speci?c permission and/or a fee.

c 2002ACM0000-0000/2002/0000-0111$5.00

ACM Journal Name,Vol.2,No.3,092002,Pages111–138.

112·Klaus Julisch

man operator who evaluates them and initiates an adequate response.Examples of possible responses include law suits,?rewall recon?gurations,and the?xing of discovered vulnerabilities.

Evaluating intrusion detection alarms and conceiving an appropriate response was found to be a challenging task.In fact,practitioners[Broderick–Editor1998; Manganaris et al.2000]as well as researchers[Axelsson2000;Bloedorn et al.2000; Clifton and Gengo2000;Julisch2001]have observed that IDSs can easily trigger thousands of alarms per day,up to99%of which are false positives(i.e.alarms that were mistakenly triggered by benign events).This?ood of mostly false alarms makes it very di?cult to identify the hidden true positives(i.e.those alarms that correctly?ag attacks).For example,the manual investigation of alarms has been found to be labor-intensive and error-prone[Broderick–Editor1998;Dain and Cunningham2002;Manganaris et al.2000].Tools to automate alarm investigation are being developed[Dain and Cunningham2002;Debar and Wespi2001;Valdes and Skinner2001],but there is currently no silver-bullet solution to this problem. This article presents a novel semi-automatic approach for handling intrusion de-tection alarms e?ciently.Central to this approach is the notion of alarm root causes.Intuitively,the root cause of an alarm is the reason for which it occurs.We have made the key observation that in most environments,there is a relatively small number of highly predominant root causes.Speci?cally,we have observed that a few dozens of root causes generally account for over90%of all alarms.Moreover,these root causes tend to be persistent,i.e.they do not disappear unless someone removes them.Predominant and persistent root causes are problematic because they trig-ger an alarm?ood that distracts the intrusion detection analyst from spotting real attacks,which tend to be more subtle.We have therefore investigated techniques to e?ciently handle such large groups of redundant alarms.

The outcome of our research was a semi-automatic process that consists of two steps:Step one,which is conventionally called root cause analysis,identi?es root causes that account for large numbers of alarms.Step two removes these root causes and thereby signi?cantly reduces the future alarm load.The experiments presented in this article show how the one-time e?ort of identifying and removing root causes pays o?by reducing the future alarm load by87%.This is signi?cant because it enables the intrusion detection analyst to henceforth concentrate on the remaining 13%of alarms.As a general rule,root cause identi?cation and removal should be done roughly once a month in order to keep up with changes in the computing environment.

This article focuses on the?rst of the above two steps,i.e.on the identi?cation of root causes.To support this step,we have developed a novel alarm clustering method.The motivation for this method stems from the observation that the alarms of a given root cause are generally“similar”(Section4.2will formally de?ne our notion of similarity).Our alarm clustering method reverses this implication and groups similar alarms together,assuming that these alarms also share the same root cause.For each alarm cluster,a single so-called generalized alarm is derived. Intuitively,a generalized alarm is a pattern that an alarm must match in order to belong to the respective cluster.We show that knowledge of generalized alarms vastly simpli?es root cause analysis.

ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·113 An example will clarify the relationship between root causes,alarm clustering, and root cause analysis.Let us consider the root cause of a broken TCP/IP stack,which fragments all outgoing IP tra?c and thereby triggers“Fragmented IP”alarms.Moreover,let us assume that the broken TCP/IP stack belongs to a popular Web server that is primarily used on workdays.Clearly,all“Fragmented IP”alarms have the same source IP address(namely the IP address of the Web server)and the same source port(namely80).The targets of the alarms are non-privileged ports of various Web clients.Given that the Web server is mostly used on workdays,it follows that the majority of alarms occurs on workdays.Finally, note that“Fragmented IP”alarms are triggered each time that the Web server responds to a client request.Given the assumption that the Web server is popular and therefore heavily used,it follows that we are?ooded by a large number of “Fragmented IP”alarms.

Our alarm clustering method groups the“Fragmented IP”alarms together and reports them by a single generalized alarm.This generalized alarm states that “source port80of the Web server triggers many’Fragmented IP’alarms on work-days against non-privileged ports of Web clients”.Clearly,a generalized alarm like this facilitates the identi?cation of root causes,but human expertise is still needed. Therefore,alarm clustering only supports root cause analysis,but does not com-pletely automate it.Moreover,recall that root cause analysis is only the?rst in a two-step process.The second step,is to act upon the identi?ed root causes.Ide-ally,one would always remove root causes(e.g.by?xing the broken TCP/IP stack). Unfortunately,some root causes are not under our control or they are expensive to remove.Then,custom-made?ltering rules(which automatically discard alarms) or correlation rules(which intelligently group and summarize alarms)can be an alternative.Clearly,this second step requires care and human judgment,as well. The novel contribution of this article is fourfold:First,we show that a few root causes generally account for the majority of alarms in an alarm log.Second,we present a novel alarm clustering method that groups similar alarms together,and summarizes them by a single generalized alarm.Third,we show that knowledge of generalized alarms vastly simpli?es the identi?cation of root causes.Finally,we show that removing root causes can signi?cantly reduce the future alarm load,thus enabling a more thorough analysis of the remaining alarms.

Article Overview.The remainder of this article is organized as follows:Section 2discusses related work.Section3de?nes our terminology.Section4formalizes the problem that alarm clustering should solve.Section5proposes an algorithm to solve this problem.Section6presents our experience with alarm clustering and root cause analysis.Section7contains concluding remarks.

2.RELATED WORK

Related work towards“better”IDSs,which trigger less false positives,is described in Section2.1.Section2.2discusses real-time correlation systems for intrusion detection alarms.Section2.3surveys how data mining has been used to support the investigation of alarms.Section2.4reviews alarm correlation and root cause analysis in the context of network fault management.

ACM Journal Name,Vol.2,No.3,092002.

114·Klaus Julisch

2.1Towards Better IDSs

The intuitively most appealing way of dealing with false positives is to build“bet-ter”IDSs,which trigger less false positives.This is a challenging endeavor because false positives are the result multiple problems,including a lack of suitable au-dit sources[Price1997;Ptacek and Newsham1998],harsh real-time requirements (which preclude a thorough analysis of the audit data)[Ilung1993;Ptacek and New-sham1998],the problem that for some events(e.g.failed logins)it is undecidable whether they constitute attacks[Bellovin1993;Paxson1999],and the inherent di?culty of writing correct intrusion detection signatures[Kumar1995;Lee and Stolfo2000;Mounji1997;Ning et al.2001].A“better”IDS would have to address all these issues,and a small number of research projects have attempted to so. Examples of IDSs that are less prone to false positives include the embedded de-tectors technology by Zamboni[2001],a lightweight tool for detecting Web server attacks by Almgren et al.[2000],and a network-based IDS that focuses exclusively on low-level network attacks[Sekar et al.1999].Interestingly,all three IDSs share two commonalities:First,they have public signatures that can be tuned to a given environment,and second they are special-purpose.Special-purpose IDSs are tai-lored towards detecting one class of attacks(e.g.Web server attacks)and they monitor audit sources that are particularly suitable for this task.A drawback of special-purpose IDSs is that they must be combined with other complementary IDSs to obtain comprehensive coverage.

2.2Alarm Correlation

Alarm Correlation Systems(ACSs)[Cuppens2001;Cuppens and Mi`e ge2002;Dain and Cunningham2002;Debar and Wespi2001;Staniford et al.2000;Valdes and Skinner2001]try to group alarms so that the alarms of the same group pertain to the same phenomenon(e.g.the same attack).In that way,they o?er a more condensed view on the security issues raised by an IDS.In particular,they make it easier to distinguish false positives from real security threats.As ACSs operate in real-time,they can be viewed as real-time clustering systems(after all,they also group/cluster alarms).However,the clustering method proposed in this article contrasts along three dimensions with prior work:

Depth of analysis.Being conceived for o?-line usage,our clustering algorithm performs a more thorough analysis than ACSs.For example,consider a phe-nomenon that only occurs on Saturdays(e.g.false positives due to weekly system backups).Our clustering algorithm is able to correctly group and report the re-sulting alarms,whereas ACSs do not have this capability because it is di?cult to implement in real-time.Moreover,to reliably identify a weekly alarm pattern,one must observe at least several weeks of alarms.Clearly,delaying correlation results for weeks defeats the very purpose of real-time alarm correlation.

Ease of use.Some ACSs have dozens of con?guration parameters,which take ex-perience to set[Debar and Wespi2001;Valdes and Skinner2001].Other ACSs face a knowledge engineering bottleneck because they require the user to specify corre-lation rules[Cuppens2001;Cuppens and Mi`e ge2002].The ACS by Dain and Cun-ningham[2002]learns correlation rules from the user.To this end,the user has to manually correlate alarms,so the system can learn his or her ability.Unfortunately, ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·115 manual alarm correlation is di?cult and time-consuming.As will become apparent throughout this article,our clustering algorithm is easy and intuitive to use. Bias.ACSs are generally optimized to build alarm groups that correspond to at-tacks.For example,some ACSs implement special techniques to deal with spoofed source addresses and with alarms that belong to larger multi-stage attacks[Cup-pens and Mi`e ge2002;Dain and Cunningham2002;Valdes and Skinner2001]. Other ACSs reassess the severity of alarm groups and discard alarm groups that are considered benign[Debar and Wespi2001;Staniford et al.2000].Virtually all publications use only attacks to test the proposed ACSs.By contrast,our clustering method is designed to?nd large groups of mostly false positives.It is questionable if today’s ACSs with their attack-centric bias are suitable for?nding such groups.

2.3Data Mining

The idea of using data mining to support alarm investigation is not new.Manga-naris et al.[2000]mine association rules over alarm bursts.Subsequently,alarms that are consistent with these association rules are deemed“normal”and get dis-carded.The risk of discarding true positives is not considered in this work,whereas bounding this risk is central to our approach.Clifton and Gengo[2000]use episode mining to guide the construction of custom-made?ltering rules.The present article pursues a similar idea,but mines alarm clusters rather than episode rules.Finally, the previously mentioned work by Dain and Cunningham[2002]is also relevant in this context as it uses data mining techniques to learn correlation rules from hand-labeled training examples.

Some researchers have used data mining to build IDSs more systematically.For example,Barbar′a et al.[2001]use incremental data mining techniques to detect anomalous network tra?c patterns in real time.Lee and Stolfo[2000]use data mining for feature construction and training of classi?ers that detect intrusions.

A recent book edited by Barbar′a and Jajodia[2002]surveys these projects,and o?ers a general treatment of data mining in computer security.Data mining for fraud detection is investigated by Fawcett and Provost[1997],and by Chan and Stolfo[1998].Finally,in the world of telecommunication networks,Klemettinen [1999]uses association rules and episode rules to support the development of alarm correlation systems.Hellerstein and Ma[2000]pursue the same goal by means of visualization,periodicity analysis,and m-patterns(a variant of association rules requiring mutual implication).

2.4Network Fault Management

In network fault management[Bouloutas et al.1994;Houck et al.1995;Jakobson and Weissman1993;1995;Lewis1993;Nygate1995;Ohsie1998;Yemini et al.1996], alarms indicate problems in a network’s operation,such as hardware or software failures,performance degradations,or miscon?gurations.As network components are highly inter-dependent,a problem in one component propagates to all transi-tively dependent components.As a consequence,a problem a?ecting any single component can impair many other components,most of which report their impair-ment by means of alarms.The goal of network fault management is to evaluate these alarms and to pinpoint the original problems(the so-called root causes).

ACM Journal Name,Vol.2,No.3,092002.

116·Klaus Julisch

Identifying root causes is an instance of the general problem of abductive in-

ference[Ohsie1998;Peng and Reggia1987a;1987b].Abductive inference is the

process of reasoning from e?ects(i.e.alarms)to causes.Many network manage-

ment systems do abductive inference in two steps:First,they model the cause-e?ect

propagation in networks,and then,they heuristically search this model for plausible

root causes that explain the observed alarms[Bouloutas et al.1994;Houck et al.

1995].Other systems require the user to encode his or her knowledge about root

cause analysis in expert system rules[Jakobson and Weissman1993;1995;Nygate

1995].Thereby,the problem of abductive inference is o?-loaded upon the user.

Yet other systems implement root cause analysis by means of case-based reasoning

[Lewis1993]or codebooks[Yemini et al.1996].

All of the above research is not directly applicable to intrusion detection because

the notions of dependence and cause-e?ect propagation are not easily transferable.

Moreover,network management systems can only diagnose known root causes.By

contrast,this article describes a clustering method that supports the discovery of

new,previously unknown root causes.

3.DEFINITIONS AND NOTATIONS

This section de?nes concepts that are central to this article,including the notions

of root causes,root cause analysis,alarms,alarm logs,generalized alarms,and

generalization hierarchies.

Root Causes&Root Cause Analysis.The root cause of an alarm,as de?ned in

the introduction,is the reason for which it is triggered.Another useful mental

construct is that root causes are problems that a?ect components and cause them

to trigger alarms.For example,a failure can a?ect the implementation of a TCP/IP

stack,and cause it to fragment all outbound IP tra?c,which triggers“Fragmented

IP”alarms.Similarly,a worm is a root cause that a?ects a set of hosts and

causes them to trigger alarms when the worm spreads.Root cause analysis is the

task of identifying root causes as well as the components they a?ect.We do not

attempt to formally de?ne root causes or root cause analysis,because such an

attempt seems fruitless.In fact,in the dependability?eld,the term fault denotes

a concept that is very similar to a root cause,and the dependability notion of

fault diagnosis corresponds to root cause analysis[Laprie1992;Powell and Stroud

2001].Neither faults nor fault diagnosis have been formally de?ned.Therefore,we

consider it unlikely that the intrusion detection equivalents of these terms possess

formal de?nitions.

Alarms,Alarm Logs,&Generalized Alarms.Intrusion detection systems trigger

alarms to report presumed security violations.This article models alarms as tuples

over the Cartesian product dom(A1)x...x dom(A n),where{A1,...,A n}is the set of alarm attributes and dom(A i)is the domain(i.e.the range of possible values)

of attribute A i.The alarm attributes(attributes for short)capture intrinsic alarm

properties,such as the source IP address of an alarm,its destination IP address,

its alarm type(which encodes the observed attack),and its time-stamp.The value

that attribute A i assumes in alarm a is denoted by a[A i].This article models alarm

logs as sets of alarms.This model is correct because alarms are implicitly ordered ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·117 by virtue of their time-stamps;moreover,unique alarm-identi?ers can be used to guarantee that all alarms are pairwise distinct.

A generalized attribute value is a concept name that represents a subset of an attribute domain dom(A i),i∈{1,...,n}.For example,the generalized attribute value Web-server might represent the subset of IP addresses that host Web servers. Similarly,the generalized attribute value Weekdays can be de?ned to comprise the subset of time-stamps that fall on weekdays.The extended domain Dom(A i)of an attribute A i is the union of the domain dom(A i)and the set of generalized attribute values that have been de?ned for A i.Finally,a generalized alarm is a tuple in [Dom(A1)x...x Dom(A n)] [dom(A1)x...x dom(A n)].Note that a generalized alarm g models the set{a|?A i:(a[A i]=g[A i]∨a[A i]∈g[A i])}of ordinary(i.e. ungeneralized)alarms a.

Generalization Hierarchies.For each attribute A i,let G i be a single-rooted and connected directed acyclic graph(DAG)on the elements of the extended domain Dom(A i).The graph G i is called a generalization hierarchy(a.k.a.is-a hierarchy or taxonomy).For example,Figure1(c)shows sample generalization hierarchies for IP addresses and port numbers.For two elements x,?x∈Dom(A i),we call?x a parent of x if the generalization hierarchy G i contains a directed path from?x to x(in symbols:x ?x).To extend these de?nitions from attributes to alarms,let a,?a∈X1≤i≤n Dom(A i)denote two(possibly generalized)alarms.The alarm?a is called a parent of alarm a if and only if a[A i] ?a[A i]holds for all attributes A i. This is denoted by a ?a,and the alarm a is said to be more speci?c than?a,while ?a is called more abstract or general than a.

Example.By way of illustration,Figure1shows a network topology,a sample alarm log,and sample generalization hierarchies for IP addresses and port numbers. Let us verify that the following holds:

—Let a be the?rst alarm in the alarm log of Figure1(b),i.e.a:=(ip1,80).Then, a[Dest-IP]=ip1holds.

—The extended domain Dom(IP)of IP addresses is the union of elementary IP addresses(i.e.the set dom(IP)={p.q.r.s|p,q,r,s∈{0,...,255}})and gener-alized IP addresses(i.e.the set{HTTP/FTP,FIREWALL,DMZ,INTERNET, ANY-IP}of generalized attribute values).Similarly,the extended domain of port numbers is{1,...,65535,PRIV,NON-PRIV,ANY-PORT}.—According to Figure1(c),the IP address ip1is a?rewall,which is a machine in the DMZ,which,more generally,is any IP address.More succinctly,this relationship can be expressed as ip1 FIREWALL DMZ ANY-IP.Moreover,note that ip1 ip1and(ip1,80) (FIREWALL,PRIV)hold.

4.ALARM CLUSTERING PROBLEMS

In this article,we use alarm clustering to extract alarm groups,called clusters, that a skilled user can interpret in terms of root causes.Because alarm clusters can be huge,we additionally summarize them by means of generalized alarms,which facilitates their interpretation.It is intuitively clear,that the way in which we group alarms has a strong in?uence on how useful the resulting alarm clusters are. Ideally,alarm clustering should solve the following problem:

ACM Journal Name,Vol.2,No.3,092002.

118·Klaus Julisch

a) Network

FIREWALL

b) Sample alarm log Dest?IP

ip1

80ip2

ip4ip4

ip4

ip512348080c) Generalization hierarchies for IP addresses and port numbers

808021ip2

23Dest?Port FIREWALL

HTTP/FTP ip3ip2DMZ

ANY?IP

ip1ip6ip4ip5ip7...INTERNET ANY?PORT NON?PRIV 1025 ... 65535 PRIV https://www.wendangku.net/doc/fb18628465.html,work,alarm log,and generalization hierarchies of the running example.

Definition 1Exact Alarm Clustering Problem.Given an alarm log,the exact alarm clustering problem is to group alarms into clusters such that all alarms of a given cluster share the same root cause.

An algorithm that solves the exact alarm clustering problem eliminates redun-dancy by grouping alarms that share the same root cause.Furthermore,if one understands the root cause of any alarm in an alarm cluster,then one understands the root causes of all alarms in the cluster.

Unfortunately,there is no algorithmic solution to the exact alarm clustering problem.In fact,root causes are a model-world concept,which only exists in the world of our https://www.wendangku.net/doc/fb18628465.html,puter programs are not aware of root causes and can therefore not enforce the requirement that all alarms of an alarm cluster must share the same root cause.To illustrate this,let us consider a machine whose broken TCP/IP stack fragments most IP tra?c.Suppose that this machine is behind a router that itself fragments a substantial fraction of the tra?c passing through it.Now,let an IDS in front of the router trigger a “Fragmented IP”alarm for a packet from said machine.Unless substantial background knowledge about the state of the system is available,there is no way of deciding if the alarm’s root cause is the broken TCP/IP stack or the fragmenting router.More generally,if only an alarm log is given,it is not possible to decide whether two or more alarms have the same root cause.Hence,the impossibility of solving the exact alarm clustering problem.Given this negative result,we will next de?ne a variant of the exact alarm cluster-ing problem,which is soluble and yet useful in practice.Section 4.1uses examples to derive this variant,and Section 4.2re?nes it.

ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·119 4.1Approximate Alarm Clustering Problem

This section uses examples to show that many root causes manifest themselves in alarm clusters that have certain structural properties.We are interested in these structural properties because they will lead to the approximate alarm clustering problem,which is to?nd alarm clusters that have these structural properties.Fol-lowing the logic of abductive inference(cf.Section2.4),we then argue that alarm clusters with these structural properties are most likely the result of root causes. Abductive arguments like this are very common.For example,attack signatures are generally derived by arguing that if an attack has the manifestation M,then de-tection of M implies the attack.Analogously,we argue that if root causes typically induce alarm clusters having the structural property P,then detection of an alarm clusters having property P is likely to be the result of a root cause.Finally,to determine said structural properties,we consider the following sample root causes: (1)A HTTP server with a broken TCP/IP stack that fragments outgoing tra?c.

“Fragmented IP”alarms ensue when the server responds to clients requests.

(2)At one site,a miscon?gured secondary DNS server performed half-hourly DNS

zone transfers from the primary DNS server.The resulting“DNS Zone Trans-fer”alarms are no surprise.

(3)A Real Audio server whose tra?c remotely resembles TCP hijacking attacks.

This caused our commercial IDS to trigger countless“TCP Hijacking”alarms.

(4)A?rewall that has Network Address Translation(NAT)enabled funnels the

tra?c of many users and thereby occasionally seems to perform host scans.In detail,a NAT-enabled?rewall acts as proxy for its users.When these users simultaneously request external services,then the?rewall will proxy these re-quests and the resulting SYN packets resemble SYN host sweeps.

(5)A load balancing reverse proxy such as Cisco LocalDirector that dispatches

Web client requests to the least busy server.The resulting tra?c patterns resemble host scans that trigger alarms on most IDSs.

(6)Many network management tools query sensitive MIB variables and thereby

trigger IDS alarms.(Other network management tasks such as vulnerability scanning or network mapping o?er further examples of root causes.)

(7)Macintosh FTP clients,which issue the SYST command on every FTP con-

nection,trigger an abundance of“FTP SYST command”alarms.The FTP SYST command is reported by some IDSs because it provides reconnaissance information about the FTP server.

(8)A distributed denial-of-service(DDoS)attack being launched from an external

network against a Web hosting site triggered“SYN Flooding“alarms.

(9)External Code Red infected machines[CERT2001]scanning the internal net-

work for vulnerable servers.

Note that the alarms of the?rst root cause originate from source port80of the HTTP server.Moreover,all of these alarms are targeted at HTTP clients on non-privileged ports.In addition,these alarms always have“Fragmented IP”as alarm type.Therefore,the?rst root cause manifests itself in an alarm cluster that can be modeled by the generalized alarm shown in the?rst row of Table I.Similarly,

ACM Journal Name,Vol.2,No.3,092002.

120·Klaus Julisch

Table I.The generalized alarms induced by nine sample root causes.

RC Source-IP Src-Port Destination-IP Dst-Port Alarm Type 1HTTP server80HTTP clients Non-priv.Fragmented IP 2Sec.DNS server Non-priv.Prim.DNS server53DNS zone transfer 3Real Audit server7070Real Audio clients Non-priv.TCP hijacking 4Firewall Non-priv.External network Privileged Host scan 5Reverse proxy Non-priv.HTTP servers80Host scan 6Mgmt.console Non-priv.SNMP clients161Suspicious GET 7Mac FTP clients Non-priv.FTP server21FTP SYST 8External network Non-priv.HTTP servers80SYN?ood 9External network Non-priv.Internal network80Code Red

the second row of the table shows that the second root cause can be modeled by the generalized alarm of“’DNS zone transfer’alarms being triggered from a non-privileged port of the secondary DNS server against port53of the primary DNS server“.Analogously,the remaining rows of Table I show the generalized alarms that the other root causes induce.The“RC”(root cause)column of the table refers to the item numbers in the above enumeration of root causes;the entry “Non-priv.”denotes the set{1025,...,65535}of non-privileged ports,and the entry“Privileged”stands for the set of privileged ports below1025.

Another important observation is that most of the above root causes are ex-tremely persistent in the sense that they keep generating alarms until someone re-moves them.As a consequence,these root causes are likely to manifest themselves in large alarm clusters.For example,the?rst root cause triggers a“Fragmented IP”alarm whenever the HTTP server responds to a client request.Typically,HTTP servers are heavily used,and consequently,“Fragmented IP”alarms will abound. Similarly,the miscon?gured secondary DNS server triggers one“DNS zone trans-fer”alarm every thirty minutes.This makes approximately1440alarms a month. Using analogous arguments,it becomes clear that all of the above root causes can be expected to trigger large amounts of alarms.

Based on the above examples,we postulate that many root causes manifest them-selves in large alarm clusters that are adequately modeled by generalized alarms. By adequately,we mean that generalized alarms are capable of capturing and rep-resenting the main features of alarm clusters.In other words,little information is lost when modeling alarm clusters by generalized alarms.To summarize: Proposition1Alarm Cluster Hypothesis.Root causes frequently mani-fest themselves in large alarm clusters that are adequately modeled by generalized alarms.

The alarm cluster hypothesis is something like a“natural law”of intrusion detec-tion,and can probably not be proven.However,the alarm cluster hypothesis is not completely unexpected,either.In fact,recall from Section1that most intrusion detection alarms are false positives.In other words,most alarms are triggered by benign root causes.Now,note that anomaly detection systems only work because benign(or normal)behavior is systematic and repetitive.In fact,if benign/normal ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·121

behavior was not systematic,then a model could not capture it,and if it was not repetitive,then models would out-date too fast.Analogously,benign root causes tend to be systematic(which makes generalized alarms an adequate model for their alarms)and repetitive(which causes them to trigger many alarms).Given that benign root causes account for the majority of alarms,we now see that the alarm cluster hypothesis is https://www.wendangku.net/doc/fb18628465.html,ing the abductive argument given at the begin-ning of this section,we can?nally de?ne:

Definition2Approximate Alarm Clustering Problem.Given an alarm log,the approximate alarm clustering problem is to?nd large alarm clusters that are adequately modeled by generalized alarms.

4.2A Framework for Alarm Clustering

A problem with De?nition2is that it leaves the terms“large”and“adequately”unde?ned.This section addresses this problem,and formalizes these terms.Before reading on,please recall the de?nitions of Section3.In particular,recall that alarms are tuples over the n-dimensional attribute space dom(A1)x...x dom(A n),that a[A i]denotes the A i-value of alarm a,that generalization hierarchies G i are single-rooted directed acyclic graphs over the extended attribute domains Dom(A i):= dom(A i)∪{generalized attribute values for A i},and that the relation orders attribute values and alarms by their generality.

To formalize the notion of adequacy,we de?ne the dissimilarity d(·,·),which takes two alarms as input and returns a numerical measure of how adequately these alarms can be modeled by a single generalized alarm.Dissimilarity is inversely related to similarity,i.e.the alarms a1and a2are similar(and can be adequately modeled by a generalized alarm)if d(a1,a2)is small.To de?ne dissimilarity,we need user-de?ned generalization hierarchies G i for all attributes A i,i=1,...,n. Section6.1considers the problem of de?ning such generalization hierarchies.For the time being,we simply assume that suitable generalization hierarchies have been de?ned.For example,one could use the generalization hierarchies of Figure1(c) for IP addresses and port numbers.

We begin by de?ning dissimilarity for individual attributes and then generalize it to alarms.To this end,let A i be an attribute,and let G i be the aforementioned generalization hierarchy for A i,i=1,...,n.The dissimilarity d(x1,x2)between any two elements x1,x2∈Dom(A i)is the length of the shortest path in G i that con-nects x1and x2via a common parent p,i.e.d(x1,x2):=min{δ(x1,p)+δ(x2,p)|p∈G i,x1 p,x2 p},whereδ(·,·)measures the length of the shortest path be-tween two nodes in G i.For example,in Figure1(c),we have d(ip1,ip1)=0, d(ip1,ip4)=4,and d(PRIV,NON-PRIV)=2.The dissimilarity measure d(·,·) has its origin in the?eld of information retrieval,where similar measures have been studied[Rada and Bicknell1989;Rada et al.1989;Lin1998;Resnik1999]. Next,we extend our dissimilarity measure from attributes to alarms.To this end,let a1,a2∈X1≤i≤n Dom(A i)denote two(possibly generalized)alarms.The dissimilarity d(a1,a2)between any two alarms a1and a2is de?ned as the sum of attribute dissimilarities,i.e.d(a1,a2):= n i=1d(a1[A i],a2[A i]).An obvious generalization that we do not pursue further in this article is to use the weighted sum of attribute dissimilarities as inter-alarm dissimilarity.By way of illustration,

ACM Journal Name,Vol.2,No.3,092002.

122·Klaus Julisch

in the sample log of Figure1(b),we have d((ip1,80),(ip2,1234))=d(ip1,ip2)+

d(80,1234)=2+4=6.

Why does d(a1,a2)measure how adequately the alarms a1and a2can be modeled

by a generalized alarm g?To answer this question,let g∈X1≤i≤n Dom(A i)be a

generalized alarm to which both alarms can be generalized,i.e.a1,a2 g.Note that d i:=d(g,a i),i=1,2,is the number of times that an attribute in alarm a i

must be generalized to transform a i into g.Therefore,the smaller the sum d1+d2is,

the smaller the number of generalization steps that separate g from a1and a2,and

the more adequately g models the alarms a1and a2.Conversely,if the sum d1+d2

is large,then g is an inadequate model because it is too abstract,and insu?ciently

captures the detailed information of the alarms a1and a2.We can therefore use the

sum d1+d2to measure the adequacy of g.Given that the dissimilarity d(a1,a2)

equals the minimum value that the sum d1+d2can possibly assume for any g,we

see that d(a1,a2)measures the adequacy of the most adequate generalized alarm.

Hence,a small dissimilarity value implies that an adequate model exists.

In generalization of the dissimilarity d(·,·),we now de?ne the heterogeneity H(C)

of an alarm cluster C.Heterogeneity is a function that returns a small value when

the cluster C can be adequately modeled by a generalized alarm.To formally de?ne

heterogeneity,let g be a generalized alarm that is a parent of all alarms in C,i.e.

?a∈C:a g.The average dissimilarityˉd(g,C)between g and C,and the heterogeneity H(C)are de?ned as follows:

ˉd(g,C):=1/|C|× a∈C d(g,a)(1)

H(C):=min{ˉd(g,C)|g∈X n i=1Dom(A i),?a∈C:a g}(2) Intuitively,average dissimilarity measures how adequately the generalized alarm g models the cluster C on the average.Moreover,a small heterogeneity value implies that there exists a generalized alarm that models C adequately.A generalized alarm g with?a∈C:a g andˉd(g,C)=H(C)is called a cover of C.A cover of C is a maximally adequate model for C.If all generalization hierarchies are trees,rather than DAGs,then there exists exactly one cover for each alarm cluster C.Finally, for C={a1,a2},we have H(C)=1/2×d(a1,a2).The data mining problem to be solved now becomes:

Definition3Alarm Clustering Problem.Let L be an alarm log,min size ∈N an integer,and G i,i=1,...,n,a generalization hierarchy for each attribute A i. The alarm clustering problem(L,min size,G1,...,G n)is to?nd a set C?L that minimizes the heterogeneity H(C),subject to the constraint that|C|≥min size holds.We call C an alarm cluster or cluster for short.

In other words,among all sets C?L that satisfy|C|≥min size,a set with

minimum heterogeneity has to be found.If there are several such sets,then any one

of them can be picked.By minimizing heterogeneity,we maximize the adequacy of

the generalized alarms that model C.Note that the min size parameter formalizes

our hitherto intuitive notion of“largeness”,which?rst appeared in Proposition1.

Finally,once the cluster C has been found,the remaining alarms in L C can be

searched for additional clusters.

ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·123 Some concluding remarks are in order.First,note that we initially de?ned

(alarm)clusters to be groups of alarms that share the same root cause(cf.De?ni-

tion1).Henceforth,an(alarm)cluster is a set of alarms as de?ned in De?nition

3.Second,note that stealthy attacks that trigger fewer than min size alarms do

not yield any clusters.More generally,solving the alarm clustering problem will

not enable us to discover all kinds of root causes.However,based on its derivation,

we know that solving the alarm clustering problem will enable us to discover a

large and practically relevant class of root causes(cf.Proposition1).Finally,it is

worth pointing out that alarm clusters have an intensional description in the form

of covers.Therefore,covers can be used to summarize and report alarm clusters.

Because alarm clusters tend to be huge,it is frequently more practicable to report

alarm clusters by means of their covers,rather than by listing their constituent

alarms.

5.PRACTICAL ALARM CLUSTERING

This section describes a heuristic algorithm for solving the alarm clustering problem.

In addition,it addresses the issue of setting the min size parameter.First,however,

the following result is of importance:

Proposition2.Let L be an alarm log,min size∈N an integer,and G i,i=

1,...,n,a family of generalization hierarchies.The alarm clustering problem

(L,min size,G1,...,G n)is NP-complete.

Proof.The proof is obtained by reducing the CLIQUE problem[Papadimitriou 1994]to the alarm clustering problem.In the CLIQUE problem,we are given a

graph G and an integer k.The goal is to decide whether G contains a k-clique,i.e.

a fully connected subgraph of size k.To reduce the CLIQUE problem to the alarm

clustering problem,we assign a separate attribute to each node in G.For each edge

in G,we create an alarm.The two attributes that correspond to the end points

of the edge are set to1,while all the other attributes are set to zero.The set of

all alarms constitutes the alarm log,min size is set to k2 ,and the generalization hierarchies de?ne1to be the single parent of0,i.e.G i:=1?→0for i=1,...,n.

Let C be a solution of the alarm clustering problem that has just been de?ned.

Setδi:=max{a[A i]|a∈C}(note thatδi∈{0,1}holds).Then,the graph G contains a k-clique if and only if n i=1δi=k.Details are given elsewhere[Julisch 2001].

Given the need for a scalable solution,Section5.1describes a heuristic algorithm

for solving the alarm clustering problem.Section5.2discusses extensions to the

heuristic algorithm,and Section5.3proposes algorithmic support for setting the

min size parameter.

5.1A Heuristic Alarm Clustering Method

Given the NP completeness of the alarm clustering problem,we have developed a

heuristic algorithm.This algorithm?nds clusters C?L that satisfy|C|≥min size,

but do not necessarily minimize the heterogeneity H(C).The heuristic algorithm

is a variant of Attribute-Oriented Induction(AOI)[Han et al.1992;1993],a well-

established technique in the?eld of data mining.Our modi?cations over the clas-

ACM Journal Name,Vol.2,No.3,092002.

124·Klaus Julisch

sical AOI are twofold:First,we generalize attributes more conservatively than classical AOI.Second,we use a di?erent termination criterion,which is reminiscent of density-based clustering[Agrawal et al.1998;Han and Kamber2000].

For the sake of simplicity,we will assume that all generalization hierarchies are trees(we will revisit this assumption in Section5.2).It follows from this assumption that each alarm cluster possesses a unique cover.Our heuristic algorithm constructs this cover directly,i.e.it does not make the detour over?rst?nding an alarm cluster and then deriving its cover.Rather,alarm clusters must be determined after the fact by determining for each cover the set of alarms that match it.In more detail, the algorithm starts with the alarm log L and repeatedly generalizes the alarms in L.Generalizing alarms is done by choosing an attribute A i and replacing the A i values of all alarms in L by their parents in G i.This process continues until an alarm has been found to which at least min size of the original alarms can be generalized.This alarm constitutes the output of the algorithm.

Figure2shows the pseudo-code of the alarm clustering method.Line1copies the alarm log L into a relational database table T.This table has one attribute (or column)for each alarm attributes A i.In addition,the table T possesses the integer-valued attribute count,which is used for bookkeeping,only.Line2sets the count attributes of all alarms to1.In the lines3to9,the algorithm loops until an alarm a has been found,whose count value is at least min size.Line4selects an alarm attribute A i according to a heuristic,which we will describe in a moment. The lines5and6replace the A i values of all alarms in T by their parent values in G i.By doing so,previously distinct alarms can become identical.Two alarms a and a are identical if a[A i]=a [A i]holds for all attributes A i,while a[count] and a [count]are allowed to di?er.The steps7and8merge identical alarms into a single generalized alarm whose count value equals the sum of individual counts. In this way,the count attribute always re?ects the number of original alarms that are summarized by a given generalized alarm.Moreover,each generalized alarm a represents an alarm cluster of size a[count].To conclude the discussion,we now specify the heuristic that we use in line4:

Definition4Attribute Selection Heuristic.For each attribute A i,let F i:=max{f i(v)|v∈Dom(A i)}be the maximum of the function

f i(v):=SELECT sum(count)FROM T WHERE A i=v,

which sums the counts of all alarms a∈T with a[A i]=v.Line4of Figure2 selects any attribute A i whose F i value is minimal,i.e.the F i value must satisfy ?j:F i≤F j.

The intuition behind this heuristic is that if there is an alarm a that satis?es a[count]≥min size,then F i≥f i(a[A i])≥min size holds for all alarm attributes A i,i=1,...,n.In other words,an alarm a with a[count]≥min size cannot exist (and the algorithm cannot terminate)unless F i≥min size holds for all attributes A i.We therefore use it as a heuristic to increase any of the smallest F i values by generalizing its corresponding attribute A i.Other heuristics are clearly possible, but the one above performed favorable in informal experiments.Moreover,rather than merely varying the heuristic,one could even conceive a completely di?erent clustering algorithm,for example one that is based on partitioning or hierarchical ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·125

Input:An alarm clustering problem(L,min size,G1,...,G n)

Output:A heuristic solution for(L,min size,G1,...,G n)

Algorithm:

1:T:=L;//Store log L in table T.

2:for all alarms a in T do a[count]:=1;//Initialize counts.

3:while?a∈T:a[count]

4:Use heuristics to select an attribute A i,i∈{1,...,n};

5:for all alarms a in T do//Generalize attribute A i

6:a[A i]:=father of a[A i]in G i;

7:while identical alarms a,a exist do//Merge identical alarms.

8:Set a[count]:=a[count]+a [count]and delete a from T;

9:}

10:Output all generalized alarms a∈T with a[count]≥min size;

Fig.2.Heuristic alarm clustering algorithm.

clustering[Han and Kamber2000;Jain and Dubes1988].It is a well-known problem of clustering methods that there are few guidelines for choosing the“right”method, let alone for proving the“correctness”of clustering results[Anderberg1973;Jain and Dubes1988;Milligan1996].We therefore make no formal claims about the above clustering method except that it works well in practice(cf.Section6).

5.2DAG-Structured Generalization Hierarchies

This section extends the heuristic alarm clustering algorithm to support DAG-structured generalization hierarchies.When generalization hierarchies are DAGs rather than trees,then any node can potentially have multiple parents.Conse-quently,there is no longer a unique parent that an attribute value can be generalized to.There are two basic strategies for resolving this issue:

Choose-one.This strategy employs user-de?ned rules to resolve ambiguities.For example,consider an IP address ip that simultaneously runs an HTTP server and an FTP server.Accordingly,HTTP-server or FTP-server are two possible gener-alizations of ip.The following rule assumes that ip is the destination IP of alarm a,and generalizes ip according to the value of the destination port: if a[Destination?port]=80

then generalize ip to HTTP-server

else generalize ip to FTP-server;

A similar rule is conceivable for the case that ip is the source IP address of an alarm.

Explore-all.This strategy pursues all possible generalizations in parallel and re-tains the one that?rst leads to a generalized alarm of size min size or larger. Both strategies have been studied in the context of classic AOI[Cheung et al. 2000;Han et al.1992],and we can directly reuse the solutions proposed there.In the choose-one strategy,the problem of induction anomaly needs to be solved[Cheung et al.2000].Induction anomalies arise when a user-de?ned rule tries to access attribute values that have been“abstracted away”in a previous generalization step. For example,the above rule for the HTTP/FTP server is only applicable when the destination port number has not previously been generalized.The solution to the induction anomaly problem is to determine the generalization path for all attribute

ACM Journal Name,Vol.2,No.3,092002.

126·Klaus Julisch

values before the?rst attribute is actually generalized[Cheung et al.2000].The explore-all strategy is conceptually easy to implement,as well[Han et al.1992]. First,line6of Figure2is replaced by:

6.1:T:=T {a};

6.2:for all parents p that a[A i]has in G i do

6.3:{a :=a;a [A i]:=p;T:=T∪{a };}

In other words,the attribute A i of alarm a is generalized in all possible ways and the resulting alarms a are added to T.Now,however,the clusters modeled by the generalized alarms in T are no longer disjunct.It is therefore incorrect to merge generalized alarms by adding their counts as is done in line8of Figure2. The easiest way to determine the correct count values of generalized alarms is to rescan the original alarm log and to determine the number of original alarms that match them.More e?cient implementations are possible.

5.3Algorithmic Support for Setting the min size Parameter

Currently,the user is the only authority over the min size parameter.If the user chooses an excessively large min size value,then the quest for a cluster of at least this size can force the clustering algorithm to merge alarms with di?erent root causes.This is undesirable because the resulting alarm clusters can be hard or even misleading to interpret.On the other hand side,if min size is set to an overly small value,then clustering can end prematurely and alarms that have the same root cause can end up in di?erent clusters.This can in?ate the number of clusters and the time needed to interpret them.Both situations are undesirable. To mitigate this problem,we next describe an algorithm that assists the user in setting the min size parameter.

Before investigating ways of setting the parameter min size,it is important to note that other clustering methods have similar parameters[Anderberg1973;Han and Kamber2000;Jain and Dubes1988].For example,partitional clustering meth-ods require the user to specify the number of clusters that the dataset is supposed to contain.Similarly,density-based clustering methods expect the user to de?ne the minimum number of data objects that a data volume must contain to count as“dense”.In general,most clustering methods have parameters like these,which allow the user to control the desirable amount of clustering.Intuitively,these pa-rameters select an“operating point”between the two extremes of no clustering(i.e. all objects form clusters by themselves)and maximum clustering(i.e.all objects are grouped into a single cluster).Aside from a few ad hoc rules[Anderberg1973; Jain and Dubes1988],there exist no guidelines for choosing the operating point. We now describe the revised scheme for setting the min size parameter.As before,min size is an input parameter that the user has to set.However,the user-de?ned min size value is no longer used as is,but rather serves as the seed value of an iterative process that converges towards a robust min size value.This robust value is?nally used for clustering.Intuitively,a min size value is robust if slightly smaller or slightly larger values still yield the same clustering result.Robustness is an important property because it limits the e?ects that spurious or arbitrary decisions can have on the clustering results.

ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis·127 Formally,letεbe a small fraction of1,e.g.ε=0.05.A given min size value ms isε-robust if it yields the same alarm cluster as the values(1?ε)×ms and (1+ε)×ms.We can test forε-robustness by simulating the alarm clustering method for the values ms,(1?ε)×ms,and(1+ε)×ms.The value ms isε-robust if all three simulations yield the same result.Note thatε-robustness is always relative to a particular alarm log.Thus,a given min size value can beε-robust with respect to one alarm log,while being in-robust for another alarm log.

Let ms0be the min size value that the user originally keyed in.This value is used for clustering if and only if it isε-robust(with respect to the alarm log at hand).Otherwise,it is depreciated according to the formula ms i+1:=(1?ε)×ms i. This test-and-depreciate cycle is iterated until anε-robust min size value has been found.In the worst case,termination occurs after O(log(ms0))iterations.Our decision to progressively decrease rather than increase the min size value stems from our conviction that it is better to have too many alarm clusters,rather than to mix unrelated alarms in a single cluster.

6.EXPERIENCE WITH ALARM CLUSTERING

This section summarizes our experience with alarm clustering.Section6.1in-vestigates the choice of generalization hierarchies.Section6.2presents a detailed example to illustrate how alarm clustering and root cause analysis work in practice. Section6.3discusses the risk of discarding true positives when?ltering rules are derived based on one’s understanding of root causes.

6.1De?nition of Generalization Hierarchies

Our alarm clustering framework assumes that meaningful generalization hierarchies have been de?ned for all alarm attributes.Figure1(c)shows that such generaliza-tion hierarchies exist for IP addresses and port numbers.This section suggests further generalization hierarchies for numerical,time,and string-valued attributes. In fact,intrusion detection alarms can contain all of these attribute types: Numerical attributes.Examples of numerical attributes include counters(e.g.for the number of SYN packets in a SYN?ooding alarm),and size?elds(e.g.for the packet size in“Large ICMP Tra?c”alarms[CERT1996]).

Time attributes.All alarms are time-stamped.However,time should not be treated as a numerical attribute,because doing so would mean to ignore its unique semantics,including notions such as periodicity,workdays versus weekends,etc.. String attributes.String attributes assume arbitrary and unpredictable text val-ues.A typical string attribute is the context,which stores the raw audit records or network packets that the IDS believes to constitute and attack.Not all IDSs set the context,but when available,it can signi?cantly facilitate the investigation of alarms.Note that attributes such as“Attack Name”or“Recommended Action”are not string attributes because they assume values out of a small and predetermined domain.

By de?ning generalization hierarchies,the user encodes his or her background knowledge about the application domain.Like most knowledge engineering tasks, there is no single best way to do this.The pros and cons of di?erent generalization hierarchies must be weighted,and a savvy decision must be taken.Therefore,we do

ACM Journal Name,Vol.2,No.3,092002.

128·Klaus Julisch

not intend to propose the“right”generalization hierarchies.Rather,the general-ization hierarchies of this section should be seen as examples that demonstrate the usefulness and versatility of our alarm clustering framework.Other generalization hierarchies are possible and might prove useful in practice.

Numerical Attributes.Generalization hierarchies for numerical attributes are ob-tained by discretizing the attribute domain into a hierarchically nested sequence of intervals.In the simplest case,a human expert manually discretizes each numerical attribute.In doing so,the expert might be bound by regulations,which,for exam-ple,might stipulate that“children”,“adolescents”,“adults”,and“senior citizens”are(per de?nition)people in the age ranges[1,10[,[10,20[,[20,65[,and[65,∞], respectively.In other cases,the human expert is free to exercise her judgment.For example,suppose that the severity of an alarm is measured by a real number in the range from1to97.Assuming that all severity values in this range are equally likely,the expert might decide that a tree of height four and fan-out four constitutes a useful generalization hierarchy.The leaves of this tree are the values in[1,97[, the next higher level consists of the intervals[1+6×i,7+6×i[,i=0, (15)

followed by the intervals[1+24×i,25+24×i[,i=0,...,3.The interval[1,97[ constitutes the root of the generalization hierarchy.

A drawback of user-de?ned generalization hierarchies is that they are static.In fact,their inability to adjust to the actual distribution of data values can make them a rather unnatural choice.For example,consider a case where90%of severity values fall into the range[1,20].The aforementioned balanced generalization hierarchy is not particularly suitable for this case,because it is too coarse-grained in the range[1,20],while being too detailed in the range]20,97[.Dynamic generalization hierarchies,which are constructed at run-time to?t the actual data distribution, are a pragmatic way to mitigate this shortcoming[Dougherty et al.1995;Han and Fu1994;Lu1997].

In the simplest case,algorithms for the construction of dynamic generalization hierarchies construct hierarchies such that all intervals at a given level of the hi-erarchy contain the same number of data values[Han and Fu1994].This type of algorithm has the drawback that it assigns close values(e.g.4and5)to distinct intervals,while putting distant values(e.g.22and87)into the same interval,if this is necessary to equalize the number of data values per interval.The resulting gen-eralization hierarchies can be rather counter-intuitive.To mitigate this problem, clustering has been used to?nd intervals that re?ect the natural grouping of the data values[Lu1997;Miller and Yang1997].Other algorithms for the construction of dynamic generalization hierarchies are surveyed by Dougherty et al.[1995].This is not the place to explore these algorithms further,nor is it our intention to advo-cate any one of them.Instead,we conclude that the construction of generalization hierarchies for numerical attributes is a well-understood problem that has many practical solutions.

Time Attributes.For time attributes one typically wishes to capture temporal information such as the distinction between weekends and workdays,between busi-ness hours and o?hours,or between the beginning of the month and the end of the month.To make the clustering method aware of concepts like these,one can use generalization hierarchies such as the ones in Figure3.For example,the left-hand ACM Journal Name,Vol.2,No.3,092002.

Clustering Intrusion Detection Alarms to Support Root Cause Analysis ·129......ANY?DAY?OF?WEEK

WEEKEND

WORKDAY ...MON FRI SAT

SUN ts 3ts 1ts 21ANY?DAY?OF?MONTH BEGINNING END 28293031123415...16......The actual time?stamps ...ts’3ts’2ts’Fig.3.Sample generalization hierarchies for time attributes.

generalization hierarchy shows that the time-stamp ts 1can be generalized to the concepts SAT ,WEEKEND ,and ultimately,ANY-DAY-OF-WEEK .The right-hand generalization hierarchy shows how time-stamps can be generalized according to when during a month they were generated.

It can be desirable to use both generalization hierarchies of Figure 3simulta-neously.To this end,one could combine the two hierarchies into a single one,in which each time-stamp has two parents —one that corresponds to its day of the week,and one that corresponds to its day of the month.A drawback of this approach is that the resulting generalization hierarchy does not contain concepts such as “Mon 2nd”,which specify a day of the week plus a day of the month.To mitigate this problem,one could construct the “full cross-product”of the two gen-eralization hierarchies.The resulting generalization hierarchy contains all concept pairs,including “Mon 2nd”.Unfortunately,the generalization hierarchy resulting from a full cross-product is generally complex.It is therefore easier to replicate the time-stamp attribute and assign each generalization hierarchy to a separate replica.That way,one replica plays the role “day of the week”,whereas the other plays the role “day of the month”.Further,each replica is generalized according to its own generalization hierarchy.Because of its simplicity,we use this second option in our experiments.

String Attributes.String attributes can assume arbitrary and completely un-forseeable text values.Therefore,the challenge lies in tapping the semantic infor-mation of these strings.One possible solution to this problem is to use a feature extraction step that precedes the actual alarm clustering.Features are crisp bits of semantic information that,once extracted,replace the original strings.Thus,each string is replaced by the set of its features.Note that subset-inclusion de?nes a natural generalization hierarchy on feature sets.For example,the feature set {f 1,f 2,f 3}can be generalized to the sets {f 1,f 2},{f 1,f 3},or {f 2,f 3},which in turn can be generalized to {f 1},{f 2},or {f 3}.The next level is the empty set,which corresponds to “ANY-FEATURE”.Note that the feature extraction pro-cess constructs a generalization hierarchy at run-time.Hence,string attributes are another instance where dynamic generalization hierarchies are useful.

A potential problem of this feature-based approach is that it can result in very large generalization hierarchies.Speci?cally,if the feature extraction step identi?es m features,then the complete generalization hierarchy has 2m ?1nodes,namely

ACM Journal Name,Vol.2,No.3,092002.

130·Klaus Julisch

one node for each non-empty subset of the feature set.To keep the size of the generalization hierarchy at a manageable level,we use a variant of the above idea in our own experiments.Speci?cally,let L be an alarm log,let A be a string attribute,and let V:=be the multi-set(a.k.a.bag or collection)of values that attribute A assumes in the alarm log L.We run the Teiresias algorithm [Rigoutsos and Floratos1998]on V in order to?nd all substrings that have a user-de?ned minimum length and minimum frequency.These substrings are the features and each original string s is replaced by the(single)most frequent feature that is also a substring of s.Thus,all feature sets have size one.Finally,each feature set can only be generalized to the“ANY-FEATURE”level.The resulting generalization hierarchy is simple,but frequent substrings have the advantage of being rather understandable features.Therefore,they contribute to the overall understandability of covers.

6.2An Illustrative Example

This section describes a clustering experiment that we have conducted on a log of historical alarms.The alarm log is taken from a commercial network-based IDS,spans a time period of one month,and contains156380alarm messages.The IDS sensor was deployed in a network that is isomorphic to the one in Figure 1(a).Please note that this experiment involves real alarm data that was collected during the day-to-day operation of a commercially used network.Furthermore,it is worth mentioning that we have applied the same algorithm to many more data sets of various environments.The results obtained with these di?erent data sets are consistent with the results presented here[Julisch and Dacier2002].However, this section focuses on a single data set in order to have a detailed discussion of the results.

For the purpose of our experiment,we model alarms as7-tuples.In detail,the individual alarm attributes are the source and destination IP address,the source and destination port,the alarm type(which encodes the observed attack),the time-stamp,and the context.Please recall that the context is optional,but when present,contains the suspicious network packet.For IP addresses and port num-bers,we use the generalization hierarchies in Figure1(c).For time-stamps,we use the generalization hierarchies in Figure3.As described at the end of the previous section,we use generalization hierarchies based on frequent substrings for the con-text.Finally,the alarm clustering method is con?gured to enforceε-robustness(cf. Section5.3)withε=0.05.

Table II shows the generalized alarms of the thirteen largest alarm clusters that we found.Each line of the table represents one alarm cluster and the“Size”column indicates the cluster’s size.Throughout the table,“any”is generically written for attributes that have been generalized to the root of their generalization hierarchies. The value“unde?ned”in the“Context”column indicates that the IDS did not store any value for the context attribute.Similarly,the port attributes are occasionally unde?ned.For example,the ICMP protocol has no notion of ports[Tanenbaum 1996].As a consequence,the port attributes of“Fragmented ICMP Tra?c”alarms are unde?ned.Finally,recall that the names ip1,ip2,...refer to the machines in Figure1(a).

ACM Journal Name,Vol.2,No.3,092002.

实验4--系统聚类分析

实验4 系统聚类分析(Hierarchical cluster analysis) 实习环境要求：计算机及相关设备、SPSS统计软件实习目的：熟练运用SPSS软件进行系统聚类分析实习分组：每人一组，独立完成实验内容：聚类分析是直接比较各事物之间的性质，将性质相近的归为一类，将性质差别较大的归入不同的类。聚类分析事先并不知道对象类别的面貌，甚至连共有几个类别也不确定。一、数据准备课本71页，表3.4.2 已经有该文件：表3.4.2某地区九个农业区的七项经济指标数据二、菜单命令如下：(Analyze>Classify>Hierarchical Cluster) 1、系统聚类分析主界面设置如图选择要参加聚类的变量（Variable(s)）；选择对样品聚类（Cases默认）还是变量聚类(Variables)。在样品聚类时，你还可以使用标签变量(Label Cases By:)来代替默认的记录号结果输出。是否显示(Display)统计量（Statistics）和统计图(Plots)，默认都显示。

2、按Method…按钮，进行设置 2.1 Transform Values选择原始数据标准化方法如需要变换，一般做标准正态变换。本例课本选择了极差标准化（Range 0 to 1）。其他选项含义：

None：不变换 Z scores ：标准正态变换，具体方法为 (?)：(X-mean)/s Range –1 to 1 ：将数据范围转化为-1 至1之间，具体方法为(?)： [X-min-(max-min)/2]/ [(max-min)/2] Range 0 to 1 ：将数据范围转化为0 至1之间，具体方法为：（X-min）/(max-min)。即：极差标准化。 Maximum magnitude of 1：极大值标准化。做最大值为1的转换，具体方法为：X/max Mean of 1：做均值为1的转换 Standard deviation of 1做标准差为1的转换 2.2 样本间距离的计算公式（Measure defines the formula for calculating distance.）对不同的数据类型有不同的计算公式，我们一般仅涉及间隔尺度数据，不涉及分类变量的计数数据和二元数据。对于间隔尺度数据，有以下距离公式可以选择： Euclidean distance：欧氏距离 Squared Euclidean distance：欧氏距离的平方 Cosine ：夹角余弦 Pearson correlation ：简单相关系数 Chebychev ：切比雪夫距离 Block：绝对值距离 Minkowski ：明科夫斯基距离 Customized 自定义距离本例选择了绝对值距离Block

Clustering by fast search and find of density peaks

DOI: 10.1126/science.1242072 , 1492 (2014); 344 Science Alex Rodriguez and Alessandro Laio Clustering by fast search and find of density peaks This copy is for your personal, non-commercial use only. clicking here.colleagues, clients, or customers by , you can order high-quality copies for your If you wish to distribute this article to others here.following the guidelines can be obtained by Permission to republish or repurpose articles or portions of articles ): June 27, 2014 https://www.wendangku.net/doc/fb18628465.html, (this information is current as of The following resources related to this article are available online at https://www.wendangku.net/doc/fb18628465.html,/content/344/6191/1492.full.html version of this article at: including high-resolution figures, can be found in the online Updated information and services, https://www.wendangku.net/doc/fb18628465.html,/content/suppl/2014/06/25/344.6191.1492.DC1.html can be found at: Supporting Online Material https://www.wendangku.net/doc/fb18628465.html,/content/344/6191/1492.full.html#ref-list-1, 1 of which can be accessed free: cites 14 articles This article https://www.wendangku.net/doc/fb18628465.html,/cgi/collection/comp_math Computers, Mathematics subject collections:This article appears in the following o n J u n e 27, 2014 w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m

谱聚类Clustering -

聚类分析 1.聚类分析定义： 2.聚类方法： 3.谱聚类： 3.1 常见矩阵变换 3.2 谱聚类流程 3.3 谱聚类理论前提、证明 3.4 图像分割实例结果 4.总结：

聚类分析： ?聚类分析（Cluster analysis，亦称为群集分析）是对于静态数据分析的一门技术，在许多领域受到广泛应用，包括机器学习，数据挖掘，模式识别，图像分析以及生物信息。

算法分类： ?数据聚类算法可以分为结构性或者分散性。 ?结构性算法以前成功使用过的聚类器进行分类。结构性算法可以从上至下或者从下至上双向进行计算。从下至上算法从每个对象作为单独分类开始，不断融合其中相近的对象。而从上至下算法则是把所有对象作为一个整体分类，然后逐渐分小。 ?分散型算法是一次确定所有分类。K-均值法及衍生算法。 ?谱聚类（spectral clustering）

结构型：层次聚类的一个例子：

分散型：K-均值算法：

分散型k-means 及其衍生算法的比较：K-means K-Medoids K-Means算法： 1. 将数据分为k个非空子集 2. 计算每个类中心点（k-means中心点是所有点的average），记为seed point 3. 将每个object聚类到最近seed point 4. 返回2，当聚类结果不再变化的时候stop K-Medoids算法： 1.任意选取K个对象作为medoids(O1,O2,…Oi…Ok）。 2.将余下的对象分到各个类中去（根据与medoid最相近的原则); 3.对于每个类（Oi）中，顺序选取一个Or，计算用Or代替Oi后的消耗E（Or）。选择E最小的那个Or来代替Oi。转到2。 4.这样循环直到K个medoids固定下来。这种算法对于脏数据和异常数据不敏感，但计算量显然要比K均值要大，一般只适合小数据量。

聚类(2)——层次聚类 Hierarchical Clustering .

聚类(2)——层次聚类Hierarchical Clustering 分类：Machine Learning 2012-06-23 11:09 5708人阅读评论(9) 收藏举报算法2010 聚类系列： ?聚类(序)----监督学习与无监督学习 ? ?聚类(1)----混合高斯模型 Gaussian Mixture Model ?聚类(2)----层次聚类 Hierarchical Clustering ?聚类(3)----谱聚类 Spectral Clustering -------------------------------- 不管是GMM，还是k-means，都面临一个问题，就是k的个数如何选取？比如在bag-of-words模型中，用k-means 训练码书，那么应该选取多少个码字呢？为了不在这个参数的选取上花费太多时间，可以考虑层次聚类。假设有N个待聚类的样本，对于层次聚类来说，基本步骤就是： 1、（初始化）把每个样本归为一类，计算每两个类之间的距离，也就是样本与样本之间的相似度； 2、寻找各个类之间最近的两个类，把他们归为一类（这样类的总数就少了一个）； 3、重新计算新生成的这个类与各个旧类之间的相似度； 4、重复2和3直到所有样本点都归为一类，结束。整个聚类过程其实是建立了一棵树，在建立的过程中，可以通过在第二步上设置一个阈值，当最近的两个类的距离大于这个阈值，则认为迭代可以终止。另外关键的一步就是第三步，如何判断两个类之间的相似度有不少种方法。这里介绍一下三种： SingleLinkage：又叫做nearest-neighbor ，就是取两个类中距离最近的两个样本的距离作为这两个集合的距离，也就是说，最近两个样本之间的距离越小，这两个类之间的相似度就越大。容易造成一种叫做Chaining 的效果，两个cluster 明明从“大局”上离得比较远，但是由于其中个别的点距离比较近就被合并了，并且这样合并之后Chaining 效应会进一步扩大，最后会得到比较松散的cluster 。 CompleteLinkage：这个则完全是Single Linkage 的反面极端，取两个集合中距离最远的两个点的距离作为两个集合的距离。其效果也是刚好相反的，限制非常大，两个cluster 即使已经很接近了，但是只要有不配合的点存在，就顽固到底，老死不相合并，也是不太好的办法。这两种相似度的定义方法的共同问题就是指考虑了某个有特点的数据，而没有考虑类内数据的整体特点。 Average-linkage：这种方法就是把两个集合中的点两两的距离全部放在一起求一个平均值，相对也能得到合适一点的结果。 average-linkage的一个变种就是取两两距离的中值，与取均值相比更加能够解除个别偏离样本对结果的干扰。这种聚类的方法叫做agglomerative hierarchical clustering（自下而上，@2013.11.20 之前把它写成自顶而下了，我又误人子弟了。感谢4楼的网友指正）的，描述起来比较简单，但是计算复杂度比较高，为了寻找距离最近/远和均值，都需要对所有的距离计算个遍，需要用到双重循环。另外从算法中可以看出，每次迭代都只能合并两个子类，这是非常慢的。尽管这么算起来时间复杂度比较高，但还是有不少地方用到了这种聚类方法，在《数学之美》一书的第14章介绍新闻分类的时候，就用到了自顶向下的聚类方法。是这样的，谷歌02年推出了新闻自动分类的服务，它完全由计算机整理收集各个网站的新闻内容，并自动进行分类。新闻的分类中提取的特征是主要是词频因为对不同主题的新闻来说，各种词出现的频率是不一样的，比如科技报道类的新闻很可能出现的词就是安卓、平板、双核之类的，而军事类的新闻则更可能出现钓鱼岛、航

谱聚类算法(Spectral Clustering)原理分析

谱聚类算法(Spectral Clustering) 谱聚类(Spectral Clustering, SC)是一种基于图论的聚类方法——将带权无向图划分为两个或两个以上的最优子图，使子图内部尽量相似，而子图间距离尽量距离较远，以达到常见的聚类的目的。其中的最优是指最优目标函数不同，可以是割边最小分割——如图1的Smallest cut(如后文的Min cut)，也可以是分割规模差不多且割边最小的分割——如图1的Best cut(如后文的Normalized cut)。图1 谱聚类无向图划分——Smallest cut和Best cut 这样，谱聚类能够识别任意形状的样本空间且收敛于全局最优解，其基本思想是利用样本数据的相似矩阵(拉普拉斯矩阵)进行特征分解后得到的特征向量进行聚类。 1 理论基础对于如下空间向量item-user matrix：如果要将item做聚类，常常想到k-means聚类方法，复杂度为o(tknm)，t为迭代次数，k为类的个数、n为item个数、m为空间向量特征数： 1 如果M足够大呢？ 2 K的选取？ 3 类的假设是凸球形的？ 4 如果item是不同的实体呢？ 5 Kmeans无可避免的局部最优收敛？ …… 这些都使常见的聚类问题变得相当复杂。 1.1 图的表示

如果我们计算出item与item之间的相似度，便可以得到一个只有item的相似矩阵，进一步，将item看成了Graph(G)中Vertex(V)，歌曲之间的相似度看成G中的Edge(E)，这样便得到我们常见的图的概念。对于图的表示(如图2)，常用的有：邻接矩阵：E，e ij表示v i和v i的边的权值，E为对称矩阵，对角线上元素为0，如图2-2。 Laplacian矩阵：L = D – E，其中d i (行或列元素的和)，如图2-3。图2 图的表示 1.2 特征值与L矩阵先考虑一种最优化图像分割方法，以二分为例，将图cut为S和T两部分，等价于如下损失函数cut(S, T)，如公式1所示，即最小(砍掉的边的加权和)。假设二分成两类，S和T，用q(如公式2所示)表示分类情况，且q满足公式3的关系，用于类标识。那么：

k-Means-Clustering

合肥工业大学—数学建模组
k-Means Clustering
On this page…
Introduction to k-Means Clustering Create Clusters and Determine Separation Determine the Correct Number of Clusters Avoid Local Minima
Introduction to k-Means Clustering
k-means clustering is a partitioning method. The function kmeans partitions data into k mutually
exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a single level of clusters. The distinctions mean that k-means clustering is often more suitable than hierarchical clustering for large amounts of data.
kmeans treats each observation in your data as an object having a location in space. It finds a
partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. You can choose from five different distance measures, depending on the kind of data you are clustering.
Each cluster in the partition is defined by its member objects and by its centroid, or center. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster
is minimized. kmeanscomputes cluster centroids differently for each distance measure, to
minimize the sum with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its
cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the cluster centroids, and for the
maximum number of iterations.
Create Clusters and Determine Separation
The following example explores possible clustering in four-dimensional data by analyzing the results of partitioning the points into three, four, and five clusters.
Note Because each part of this example generates random numbers sequentially, i.e., without setting a new state, you must perform all steps in sequence to duplicate the results shown. If you perform the steps out of sequence, the answers will be essentially the same, but the intermediate results, number of iterations, or ordering of the silhouette plots may differ.
王刚

kMeansClustering

k-Means Clustering
On this page…
Introduction to k-Means Clustering Create Clusters and Determine Separation Determine the Correct Number of Clusters Avoid Local Minima
Introduction to k-Means Clustering
k-means clustering is a partitioning method. The function kmeans partitions data into k mutually
exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a single level of clusters. The distinctions mean that k-means clustering is often more suitable than hierarchical clustering for large amounts of data.
kmeans treats each observation in your data as an object having a location in space. It finds a
partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. You can choose from five different distance measures, depending on the kind of data you are clustering.
Each cluster in the partition is defined by its member objects and by its centroid, or center. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster
is minimized. kmeanscomputes cluster centroids differently for each distance measure, to
minimize the sum with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its
cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the cluster centroids, and for the
maximum number of iterations.
Create Clusters and Determine Separation
The following example explores possible clustering in four-dimensional data by analyzing the results of partitioning the points into three, four, and five clusters.
Note Because each part of this example generates random numbers sequentially, i.e., without setting a new state, you must perform all steps in sequence to duplicate the results shown. If you perform the steps out of sequence, the answers will be essentially the same, but the intermediate results, number of iterations, or ordering of the silhouette plots may differ.

聚类分析学习总结

聚类分析学习总结(总7页) -CAL-FENGHAI.-(YICAI)-Company One1 -CAL-本页仅作为文档封面，使用请直接删除

聚类分析学习体会聚类分析是多元统计分析中研究“物以类聚”的一种方法，用于对事物的类别尚不清楚，甚至在事前连总共有几类都不能确定的情况下进行分类的场合。聚类分析主要目的是研究事物的分类，而不同于判别分析。在判别分析中必须事先知道各种判别的类型和数目，并且要有一批来自各判别类型的样本，才能建立判别函数来对未知属性的样本进行判别和归类。若对一批样品划分的类型和分类的数目事先并不知道，这时对数据的分类就需借助聚类分析方法来解决。聚类分析把分类对象按一定规则分成组或类，这些组或类不是事先给定的而是根据数据特征而定的。在一个给定的类里的这些对象在某种意义上倾向于彼此相似，而在不同类里的这些对象倾向于不相似。 1．聚类统计量在对样品（变量）进行分类时，样品（变量）之间的相似性是怎么度量？通常有三种相似性度量——距离、匹配系数和相似系数。距离和匹配系数常用来度量样品之间的相似性，相似系数常用来变量之间的相似性。样品之间的距离和相似系数有着各种不同的定义，而这些定义与变量的类型有着非常密切的关系。通常变量按取值的不同可以分为： 1.定量变量：变量用连续的量来表示，例如长度、重量、速度、人口等，又称为间隔尺度变量。 2.定性变量：并不是数量上有变化，而只是性质上有差异。定性变量还可以再分为： ⑴有序尺度变量：变量不是用明确的数量表示，而是用等级表示，例如文化程度分为文盲、小学、中学、大学等。 ⑵名义尺度变量：变量用一些类表示，这些类之间既无等级关系，也无数量关系，例如职业分为工人、教师、干部、农民等。