当前位置：文档库 › Review Article

Review Article

Small sample issues for microarray-based

classi?cation

Edward R.Dougherty*

Department of Electrical Engineering,Texas A&M University,College Station,TX,USA

*Correspondence to:

E.R.Dougherty,Department of

Electrical Engineering,Texas

A&M University,College Station,

TX77843-3128,USA.

E-mail:e-dougherty@https://www.wendangku.net/doc/542141407.html,

Abstract

In order to study the molecular biological differences between normal and diseased tissues,

it is desirable to perform classi?cation among diseases and stages of disease using

microarray-based gene-expression values.Owing to the limited number of microarrays

typically used in these studies,serious issues arise with respect to the design,performance

and analysis of classi?ers based on microarray data.This paper reviews some fundamental

issues facing small-sample classi?cation:classi?cation rules,constrained classi?ers,error

estimation and feature selection.It discusses both unconstrained and constrained classi?er

design from sample data,and the contributions to classi?er error from constrained

optimization and lack of optimality owing to design from sample data.The dif?culty with

estimating classi?er error when con?ned to small samples is addressed,particularly

estimating the error from training data.The impact of small samples on the ability to

include more than a few variables as classi?er features is explained.Copyright#2001

John Wiley&Sons,Ltd.

Keywords:classi?cation;gene expression;genomics;microarrays;pattern recognition Introduction

cDNA microarrays can provide expression mea-

surements for thousands of genes at once[2,3,7].

A key goal is to perform classi?cation via different

expression patterns, e.g.cancer classi?cation[4].

This requires designing a classi?er(decision func-

tion)that takes a vector of gene expression levels as

input,and outputs a class label,which predicts the

class containing the input vector.Classi?cation can

be between different kinds of cancer,different

stages of tumour development or a host of such

differences.Classi?ers are designed from a sample

of expression vectors.This requires assessing

expression levels from RNA obtained from the

different tissues with microarrays,determining

genes whose expression levels can be used as

classi?er variables,and then applying some rule to

design the classi?er from the sample microarray

data.Expression values have randomness arising

from both biological and experimental variability.

Design,performance evaluation and application of

classi?ers must take this randomness into account.

Three critical issues arise.First,given a set of

variables,how does one design a classi?er from the

sample data that provides good classi?cation over

the general population?Second,how does one

estimate the error of a designed classi?er when

data is limited?Third,given a large set of potential

variables,such as the large number of expression

level determinations provided by microarrays,how

does one select a set of variables as the input vector

to the classi?er?The problem of error estimation

impacts variable selection in a devilish way.An

error estimator may be unbiased but have a large

variance and therefore often be low.This can

produce a large number of gene(variable)sets and

classi?ers with low error estimates.For a small

sample,we can end up with thousands of gene sets

for which the error estimate from the data at hand

is zero.

For at least the near future,small samples are

likely to be a critical issue for microarray-based

classi?cation.The irony is that,while microarray

technology yields information on very large gene

sets,it is just these large sets that demand

experimental replication.If detectors for each gene

are not duplicated on an array,then one microarray Comparative and Functional Genomics

Comp Funct Genom2001;2:28–34.

yields a single sample point per gene.In this case,a study using30arrays provides a very small sampling of gene behaviour.This paper discusses classi?cation issues,with particular attention to the perplexing effect of small samples.

Classi?cation rules

Classi?cation involves a classi?er,y,a feature vector,X=(X1,X2,...,X d)composed of random variables,and a binary random variable,Y,to be predicted by y(X).The values,0or1,of Y are treated as class labels.The error,e(y),of y is the probability,P(y(X)l Y),that the classi?cation is erroneous.It equals the expected(mean)absolute difference,E(|Y x y(X)|),between the label and the classi?cation.X1,X2,...,X d can be discrete or real-valued.In the latter case,the domain of y is d-dimensional Euclidean space R d.An optimal classi?er,y$,is one having minimal error,e$, among all binary functions on R d.y$and e$are called the Bayes classi?er and Bayes error,respec-tively.Classi?cation accuracy,and thus the error, depends on the probability distribution of the feature–label pair(X,Y)—how well the labels are distributed among the variables(gene expression levels)being used to discriminate them,and how the variables are distributed in R d.

The Bayes classi?er is de?ned in a natural way: for any speci?c vector x,y$(x)=1if the expected value of Y given x,E(Y|x),exceeds K,and y$(x)=0otherwise.Formulated in terms of proba-bilities,y$(x)=1if the conditional probability of Y=1given x exceeds the conditional probability of Y=0given x,and y$(x)=0otherwise;that is, y$(x)=1if and only if P(Y=1|x)>P(Y=0|x).This is most intuitive:the label1is predicted upon observation of x if the probability that x lies in class1exceeds the probability that x lies in class0.Since the sum of the probabilities is1, P(Y=1|x)>P(Y=0|x)if and only if P(Y=1|x)>K. The problem is that we do not know these conditional probabilities,and therefore must design a classi?er from sample data.

Supervised classi?er design uses a sample S n=[(X1,Y1),(X2,Y2),...,(X n,Y n)]of feature–label pairs and a classi?cation rule to construct a classi?er y n whose error is hopefully close to the Bayes error. The Bayes error e$is estimated by the error e n of y n. Because e$is minimal,e n i e$,and there is a design error(cost of estimation),D n=e n x e$.Since it depends on the sample,e n is a random variable,as is D n.Hopefully,D n gets closer to0as the sample size grows.This will depend on the classi?ca-tion rule and the distribution of the feature–label pair(X,Y).

A classi?cation rule is said to be consistent for the distribution of(X,Y)if E(D n)p0as n p‘,where the expectation is relative to the distribution of the sample.The expected design error goes to zero as the sample size goes to in?nity.This is equivalent to P(D n>t)p0as n p‘for any t>0,which says that the probability of the design error exceeding t goes to0.As stated,consistency depends upon the relation between the classi?cation rule and the joint feature–label distribution.If E(D n)p0for any distribution,then the classi?cation rule is said to be universally consistent.Since we often lack an estimate of the distribution,universal consistency is desirable.

Since the Bayes classi?er is de?ned by y$(x)=1if and only if P(Y=1|x)>K,an obvious way to proceed is too obtain an estimate P n(Y=1|x)of P(Y=1|x)from the sample S n.The plug-in rule designs a classi?er by y n(x)=1if and only if P n(Y=1|x)>K.If the data is discrete,then there is a?nite number of vectors and P n(Y=1|x)can be de?ned to be the number of times the pair(x,1)is observed in the sample divided by the number of times x is observed.The problem is that,if x is observed very few times,then P n(Y=1|x)is not a good estimate.Even worse,if x is never observed, then y n(x)must be de?ned by some convention. The rule is consistent,but depending on the number of variables,may require a large sample to have E(D n)close to0,or equivalently,e n close to the Bayes error.Consistency is of little consequence for small samples.

For continuous data,many classi?cation rules partition R d into a disjoint union of cells. P n(Y=1|x)is the number of1-labelled sample points in the cell containing x divided by the total number of points in the cell.A histogram rule is de?ned by the plug-in rule:y n(x)is0or1according to which is the majority label in the cell.The cells may change with n and may depend on the sample points.They do not depend on the labels.To obtain consistency for a distribution,two conditions are suf?cient when stated with the appropriate mathe-matical rigour:(1)the partition should be?ne enough to take into account local structure of the

Small sample issues for microarray-based classi?cation29 Copyright#2001John Wiley&Sons,https://www.wendangku.net/doc/542141407.html,p Funct Genom2001;2:28–34.

distribution,and(2)there should be enough labels in each cell so that the majority decision re?ects the decision based on the true conditional probabilities. The cubic histogram rule partitions R d into same-size cubes.These can remain the same or vary with sample size n.If the cube edge length approaches0 and n times the common volume approaches in?nity as n p‘,then the rule is universally consistent.For discrete data,the cubic histogram rule reduces to the plug-in rule for discrete data if the cubes are suf?ciently small.

Another popular rule is the nearest-neighbour (NN)rule.y n(x)is the label of the sample point closest to x.This rule is simple,but not consistent. An extension of this rule is the k-nearest-neighbour (k NN)rule.For k odd,the k points closest to x are selected and y n(x)is de?ned to be0or1according to which is the majority among the labels of the chosen points.The k NN is universally consistent if k p‘in such a way that k/n p0as n p‘. Constrained Classi?ers

To reduce design error,one can restrict the functions from which an optimal classi?er must be chosen to a class C.This leads to trying to?nd an optimal constrained classi?er,y C s C,having error e C.Constraining the classi?er can reduce the expected design error,but at the cost of increasing the error of the best possible classi?er.Since optimization in C is over a subclass of classi?ers, the error,e C,of y C will typically exceed the Bayes error,unless the Bayes classi?er happens to be in C.This cost of constraint(approximation)is D C=e C x e$.A classi?cation rule yields a classi?er y n,C s C with error e n,C,and e n,C i e C i e$.Design error for constrained classi?cation is D n,C=e n,C x e C. For small samples,this can be substantially less than D n,depending on C and the rule.The error of the designed constrained classi?er is decomposed as e n,C=e$+D C+D n,C.The expected error of the designed classi?er from C can be decomposed as:

E(e n,C)~e.z D C z E(D n,C)e1TThe constraint is bene?cial if and only if E(e n,C)E(e n,C).The point N0at which the decreasing lines cross is the cut-off: for n>N0,the constraint is detrimental;for n

There are many kinds of constrained classi?ers. Perceptrons form a constrained class with some attractive properties:simplicity,a linear-like struc-ture,and contributions of individual variables that can be easily appreciated.Savings in sample size (in comparison to unconstrained classi?cation) accelerate as the number of variables

increases. Figure1.Relation between sample size and constraint

30 E.R.Dougherty Copyright#2001John Wiley&Sons,https://www.wendangku.net/doc/542141407.html,p Funct Genom2001;2:28–34.

A perceptron is de?ned by:

y(X)~T(a1X1z a2X2z...z a m X m z b)e2Twhere T is a threshold function,T(z)=0if z j0, and T(z)=1if z>0.A perceptron splits R d into two by the hyperplane de?ned by setting the sum in the preceding equation to0.Design of a perceptron requires estimating the coef?cients a1,a2,...,a m,and b.

Neural networks are multi-layer perceptrons.

A basic two-layer neural network takes the outputs of K perceptrons(called neurons)and inputs these outputs into a?nal perceptron.More general networks exist.Neural networks offer an advantage over perceptrons because by increasing the number of neurons one can arbitrarily decrease the con-

straint.But this makes neural networks tricky to use because decreasing the constraint increases the expected design cost.One faces the inevitable conundrums of balancing the contributions to E(e n,C)in Eq.1.The data requirement grows rapidly as the number of neurons is increased. Error Estimation

The error of a designed classi?er needs to be estimated.If there is an abundance of data,then it can be split into training and test data.A classi?er is designed on the training data.Its estimated error is the proportion of errors it makes on the test data. The estimate is unbiased and its variance tends to zero as the amount of test data goes to in?nity.

A problem arises when data are limited.One approach is to use all sample data to design a classi?er y n,and estimate e n by applying y n to the same data.The resubstitution estimate,e n,is the fraction of errors made by y n.For histogram rules, e n is biased low,meaning E(e n)j E(e n).For small samples,the bias can be severe.It improves for large samples.For binary features,an upper bound for the mean-square error of e n as an estimator of e n is given by E(|e n x e n|2)j6(2d)/n.Note the exponen-tial contribution of the number of variables. Figure2shows a generic situation for the inequality E(e n)j E(e$)j E(e n)for increasing sample size.

To appreciate the problem with resubstitution, consider the plug-in rule for discrete data.For any vector x,let n(x)be the number of occurrences of x in the sample data,n(Y=1|x)be the number oftimes x haslabel1,and P n(Y=1|x)=n(Y=1|x)/n(x).n(x).There are three possibilities:(1)x is observed in training,n(Y=1|x)>n(x)/2, P n(Y=1|x)>K,and y n(x)=1;(2)x is observed in training,n(Y=1|x)j n(x)/2,P n(Y=1|x)j K, and y n(x)=0;or(3)x is not observed in training and y n(x)is de?ned by a convention.Each x in the ?rst category contributes n(Y=0|x)errors.Each x in the second category contributes n(Y=1|x)errors. For a small sample,there may be an enormous number of vectors in the third category.These contribute nothing to e n,but may contribute substantially to e n.Moreover,there may be many vectors in the?rst and second categories observed only once,and they also contribute nothing to e n. Another small-sample approach is cross-validation.Classi?ers are designed from parts of the sample,each is tested on the remaining data, and e n is estimated by averaging the errors.For leave-one-out estimation,n classi?ers are designed from sample subsets formed by leaving out one sample pair.Each is applied to the left-out pair,and the estimator^e n is1/n times the number of errors made by the n classi?ers.Since the classi?ers are designed on sample sizes of n x1,^e n actually estimates the error e n x1.It is an unbiased estimator of e n x1,meaning that E(^e n)~E(e nà1).Unbiasedness is important,but of critical concern is the variance of the estimator for small n.

For a sample of size n,^e n estimates e n based on the same sample.Performance depends on the classi?cation rule.For the k-nearest-neighbour rule,E(^e n{e n

j j2)?(6k z1)=n.Given that^e n is approximately an unbiased estimator of e n,

this Figure2.Expected design error vs.expected resubstitution error

Small sample issues for microarray-based classi?cation31 Copyright#2001John Wiley&Sons,https://www.wendangku.net/doc/542141407.html,p Funct Genom2001;2:28–34.

inequality bounds the variance of ^e n {e n .Although an upper bound does not say how bad the situation is,but only how bad it can at most be,it can be instructive to look at its order of magnitude.For k =1and n =175,upon taking the square root,this bound only ensures that the standard deviation of ^e n {e n is less than 0.2.

It is informative to compare the resubstitution and leave-one-out estimates for the histogram rule.The variance of the resubstitution estimator is bounded above by 1/n ,and if the partition on which it is based contains N cells,then E (|e n x e n |2)j 6N /n .For the leave-one-out estimator:

E ?^e n {e n j j 2

1z 6e {1n z 6

????????????????p (n {1)

p e3T

[see (1)for bounds].??????????

n {1p as opposed to n in the denominator for e n shows greater variance for ^e n .There is a certain tightness to this bound.For any partition there is a distribution for which:

E ?^e n {e n j j 2 §

e 1=12????????2p n

p e4T

Performance can be very bad for small n .Unbia-sedness comes with increased variance.

To appreciate the dif?culties inherent in the leave-one-out bounds,we will simplify them in a way that makes them more favourable to precise estimation.The performance of ^e n guaranteed by Eq.3becomes better if we lower the bound.A lower bound than the one in Eq.3is (1:8)=??????????n {1p .The corresponding standard-deviation bounds for n =50and 100exceed 0.5and 0.435,respectively.These are essentially useless.The minimum worst-case-performance bound of Eq.4would be better if it were lower.A lower bound

than the one given is (0:35)=???

n p .The corresponding standard-deviation bounds for n =50and 100,exceed 0.22and 0.18,respectively.

Returning to the situation in which the data is split into training and test data,if the test-data error estimate is e n and there are m sample pairs in the test data,then E ? e n {e n j j 2 ?1=4m .The problem is that,for small samples,one would like to use all the data for design.It is necessary to use 25sample pairs for test data to get the corresponding standard-deviation bound down to 0.1.

Feature Selection

Given a large set of potential features,such as the

set of all genes on a microarray,it is necessary to ?nd a small subset with which to classify.There are various methods of choosing feature sets,each having advantages and disadvantages.The typical intent is to choose a set of variables that provide good classi?cation.The basic idea is to choose variables that are not redundant.

A critical problem arises with small samples.Given a large set of variables,every subset is a potential feature set.For v variables,there are 2v x 1possible feature vectors.Even for choosing from among 200variables and allowing at most 20variables,the number of possible vectors is astro-nomical.One cannot apply a classi?cation rule to all of these;nonetheless,even if the classes are moderately separated,one may ?nd many thou-sands of vectors for which ^e n &0.It would be wrong to conclude that the Bayes errors of all the corresponding classi?ers are small.

Adjoining variables stepwise to the feature vector decreases the Bayes error but can increase design error.For ?xed sample size n and different numbers of variables d ,Figure 3shows a generic situation for the Bayes error e $

(d )and the expected error E [e n (d )]of the designed classi?er as functions of d .e $

(d )decreases;E [e n (d )]decreases and then increases.Were E [e n (d )]known,then we could conclude that e $

(d )is no worse than E [e n (d )];however,we have only an estimate of e n (d ),which for small samples can be well below (or above)

e $

(d ).Thus,the estimate curve ^

e n (d )might drop

far Figure 3.Effect of increasing numbers of variables

32 E.R.Dougherty

Comp Funct Genom 2001;2:28–34.

below the Bayes-error curve e$(d),even being0over a fairly long interval.

We confront the general issue of the number of variables.The expected design error is written in terms of n and C in Eq.1.But C depends on d.

A celebrated theorem of pattern recognition pro-vides bounds for E(D n,C)[8].The empirical-error rule chooses the classi?er in C that makes the least number of errors on the sample data.For this (intuitive)rule,E(D n,C)satis?es the bound:

E(D n,C)?4

????????????????????????

V C log n z4

e5T

where V C is the VC(Vapnik–Chervonenkis)dimen-sion of C.Details of the VC dimension are outside the scope of this paper.Nonetheless,it is clear from Eq.5that n must greatly exceed V C for the bound to be small.The VC dimension of a perceptron is d+1.For a neural network with an even number,k, of neurons,the VC dimension has the lower bound V C i dk.If k is odd,then V C i d(k x1).To appreciate the implications,suppose d=k=10. Setting V C=100and n=5000in Eq.5yields a bound exceeding1,which says nothing.Admittedly, the bound of Eq.5is worst-case because there are no distributional assumptions.The situation may not be nearly so bad.Still,one must proceed with care,especially in the absence of distributional knowledge.Adding variables and neurons is often counterproductive unless there is a large sample available.Otherwise,one could end up with a very bad classi?er whose error estimate is very small! Conclusion

The purpose of this review has been to provide the general micorarray community with some basic guideposts in its effort to design expression-based classi?ers.There are many more implications of the kind discussed here.In some sense,we have been discussing a worst-case setting:no assumptions on the distribution of features and labels,and real-valued variables.The data requirement can be signi?cantly reduced if some prior knowledge concerning the distribution is applied,or if a strong constraint based on biological knowledge is imposed.The data problem can also be mitigated if the classi?er variables are discrete and limited in their possible values.Two possibilities naturally arise.The Boolean model has been suggested for genomic networks,and could be used here instead of considering raw expression values[5].In it,a gene is either on(1)or off(0).Ternary values are also appropriate for microrarray ratio data: a gene is upregulated(1),downregulated(x1),or invariant(0).This model has been used to measure gene interaction via expression ratios[6].One might reasonably argue that compression of the contin-uous data gives up too much information;however, given the data variability,it might be safer only to consider genes that change signi?cantly,and base classi?cation on an up–down model of control. Most likely,it will not be possible to design a classi?er from a single set of microarray experi-ments.Separation of the sample data by designed classi?ers will likely have to be taken as evidence that the corresponding gene sets are potential variable sets for classi?cation.Their effectiveness will have to be checked by large-replicate experi-ments designed to estimate their classi?cation error, perhaps in conjunction with biological input or phenotype evidence.There may,in fact,be many gene sets that provide accurate classi?cation of a given pathology.Of these,some sets may provide mechanistic insights into the molecular aetiology of the disease,while other sets may be indecipherable. This listing of dif?culties in producing accurate classi?ers based on measurements of the expression pro?les of small samples is not intended to persuade researchers to cease doing experiments and subse-quent analysis to arrive at indications that certain conditions can be discriminated via gene expression. Rather,it is intended to focus attention on the need to?nd classi?cation screening algorithms that provide reasonable collections of gene sets to be tested with new experiments.

References

1.Devroye L,Gyor?L,Lugosi G.1996.A Probabilistic Theory

of Pattern Recognition.Springer-Verlag:New York.

2.De Risi JL,Iyer VR,Brown PO.1997.Exploring the

metabolic and genetic control of gene expression on a genomic scale.Science278:680–666.

3.Duggan DJ,Bittner ML,Chen Y,Meltzer PS,Trent JM.

1999.Expression pro?ling using cDNA microarrays.Nature Genet21:10–14.

4.Golub TR,Slonim DK,Tamayo P,et al.1999.Molecular

classi?cation of cancer:class discovery and class prediction by gene expression monitoring.Science286:531–537.

Small sample issues for microarray-based classi?cation33 Copyright#2001John Wiley&Sons,https://www.wendangku.net/doc/542141407.html,p Funct Genom2001;2:28–34.

5.Kauffman SA.1993.The Origins of Order .Oxford University Press:New York.

6.Kim S,Dougherty ER,Bittner M,et al .2000.General framework for the analysis of multivariate gene interaction via expression arrays.Biomed Opt 5:411–424.

7.Schena M,Shalon D,Davis RW,Brown PO.1995.

Quantitative monitoring of gene expression patterns with a complementary DNA microarray.Science 270:467–470.

8.Vapnik V,Chervonenkis A.1971.On the uniform conver-gence of relative frequencies of events to their probabilities.Theor Prob Appl 16:264–280.

https://www.wendangku.net/doc/542141407.html,/genomics

The Genomics website at Wiley is a DYNAMIC resource for the genomics community,offering FREE special feature articles and new information EACH MONTH .

Find out more about Comparative and Functional Genomics,and Proteomics ,and how to view many articles FREE OF CHARGE !

Visit the Library for hot books in Genomics,Bioinformatics,Molecular Genetics and more.Click on Journals for information on all our up-to-the minute journals,including:Genesis ,Bioessays,Gene Function and Disease,Human Mutation,Genes,Chromosomes and Cancer and the Journal of Gene Medicine .

Let the Genomics website at Wiley be your guide to genomics-related web sites,manufacturers and suppliers,and a calendar of conferences.

The

Website at Wiley

34 E.R.Dougherty