文档库 最新最全的文档下载
当前位置:文档库 › Efficiency-Quality Tradeoffs for Vector Score Aggregation

Efficiency-Quality Tradeoffs for Vector Score Aggregation

E?ciency-Quality Tradeo?s for Vector Score Aggregation Pavan Kumar C.Singitham Mahathi S.Mahabhashyam Prabhakar Raghavan

Stanford University

Stanford

USA

pavan@https://www.wendangku.net/doc/3714545290.html,

Stanford University

Stanford

USA

mmahathi@https://www.wendangku.net/doc/3714545290.html,

Verity Inc.

Sunnyvale

USA

pragh@https://www.wendangku.net/doc/3714545290.html,

Abstract

Finding the nearest neighbors to a query in a vector space is an important primitive in text and image retrieval.Here we study an extension of this problem with applications to XML and image retrieval:we have mul-tiple vector spaces,and the query places a weight on each space.Match scores from the spaces are weighted by these weights to deter-mine the overall match between each record and the query;this is a case of score aggre-gation.We study approximation algorithms that use a small fraction of the computation of exhaustive search through all records,while returning nearly the best matches.We focus on the tradeo?between the computation and the quality of the results.We develop two ap-proaches to retrieval from such multiple vec-tor spaces.The?rst is inspired by resource allocation.The second,inspired by computa-tional geometry,combines the multiple vector spaces together with all possible query weights into a single larger space.While mathemat-ically elegant,this abstraction is intractable for implementation.We therefore devise an approximation of this combined space.Ex-periments show that all our approaches(to varying extents)enable retrieval quality com-parable to exhaustive search,while avoiding its heavy computational cost.

with Aho in the author and algorithm in the

title,with the author score being twice as im-portant as the title score”:the notion is that the author and title?elds each contribute a non-negative score that is weighted and summed for the overall score).Already a component of enter-

prise information retrieval platforms,such func-tionality becomes even more critical in content-oriented XML retrieval.

1.2Vector score aggregation

No better algorithm is known for general score aggre-gation short of an exhaustive search.We focus here on

an important case raised by the two examples above. Suppose that record e j is represented by an s-tuple of

vectors V j,i,1≤i≤s,in s vector spaces.For example if each record is a semi-structured document,we would have one vector space for author,one for title,etc.

Each vector space is built from the terms in that?eld, as in classic information retrieval[30].

De?nition1A composite vector query is a pair Q=(q,w)where q is an s-tuple of query vectors (q1,...,q s)in the corresponding vector spaces,while w is an s-vector of non-negative real weights w1,...w s. The weight w i represents the importance assigned by

the user to i th?eld;without loss of generality,we henceforth assume that s i=1w i=1.We further as-sume(as is typical in these applications)that the query

and record vectors are all normalized within their re-spective?elds,i.e.,||q i||2=||V j,i||2=1.

De?nition2The match score between query Q= (q,w)and record e j=(V j,i)is given by

Match(Q,e j)=

s

i=1

w i(q i·V j,i),(1)

where q i·V j,i represents the dot product(a.k.a.cosine similarity)between the query and record vector V j,i in the i th?eld.

Our problem then becomes:given a composite vec-tor query,can we retrieve the records of highest match score?In other words,how can we exploit the fact that the s source scores are vector cosine similari-ties?Note that for s=1,this becomes the traditional -nearest-neighbor problem.Even for this special case of computing the nearest neighbors in arbitrary di-mensions,there appears to be no algorithm that in the worst case avoids exhaustively computing the similar-ity of the query to every record[1,10,23,25].This prompts the question:can we?nd records that are “almost as good”as the exact nearest neighbors, while paying signi?cantly less than exhaustive similar-ity computation?In application settings,an approx-imation is generally acceptable provided the quality is high enough.For instance,a document or image scoring0.83is not likely to be much worse than one scoring(say)0.87;there is already a(perhaps bigger) approximation in using cosine similarity as a proxy for the user’s perception of quality.

We therefore study e?ciency-quality tradeo?s:sup-pose that an algorithm A outputs a candidate set C=A (Q,E)of records1.Can we trade o?the computational e?ort of A against the quality of C?

To study this question,we must?rst pinpoint the answers to two questions:(1)how do we quantify the computational e?ort of A in a principled manner inde-pendent of the scheme A?(2)how do we measure the goodness of a candidate set C=A (Q,E)computed by A?Once we address these questions,we have a basis for comparing various algorithms.

1.3Metrics

Computational cost:We seek a measure of com-putational e?ort that is independent of a particular runtime environment.For any algorithm A,a ba-sic operation is the query-to-record score computa-tion in equation(1)–speci?cally,this involves s in-ner product computations.We therefore adopt the number of such query-record score computations by A as the fundamental measure of work;we denote it by CC (A,Q,E).We can thus speak of the work done by A on a query,a query suite,etc.For exhaustive search,CC (Exhaustive,Q,E)is always n.Our inter-est is in algorithms A for which CC (A,Q,E) n, while delivering candidate sets of high quality. Quality of results:To evaluate the performance of an approximate retrieval scheme A on a given dataset E and query suite Q1,...,Q m,we use a benchmark called the ground truth.For each query Q,let the true set of highest scoring records be GT (Q,E).We com-pare the quality of a candidate set of records output by an algorithm A against GT (Q,E).To this end, we employ two measures of quality.(For our experi-ments in Section4we use exhaustive search to com-pute GT (Q,E).)

1.The aggregate goodness measure

AG (A,Q,E)=

e∈A (Q,E)

Match(Q,e).

Simply put,this is adding up the match scores of the records returned by A.The idea is that if this net is suitably high,then the user has been given a set of images/documents almost as good as the ground truth.By itself,AG does not tell the whole story;for instance,Q may be a query for which the ground truth does not contain good matches.Rather,we will typically compare

AG with the aggregate goodness of the ground truth,measured by e∈GT (Q,E)Match(Q,e);in fact our experiments will compare these quanti-ties averaged over a query ensemble rather than on a single query.

2.The competitive recall of the top results

CR (A,Q,E)=|A (Q,E)∩GT (Q,E)|.

This computes the fraction of the ground truth included in A’s candidate list of best records.It is more stringent than aggregate goodness in that it gives no credit for a document that may be almost as good as those in GT (Q,E).Note that it hinges on comparison with the best records for each ,rather on the Boolean notion of relevance commonplace in de?ning precision and recall in information retrieval.In this sense,our notion of competitive recall is related to the competitive analysis of algorithms[29]and is also related to measures used in[20,32].

All of the above de?nitions can be extended to an av-erage over a query suite in the natural way.

2Summary of contributions

We begin by summarizing related prior work in two broad areas:score aggregation and nearest neighbors. We do this in some depth(Section2.1.1)for a partic-ular approach to nearest neighbors in vector spaces, that we call cluster pruning.We do so because clus-ter pruning is basic building block for the subsequent development of our approaches.

2.1Related prior work

A series of papers[14,16,17]have looked at the prob-lem of retrieving the best records from combining source scores.They consider the general(not vec-tor)score aggregation problem and insist on?nding the best results rather than good results as we do.Their focus is on comparing,for a given instance (records,score function and query)the computational cost of an algorithm in comparison to that of the best algorithm,on that instance.This in the worst case could mean a computational cost of n;we instead seek ways of spending far less computation and get-ting good matches.Rank aggregation–the special case in which each source orders the records without assigning scores–owes its roots to voting theory,but has enjoyed a modern renaissance with the advent of metasearch engines[11,18].

Nearest neighbor problems in vector spaces are the special case s=1of vector score aggregation.A series of index structures have been developed for this prob-lem in various settings[2,3,21,24,26,34].These studies use the CPU and disk I/O times during query processing as a measure of speed,in contrast to our higher-level measure of the number of cosine compu-tations.ClusterTree[35]creates an index over the data set that is a hierarchy of clusters and subclusters.The nearest neighbors to a given query are obtained by performing a depth?rst search in this hierarchy.This approach e?ectively prunes the search space.They examine the number of such clusters to be probed in order to?nd all the nearest neighbors–this can be viewed as one extreme in our tradeo?space(with no approximate near-neighbors).This experimental ap-proach is instructive but may be hard to use directly –in practice we do not have a“stopping condition”that informs us the instant we have found the cor-rect nearest neighbors.In theoretical work related to approximate nearest neighbors,[23,25]reduce the problem to point location in equal balls and suggest bucketing and locality sensitive hashing algorithms.

More recently[20]show how simple k-means clus-tering can do well at approximate nearest neighbor re-trieval in multimedia databases.They evaluate qual-ity by metrics that are the complement of competitive recall,and by a matching distance measure.They fo-cus on“progressive processing”of approximate nearest neighbor searching:the user looks at the results for a query,one page at a time.They use approximation techniques with exact nearest-neighbor algorithms to progressively improve results quality as the user keeps looking at more results.

2.1.1Cluster pruning

We build on a class of schemes for the -nearest neigh-bors problem that make use of clustering(the special case of our problem where s=1).The goal is to avoid paying a cost of n cosine similarity computa-tions,while still retrieving “reasonably near”neigh-bors for any query.The generic idea is to?rst cluster the vectors in the dataset E,in the process appointing a representative for each of the K clusters[5,22,32]. Given a query,we?rst?nd the m K centroids near-est to the query and then compute cosine similarities from the query only to the records in the clusters rep-resented by these m centroids.All records in all other clusters are ignored.The hope(with no absolute guar-antee of course)is that many of the near neighbors are in these m clusters.Thus we get near neighbors while avoiding similarity computations with the majority of the vectors in the dataset.

The clean nature of cluster pruning raises hope that it can be extended to s>1;while this is the idea underlying our approaches,some interesting challenges and design decisions arise.

2.2Contributions of this paper ?Concrete,usable metrics for cost-quality tradeo?s that do not demand human relevance judgements as in the TREC evaluations[36].The idea of quantifying the cost-quality tradeo?for scoring

has not been systematically studied,even for the traditional -nearest neighbor problem.All our metrics can be applied to the general setting at the beginning of Section1.

?Two broad approaches to vector score aggregation: one inspired by resource allocation(Section3.1) and the other by ideas from computational geom-etry[12](Section3.2).

?Experiments with two variants of our scheme

based on resource allocation(Section4),as well as with the scheme inspired by geometry.We?nd that all the schemes attain close to the quality of results in the ground truth,at a computational cost dramatically lower than exhaustive search.?A comparison of the two families of schemes.

While the geometric indexes are larger than those from resource allocation,they o?er better re-trieval quality for a given amount of computation on a query.

3Two approaches

3.1Resource allocation schemes

The technical development of our schemes inspired by resource allocation is cleaner if we think in terms of a ?xed budget B of the computational e?ort(number of cosine similarities)that we can use to answer a query. We can then ask how well we perform on the quality of retrieved results for the given budget.We begin with the general idea.

Consider again s vector spaces,one for each?eld. In seeking a candidate set of records for a composite vector query Q=(q,w)we instead retrieve a set C i of candidate records from the i th?eld,for each i∈[1,s]. Finally,we return the best matches from the records in∪s i=1C i.

These retrievals C i for each i use the cluster pruning scheme in Section2.1.1;the precise implementation details and parameter choices are deferred for now. Essentially,we?rst retrieve nearly best matches from each?eld,then pick the best matches from among these candidates.An important question arises:given our budget of B for computational e?ort,how do we invest this budget across the s vector spaces?This is a resource allocation problem and we study two natural schemes for this investment.For example,consider a simple allocation between two?elds author and ti-tle.Suppose that a query places a high weight on the author?eld and relatively little weight on the title ?eld.We could on the one hand spread our budget equally in the author and title vector spaces.On the other hand,we could invest more of our budget into retrieving candidates from the author space rather from the title space,as this might give us better score-aggregated quality.

title abstract

Figure1:Mapping from query weights to points in the

simplex:resource allocation.

Note that at query time we could make use of the

weights in w to determine this allocation.Interest-

ing questions arise:should we?If so,how do the weights govern the allocation?This question may be

viewed at a slightly higher level(for conceptual de-

velopment only;eventually we will use the number of cosine similarities as the measure of computational ef-

fort).Rather than a budget B of cosine similarities,

imagine a budget P of the number of probes:a probe is a decision to evaluate the query against all the records

in any single cluster in any of the s spaces.This view

re?ects the working of our algorithms built on clus-ter pruning:the query is always evaluated against all

vectors in a cluster,or none.

A weight vector w in a query can natu-

rally be viewed as a point on the s-dimensional

simplex s i=1w i= 1.Further,any point x=(x1,...,x i,...,x s)on this simplex repre-

sents an allocation as follows:given a budget of P

probes,we dispatch x i P probes into?eld i.Figure1 shows this idea for s=3where the?elds are author, title and abstract.

De?nition3A probe resource allocation is a map-ping P from the simplex onto itself,P:w→x.

What properties should hold for an allocation func-tion P?We propose these below;of these the?rst two should clearly hold for all P,while the remainder are plausible but(as we detail in Section5)may not hold in all situations.

1.P should map each vertex of the simplex into it-

self.This simply says that a query that places all its weight into one?eld demands that all the probes go into that vector space index.

2.P should map each edge(in general,lower-

dimensional simplex)of the simplex into itself.

This says that if a?eld gets no weight in the query,

Algorithm2Transparent.

1:number of cluster probes available=P;Query Q;

2:SearchSet=?;

3:for i←1,2,...,s do

4:NSet=set of P/s nearest clusters to Q taken

from?eld i;

5:SearchSet=SearchSet union records in NSet;

6:end for

7:for all record∈SearchSet do

8:compute Match(Q,record)

9:end for

10:Rank SearchSet based on Match

Consider?rst the standard -nearest neighbors

problem,i.e.,s=1.The Voronoi decomposition[12]

partitions space into n polyhedral cells,one for each

record e i.The crucial property:for any query point

within e i’s cell,the nearest neighbor is e i.2Given a

query,the nearest-neighbor problem reduces to locat-

ing which cell the query lies in.Tight bounds exist

on the total number of facets in Voronoi decomposi-

tions as a function of n and the number of dimensions

d;these bounds are exponential in d.Consequently,

the computational costs of building the decomposition

and for point location are high when d is large;but

for d≤3this approach leads to pragmatic nearest

neighbor retrieval.

The notion of a Voronoi decomposition has been

generalized[12]for -nearest neighbors.Instead of one

cell per record e i,we now have one cell for every sub-

set of records that is a valid answer to some query.

The cell decomposition now has the following prop-

erty:the same set of records constitute the nearest

neighbors for any query point within a given cell.Thus

identifying the nearest neighbors again reduces to

identifying the cell containing the query.Here too the

cells are known to be polyhedra(for cosine similarities

between unit vectors)and bounds are known for the

facet complexities.Despite its impracticality in high

dimensions,we nevertheless pursue this view a little

further,as it leads to our eventual index structure.

Let us extend the above notions to composite vector

retrieval:denote by D1,...,D s the s vector spaces and

d1,...,d s the corresponding dimensionalities.Note

additionally that the set of all possible query weight-

ings w can be viewed as a vector space W in s dimen-

sions(in fact since s i=1w i=1,a simplex).Con-

sider a new vector space U=W∪s i=1D i,having

u=s+ s i=1d i dimensions.Any query Q=(q,w)

can be represented as a single point in U;note however

that a record is not a single point in U.Nevertheless,

U can still be partitioned into cells such that for any

query(point in U),the set of nearest records is in-

variant.In this cell structure,it su?ces to locate the

query point then read o?the nearest neighbors for

any query in that cell.Besides the immense number of cell boundaries due to the high dimensionality,there is an added di?culty here:the cell boundaries are no longer polyhedral,but rather described by(nonlin-ear)algebraic functions.Point location thus becomes highly non-trivial.

We give two generic ideas to overcome these di?-culties,leading to experimentation with a very basic implementation of these ideas in Section5.4.

1.We can group together cells in the decomposi-

tion that are“close together”,coarsening the de-composition into a small number of coarse cells with similar(rather than the same)answers.In the process,we may project U down to a low-dimensional space.

2.We can approximate the cell boundaries by linear

functions.

For any such coarse approximation U of U,we now have to address(1)point location in a coarse cell of U ;

(2)for each coarse cell,an index tuned to e?ciently retrieve high-quality records for that cell.This is necessary since there is no longer a unique set of answers within a coarse cells.

Example1We begin with an extremely simple man-ifestation of the above ideas.Suppose we have three vector spaces author,title and abstract as in Fig-ure1.Each query has three weights w author,w title and w abstract,together with corresponding query vec-tors q author,q title and q abstract.For any query in which w author≥0.34,we simply?nd the nearest neighbors to q author in the author vector space alone, ignoring the other?elds.Similar rules can be invoked for the title and abstract?elds.

The intuition of this simplistic scheme:if the query places the greatest weight in a?eld,we run a vector-space query for that?eld alone and ignore the rest of the query.Notice that this can be viewed as a projec-tion of the huge vector space developed above down to a simplex in three dimensions,a coarsening of this sim-plex into three regions,and?nally an e?cient(if not perhaps high-quality)retrieval scheme for all queries falling in each one of these regions.Just as we did for allocation maps P in Section3.1,we can enumerate basic symmetry requirements for any version of this scheme;we omit these for brevity here.In Section5.4 we experiment with a slightly more sophisticated ver-sion of this scheme.The generic cell decomposition retrieval algorithm is given in Algorithm3CellDec.

3.3Comparing the schemes

In this section we compare the resource usages of the resource allocation and cell decomposition schemes. For the allocation schemes,we would need to maintain 1:number of cluster probes available=P;

2:Query Q=(q,w);

3:Identify the cell decomposition index of the coarse cell i based on the query template.

4:NSet=set of P nearest clusters to Q from index of coarse cell i;

5:SearchSet=Union of records in NSet;

6:for all record∈SearchSet do

7:compute Match(Q,record)

8:end for

9:Rank SearchSet based on Match

4Experimental setup

We now describe the data used in our experiments, followed by the query suite.In Section5we describe our?ndings on the computation-quality tradeo?.

4.1Data set and preparation

We perform our experiments on a data set obtained from crawling citeseer[37].This data consists of 480,000documents;for each document,we have three ?elds–author,title and abstract.This data is processed by stemming and stop-word elimination (standard data preparation steps in information re-trieval[30])and inserted into three base tables in a mySQL database.For each of the?elds,the frequency of each term(tf)is computed and normalized;thus, within each?eld for each document,the squares of the frequencies of various terms add up to one.At this point we have three vectors for each document, one for each?eld.

Thus if a vector has m features with term frequen-cies{tf1,...,tf m},the weight w i of the i th term is

tf i/

w author w title w abstract

0.330.330.34

2

0.40.20.4

4

0.60.20.2

6

0.20.20.6

Table1:Weight templates.

search for.This also ensures that there are likely to

be at least some documents in the corpus that are

high-quality matches for each query.

For Set B,we use a set of250randomly generated

queries on a new synthetic data set.This new data

set has three?elds f1,f2and f3that are all gener-

ated from the title?eld of our original data set.Each

synthetic document is composed of3random original

document titles,each title forming a?eld f i.Given

title i,i∈1,2,...,n of documents OriginalDoc i in the

original collection,the documents SyntheticDoc j,j∈

1,2,...,n/3in the synthetic data set are

f1=title j,f2=title n/3+j,f3=title2n/3+j

In this data set the document vector lengths in the

three?eld spaces become comparable.Each query in

Set B consists of two terms from each?eld,generated

uniformly at random.

4.2.2Weight templates

For each query prototype we apply seven weight tem-

plates,each a triplet of weights.The weights in a

template sum to one and model skewed user weight-

ing.The templates are given in the Table1.

Note that templates2-4are rotations(around the

?elds)of each other;likewise for templates5-7.The

?rst template is meant to model an unbiased query

(the user does not emphasize any?eld);note that for

such queries an alternative approach would be to treat

the entire document(with all its?elds)as one“bag of

words”(a single vector)and treat the user query terms

also as a single vector.Templates2-4model situations

where the user emphasizes two?elds but is less certain

or demanding about the third.Similarly,templates5-

7model situations where the user emphasizes a single

?eld at the expense of the other two.These broad

situations clearly span the gamut of symmetric user

needs.The rotations are meant to elicit the e?ects

of asymmetries between the three?elds.Templates

5,6and7are especially useful to study:they can

be viewed as a“basis”using which an arbitrary w

can be expressed as a linear combination;thus results

on allocation on these templates can be combined to

devise allocations for arbitrary weight vectors.

n=50K

AG

68.3771.4767.09

Table2:Performance for di?erent values of K for col-lection size50,000.

n=100K

AG

64.9766.8265.27

Table3:Performance for di?erent values of K for col-lection size100,000.

5Results and analysis

We implement the traditional K-means algorithm in clustering each of the3?elds.To represent a centroid of the cluster,we use the mean of the document vectors within a cluster.Each cluster centroid is implemented as a hashtable of terms and the term weights,so that the lookup is much faster while performing a similar-ity computation between the query and the centroid. In order to nullify the di?erence in size of the index between the cell decomposition and allocation schemes we do the following:

?Use the same number of clusters K for both kinds of indexes.

?Store only the top1000highest weight terms of the mean of all document vectors,in the centroid. This ensures that centroid lengths and index sizes for both the schemes are the same and we can invoke Section 3.3to determine CC (A,Q,E)and CC (CellDec,Q,E);here A represents either allocation algorithm.

5.1Choosing the right value of K Theoretically,the optimal value of K can be estimated as follows.From Equation2the computational ef-fort involved for one cluster probe CC (Uniform,Q,E) with3vector spaces,is given by3K+n/K.This is minimized when K= (n/3)is400.To further validate this estimate for K,we conduct some experiments.For this,we measure the performance of Uniform against the ground truth for di?erent values of K.We experiment with subsets of our document set with50,000and100,000documents,and cluster them using various values of K.In each case we?x the computational cost at roughly2500(there is some variation because when we decide to probe a set of clusters,their sizes may not add up to exactly2500). The results(for both metrics)are shown in Tables2 and 3.

We see that the quality peaks around K=256for both the sample corpora.For n=50,000,the value

n=480K

AG

80.5580.8883.6783.9875.12 Table4:Performance for di?erent values of K for col-lection size480,000.

Figure2:Uniform vs.the ground truth–aggregate goodness.

of

Figure3:Uniform vs.the ground truth–competitive recall.

We observe that for templates with high weight on the author?eld(Templates2and5),the retrieval quality is much higher than the other templates.

This happens because of two biases:

?The asymmetry of the?elds:Each document is represented by fewer nonzero vector components in the author?eld than(for instance)the ab-stract?eld.Consequently,any match on authors tends to dominate the similarity score more than

a similar match on abstracts.

?The query generation scheme for Set A is con-ditioned by an author pair chosen from the au-thor?eld.The signi?cance is not that our query generation is skewed or misleading.Rather,an application using resource allocation should bias the mapping P towards the dominant mode by which users think of query tasks(e.g.,if they be-gin by thinking of titles as the primary driver of their queries,P should invest disproportionately additional work in the title index).

5.3Uniform vs.Transparent

Transparent gives only marginal improvements over Uniform on Set A.This stems from our mode of gen-eration of the queries in Set A;so we studied Set B instead to see if a di?erent class of queries would highlight the di?erences between Uniform and Trans-parent.The results are shown in Figures4and5. For these comparisons we show the number of clus-ter probes P invested on the x-axis(since the com-putational costs of both the allocation schemes are linear in and proportional to the number of probes invested).We observe that Transparent performs con-sistently better than Uniform,for all the templates.In particular,it is interesting to note that it beats Uni-form especially on highly skewed templates.This sug-gests that when user needs come from a more homo-geneous setting such as Set B,non-uniform allocation

Transparent

40

50

60

70

80

90

100

0510********

P

e

r

c

e

n

t

a

g

e

A

g

g

r

e

g

a

t

e

G

o

o

d

n

e

s

s

Number of Cluster probes

Uniform

Figure4:Transparent vs.Uniform-Aggregate Good-ness.

Transparent

20

25

30

35

40

45

50

55

60

65

70

0510********

C

o

m

p

e

t

i

t

i

v

e

R

e

c

a

l

l

Number of cluster probes

Uniform

Figure5:Transparent vs Uniform-Competitive recall. makes a di?erence.An interesting area that now opens up:can more sophisticated allocation than Transpar-ent make a bigger di?erence?How do we design the optimal policy P?

5.4Cell decomposition indexes

We now describe experiments with a slightly more so-phisticated cell-decomposition than the simple scheme of Example1in Section3.2.Consider the unit simplex of query weights for each of the three?elds,Figure6. This simplex is partitioned into four cells labeled1,2,3 and4.Each cell corresponds to a range of weights that a query can take.We maintain one optimized index for each cell;whenever the weights in a query fall into cell i,1≤i≤4,we use index i.Recall Table1listing the weight templates for our experiments;we thus note that Templates1–4fall in region2,with templates5, 6and7falling respectively in regions2,3and4.

Next,we describe the index for each region:?Region1:For1≤j≤n and1≤i≤3,let V j,i and denote the vector for record j in?eld i.For each record j we compute a composite vector

V j=

3

i=1

V j,i.

We now build an index based on cluster prun-ing on the single vector space spanned by the

title abstract

Figure6:Regions of the triangular simplex covered by each index.

V j’s.Intuitively:in region1the query weights are“roughly the same”so we simply treat all the contents of a record–authors,title,abstract–as one bag of words and use cluster pruning on the resulting vectors.

?The index for region2is created by a linear com-bination of the author vector and the vectors from the other two?elds–titles and abstracts–the lat-ter each multiplied by a squeeze factor,θ.Thus

V j=V1+θ?V2+θ?V3.

The indices for3and4are also created similarly by squeezing a di?erent pair of axes.Intuitively, we are attenuating the vectors from?elds that are de-emphasized in the queries using the particular region.

Note that these indexes are created up front;when a query speci?es a particular weight vector w,it is sent to the index that is likely to yield the best quality results for that weight vector.

While creating the clusters for cluster pruning,we do K-means clustering of the vectors thus obtained, just as in our earlier schemes.The only di?erence comes in the computation of the centroid for each cluster.While calculating the mean of the documents within a cluster,we do an L2-normalization within the terms of each?eld in a document,before calculating the centroid.This ensures that?elds with very few dimensions are not under-represented.

To estimate a good value for the squeeze factorθ, we use a sampled subset of the documents containing 10000random documents and values ofθ∈[0.1,1]. The results are shown in Figure7.We observe that forθ=0.5,queries from all the three templates do the best.Hence we now choose this as our squeeze factor to compare the performance of the cell decomposition scheme with respect to both the uniform and trans-parent allocation schemes,using the same data set of

Figure7:Performance for di?erent squeeze factors.

Figure8:Aggregate goodness https://www.wendangku.net/doc/3714545290.html,putational cost. 10000documents.The results are shown in Figures8 and9.

We observe that even our simple cell decomposi-tion scheme consistently outperforms both Uniform and Transparent,whether for a?xed cost or for a?xed quality.

6Conclusions and further work

Our work(particularly with the aggregate goodness measure)suggests that we can?nd high quality re-sults for vector score aggregation at a small fraction of the computation of exhaustive search.Our exper-iments raise the pursuit of more sophisticated alloca-tion schemes.This becomes especially intriguing with recursive cluster pruning schemes,where the alloca-tion at higher levels can depend on what is deeper in each sub-tree.The second area for further work is on more sophisticated cell decomposition schemes: given an application,how do we determine the best cell decomposition scheme based on system parameters? How(for either class of schemes)should the algorithm parameters be data-dependent?Empirically studying cost-quality tradeo?s in more general settings[13,6]

Figure9:Competitive recall https://www.wendangku.net/doc/3714545290.html,putational cost. is an exciting direction.

References

[1]P.K.Agarwal,J.Erickson.Geometric Range Searching and Its Relatives.In CRC Handbook of Computational Geometry,1997.

[2]N.Beckmann,H.-P.Kriegel,R.Schneider,and B. Seeger.The R*-tree:An e?cient and robust ac-cess method for points and rectangles.Proceedings of ACM SIGMOD,322–331,1990.

[3]S.Berchtold,D.A.Keim,and H.-P.Kriegel.The x-tree:An index structure for high-dimensional data. Proc.of the22th VLDB Conference,1996.

[4]M.W.Berry,S.Dumais,G.W.O’https://www.wendangku.net/doc/3714545290.html,ing Linear Algebra for Intelligent Information Retrieval. SIAM Review37:4(1995).

[5]S.Bhatia,J.Deogun.Cluster characterization in Information retrieval.ACM-SAC1993Indiana USA,721-727.

[6]M.Charikar,R.Fagin,V.Guruswami,J.Klein-berg,P.Raghavan and A.Sahai.Query strategies for priced information.Journal of Computer and System Sciences64(4):785-819,2002.

[7]W.Cody,L.Haas,W.Niblack,M.Arya,M.Carey, M.Flickner,D.Lee,D.Petkovic,P.Schwarz,J. Thomas,M.Tork Roth,J.Williams,R.Fagin and E.Wimmers.Querying multimedia data from multi-ple repositories by content:the Garlic project.IFIP 2.63rd Working Conference on Visual Database Systems(VDB-3),1995.

[8]M.-J.Condorcet.Essai sur l’application de l’analyse a la probabilite des decisions rendues a la pluralite des voix,1785.

[9]T.Cover,P.Hart.Nearest Neighbor pattern classi-?cation.IEEE Transactions on Information Theory, 13(1967),21–27.[10]D.Dobkin,R.Lipton.Multidimensional Search Problems.SIAM Journal of Computing,5(1976), 181–186.

[11]C.Dwork,R.Kumar,M.Naor and D.Sivakumar. Rank aggregation methods for the web.Proceedings of WWW10,2001.

[12]H.Edelsbrunner.Algorithms in Combinatorial Geometry.Springer-Verlag,1987.

[13]O.Etzioni,S.Hanks,T.Jiang,R.M.Karp,O. Madani and O.Waarts.E?cient Information Gath-ering on the Internet,37th Annual Symposium on Foundations of Computer Science,1996.

[14]https://www.wendangku.net/doc/3714545290.html,bining Fuzzy information from mul-tiple systems.Journal of Computer and System Sci-ences,58(1):83–99,1999.

[15]R.Fagin and Y.Maarek.Allowing users to weight search terms,RIAO(Recherche d’Informations As-sistee par Ordinateur),682–700(2000).

[16]R.Fagin and E.Wimmers.A formula for incorpo-rating weights into scoring rules.Theoretical Com-puter Science239,2000.

[17]R.Fagin,A.Lotem and M.Naor.Optimal aggre-gation algorithms for https://www.wendangku.net/doc/3714545290.html,puter and System Sciences66,2003.

[18]R.Fagin,R.Kumar,D.Sivakumar.E?cient sim-ilarity search and classi?cation via rank aggregation Proceedings of ACM SIGMOD,2003.

[19]C.Faloutsos,W.Equitz,M.Flickner,W.Niblack,

D.Petkovic and R.Barber.E?cient and E?ective Querying by Image Content.Journal of Intelligent Information Systems,1994.

[20]H.Ferhatosmanoglu, E.Tuncel, D.Agrawal and A.E.Abbadi.Approximate Nearest Neighbor Searching in Multimedia Databases.Technical Re-port TRCS00-24,Comp.Sci.Dept.,UC Santa Bar-bara,2000.

[21]A.Guttman.R-trees:a dynamic index structure for spatial searching.Proceedings of ACM SIGMOD, 47–57,1984.

[22]J.Hafner,N.Megiddo and https://www.wendangku.net/doc/3714545290.html, Patent 5848404:Fast query search in large dimension database,1998.

[23]P.Indyk and R.Motwani.Approximate Nearest Neighbors:Towards Removing the Curse of Dimen-sionality.In Proc.of30th STOC,604–613,1998. [24]N.Katayama and S.Satoh.The SR-tree:An In-dex Structure for High-Dimensional N earest Neigh-bor Queries.Proceedings of ACM SIGMOD,1997.

[25]J.Kleinberg.Two algorithms for nearest-neighbor search in high dimensions.Proc.29th ACM Sympo-sium on Theory of Computing,1997.

[26]K.-I.Lin,H.V.Jagadish and C.Faloutsos.The TV-tree:An Index Structure for High Dimensional Data.VLDB Journal,3(4):517–542,1992.

[27]X.Long and T.Suel.Optimized Query Execution in Large Search Engines with Global Page Ordering. Proceedings of VLDB,2003.

[28]D.G.Luenberger.Investment Science.Oxford Press,1997.

[29]On-line Problems.Journal of Algorithms,11:208-230,1990.

[30]G.Salton.The SMART Retrieval System–Ex-periments in automatic document processing.Pren-tice Hall Inc.,Englewood Cli?s,1971.

[31]T.Sellis,N.Roussopoulos and C.Faloutsos.The R+-Tree:A Dynamic Index For Multi-Dimensional Objects.VLDB Journal,1987.

[32]S.Sitarama,U.Mahadevan,M.Abrol.E?cient cluster representation in similar document search. Proceedings of WWW conference,2004.

[33]I.H.Witten,A.Mo?at,T.C.Bell,Managing Gi-gabytes:Compressing and Indexing Documents and Images,1994.

[34]D.A.White and R.Jain.Similarity Indexing with the SS-tree.In Proceeding s of the12th Intl.Conf. on Data Engineering,1996.

[35]D.Yu, A.Zhang.ClusterTree:Integration of Cluster Representation and Nearest Neighbor Search for Large Datasets with High Dimensional-ity.IEEE Internati onal Conference on Multimedia and Expo,2000

[36]https://www.wendangku.net/doc/3714545290.html,/:Text Retrieval Conference series.

[37]https://www.wendangku.net/doc/3714545290.html,:Citeseer Scienti?c Digital Library.

相关文档
相关文档 最新文档