当前位置：文档库 › UCAM-CL-TR-679

UCAM-CL-TR-679

Technical Report Number

679Computer Laboratory

UCAM-CL-TR-679ISSN 1476-2986

Automatic summarising:a review and discussion

of the state of the art

Karen Sp ¨arck Jones

January 2007

15JJ Thomson Avenue

Cambridge CB30FD

United Kingdom

phone +441223763500

c 2007Karen Sp¨arck Jones

Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: https://www.wendangku.net/doc/7412064385.html,/techreports/

Abstract

This paper reviews research on automatic summarising over the last decade.This period has seen a rapid growth of work in the area stimulated by technology and by several

system evaluation programmes.The review makes use of several frameworks to organise

the review,for summarising,for systems,for the task factors a?ecting summarising,and

for evaluation design and practice.

The review considers the evaluation strategies that have been applied to summarising and the issues they raise,and the major summary evaluation programmes.It examines the

input,purpose and output factors that have been investigated in summarising research in

the last decade,and discusses the classes of strategy,both extractive and non-extractive,

that have been explored,illustrating the range of systems that have been built.This

analysis of strategies is ampli?ed by accounts of speci?c exemplar systems.

The conclusions drawn from the review are that automatic summarisation research has made valuable progress in the last decade,with some practically useful approaches,

better evaluation,and more understanding of the task.However as the review also makes

clear,summarising systems are often poorly motivated in relation to the factors a?ecting

summaries,and evaluation needs to be taken signi?cantly further so as to engage with

the purposes for which summaries are intended and the contexts in which they are used.

A reduced version of this report,entitled‘Automatic summarising:the state of the art’will appear in Information Processing and Management,2007.

Automatic summarising:

a review and discussion of the state of the art

Karen Sp¨a rck Jones

1Introduction

In the last decade there has been a surge of interest in automatic summarising.This paper reviews salient notions and developments,and seeks to assess the state of the art for this challenging natural language information processing(NLIP)task.The review shows that some useful summarising for various purposes can already be done but also,not surprisingly, that there is a huge amount more to do.

This review is not intended as a tutorial,and has somewhat di?erent goals from such valuable earlier publications as Mani and Maybury(1999)and Mani(2001).As a state of the art review it is designed to consider the nature and results of the very extensive work on,and experience of,summary system evaluation since e.g.Mani(2001),though to motivate this analysis the review takes into account the large growth of summarising research since the mid 1990s.Thus the review approaches the status of summarising research?rst from the point of view of recent evaluation programmes and the factors a?ecting summarising that need to be taken into account in system evaluation.Then to complement this discussion,the review examines system strategies(for convenience using fairly conventional strategy classes)to see both how these strategies interpret a general model of the summarising process and what evidence there is for the strategies’e?ectiveness,insofar as the evaluations to date have stress tested them:it is in fact hard to make solid comparisons or draw general conclusions about correlations between task conditions and strategy choice.

The paper is organised as follows.The remainder of the Introduction notes the stimuli to summarising research in the last decade.Section2presents basic frameworks for char-acterising summarising systems,for evaluation in general,and for summary evaluation,that are used in the sections that follow.Section3considers summary evaluation in more detail, and analyses the evaluations that have been done so far.Section4examines the coverage of factors a?ecting summarising in systems and tests so far.Section5reviews implemented system design classes,with exemplar illustrations.Finally Section6o?ers an assessment of overall progress in understanding both summary task requirements and in building systems to meet these.

The Dagstuhl Seminar in1993(Endres-Niggemeyer et al.1993)represented a?rst com-munity attempt to promote research on a task that had,apart from scattered e?orts,seemed too hard to attempt.The1997ACL Workshop(ACL-97)can be seen as a de?nite starting point for major research e?ort.Since then there has been a rapid growth of work on auto-matic summarising,worldwide,illustrated by a large literature including two books(Mani and Maybury1999;Mani2001).This research has been fostered by many workshops and

further encouraged by the Document Understanding Conferences(DUCs),now in their sixth cycle(DUC).The DUC programme,actually,despite its name,about summarising,owes much to the style and lessons of the Text REtrieval Conferences(TRECs-see Voorhees and Harman2005).It has addressed the very di?cult issues of summary evaluation through road maps designed to specify versions of the task,and performance criteria for these,in a way that is realistic given the state of the art at any time,but promotes a coherent advance.

Research on summarising since the mid-90s has been driven not only by ideas going back to the beginning of automatic summarising in Luhn’s work(Luhn1958)),but also by the general development of statistical approaches to NLIP,as illustrated by Language Modelling,and by successes with hybrid symbolic and statistical approaches to other complex NLIP tasks like information extraction(IE)and question answering(QA).The more recent QA evaluations within TREC have included questions seeking extended answers that are a form of summary, and teams participating in the DUC programme have also been participants in the earlier IE evaluations in the Message Understanding Conferences(MUC)programme(see Chinchor 1998)and in the QA evaluations.The performance levels reached with statistical and hybrid techniques in other areas,though not always high,have been su?ciently respectable to suggest that they o?er a practical approach to useful summarising,where more ambitious strategies that exploit semantic and discourse information,of the kind discussed at Dagstuhl,can only be long-term goals.The general improvements in NLP technology,for example in fast and robust parsing(Appelt et al.1993),and the arrival of solid public tools,like part-of-speech taggers,have made it much easier to put together experimental rigs for exploring new tasks and strategies for tackling them.

At the same time,the huge growth in digital material,and especially full text,has natu-rally stimulated a demand for systems that can produce derivatives that are highly concen-trated on particular themes and topics,whether by selecting particularly informative initial text material,or by producing wholly new text to a more compact form,or by some combi-nation of the two.This explosion of digital material has occurred in both the public and the non-public domain,but with di?erent consequences for summarising work.

Much of the material in the non-public domain,for example proprietary journal databases with subscription access,is of the kind with which the original automatic summarising work was concerned;but the di?culty of obtaining open test material,and the proprietors’focus on other concerns,together with the technical opacity of much journal material(e.g.in chemistry)have meant that more recent summarising research has in general not tackled this type of text material.It is equally di?cult to get test collections for other non-public material like enterprise data,which are often heterogeneous enough to be a di?erent kind of challenge for summarising;and enterprise systems for managing these data have also concentrated more on improving other facilities like indexing,categorisation and search.The public material, on the other hand,and in particular the news material from which recent test collections have been predominantly drawn,presents its own distinctive challenges and opportunities for summarising,especially through the extensive repetition of text content:for example,this repetition may make it easier to identify salient content and ensure that even quite coarse summarising techniques will pick up anything important from one or another similar source.

This?ood of digital text,and notably open Web text,has thus been a stimulus to work on summarising in multiple ways.It has encouraged a demand for summarising,including summarising for material in di?erent languages and in multi-media contexts,i.e.for speech as well as‘born’text and for language associated with image data.It has also emphasised the multiple roles that summarising can have,i.e.the di?erent forms that summarising as

an NLIP task can take,even if the classically dominant roles,namely prejudge or preview vis-a-vis a larger source text remain important:this is closely associated with the browsing,‘cut-and-paste’model of information management that IT has encouraged.The text?ood has at the same time made it easier to develop or enhance NLIP strategies that rely on statistical data about language use.

The demand for automatic summarising has been matched by the NLIP research commu-nity’s con?dence(or at any rate belief)that,compared with twenty years ago,they are much better equipped with techniques and tools and by their experience with information extraction in particular,to make a non-derisory attack on summarising.Potential clients,notably the ‘military-security complex’,have reactively,as well as proactively,raised their performance expectations.In a more general way,the rampant march of the Web and Web engines have encouraged the perception that NLIP can do amazing things,and thus the expectation that new and more powerful facilities will come on stream all the time:the summary snippets that engines now o?er with search results are normally extremely crudely extracted,but this does not mean they are not useful,and they illustrate the continuously improving facilities that the engines o?er.

For all of these reasons,the status,and state,of automatic summarising has been trans-formed in the last decade.Thus even though most of the work done has been on shallow rather than deep techniques,the summaries produced have been defective in many ways, and progress in relation to‘quality’summarising has been very limited,something has been learnt about the task,a good deal has been learnt about some summarising needs and some summarising technologies,and a useful start has been made on coherent experimental work in the?eld.

2Discussion framework

As indicated,this review will consider both the character of the summarising technques and systems as have been explored in the last decade,and such task performance results that have been obtained in evaluation studies.Since the work reported has been very varied,I will exploit some earlier description schemes as ways of analysing approaches to summarising and of examining system performance.(The speci?c publications cited in this framework presentation are used because they provide concrete handles for the subsequent review,not as claims to exclusive originality.)

System structure

As a very general framework for characterising summarising systems I will use that pre-sented in Sparck Jones(1999).This de?nes a summary,taking text as the classic though not essential form of input and output as

a reductive transformation of source text to summary text through content condensation

by selection and/or generalisation on what is important in the source.

Sparck Jones(1999)then assumes a tripartite processing model distinguishing three stages,as shown in Figure1:interpretation of the source text to obtain a source representation,trans-formation of the source representation to obtain a summary representation,and generation

of the summary text.

De?nition and framework seem obvious,but are deliberately intended to allow for more variety in what constitutes a summary and in how it is derived than is now too frequently assumed.Thus the de?nition refers both a summary’s conceptual content and its linguistic expression,and the framework allows for both minimal surface processing,transferring some given source text to summary text,and much more radical,deeper operations that create and transform meaning representations.The amount of work done at the di?erent stages can also vary greatly,not merely between but within systems.Much current work is focused,under the label extractive summarising,with various approaches to surface processing,for instance by choosing di?erent source-text selection functions;but where such extractive summarising seems to be inadequate for some summary purpose,a shift to abstractive summarising is pro-posed.This is intended to identify and re-present source content,following what is taken to be the generic style of conventional abstracts,as for academic papers,i.e.to be informative rather than indicative,to use the same language as the source,perhaps a similar content ordering,etc.However there are other summary forms that include digests and reviews,and range from query-oriented quotations to populated template population,which satisfy the def-inition and to which the framework can be applied,for both analysis and comparison purposes.

Summarising factors

The simple framework of Figure1applies just to summarising systems in themselves, as processors.But summarising systems are not context free.It is essential,as discussed in Sparck Jones(1999)and further developed in Sparck Jones(2001),to make the task for which summarising is intended explicit:there is no natural or best summary of a source re-gardless of what summarising is for.As Endres-Niggemeyer(1998)for example makes clear, professionals develop summaries on the basis of knowing what they are for.The design,and evaluation,of a summarising system has therefore to be related to three classes of context factor,as shown in a condensed and slightly modi?ed version of Figures2-4of Sparck Jones (2001)in Figure2.These are the input factors that characterise properties of the source ma-terial,https://www.wendangku.net/doc/7412064385.html,nguage,style,units,etc;the purpose factors that bear on summaries,including their intended use and audience;and the choices for output factors,like degree of reduction and format,that depend on the the input and purpose features of any particular summarising case.(The sketchy factor characterisation given in the?gure will be?lled out using speci?c system examples in later sections.)There is no point in comparing the mechanisms used in di?erent systems without regard for the summarising purposes for which they are intended and the nature of the source material to which these mechanisms are being applied.Equally, there is no legitimacy in system assessment for output without regard to purpose and input data.For proper evaluation,of course,the purpose factors have to be solidly enough speci?ed to ground the actual evaluation methodology used.

Evaluation elements and levels

There are however many choices to be made in designing and conducting an evaluation. These can be developed using the decomposition framework for evaluation developed in Sparck Jones and Galliers(1996).This covers the evaluation remit and the evaluation design intended to meet this remit.Thus,as shown in Figure3,the remit has to establish the evaluation motivation and goal,and a set of choices collectively determining what may be labelled

the nature of the evaluation.The evaluation design then has to locate,or position,the summarising system appropriately in relation to the remit.The input and purpose factors of Figure2de?ne the environment variables,along with the output factors insofar as their generic character is clearly implied by the system’s purpose or,indeed is explicitly stated,perhaps even in detail.The system parameters and their settings re?ect the processor structure of Figure1.To complete the evaluation design,this view of the system in relation to the evaluation remit has to be?lled out with choices of performance criteria,evaluation data, and evaluation procedure.Again,the discussion of actual evaluations will?ll out the brief evaluation characterisation given in Figure3.

One feature of evaluations has been particularly important for language processing tasks, and has?gured in summary evaluation.This is the distinction between intrinsic evaluation, where a system is assessed with respect to its own declared objectives,and extrinsic evaluation, where a system is assessed by how well it functions in relation to its encompassing setup.Thus for example summaries may be intrinsically evaluated against a system objective of delivering well-formed discourse,and extrinsically against a setup requirement for summaries that can replace full scienti?c articles for information reviews for busy researchers.Experience with summary evaluation since Sparck Jonea and Galliers(1996)has suggested a?ner granularity is needed,from semi-through quasi-and pseudo-to full-purpose evaluation as shown in Figure4 and further discussed below.This emphasises the point that evaluation without any reference to purpose is of extremely limited value,and it is more sensible to think of a continuum from the more intrinsic to the more extrinsic.Thus even an apparently intrinsic assessment of text well-formedness presupposes well-formedness is required in the task context.The gradations in the?gure may seem over-re?ned,but as the discussion later illustrates,they are grounded in experience.

As summarising overall is so rich and complicated,I will use what has been done in summary evaluation as a route into my analysis of systems work.Thus I will take the way researchers have tackled evaluation as a way of addressing what summarising is all about, considering?rst evaluation over the last decade in the next section and then,in the following section,the major approaches to summarising that have?gured in more recent research on automated summarising.Though this strategy is the reverse of the more conventional one which begins by considering summarising models and then how successfully they have been applied,it may supply a better picture of the state of the art.

The context for summary evaluation has also been in?uenced by the development of NLIP system evaluation in general in the last?fteen years.Evaluation methodology and practice has been seriously addressed for di?erent NLIP tasks,notably speech transcription,translation, information extraction,text retrieval and question answering.These tasks variously share characteristics and technologies.This has encouraged transfers of evaluation notions from one to another,including ones from earlier-addressed tasks like translation to later ones like summarising.These transfers are not always well taken,speci?cally by failing to distinguish evaluation against system objectives from evaluation against larger setup purposes,so meeting objectives is taken to mean satisfying purposes.Experience with summary evaluation in the last decade shows the distance is greater than earlier believed.

3Summary evaluation

Some of the earlier research on automatic summarising included evaluation,sometimes of a fairly informal kind for single systems(e.g.Pollock and Zamora1975),sometimes more organised(e.g.Edmundson1969;Earl1970),also with comparisons between variant systems (e.g.Edmundson1969)or against baselines(Brandow et al.1995).The growth of interest in summarising during the nineties prompted the more ambitious SUMMAC cross-system eval-uation(SUMMAC1998,Mani et al.2002).The DUC programme(DUC)in turn represents a more sustained e?ort to evaluate summarising systems.It has been of value directly in providing information about the capabilities of the systems tested,with some additions from other tests using the DUC materials though not formally part of the programme.But it has been of more value so far in forcing researchers in the?eld to pay attention to the realities of evaluation for such a complex NLIP task,both in terms of how concepts like intrinsic and ex-trinsic evaluation are to be interpreted and hence how useful they are,and of how su?ciently detailed evaluation designs can be formulated.

The DUC programme’s original,and revised,road maps envisaged a progression from essentially internal system-oriented evaluation to external purpose-oriented evaluation.But the challenge of devising a true task-oriented evaluation for summarising,engaging with the contextual task for which summaries are to be used has,not surprisingly,proved far more di?cult than for other NLIP tasks where evaluation programmes have been able,in one way or another,to limit evaluation scope.Thus for speech recognition,for example,evaluation has normally has been limited to transcription,and for retrieval to a system’s ability to deliver relevant documents,especially at high ranks,in both cases ignoring larger task interests on the basis that doing better with these core components automatically assists the larger task. Information extraction has de facto followed a similar core-focused strategy.

It is much harder to pin down a summarising core component,certainly in a form which o?ers system developers much insight into its parameters or from which useful predictions about larger task performance can be made.This di?culty has been compounded by the fact that the researchers involved have come from very di?erent summarising starting points and by the fact that,on the measures so far applied,system performance has been far from high.This makes it hard to develop task-oriented evaluations that that are both related to researchers’interests and are not too far beyond their systems’capabilities.

The DUC programme,along with some related programmes,has nevertheless played a signi?cant role in encouraging work on summarising.The next sections review the major evaluation concepts,over the intrinsic/extrinsic spectrum,i.e.from least to most involved with purpose,that have been deployed in summary evaluation,and consider DUC and other evaluation programmes.

Summary evaluation concepts

The problems of summary evaluation,and some common evaluation strategies,already?gure in Pollock and Zamora(1975).Much of what has been done in the last decade can be seen as an attempt to?rm up and,as importantly,to scale up,earlier approaches and to move from the kind of approach used in Edmundson(1969),which explicitly eschewed any evaluation for the literature screening purpose for which such summaries were intended,to task e?ectiveness testing.

In the earlier work on summarising,it was evident,?rst,that producing coherent dis-

course was in itself an NLP challenge:thus the sentences in a Luhn(1958)abstract,however individually grammatical,did not when concatenated give a coherent summary text,syn-tactically,semantically,or referentially.Moreover even if sentence-extractive approaches like Luhn’s do in general deliver syntactically well-formed sentences,there is no good reason to limit automatic summarising to these methods,and there is therefore a general summarising requirement to produce both well-formed sentences and well-formed discourse.It was also evident,second,that capturing key source concepts for a summary is hard,given we are deal-ing with complex conceptual structure,even on some‘simple’re?ective view of a summary as a source writ small for the source text readers’preview.It was further evident,third,that measuring success in coherent delivery and concept capture is a tough problem,again even on some simple re?ective view of the source-summary relationship.

Text quality

There is no absolute requirement that summarising output must consist of running text: it can consist of,e.g.,a sequences of phrases,or a tabular format with phrasal?llers.But the need to produce running text is su?ciently common that it seems reasonable to start evalua-tion simply by considering whether the system can produce‘proper’sentences and‘properly connected’discourse.NLP has advanced su?ciently to produce both proper sentences and locally cohesive,even globally coherent,discourse.Thus for this kind of‘preliminary?lter-ing’evalution it is appropriate to apply a series of text quality questions or checks,e.g.‘It should be easy to identify what the pronouns and noun phrases in the summary are referring to.’It may well be the case in practice that users in some particular contexts can tolerate a good deal of ill-formedness,but text quality evaluation is still valuable,especially for system developers,and it has played a signi?cant role in DUC.

However quality questions are easiest to answer for local phenomena,within individual sentences or between adjacent ones.When they refer to a summary as a whole they are bound to be either restricted to speci?c phenomena,e.g.anaphoric references,or rather impressionistic.In particular,it may be hard to establish true semantic coherence for technical subject matter.Of course summaries may be misleadingly coherent,e.g.suggesting links between entities that do not hold in the source,but even human summaries can be defective in this.

Unfortunately,as Marcu and Gerber(2001)point out,quality assessment is too weak to be a system discriminator.The more substantive point about text quality evaluation,however, is that is does in fact,even if only in a low-key way,refer to summary purposes:the system’s objective,to produce well-formed discourse,or even just phrases,is geared to what this output is for.The convention that refers to text quality assessment as intrinsic evaluation should re-label it as the semi-purpose evaluation of Figure4,and recognise this in any evaluation detail.

Concept capture

The second question,does the summary capture the key(appropriate)concepts of the source,is much harder to answer or,more particularly,to answer to measurable e?ect.Even for a‘plain’re?ective version of summarising,establishing that a summary has this rela-tionship to its source is extremely challenging,not only because it involves judgements about conceptual importance in the source but because concepts,especially complex relational ones, are not clear cut and they may be variably expressed.We may wish to specify precisely what

should appear in the summary,but this is impossible in other than unusually constrained con-texts.In general,asking for important source concept markup while leaving open how this should appear in the summary is too vague to support evaluation by direct source-summary pairing.Trying to control the process leads naturally to the model summary evaluation strategy considered below.

The basic problem is that humans do not agree about what is important in a source. As Rath et al.(1961)showed,even when given the relatively restricted requirement to pick out a speci?c number of the most representative sentences in a source,agreement between his human subjects was low.It would in principle be possible to handle this,and hence evaluation,via a degrees-of-agreement strategy,but multiple source markup would be very costly,and there is still a problem about whether the markup speci?cation could be made su?ciently robust,without being unduly prescriptive,to support useful system evaluation by direct source(markup)-summary comparison.

The literature for and on professional abstracters(e.g.Rowley1982;Endres-Niggemeyer 1998)suggests that important source content markup is a key practical process,but only as one element in the summarising https://www.wendangku.net/doc/7412064385.html,ing source markup as a basis for evaluation is thus problematic for this reason regardless of the others.It nevertheless seems that proper summary evaluation should consider the relation between summaries and their sources.Some other evaluation methods have therefore been used which refer to this,albeit indirectly rather than indirectly.

Edmundson’s(1969)and Brandow et al.’s(1995)checking of summaries for acceptability against source was a weak procedure.But there are problems with the tighter or more focused form of summary-source comparison that reading comprehension tests appear to o?er.Morris et al.(1992)used standard education assessment comprehension tests to investigate the e?ects of summary condensation,but does not discuss the implications of the type of question used. Minel et al.(1997)and Teufel(2001)use more sophisticated questions referring to source argument structure.However using questions about signi?cant source points that a summary should also be able to answer,as in SUMMAC(SUMMAC1998,Mani et al.2002),is bound to be somewhat hit-and-miss where rich sources are concerned,and Kolluru and Gotoh’s (2005)experiment is too small to support their claim that the method is robust despite human subjectivity.More generally,as Minel et al.point out,this strategy again involves an implicit reference to context and purpose:just as the notion of re?ective summary implies that this is the sort of summary that is required for some reason,the same applies,in sharper form,to the reading comprehension model.This point is addressed in Farzinder and Lapalme (2005)’s lawyer-oriented questions.But more generally,reading comprehension is a typically underspeci?ed variety of quasi-purpose evaluation.

In general,therefore,direct source-summary comparison has not?gured largely in sum-mary evaluation.It seems a plausible strategy in principle.But it is methodologically unsound when divorced from knowledge of summary purpose which could mandate source content that should appear in any summary.The main reasons it has not been used in practice,however, appear rather to be the e?ort involved except in the weaker versions just considered,rather than recognition of its methodological weakness.

Gold standards

In practice,direct source-summary pairing has been replaced by the use of human refer-ence,or gold standard,summaries,so comparison for agreement on signi?cant source content

can be considered without the complication introduced by the source-summary condensa-tion.Most of the automatic summary evaluation for content capture done in the last decade has been on this more restricted,summary-summary pairing basis.It can be applied rather straightforwardly to extracted sentences,when the human summarisers are instructed that this is the form of summary required,and also to the more usual form of newly-written sum-mary through a content‘nugget’markup process.With non-extractive summaries human assessors are still required,both to do the reference summary markup and to judge whether, and how far,the reference nuggets are captured in the system summaries(c.f.the SEE program used in DUC(Over and Yen2004)).

Unfortunately di?erent human beings do not agree even on what sentences to extract to form a summary,let alone write identical or,often,very similar summaries,especially where there are no heavily constraining summarising requirements,e.g.specifying precisely which types of information are to be given,as noted in Rath et al.(1961)’s study when viewed as gold-standard extractive summary evaluation,and considered recently by,e.g.,Daum′e and Marcu(2004)and Harman and Over(2004).Thus model summary variations may swamp system variations(McKeown et al.2001),and comparisons between reference and system summaries are likely to show many di?erences,but without any indication of how far these a?ect summary utility for the end-user.One way round this,as with source-summary pairing, is to have multiple human summaries,with the reference extracted sentences,or nuggets, ranked by their mutually agreed status,as in the Pyramid scheme(Passonneau et al.2006). However this increases the evaluation e?ort,especially when the need for many reference summaries to counteract the e?ects of variation is fully recognised(van Halteren and Teufel 2003).Moreover where human aseessors are required,as with nugget comparison,variation can be large(Lin and Hovy2002b;Daum′e and Marcu2004;Harman and Over2004),which implies many judges are needed for evaluation to be really useful.

One advantage,in principle,of the gold-standard strategy is that appropriate deference to summary purpose can be built in.Thus as long as the human summarisers write summaries for the speci?ed use and take account of other purpose factors like those shown in Figure2, evaluating automatic summaries by comparison with human summaries ought to indicate the automatic summaries’?tness for the purpose in question.However there are two problems with this.The?rst(as painfully learnt in document indexing)is that human output is not necessarily well-?tted to purpose.The second,more important,point is that alternative outputs may in fact be as,or even more,satisfactory for the purpose in question.It is true that while unobvious index terms may work well for autonomous system searches,summaries have to be comprehensible to people.However this still allows for summaries very di?erent from given human ones,and for e?ective summaries that do not?t closely even with the most agreed human content.(Fitness for purpose also applies in principle to the earlier source-markup strategy,but is even harder to manage than in the comparatively‘packaged’reference summary case.)

With extractive summaries in particular,automatic comparison between reference and system summaries is perfectly feasible,and the technology for ngram comparison,originally applied to machine translation,has been been developed in the ROUGE program and ap-plied to summary evaluation(c.f.ROUGE,Lin2004).It can allow for multiple reference summaries,and indeed for evaluation against other system summaries.It can also be used to compare non-extractive summaries,though clearly lexical agreement is likely to be lower. As a technique for evaluating summaries it is however much less informative than for transla-tions,since with translations it is quite reasonable to bound comparisons,e.g.by sentences,

or at any rate to expect local rather than global variation.With whole-summary comparisons more variation can be expected,so similarity is likely to be much lower.As the method is applicable not just to individual words but to strings,it can implicitly take some account of well-formedness and not just lexical similarity,but only(de facto)in a limited way.It is thus evident that ROUGE-style evaluation for summarising is a very coarse mode of performance assessment except for speci?c,tightly de?ned summary requirements.Proposals have been made for more sophisticated forms of automatic comparison designed to capture cohesion, by Hori et al.(2004),or concept structures via graphs,by Santos et al.(2004),but these do not escape the fundamental problems about the gold standard evaluation model.There is the problem of model summary variation and,as Daum′e and Marcu,and Harman and Over,point out,of variation in human assessors in e.g.identifying nuggets or comparing them.The implication is that multiple measures of performance are needed,especially since, as McKeown et al.(2001)show,they rank systems di?erently,and that wherever human judges are required,measures of inter-judge agreement should be applied.

In this spirit,Amigo et al.(2005)put forward a more ambitious gold standard method-ology,using probabilistic techniques to assess,choose among,or combine di?erent similarity metrics for comparing automatic and model summaries.But as they acknowledge,it all depends on having satisfactory gold standard summaries(though possible many alternative ones),and there has to be independent validation for the gold standards.The gold standard model therefore,however inadequate,is thus more correctly labelled quasi-purpose evalua-tion,as in Figure4than,as usually hitherto,as intrinsic evaluation;and as with the previous evaluation concepts the real status of the method deserves more examination in any particular case:speci?cally,what is the assumed purpose and what grounds are there for supposing the gold standard summaries,especially newly written rather than existing ones,satisfy it.

The foregoing implies there are relatively early limits to what can be learnt about sum-mary merit independent of demonstrated utility in a task context.System developers may indeed?nd any of the forms of evaluation mentioned extremely useful in helping them to get some idea of whether their systems are doing the sort of thing they would like them to do, but encouraging system developers in their beliefs is not the same as showing their beliefs are well-grounded.However,as the speci?c evaluation examples described in Sparck Jones (2001)and also Sparck Jones and Galliers(1996)imply,genuine purpose-based evaluation for some task systems is bound to be extremely expensive,and it is natural for those trying to build systems with any signi?cant capabilities to start by working only with much cheaper and simpler test protocols.Moreover given some widely available data and evaluation conven-tions,it is natural to adopt a‘suck it and see’approach,trying new ideas out using existing evaluation rigs,particularly since this allows obvious comparisons with others’systems as well as reducing costs.This has been long-established practice with text retrieval.But it has the major disadvantage that it emphasises mechanics,and diverts attention from the contextual conditions within which the original data collection and evaluation apparatus were based and which should properly be reassessed as appropriate for new systems.(This‘suck it and see’strategy is of course quite di?erent from‘suck it and see’out there with real users,considered further below.)

There are other problems with the forms of assessment just considered,which have not been su?ciently recognised.One is evaluation scale.Though DUC and other programmes have increased test scale,summary evaluation has generally been modest in scale,and in some cases very limited in all respects.As Jing et al.(1998)show,evaluation may be very sensitive to speci?c data and context conditions,for example required summary length.Though the

range of environment variable values and system parameter settings covered in comparisons has slowly increased,sensitivity analysis is still too rare.

Baselines and benchmarks

Gold-standard summaries,speci?cally manual ones,have been taken as de?ning a target level for automatic summarising systems.Direct comparisons with them do not themselves de?ne upper bound task performance,but they may be used to obtain a target task perfor-mance level,just as in retrieval the performance obtained with careful,manually formulated search queries is de facto a target for automatic systems.It has also become standard practice to de?ne a baseline level of summary satisfactoriness or task performance,that any automatic system worth its salt ought to outdo.

One such baseline,for extractive summarising,as been random sentence selection,and indeed any system that cannot do better than this has problems.Perhaps more instructive baselines for news material have been taken,as in Brandow et al.(1995)and later in DUC,as a suitable length of opening,lead,source text.This baseline strategy depends on the particular properties of news,where sources are typically opened with a summary sentence or two.It will not necessarily work for other kinds of source,and it would be useful to develop a more generally-applicable form of baseline,or rather benchmark,analogous to that given by the basic‘tf?id f’-type weighting with stemmed terms in retrieval:with sensible interpretations of tf?id f this gives respectable performance.Thua one possibility for summarising would be a‘basic Luhn’using sentences ranked by a similar form of weighting.This could be justi?ed as a simple,but motivated,approach to automatic summarising that it ought to be possible, but is not trivial,to outdo;and indeed this form of benchmark has been used in practice.It is motivated,in particular,as delivering summaries that could have some practical utility in task contexts.The strategy could also be applied,with suitable adjustment,to multi-document as well as single-document summarising.However since tf?id f weighing varies in detail,it could be useful for researchers to adopt a common standard as a benchmark.

With some particular summarising strategies it may be possible to de?ne upper bound performance(Lin and Hovy2003),but such upper bounds,while useful,cannot be of general application.

Recognising purpose

The need to address evaluation conditions explicitly and afresh,and to cross what has been taken as the intrinsic/extrinsic boundary,is illustrated in Figure5,which shows a reduced version of Sparck Jones(2001).This instantiates the evaluation speci?cation shown in Figure3for the particular case where police reports about potential road obstructions from sleeping animals are the base for alerting summaries to a local population.The evaluation is intended,by questioning the townspeople,to show whether the summaries published in the local newspaper have been more e?ective as warnings than mobile police radio vans.It is evident that there are many other possible ways of establishing the alerts’e?ectiveness, and equally that di?erent evaluations would be required to show,for instance,whether alerts with graphics were more e?ective than simple text alone,and also,whatever these evaluations might seem to show about the summarising system’s e?ectiveness,whether police time spent on sending round warning vans and attending to accidents due to this type of obstruction was actually reduced.

At the same time,the original police reports might be taken as source material for a quite di?erent purpose,namely as information sources for a biological researchers’database,for which a di?erent sort of summary would be required and,of course,quite di?erent evaluation.

This example about wombats may appear frivolous.But all of the elements have serious analogues:thus police reports about tra?c were used to produce alerts in the POETIC project(Evans et al.1995).Summaries,rather than just extracts,as potential material for a database are illustrated by BMJ(the British Medical Journal).Here editorials and news items are summarised by lead-text extracts,but research papers have formatted abstracts with headings subsuming separate mini-abstracts,which may be phrasal or telegraphese(see Figure6).Questionnaires are a standard evaluation device.The examples also emphasise, crucially for the present context,the point that though both the alerting evaluations can be labelled extrinsic ones,they illustrate di?erent levels of evaluation from Figure4.Thus the questionnaire-based evaluation is a form of pseudo-purpose evaluation,addressing the real summary audience but asking about putative in?uences on their driving behaviour rather than?nding out about their actual driving.The police time analysis,on the other hand, if done with a suitable before-the-alerts and after-the alerts comparison,embeds the police activities that are responses to the alerts’causes within a larger setup evaluation.This is thus a full-purpose evaluation.

Research on automatic summarising so far has,as noted,touched on only a few and scattered choices among the many output possibilities,though they have been given larger scope by suppressing context detail and assuming that same-size-?ts-many is a respectable summarising strategy.But the examples in Figure5imply that it is important,before tak-ing over existing evaluation datasets and performance measures,to check that their context implications for a system are recognised and acceptable.Thus even for two cases which are quite similar,namely producing alerting summaries from input incident reports as done in POETIC and imagined in Figure5,it would be easier to assess summaries for the former for factual accuracy and appropriate timing,because derived from particular tra?c accidents, than the latter,which are generalisations.

The rami?cations of context are well illustrated by the real BMJ case.BMJ o?ers two types of summary,for di?erent materials.Editorials and news items are summarised us-ing lead-text extracts,while research papers have formatted abstracts with subsidiary mini-abstracts-which are sometimes phrasal or telegraphic in style-per?eld.These di?erences are partly attributable to di?erences of the sources themselves,but much more to the purposes that the summaries are intended to serve.They presumably re?ect di?erent combinations of readership and reader interest,https://www.wendangku.net/doc/7412064385.html,e,including multiple uses which may apply both to the extracts and the abstracts.In this they also illustrate the fact that even a single notional use e.g.,say,scanning for general background knowledge,has many variations:thus di?erent researchers may note,on the one hand,that yet another study has been done on condition C,another researcher that this is a study on C in the elderly.

Purpose evaluations

The DUC programme has had proper purpose evaluation as an eventual goal(DUC).More generally,it has been recognised that it is necessary to address the task for which summarising is intended(e.g.Hand1997),not least because,as Okurowski et al.(2000)make clear,what happens in real world situations introduces desiderata and complexities that make focusing on the summarising system per se a recipe for inadequate or inappropriate systems,as well

as wasted e?ort.

Thus while summary evaluation so far has mostly been of the kinds already considered, some purposes have been envisaged,though without any associated evaluation,and there have been serious purpose-oriented evaluations.These have normally,however,not been full-purpose evaluations,but only pseudo-purpose ones,with varying degrees of simpli?cation of or abstraction from full working environments.

The main summary use considered so far has has been for relevance?ltering for full documents in retrieval.This was assumed,for example,in Pollock and Zamora(1975), and tested in Brandow et al.(1995),Mani and Bloedorn(1997)and Jing et al(1998),in SUMMAC(SUMMAC1998,Mani et al.2002),and by Wasson(2002).These tests all used a protocol where relevance assessments on summaries are compared with those on their full sources.This seems simple,but is methodologically challenging.Thus comparing subjects’assessments on summaries with reference assessments of sources,(i.e.against what Dorr et al.(2005)call gold-standard annotations),improperly changes the people involved.However it may be di?cult to avoid untoward priming e?ects if the same users judge both sources and summaries.Dorr et al.’s experiments avoided this but for a rather special document set situation.As is the case with retrieval testing generally,large samples are needed for stable comparative results when user assessments are individual.Earlier tests used existing rather than task-tailored summaries,but query-independent ones.Tombros et al.(1998)compared these with dynamic query-biased summaries,to the latter’s advcantage.

Overall,there have been enough evaluations,both individual and within programmes, of summaries envisaged as or designed for full-document screening to support,despite the limitations of individual evaluations,some general conclusions about summarising for this use. Thus just for this one component function in information seeking,it appears that summaries are respectable predictors of source relevance.But it is also the case that very di?erent summaries are equally e?ective,because the task is not a demanding one,so performance in this task cannot be taken as a useful indicator for others,and the retrieval task as a convenient evaluation proxy for other tasks.

Retrieval is an independently existing task with an established evaluation protocol.Other purposes for which summarising systems have been intended are clearly legitimate in that we observe that they exist in real life,though there is no established test protocol(or handy data)for them,and we can see that they either already involve summarising or summarising might bene?t from it,say by digesting masses of material.However just as with retrieval,as a task that can take many di?erent forms depending on local context,other tasks can come in many guises and may indeed be just one functional element in a larger whole as,for example, question answering,so the particular evaluation task has an arti?cial independence about it. This line may be pushed further when the evaluation task is based on a hypothesised function for summaries.Thus Minel et al.(1997)evaluated summaries as potential support for writing syntheses of sources.

Summary roles related to the retrieval one that have had some evaluation include support for browsing,to preempt or facilitate source review.Browsing is hard to evaluate:Miike et al.(1994)reports a simple time-based evaluation.This group of purposes includes making question answering more e?cient,as in Hirao et al.(2001).Summarising has also been used as an internal system module to improve document indexing and retrieval,as in Strzalkowski et al.(1998),Sakai and Sparck Jones(2001),Lam-Adelsina and Jones(2001)and Wasson (2002);however here evaluation is by standard retrieval methods.

The other major class of purposes evaluated so far has been report generation,whether

these are viewed as digests or brie?ngs.However there have not been many purpose-oriented evaluations here.McKeown et al.(1998)report only an informal user-oriented study on the value of patient-oriented medical literature summaries,but one in a real application context.Jordan et al.(2004)’s evaluation for data-derived summary brie?ngs in a similar clinical setting is yet more embedded in a real context,as well as being a more substantive evaluation.In McKeown et al.’s(2005)evaluation of Newsblaster,report writing was used as means of evaluating summaries as sources of facts:system-generated reports were not the subject of evaluation.

Summary evaluation programmes

The DUC evaluations

DUC has been the?rst sustained evaluation programme for automatic summarising.But as it has been considered in detail elsewhere(e.g.in Over’s overviews),I shall consider only its salient features here,focusing on what we can learn about the general state of automatic summarising from it and regretfully ignoring the the substantial and instructive detail to be found in its workshop proceedings and website(DUC).

The programme was based on a broad road map(Road Map1)that envisaged a gradual advance from less to more challenging summarising along major dimensions:for input,from monolingual to translingual,single document to multi-document,unspecialised material,like news,to technical material;for purpose,from‘generic’or general-purpose re?ective summaries to speci?c-purpose ones typically requiring more transformation,of one kind or another,of the source material,or longer rather than very brief summaries(and hence informative rather than indicative ones);and for output,as moving from summaries where fully cohesive text might not be mandatory to properly integrated and coherent ones.As importantly,it was envisaged that evaluation would progress from the less demanding,but su?cient for the early phases,intrinsic evaluation to,eventually,serious task-based extrinsic evaluation.

In fact,changes were made even in the initial stages:thus multi-document summaris-ing?gured from the beginning,largely because participants were already working on it,and because it may not,for the reason indicated earlier,be harder than single-document sum-marising.On the other hand it proved di?cult to move from news source material,through a lack of participant interest and background resources.Moreover formulating and implement-ing satisfactory evaluation protocols proved to be extremely di?cult.Thus the programme has so far had two stages:an initial one from2000onwards,covering DUC2001-DUC2004, and the second from2004onwards.The main features of the DUC cycles to date are shown in Figure7.The mode of evaluation as labelled within the programme itself,intrinsic vs extrin-sic,is shown on the left,with annotations re?ecting the?ner granularity proposed in Figure 4on the right.It should be noted that within the DUC literature the term“task”refers to the speci?c di?erent summarising requirements,e.g.produce short headling summaries of less than12words,rather than to task in the rather broader sense used in this paper.

The?rst phase,following the?rst road map,was devoted to news material,with evalua-tion primarily for output text quality and by comparison with human reference summaries, but with a slightly more overt reference to envisaged system output use in2003and2004.The speci?c evaluation tasks covered a wide range of summary lengths,and even cross-language summarising;and they introduced some modest variation on purpose factors through speci-?ed di?erent summary uses.These were sometimes only implicit,as in asking for summaries

focused on events or from viewpoints,but were sometimes explicit,as in seeking summaries for utility in indicating source value,or as responsive to question topics.The participants ex-plored a range of strategies,all of an essentially extractive character,but ranging from wholly statistical approaches to ones combining statistical and symbolic techniques,and with such hybrid methods applied to both source material selection and output generation.Particular groups sometimes used the same,or very similar methods for di?erent DUC tasks,but some-times quite distinct ones,for example Lite-GISTexter(Lacatusu et al.2003)as opposed to GISTexter(Harabagiu and Lacatusu2002).However as Figure7shows,the results in all four cycles up to2004,while better than the?rst-sentence(s)baseline,were consistently inferior to the human reference summaries.In particular,coverage of reference content was low.

But this would not necessarily mean that the automatic summaries were of no utility for speci?c purposes.The problem with the initial moves towards task-oriented evaluation attempted in the various styles of summary sought in these evaluations in2003and2004was that the constraints they imposed on summaries,for example creating summaries geared to a topic,were already rather weak,so evaluation via comparison with a supposedly appropriate human summary was very undemanding indeed.Thus while this version of quasi-evaluation was intended to be more taxing than comparison against general-purpose summaries,it was not noticeably discriminating.At the same time,the content-oriented nugget evaluation by human judges was both expensive and not unequivocal,while the ROUGE-style automatic evaluation was not very informative.Moreover the?rst attempts,in2003and2004,to address summary task roles more explicitly by asking human judges about summary utility(as a guide to potential source value)or responsiveness to a prior question,in a kind of minimal pseudo-evaluation task,were not at all revealing.It is di?cult to judge utility or responsiveness ‘stone cold’in the absence of an actual working task context.

The di?culties and costs of the evaluations,as they became evident in DUC2003,stim-ulated the emphasis on ROUGE as the means of coverage evaluation in DUC2004,to see whether this fully automatic process could replace the semi-automatic nugget comparisons. The results showed fair correlation,but perhaps not surprisingly given the dominance of ex-tractive approaches.It was certainly not clear that such a convenient mode of evaluation would serve as the sole useful one.There were also,in phase one as a whole,problems with the early limits of the text-quality questions for extractive summaries:they could function as a?lter below which summary quality should not fall,rather than a?rst-rank evaluator.

Some of the complexities of the DUC programme are shown in Figure8,which gives the tasks and their evaluation methods for DUC2003and DUC2004in more detail.The programme has sought to advance the state of art in tandem with appropriate measures of performance,but this has in in practice meant much more adhoc than systematic change, so that while it is possible to see some development in broad terms,it is almost impossible to make systematic comparisons.Thus while DUC2004attempted to tackle some of the problems that previous DUCs raised,it did not resolve them,and brought new ones with the work on ROUGE.It also brought new complexities by introducing wholly new tasks,namely deriving summaries from(mechanically or manually)translated Arabic source documents,as well as by other changes of detail.Overall,there was a general feeling that the evaluations were not providing enough leverage:thus cruder extractive methods performed poorly,but not signi?cantly worse than rather more sophisticated ones,There were other causes of dis-satisfaction too.Thus the focus on news,though valuable from some practical and funding points of view,meant that issues that other types of source present were never tackled.

All of these considerations led to a revised road map,Road Map2,intended to move more

decisively towards serious purpose-based evaluation and to the?rst evaluation of this second DUC phase in2005.This was much more tightly focused than previous DUCs,with a single task,creating short,multi-document summaries focused on a rather carefully-speci?ed user topic with associated questions,and also at either at general or particular level,as illustrated in Figure9.Evaluation was on the same lines as for DUC2004,but to compensate for human vagaries,both the ROUGE-based coverage and responsiveness assessment was against multiple human reference summaries.These sets of human summaries were also used in a parallel study(Passsonneau et al.2005)to see whether nugget-based evaluation could be improved by weighting nuggets by their human capture.At the same time,it was hoped that the more carefully developed user questions,as well as checking for responsive content rather than expression,would support more discriminating and also useful responsiveness evaluation.

In general,the more concentrated character of the DUC2005evaluation,with its method-ology emphasis,was an advantage:for example the particular quality questions distinguished baseline from human from system summaries in di?erent ways.Overall,however,for all of the evaluation methods.the general comparative performance continued as before,with many di?erent systems performing the same,edging above the baseline but clearly inferior to humans.However it is worth noting that the ROUGE scores were strongly correlated with responsiveness performance:this could imply that with well-(i.e.purpose-)oriented reference summaries,a good deal might be learnt from quasi-purpose comparative evaluation, though this has to be quali?ed if absolute scores are low.

DUC2006continues the DUC2005model,with only minor modi?cations,apart from the o?cial inclusion of the pyramid nugget-scoring mode of evaluation.

Overall,as inspection of the detailed results for DUC2005shows,while many systems perform equally well as members of indistinguishable blocks in relation to the performance measures used,systems that do relatively well on one measure tend to do so on another; however individual systems vary both for individual quality questions and for quality versus responsiveness.The lessons to be learnt for the systems themselves and for the types of ap-proach to summarising they represent,are considered further in Section6.

Other programmes

The second major evaluation programme for summarising has been the NTCIR one(NT-CIR),over three cycles of Text Summarisation Challenge(TSC-1-TSC-3)from NCTIR-2 in2001to NCTIR-4in2004(NTCIR has other tasks as well).The programme in general resembled DUC,but with an institutionalised distinction between extracts and abstracts, implying di?erent evaluation methods.However the programme involved explicit extrinsic task evaluation from the beginning,following SUMMAC models,i.e.pseudo-purpose evalu-ation,as well as intrinsic evaluation including both semi-purpose evaluation by text quality and quasi-purpose model comparisons/The tests used Japanese news material,so lengths for abstracts were speci?ed in characters.Like DUC,the details became more careful over time.

TSC-1had three tasks,all single-document summarising.The?rst,extracting di?er-ent numbers of important sentences,was evaluated by comparison with professional human extracting on Recall,Precision and F measures.The second,aimed at producing plain text summaries of di?erent character lengths(abstracts in principle though they could be extracts), was evaluated in two ways,by word-stem comparisons with human summary vocabularies, and on content coverage against the source and readability.The third,aimed at producing

summaries as guides for retrieval,was evaluated against full sources in SUMMAC style.TSC-2had two tasks:producing single-document summaries at di?erent character lengths,and producing short or long multi-document summaries.The evaluation here used the same cov-erage and readability assessment as in TSC-1for both single and multi-document summaries, and also a‘degree of revision’measure for the single-document summaries.There was no form of extrinsic evaluation.TSC-3was the most careful evaluation,for example with source doc-uments marked up both for important sentences for extracts,and useful sentences providing matter for abstracts.The tests were on multi-document summarising,again treating extracts and abstracts separately.The former were intrinsically assessed for Precision and for cover-age,the latter for coverage and readability.The abstracts were also evaluated extrinsically, following another SUMMAC model,for question answering,but in modi?ed“pseudo-QA”form checking only for presence of an“answer”using string matching and edit distance.

In considering the results,it is most useful to take those for TSC-3.Performance for the extracts showed low coverage(implying uneliminated redundancy)with middling Precision, with many systems performing similarly,though for this data(news material but presumably with di?erent conventions in Japanese media),noticeably better than the lead baseline.The content evaluation scores for abstracts were low,with human abstracts much better,and generally low scores for readability,though with variations for the many di?erent speci?c questions.The“pseudo-QA”scores look rather better numerically than the content ones, but it is di?cult to determine their real signi?cance and informativeness.

The NTCIR programme was similar in many ways to the earlier DUCs,in design,results, and also in the dominance of essentially extractive systems.Thus the tests used the same kind of material and reproduced the single-document/multi-document modes of summarising at various lengths.The evaluations were primarily intrinsic,including both semi-purpose quality assessment and quasi-purpose comparisons with manual summaries(presumably of a re?ective kind).The results similarly exhibit relatively low performance levels,inferior to manual summaries.The systems were typically variations on sentence extraction using statistical and location criteria,perhaps with light parsing so subordinate material could be eliminated or,in multi-document summarising,to make it easier to compare sentences for similar and hence redundant material.

The topic-with-questions model for summarising adopted for DUC2005and2006clearly has a close relationship with one of the forms of question answering studied in the TREC Question Answering(QA)track(Voorhees2005a,2005b).Thus the QA tests have included ones for so-called de?nition questions,like“Who is X?”or“What is a Y?”,which appear to seek,or at any rate justify,rather more extensive responses than factoid questions like“How long is the Mississippi River?”:in the tests the responses were treated as a set of information nuggets.In a later development taking questions in series,similar responses were supplied to supplement the answers to earlier speci?c questions in the series.Evaluation depended on assessors formulating,partly a priori and partly a posteriori,a set of appropriate nuggets and, more speci?cally,on identifying some of these as vital.System performance could then be assessed for returning(a)vital and(b)acceptable nuggets.A similar strategy was adopted for the series response case.

These extended question responses do fall under the general heading of summaries.But the speci?c task di?ers from the DUC one in that there is no prior speci?cation of the source, or set of sources,to be summarised;the documents from which a set of nuggets is drawn need have no relationship with one another in the way that the members of DUC set of multiple