文档库 最新最全的文档下载
当前位置:文档库 › discourse relations and document structure

discourse relations and document structure

Chapter6

Discourse Relations and Document Structure Harald L¨ungen,Maja B¨a renf¨a nger,Mirco Hilbert,Henning Lobin,

and Csilla Pusk′a s

Abstract This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scienti?c jour-nal articles.It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing,cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account.An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological repre-sentation of additional knowledge sources is presented.By means of quantitative and qualitative corpus analyses,cues and constraints for automatic discourse analy-sis can be derived.Furthermore,the proposed representations are used as the input sources for discourse parsing.A short overview of the projected parsing architecture is given.

Keywords Discourse parsing·Discourse relations·Document structure·Text technology·Linguistic annotations·XML

6.1Introduction

In the past,several approaches to automatic discourse analysis have been developed as applications of relational discourse theories which describe the semantics of dis-course.These approaches are often based on the analysis of discourse connectives as well as morphological and syntactic features.Such surface-oriented strategies are adequate and have yielded good results when applied to the analysis of simple text types like newspaper articles,which are characterised by a limited size and a rela-tively simple document and syntactic structure.When dealing with more complex text types,however,an analysis of lexis and grammar is not suf?cient.Sources of knowledge about discourse and document semantics have to be considered as well.

H.L¨u ngen(B)

Justus-Liebig-Universit¨a t Gie?en,Gie?en,Germany

e-mail:luengen@uni-giessen.de

A.Witt,D.Metzing(eds.),Linguistic Modeling of Information and Markup

97 Languages,Text,Speech and Language Technology40,

DOI10.1007/978-90-481-3331-46,C Springer Science+Business Media B.V.2010

98H.L¨u ngen et al.

This chapter deals with the linguistic foundations of discourse analysis for a complex text type by the example of scienti?c journal articles.Its focus is on the contribution of logical document structure,generic document structure and thematic structure to discourse parsing.The modelling and representation of linguistic struc-tures and knowledge sources based on text-technological(XML-based)formalisms and methods is addressed.The representations are used in investigating correlations and interactions between different types of linguistic information and serve as an input to a discourse parsing system.

In the project SemDok,which is part of the Research Group Text-technological modelling of information funded by the German Research Foundation DFG and scheduled to run in its second phase for three years2005–2008,a discourse parser for the complex text type“scienti?c research article”is being developed.Scien-ti?c articles exhibit a highly complex document structure(both logical document structures and relational discourse structures are deeply nested)and a relatively large average size in terms of word count.The discourse parser is envisaged in a speci?c application scenario:It shall be part of an explorative reading system which supports novice students in learning to adopt adequate strategies for reading scienti?c articles.The system shall have two dimensions:Firstly,it shall provide a tool to support selective and explorative reading and,secondly,it shall function as a learning environment where students can acquire knowledge about the genre “scienti?c article”,its generic text type structure(with categories such as intro-duction,method,results and discussion)and possible argumentative strategies and thematic structures.Support for explorative and selective reading shall be based on two mechanisms:highlighting text structures and providing automatically generated link lists to different structural nodes as navigation elements.Highlighting and link-ing both serve as starting points for the exploration of an article.By offering link lists or by directing attention to highlighted passages,readers are guided to themat-ically or rhetorically signi?cant parts of a text.Additionally,access to the different structural levels of the text is simpli?ed,as the building plan of the text is made explicit.

Highlighting and linking requires the preprocessing of articles.They must be automatically analysed and annotated on the levels of document structure,text type structure,rhetorical and thematic structure.The automatisation of analysis and annotation is necessary to enable users of the system to upload articles that they themselves consider relevant.The discourse parser developed in the SemDok project will automatically add discourse structure annotations and thus allow students a personalised use of the system.

The present chapter is structured as follows:Section6.2gives a theoretical overview of the different linguistic levels relevant for the analysis of the relational discourse structure of a scienti?c article:logical document structure,thematic struc-ture(referential structure,lexical cohesion),and generic document structure.Fur-thermore,our notion of relational discourse structure,which refers to Rhetorical Structure Theory(RST,Mann and Thompson1988),is introduced.In Sections6.3.1

6Discourse Relations and Document Structure99 and6.3.2,the corpus and the layers of annotations that we employ in developing and evaluating the parser,are characterised.Section6.3.3addresses additional resources such as the discourse marker lexicon and the inventory of rhetorical relations and describes their representation in XML.The chapter is concluded by a short overview of the architecture of the projected discourse parser and an outlook on future work.

6.2Linguistic Foundations

6.2.1Document Structure

The research described in this chapter is based on the assumption that documents can be regarded as complex signs.As complex signs they are built up from smaller units in which these units themselves and their connections are constituted by lin-guistic and visual mechanisms.These units of a document are complex and ele-mentary segments.Elementary segments are usually rectangular areas,which can be delimited clearly according to certain features and are not put together from seg-ments(e.g.paragraphs or headings).Complex segments are adjacent combinations of segments to which a common document function can be assigned(e.g.sections).

Documents can be regarded as signs with respect to their syntagmatic,their semantic and their pragmatic dimensions.In a syntagmatic perspective,documents can be described by grammars which de?ne the way in which segments can be combined to yield valid documents of a certain type.In a semantic perspective, the meaning of a document is a function of the meanings of its parts and its doc-ument type.The combination of elementary segments to form complex segments follows compositional principles.These,however,are activated by the document type assumptions and expectations,which complete the compositionally formed document meaning.

Constitutive units of documents are2D objects,segments.Segments can almost always be geometrically described as rectangles which cover parts of the document area.Segments are e.g.text blocks,tables,headings,address?elds,but also graphics and illustrations,i.e.?at objects which have a recognisable coherent structure and can be described not by linguistic means alone.Tables and lists contain on the one hand linguistically de?nable structures(e.g.lists can be interpreted as coordination), however,on the other hand,they are speci?ed by geometric and graphic properties at the same time.

Only text blocks,which do not show any further geometric properties apart from the line break,represent purely linguistic objects.Text blocks form the transition between the one-dimensionality of the language and the two-dimensionality of the document by being split up mechanically into lines which?ll the segment from top to bottom.

Segments are aggregated in the document area in which semantic connections between the segments are encoded by topology and graphic design.The document

100H.L¨u ngen et al. area is restricted,though;it is de?ned by the restrictions of the medium(size of the printable paper,screen or window etc.).If this does not suf?ce,document parts are formed so that they can be read in a temporal order one after the other(suc-cessive pages of a book,window content which can be scrolled,or window content which is replaced by the activation of a link).In this respect,many documents also have a temporal dimension besides the two spatial ones so that one can talk about documents as of a2.5-dimensionality.

The syntagmatic structure of text segments has been examined quite extensively in text linguistics.Dependencies between the sentences are established by means of cohesion of different types.The linguistic properties of the syntagmatic level of text segments can be described by rules which permit the sentence syntax to continue above the sentence boundaries.The syntagmatic structure of segments with graphic elements,such as tables,is given by the iconic properties of lines,columns and boxes.These relations can also be described by rules that are based primarily on the cognitive processes of https://www.wendangku.net/doc/df18611079.html,plex segments and whole documents are formed by the aggregation of segments.Typical is the aggregation of several text segments(paragraphs)to form a text body that is provided with a heading to yield a section.The formation of a complex segment is de?ned by the adjacent aggregation of the segments in the text area.These syntagmatic properties of documents can be described by rules which resemble those for the formation of sentences;they can be collected in a document grammar.In document grammars,the media-speci?c conditions of a document are omitted systematically.The necessary page breaks are included in a document grammar no more than line breaks are in the descriptions of segments in a text grammar.

Grammatical dependencies indicate semantic relations.The syntactic structure of a sentence licenses the construction or representation of its meaning in a suitable formalism.There are different approaches to text semantics which presuppose the availability of meaning representations for the individual sentences as well as cohe-sive means for the representation of the meaning of segments.An example of this is the logical text representation in terms of(S)DRT(Asher and Lascarides2003). The meaning of a document arises from the composite meanings of the segments contained in it in connection with prede?ned meaning structures which are acti-vated by document type and text type.To combine the meaning of segments it has to be decided which semantic relations are encoded by a certain con?guration of segments(e.g.the semantic relationship between heading and text body).By the document type,a text type is activated which speci?es a semantic structure which is valid for all instances of this type,regardless of the meanings speci?ed by the segments.So it is clear from the start,e.g.for a scienti?c article,that the state of the art,methodological questions or results are represented in certain sections of the document.

Based on speech act theory,different expansions have been suggested on the textual level.Motsch and Viehweger(1991)describe the construction of complex illocutions in texts,Schr¨o der(2003)examines the action structure of texts with the same aim.Following this line of research,document functions can be described in a similar way as complex illocutions.

6Discourse Relations and Document Structure101 6.2.2Relational Discourse Structure

Current text-type independent linguistic discourse theories such as the Uni?ed Lin-guistic Discourse Model(ULDM,Polanyi et al.2004a,b),Segmented Discourse Representation Theory(SDRT,Asher and Lascarides2003,Asher and Vieu2005), and Rhetorical Structure Theory(RST,Mann and Thompson1988,Marcu2000) describe discourse structures as a system of discourse coherence relations that hold between adjacent discourse constituents(spans).Discourse constituents can be either elementary discourse segments or complex discourse segments,the lat-ter are relationally structured themselves.It seems to be generally acknowledged that discourse is structured hierarchically,but it is controversial whether the basic information structure for discourse representation should be a tree or a graph.While SDRT employs graph structures,in ULDM and RST,discourse trees with labelled nodes and edges are constructed.Recently,Wolf and Gibson(2005)have put for-ward linguistic arguments for a graph representation of discourse structures.

In the present project,we adopt the view that a discourse representation is basi-cally a tree structure,which may be enhanced to include re-entrant edges in certain well-de?ned cases(cf.L¨u ngen et al.2006a).

It is also generally accepted that there are two main structural types of discourse relations under which all other relations can be subsumed,namely subordinating vs. coordinating relations.In RST,these types are called mononuclear(or sometimes

hypotactic)and multinuclear(paratactic)relations.In a mononuclear relation,one

of the elements(text spans)involved has the status of being the nucleus,the“more salient,essential piece of information”(Carlson et al.2001)of the relation.The other ones are labelled the satellites,which contain“supporting or background information”(Carlson et al.2001).Like many authors(e.g.Corston-Oliver1998, Marcu2000,Egg and Redeker2005),we restrict the representation of mononuclear relations to binary trees,i.e.with exactly one nucleus and one satellite.In multinu-clear relations,all elements(possibly more than two)are labelled as nuclei.

While in ULDM subordinating and coordinating relations are the only types of relations,the original RST is actually a theory about the nature and diversity of mono-and multinuclear discourse relations,thus a set of26so-called rhetorical relations and their de?nitions are introduced in Mann and Thompson(1988).

The fact that all rhetorical relations are either mononuclear or multinuclear and that some(such as E VALUATION and I NTERPRETATION)are rhetorically similar, and furthermore that some relations are special cases of other relations(e.g.N ON-VOLITIONAL-CAUSE and C AUSE),can be accounted for by grouping relations into classes and constructing taxonomies over these classes.This has previously been

done e.g.by Hovy and Maier(1995)and Carlson and Marcu(2001);see also Goecke et al.(2005).On the one hand,Mann and Thompson(1988)have provided a relation set which is supposed to be text type-and application-independent,on the other hand they stress that the set is open to extension.In practice,depending on a text type and application(e.g.discourse analysis vs.generation),speci?c subsets or extended sets of relations have been chosen(cf.Hovy and Maier1995).Many of the RST rhetorical relation types examined in the literature,such as E VIDENCE

102H.L¨u ngen et al. or I NTERPRETATION,are immediately relevant for our text type,which was one factor that led us to opt for RST-based text parsing.Based on relation sets previously described in the literature as well as on corpus investigations,we have de?ned an extended relation taxonomy for the SemDok project,see Section6.3.3.2.

Discourse theories also differ in their strategies of discourse interpretation,that is,the question of how discourse analysis and the construction of a formal repre-sentation of a speci?c discourse is achieved.In a theory like SDRT,a full-?edged semantic representation of discourse segments is required to perform discourse analysis.Its output then is a logical form,too.In the original conception of RST, text spans comprise plain text,not logical forms.Relational analysis as designed in Mann and Thompson(1988),however,also presupposes knowledge about the meaning of discourse segments as well as goals and beliefs of authors and readers about these meanings.Since a complete and robust automatic semantic analysis of input segments seems not feasible,computational analysis of discourse has often relied on linguistic properties that are more easily obtainable,such as discourse connectives and syntactic and morphological features derived from(deep or shal-low)grammatical analysis,see the projects described in Corston-Oliver(1998), Marcu(2000),Reitter(2003b),Polanyi et al.(2004a),and cf.also the argumentation in Egg and Redeker(2005).This is also the path that is taken in the SemDok project. But since we are dealing with a complex text type,we are also investigating cues for the more global(or macro)discourse structure such as thematic structure and lexical cohesion(lexical chains and anaphoric structure,see Section6.2.3),logical document structure,and text type structure(Section6.2.4).

In the extract from our corpus in Listing6.1,1the adverbial discourse connective z.B.introduces a mononuclear E LABORATION-EXAMPLE relation where the seg-ment that contains the connective is the satellite.This relation de?nes a complex discourse segment which is related to the previous segment,which contains the dis-course marking conjunction und,introducing a multinuclear L IST-COORDINATION relation.The corresponding RST tree is shown in Fig.6.1.2An equivalent dis-course dependency tree representation according to Danlos(2005),which better

Listing6.1Discourse segments and discourse markers

6Discourse Relations and Document Structure

103

Fig.6.1RST tree

corresponds to data structures preferred in computational linguistics,is shown in Fig.6.2.The involved segments are represented by IDs that refer to the textual content of elementary discourse segments as shown in Listing6.1.

Elementary discourse segments(EDSs)in our project are based on syntax(syn-tactic tagging),punctuation and logical document structure.The basic idea is that elementary discourse segments correspond to clauses as in most theories,but may also correspond to other kinds of phrases()when they are especially marked by punctuation(e.g.bracketing)or logical document structure(e.g.a element).Moreover,a minimal unit of discourse is supposed to be part of a discourse relation where the nucleus is semantically independent enough so that the satellite can potentially be omitted.This means that https://www.wendangku.net/doc/df18611079.html,plement clauses,conditional clauses,and restricting relative clauses cannot be EDSs in our scheme.Since in these respects we deviate from the de?nition of English elementary discourse units in Marcu(1999),we did not adopt his technical term edu for our minimal segments.

We developed a discourse segmenter that is able to perform EDS segmentation automatically based on the input of the syntactic and logical document structure annotations(annotation layers CNX and DOC,cf.Section6.3.2)of an input text. It outputs a new annotation layer called SEG,where besides EDSs,also SDSs (sentential discourse segments,i.e.sentences)from the text,and CDSs(complex discourse segments,which correspond to DOC elements)are marked,as can be

seen

[e466][e468]

Fig.6.2Discourse dependency tree according to Danlos(2005)

104H.L¨u ngen et al. in Listing6.1.The criteria for EDSs as well as the discourse segmenter algorithm are described in L¨u ngen et al.(2006b).

Among CDSs,we further distinguish three types(cf.B¨a renf¨a nger et al.2006): First,CDS type=“block”corresponds to paragraphs and2D objects that are on a par with paragraphs,such as titles,captions,and images,i.e.the elementary element types from the document structure that contain only text or non-textual2D objects like images or diagrams,cf.Section6.2.1.Second,CDS type=“division”corre-sponds to the lowest section level or elements that are on a par with it in terms of DOC markup,e.g.titles and paragraphs that are sisters of section elements.Finally, CDS type=“document”comprises all residual section elements,i.e.those which are on a higher level than CDS type=“division”.In our approach to discourse parsing, these segment types serve to constrain the extent to which discourse segment can be relationally combined,e.g.a CDS type=“block”can only be related to another CDS type=“block”,but not a CDS type=“division”.In practice this means that the core parser module is called several times in a cascade architecture,starting out with EDSs,and1each time using the next higher one of the above sketched segments types as its base segment type.

6.2.3Thematic Structure

The thematic structure of a text constitutes its thematic coherence in that it is responsible for the thematic connections between micro-and macrosegments of the text,and for their connection to an overall discourse topic,which serves as a frame for integrating the subtopics with regard to content.These connec-tions between discourse topics and subtopics(and the thematically homogeneous macrosegments of the text,respectively)can be either semantic or functional or schema-based/associative.They constitute global thematic coherence.

Apart from thematic coherence on the global level,coherence can also be mani-fested by a relationship between adjacent sentences or clauses(i.e.elementary dis-course segments).Such local relations are often signalled by explicit grammatical connections,which are formally realised by recurrence(e.g.coreference,anaphora) or by means of connectivity(e.g.conjunctions).These forms of connections are also called cohesion.Existing frameworks which model these local connections between elementary discourse segments operate on one of the different levels of discourse structure,i.e.referential structure(anaphoric relations),thematic structure(thematic development)and relational discourse structure(rhetorical relations).

The best known model for the description of local thematic development(i.e.the thematic relations between elementary discourse segments)is the model of thematic progression by Danes(1970).Another,similar,model of thematic organisation was proposed by Zifonun et al.(1997).Their proposed major patterns of local thematic development can be summarised as follows:1.Continuation(of theme or rheme3) 2.Derived Theme(a.derived from a hypertheme,b.derived from a preceding theme or rheme),3.Associated Theme.Apart from associated theme,all connections

6Discourse Relations and Document Structure105 between two adjacent topics are based on semantic relations like part-of or identity and are often explicitly signalled by means of coreference.But such connections are not suf?cient to describe all possible thematic relations.As Brinker(1997)points out,models like the one by Danes(1970)do not cover anything that cannot be covered by an analysis of the referential structure alone.

Research investigating functional and associative connections between topics is therefore important to overcome limitations of models which solely focus on semantic or referential ties between sentences to describe patterns of thematic development.Examples of more functionally oriented research are L¨o tscher(1987), Brinker(1997)and Schr¨o der(2003),who propose functional relations like reason, justify,or exempli?cation to model thematic connections.The integration of func-tional relations in the analysis of the thematic structure seems quite natural,because an elaboration of a topic not only comprises the elaboration of its parts(which could be modelled by semantic relations like hyperonymy)but also the speci?cation of functionally connected aspects of the topic,which could be modelled by RST relations.

To be able to model both kinds of relations(semantic and functional)in one discourse representation framework,we interpret the RST relation E LABORATION to represent coherence relations between discourse segments that are induced by the semantically motivated relations between discourse referents contained in them. For a detailed modeling of patterns as described in Danes(1970)or Zifonun et al.(1997),an extension of the E LABORATION relation with different subtypes was necessary.Figure6.3shows the subtypes that we de?ned for discourse annotation in the project SemDok.4

E LABORATION-DERIVATION comprises all relations between a nucleus and a satellite which are based on topic derivation,or ontological subordination.The sub-types of this relation are all mentioned in various publications but have not been grouped together before(cf.Mann and Thompson1988,Hovy and Maier1995, Carlson and Marcu2001).E LABORATION-IDENTITY holds between a nucleus and a satellite that share a referential identity,that are about the same discourse referent. On the one hand we distinguish between forms of theme-theme-or rheme-theme-chaining(cf.Polanyi et al.2003),on the other hand between assignment(of a tech-nical term or an abbreviation)and other forms of speci?cation where the meaning of the topic in the nucleus is expanded,restricted or speci?ed by a syntactically incomplete satellite.

With this extension of the set of rhetorical relations we can capture all patterns of thematic development by means of RST(Table6.1).It must be emphasised that E LABORATION has some special characteristics compared with other discourse relations:First,it is a relation that potentially holds between all thematically con-nected discourse segments.It is therefore one of the“most prevalent forms of modi?cation of a nucleus”and“extremely common at all levels of the discourse structure”(Carlson et al.2001)–in our corpus,E LABORATION is the second most frequent relation(about25%of all relation instances in the presently annotated sub-corpus).In an annotation process E LABORATION can be overridden by more spe-ci?c discourse relations,i.e.whenever there are signals for a more speci?c discourse

106

H.L¨u ngen et al.E l a b o r a t i o n ?w h o l e ?p a r t E l a b o r a t i o n ?c l a s s ?s u b c l a s s E l a b o r a t i o n ?d e r i v a t i o n E l a b o r a t i o n ?a s s i g n E l a b o r a t i o n ?d r i f t E l a b o r a t i o n ?r h e m e ?t h e m e E l a b o r a t i o n ?c o n t i n u a t i o n E l a b o r a t i o n E l a b o r a t i o n ?i n t e g r a t i o n E l a b o r a t i o n ?c l a s s ?i n s t a n c e E l a b o r a t i o n ?s e t ?m e m b e r E l a b o r a t i o n ?i d e n t i t y E l a b o r a t i o n ?p r o c e s s ?s t e p E l a b o r a t i o n ?c o n t i n u a t i o n ?o t h e r E l a b o r a t i o n ?s p e c i f i c a t i o n ?o t h e r E l a b o r a t i o n ?s p e c i f i c a t i o n

E l a b o r a t i o n ?a s s i g n ?o t h e r

E l a b o r a t i o n ?a s s i g n ?a b b r e v i a t i o n E l a b o r a t i o n ?t h e m e ?t h e m e

F i g .6.3S e m D o k h i e r a r c h y o f E L A B O R A T I O N r e l a t i o n s

6Discourse Relations and Document Structure107

Table6.1Thematic relations

Patterns of thematic development Thematic connections

Semantic relations Rhetorical relations

(Referential)synonymy,E LABORATION-IDENTITY

Continuation identity,E LABORATION-CONTINUATION

paraphrase E LABORATION-SPECIFICATION

E LABORATION-RESTATEMENT

E LABORATION-EXAMPLE

E LABORATION-DEFINITION

(Ontological)hyponymy,E LABORATION-DERIVATION

Derivation hyperonymy,E LABORATION-SET-MEMBER

partonymy,E LABORATION-PROCESS-STEP

meronymy E LABORATION-CLASS-SUBCLASS

E LABORATION-CLASS-INSTANCE

E LABORATION-WHOLE-PART

E LABORATION-INTEGRATION

(Functional)BACKGROUND,CIRCUMSTANCE

Supplementation/CAUSE,RESULT,CONSEQUENCE

Association PURPOSE,CONDITION,CONTRAST

INTERPRETATION,EVALUATION,... relation to hold between two discourse segments,this more speci?c relation is anno-tated.Second,E LABORATION is seldom signalled by syntactic or lexical discourse markers.Instead,E LABORATION may be identi?ed by means of those linguistic fea-tures that signal thematic development:lexical-semantic and referential(anaphoric) relations between the central discourse entities of two discourse segments as well as lexical chains(Morris and Hirst1991).As shown in Table6.1,E LABORATION-DERIVATION and the converse relation E LABORATION-INTEGRATION are theoret-

ically signalled by semantic relations like hyponymy,hyperonymy,holonymy etc.,

E LABORATION-IDENTITY by relations like synonymy,identity etc.Figure6.4and

6.5show two examples where holonymy induces E LABORATION-DERIVATION,and pertonymy E LABORATION-DRIFT.5

These semantic relations(and the corresponding E LABORATION subtypes)can in principle be identi?ed by consulting a lexico-semantic resource like GermaNet (cf.Kunze2001)–only the coverage of GermaNet5.0is not suf?cient for our corpus of scienti?c articles:only69.3%of all noun tokens and41.8%of all noun types in our corpus can be found in it(cf.B¨a renf¨a nger et al.2007).We therefore primarily focus on the identi?cation of E LABORATION and its subtypes by means of(annotations of)anaphoric relations and lexical chains as supplied by our project partners.

In various studies it has been pointed out that thematic development is closely connected with referential continuity,and that anaphoric relations may be used as signals for thematic continuity(cf.Danes1970,Givon1983,Zifonun et al.1997). For the utilisation of anaphoric relations as cues for E LABORATION we cooperate with the Sekimo project where our corpus was annotated according to a schema for anaphoric relations(CHS,cf.Holler2004).Two types of intra-textual anaphoric

108H.L¨u ngen et al.

Fig.6.4Holonymy as a cue for E LABORATION-DERIVATION

relations are distinguished:bridging and cospeci?cation relations.In cospeci?ca-tion relations(COSPEC),anaphora and antecedent are referentially identical,while bridging relations(BRIDGING)are based on semantic relations like meronymy,set-membership,and associative relations between anaphor and antecedent which have to be inferred from context.

Analyses of our corpus have shown that the presence of an anaphoric relation between discourse entities in two discourse segments is(approximately)a neces-sary condition for E LABORATION to hold between them.Yet,it is not a suf?cient condition–this is amongst other things due to the status of E LABORATION as a default relation.However,correlations between certain subtypes of E LABORA-TION and speci?c anaphoric relations could be found as well,e.g.in66.7%of all

Fig.6.5Pertonymy as a cue for E LABORATION-DRIFT

6Discourse Relations and Document Structure109

Listing6.2Correspondence of COSPEC:IDENTITY and E LABORATION-CONTINUATION occurrences of bridging:has-member,E LABORATION-INTEGRATION holds,and in 82%of all E LABORATION-CONTINUATION occurrences,cospec:ident holds.An example of the latter is shown in Listing6.2.6

Another approach to identifying thematically connected discourse segments is based on lexical cohesion,or,more speci?cally,the presence of lexical chains between discourse segments.“Lexical chains tend to indicate the topicality of seg-ments”(Morris and Hirst1991).This suggests that lexical chains can be employed to identify pairs of thematically homogeneous segments and,conversely,thematic breaks within logically de?ned segments.Lexical chains could thus also be used to revise the segment boundaries de?ned by the logical document structure.Incidents where discourse or thematic structure deviates from the logical document struc-ture de?ned by the author of a text have sometimes been observed(cf.Stein2003, Sporleder and Lapata2004).In the two partner projects HyTex(see Storrer in this volume;Lenz in this volume)and IndoGram(Mehler in this volume),algorithms for the automatic construction of lexical chains have been implemented.

As emphasised above,thematic structure can be split into a local and a global https://www.wendangku.net/doc/df18611079.html,ing RST,it is possible to analyse and represent both levels,the local level by annotating the relations between adjacent elementary discourse segments and the global level by relating complex discourse segments.Particularly for the analysis of the latter relations across larger spans of text,the relation E LABORATION and its subtypes are bene?cial(cf.also Carlson et al.2001).The goal of our approach to thematic structure is thus not to identify and label discourse topics,but to integrate semantic and functional thematic relations in one discourse representation model.

6.2.4Generic Document Structure

Genre-speci?c superstructure or text type structure(van Dijk1980,Swales1990)is an aspect of global discourse structure.An analysis of our corpus showed that most scienti?c articles are sequentially structured along the text type-speci?c categories problem,evidence,answers,although deviations are possible,and commonly found (cf.B¨a renf¨a nger et al.2006).These text type-speci?c functional categories(also e.g.method,results,and discussion)can be hierarchically organised in a text type

110

H.L¨u ngen et

al.

t h e o r y c o n c e f r a m e w m e t h o d a t a d a t a C d a t a A r e s u l t i n t e r p r c o n c l u r e s e a r c r a t i o n o t h e r s b a c k g Fig.6.6Text type structure (TTS )schema (23categories)

structure schema.One such schema (cf.Fig.6.6)was designed in the ?rst phase of the present project and is used for the text type structure corpus annotation level (TTS )described in Section 6.3.2.Previous approaches to text parsing of scienti?c articles have focussed on automatically assigning text type-speci?c functional cat-egories (or zones ,after Teufel 1999)from the text type structure to text segments using automatic text categorisation methods (Kando 1999,Teufel and Moens 2002,Langer et al.2004a).

One aim of the present project,however,is to formulate a method to integrate text type structure and overall relational discourse structure.Text structural cate-gories are functions of text parts within the whole text,i.e.they represent a mapping between pairs of one text span and the whole text into the set of textual category labels.RST analyses can be viewed as functions that map pairs of text spans onto a rhetorical relation label.Several of the category names used in previously proposed text type schemas (Kando 1999,Teufel and Moens 2002,Langer et al.2004a)such as problem,results,conclusion suggest that text type structure and rhetorical struc-ture can actually be interleaved (cf.Gruber and Muntigl 2005).This hypothesis content

answers evidence

problem results interpretation

Fig.6.7Possible instantiation of text structural categories

6Discourse Relations and Document Structure

111

is supported by the results of an empirical analysis of our corpus which showed signi?cant correlations between generic and rhetorical structure.An interpretation constituent in a text type structure schema instantiation of an article(Fig.6.7)can, for example,very often be characterised as an RST satellite to a nucleus which are related through I NTERPRETATION(Fig.6.8).The distribution of RST relations over the different TTS categories shows clear deviations from a normal distribution–some TTS and RST pairings are much more likely to occur than other pairings, e.g.the TTS category OthersWork signi?cantly correlates with the RST relation B ACKGROUND,ResearchTopic with E LABORATION.The overall?ndings of the corpus study are described in full length in B¨a renf¨a nger et al.(2006).

6.3Resources

6.3.1Corpus

For the development of the knowledge sources and the preprocessing components of the discourse parser,we work with a corpus that was compiled and annotated during the?rst project phase(2001–2004).The corpus comprises120scienti?c arti-cles from two different disciplines(psychology and linguistics),languages(English and German)and sub-genres(experimental and review).English psychological and linguistic documents were taken from electronically available journals which were ranked highly in the listings of the Institute for Scienti?c Information(ISI)and published in the years2000–2002.German linguistic articles were compiled from the online-journal“Linguistik Online”(volumes2000–2003).

6.3.2Annotation Levels

Our approach to corpus annotation was based on the assumption of four annota-tion levels that play a role in discourse analysis.(a)logical document structure (as e.g.encoded in DocBook,cf.Walsh and Muellner1999,or ldoc,cf.Stede and

112H.L¨u ngen et al. Suriyawongkul in this volume),(b)genre-speci?c text type structure(as described in van Dijk1980,Swales1990,Kando1999,Teufel and Moens2002),(c)rhetori-cal structure(Mann and Thompson1988),and(d)syntactic structure.To examine dependencies between these levels,the corpus was analysed on all of them,and the analyses themselves were represented as XML-based multi-layer-annotations (Witt et al.2005).In the multi-layer annotation approach,each information level is realised as an independent XML annotation layer and stored in a separate?le.Thus, we distinguish between annotation levels(abstract information levels such as the syntax and morphology level of a linguistic grammar)and annotation layers(their realisations in XML)(cf.Goecke et al.in this volume).In the following,the levels and XML layers of logical document structure,text type structure,and rhetorical structure are described in more detail.

Logical document structure(DOC):The logical document structure is an abstraction of the physical layout structure.The annotation of the logical document structure(abbreviated DOC)–i.e.the hierarchical division of the text in sections,titles,paragraphs,footnotes,lists etc.–was provided using a subset of the DocBook DTD,extended by13elements relevant for the corpus (such as).

Text type structure(TTS):To represent the canonical text type structure of

a scienti?c article(see Section6.2.4),an XML schema was created which

contains135functional categories such as framework,method,or dataCol-lection.The creation of the text type schema was based on an empirical anal-ysis of the corpus and on an evaluation of similar approaches regarding so-called rhetorical zones(Teufel and Moens2002)and text-level constituents (Kando1999).The categories are arranged hierarchically in the schema.The resulting tree structure was also used to generate a reduced schema with 23categories,which is more suitable for an ef?cient and consistent anno-tation.Besides,as linguistic articles show a variety of orders of functional categories,a?at schema version was derived from the hierarchical one by means of an XSLT style sheet.Articles annotated according to the?at schema still contain information about the original hierarchical structure encoded using the ID/IDREF-mechanism of XML(cf.Bayerl et al.2003a,Langer et al.2004a).

Rhetorical structure(RST):The rhetorical structure describes functional-argumentative relations(e.g.C ONCESSION,or E VIDENCE)between dis-course segments,cf.Section6.2.2.The set of rhetorical relations used for the annotation of the corpus is basically the one proposed by Mann and Thomp-son(1988)in the framework of Rhetorical Structure Theory(RST).We employed the RSTTool developed by O’Donnell(2000)to manually anno-tate the rhetorical structure.By means of a Perl program,we can convert the ?at XML output of the RSTTool to our hierarchical RST-HP-format,which, together with some extensions will be the format of the target structure of our discourse parser,cf.L¨u ngen et al.(2006a).From the English psycho-logical articles,15sections(2–3pages each)were annotated starting from

6Discourse Relations and Document Structure113 elementary discourse segments,and10German linguistic articles were anno-tated completely but starting from paragraphs as smallest units.Currently,the rhetorical annotations are being extended using the more scenario-speci?c relation set RRSET described in Section6.3.3.2.The RST annotations serve as training and evaluation material for the discourse parser.

Syntactic structure(CNX):The morphology/syntax layer was created auto-matically using the commercial Machinese Syntax tagger software from Con-nexor Oy.

During the annotation process,the quality of the manual annotations was super-vised in two ways:Inter-rater reliability and intra-individual consistency(coder drift)were checked for the manually created annotations(cf.Bayerl et al.2003b) usingκas a measure of agreement(Cohen1960).The results of the tests for inter-rater reliability show that the quality of the TTS annotation was“substantial”(aver-ageκ=.64).κfor the RST annotations was.77for the intra-sentential relations. The quality of the DOC annotation(κ=.98)is“nearly perfect”(https://www.wendangku.net/doc/df18611079.html,ndis and Koch1977).

Table6.2Corpus annotations

TTS(135)TTS(23)DOC RST CNX

English psychological articles 7373

(automatically

generated)

7315(several

sections)

73

German linguistic articles 47473+10

CDS-block

47

The extensive XML-based multi-layer-annotated corpus gives us the possibility to examine interrelations between these levels and to identify cues for rhetorical relations,e.g.cues on the level of document structure(such as an occurrence of the element)or syntactic or topical cues(e.g.the occurrence of the text type-category dataCollection).Moreover,cues from different annotation levels can be combined to form complex conditions for the assignment of a speci?c rhetorical relation.

6.3.3Additional Resources

6.3.3.1Discourse Marker Lexicon

Discourse markers are functional elements that can be regarded as signals for a rhetorical relation(coherence relation)between two text segments.As we have indicated above,there are different types of discourse markers:Firstly,there are lexical discourse markers,or connectives.These are syntactically mostly adverbs or conjunctions.They may consist of one word(weil,“because”),multiple adjacent

114H.L¨u ngen et al. parts(so dass,“so that”)or multiple discontinuous parts(wenn...dann...sonst...,“if...then...else...”).Secondly,con?gurations of grammatical and/or document type-related features can function as(more abstract)discourse markers.An occur-rence of a-environment on the logical document structure level would indicate one nucleus of a multinuclear L IST or S EQUENCE relation, would induce the nucleus of an E LABORATION-DEFINITION relation,its satellite,andthe satellite of a P REPA-RATION relation.In the present stage of the project,the lexicon comprises lexical

discourse markers,other discourse markers are currently treated in the rule compo-nent of the parser.

Many lexical discourse connectives are highly ambiguous.Frequently they do not clearly denote an individual rhetorical relation,but on the contrary the same markers signal different relations depending on their context.Our intention was to provide an XML-encoded inventory of German discourse connectives which resolves these ambiguities.

First,we extracted a list of discourse connectives from our corpus and developed a suitable representational format in XML.The de?nition and validation of the XML data was implemented in XML-Schema.The dictionary contains orthographic and syntactic characteristics of the respective discourse markers.The syntactic infor-mation included is based on the annotation generated by the Machinese Syntax Tagger from Connexor Oy,the descriptions in the Handbuch der deutschen Kon-nektoren of the IDS Mannheim(Pasch et al.2003)and the grammar by Helbig and Buscha(1998).The encoding the topological?elds resembles the format employed in DiMLex(Stede and Umbach1998).

Listing6.3Entry for“wenn”in the discourse marker lexicon

6Discourse Relations and Document Structure115 Each entry in the dictionary is represented by a-element(see Listing 6.3 for a sample entry).A-entry generally consists of three main parts:an identi-?cation unit,a?lter unit,and an allocation unit.The identi?cation unit identi?es a lexical discourse marker(word or phrase)by its form(),by the word stem ()and its part of speech(@pos).The optional?lter unit allows for disam-biguation of discourse markers by providing hypotheses about possible contexts and their associated speci?c rhetorical relations.Obligatory combinations of features(of the current segment and the reference segment)are combined to form hypotheses.Its attributes are supposed to override the general attribute values given in the allocation unit with their speci?c values in the current context.In the allocation unit all rela-tions expressed by the discourse marker are speci?ed.The@score attribute contains the conditional score for the relation given the discourse marker.It is presently based on an assumption of equal distribution but will eventually be estimated from our corpus.The attributes@beds-richtung and@skopus determine the position of the segment in comparison to the reference segment and the scope of the segment.If a segment offers several competing relations signalled by different discourse markers, a hierarchy of relations can be expressed on the basis of the attribute@skopus,so that the discourse parsing engine has criteria for a decision about the order in which the individual relations are applied and promoted(cf.Corston-Oliver1998).The allocation unit can contain additional conditions for the segment and the reference segment,which provide the discourse parsing engine with further indicators to con-?rm a relation or to?nd the corresponding reference segment.

The discourse marker lexicon currently contains92entries.A perl program for tagging lexical discourse markers in texts based on a CNX annotation of the text (see Section6.3.2)and the discourse marker lexicon exists.

6.3.3.2Set of Rhetorical Relations

One goal of the present project was to develop a set of rhetorical relations suitable for analysing scienti?c articles in our explorative reading scenario,cf.Section6.1. Our strategy was as follows:We took the extended classical MT(Mann/Thompson) relation set of34relation types as a starting point(cf.Mann and Taboada2005); additionally we reviewed the comprehensive relation taxonomies previously sug-gested for English by Carlson et al.(2001)(96relation types,78of which are at the base level of the taxonomy,which were employed in the rhetorical analysis of newspaper articles)and Hovy and Maier(1995)(65relation types,43of which at base level,which were designed mostly from the perspective of natural language generation and are not RST-speci?c)and chose candidate relations for extending the MT relation set.We then evaluated the RST annotations that were available from the ?rst project phrase(see Section6.3.1)for determining the relevance of each relation in our corpus.Subsequently,we designed our relation set(called the RRSET)along the following criteria:

?we introduced subrelations when we found strong associations with certain dis-course markers that seemed highly scenario-relevant;for instance we wanted to

116H.L¨u ngen et al.distinguish between L IST -COORDINATION relations that come about by syntactic coordination vs.L IST -DM OTHER relations that come about through discourse markers on the logical document structure level such as the elements.Similarly,we introduced P REPARATION -TITLE ,P REPARATION -QUESTION ,Preparation-other,C ITATION -EVIDENCE ,and C ITATION -ATTRIBUTION ;

RhetoricalRelation IdeationalRelation

InterpersonalRelation T extualRelation Contrast Contrast-multi

CauseResult CausePurpose

Cause Purpose-s

CauseResult-multi

ResultPurpose

Purpose-n Result Circumstance Consequence Consequence-multi Consequence-n

Consequence-s

Elaboration Sequence

. .

.

Antithesis

Background

Concession

InterpretationEvaluation Evaluation

Interpretation

List

List-coordination

List-dm_other Means

Preparation Preparation-other

Preparation-question

Preparation-title

ProblemSolution ProblemSolution-multi

ProblemSolution-n

ProblemSolution-s

Summary

Support Attribution

Citation Citation-attribution

Citation-evidence

Citation-self

Support-other

Evidence Justify

Extra

Joint

SameSegment

Schema ArticleT opLevelSchema MononuclearRelation

MultinuclearRelation

Fig. 6.9SemDok RRS ET ontology (save the subclasses of E LABORATION )(edges from M ONONUCLEAR R ELATION and M ULTINUCLEAR R ELATION are not shown)

相关文档
相关文档 最新文档