文档视界 最新最全的文档下载
当前位置:文档视界 > A model of multimedia information retrieval

A model of multimedia information retrieval

A Model of Multimedia Information Retrieval


Abstract.Research on multimedia information retrieval(MIR)has recently witnessed a booming interest.A prominent feature of this research trend is its simultaneous but independent materialization within several?elds of computer science.The resulting richness of paradigms,methods and systems may,on the long run,result in a fragmentation of efforts and slow down progress.The primary goal of this study is to promote an integration of methods and techniques for MIR by contributing a conceptual model that encompasses in a uni?ed and coherent perspective the many efforts that are being produced under the label of MIR.The model offers a retrieval capability that spans two media,text and images, but also several dimensions:form,content and structure.In this way,it reconciles similarity-based methods with semantics-based ones,providing the guidelines for the design of systems that are able to provide a generalized multimedia retrieval service,in which the existing forms of retrieval not only coexist,but can be combined in any desired manner.The model is formulated in terms of a fuzzy description logic,which plays a twofold role:(1)it directly models semantics-based retrieval,and (2)it offers an ideal framework for the integration of the multimedia and multidimensional aspects of retrieval mentioned above.The model also accounts for relevance feedback in both text and image retrieval,integrating known techniques for taking into account user judgments.The implementation of the model is addressed by presenting a decomposition technique that reduces query evaluation to the processing of simpler requests,each of which can be solved by means of widely known methods for text and image retrieval,and semantic processing.A prototype for multidimensional image retrieval is presented that shows this decomposition technique at work in a signi?cant case.

Categories and Subject Descriptors:H.3.3[Information Storage and Retrieval]:Information Se-arch and Retrieval—query formulation;relevance feedback;retrieval models;I.2.4[Arti?cial In-telligence]:Knowledge Representation Formalisms and Methods—representation languages;I.4.10 [Image Processing and Computer Vision]:Image Representation—multidimensional

General Terms:Design,Languages,Theory

Additional Key Words and Phrases:Description logics,fuzzy logics,multimedia information retrieval


The central concern of multimedia information retrieval(MIR)is easily stated: given a collection of multimedia documents(i.e.,a complex information object, with components of different kinds,such as text,images,video and sound,all in digital form),?nd those that are relevant to information need of the user.An Partial support from the European Union was kindly provided under the MIRO Working Group (n.6175)and the FERMI Action(n.8134),both part of the ESPRIT Basic Research Programme.We thank our teammates in these two actions for the many stimulating discussions.

Authors’address:Istituto di Elaborazione dell’Informazione,Consiglo Nazionale delle Ricershe,Via Siuseppe Moruzzi;1–56124Pisa,Italy,e-mail:{meghini,fabrizio,straccia}@iei.pi.cnr.it. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for pro?t or commercial advantage,the copyright notice,the title of the publication,and its date appear,and notice is given that copying is by permission of the ACM,Inc.To copy otherwise,to republish,to post on servers, or to redistribute to lists,requires prior speci?c permission and/or a fee.

C 2001ACM0004-5411/01/0900-0909$5.00

Journal of the ACM,V ol.48,No.5,September2001,pp.909–970.

910C.MEGHINI ET AL. ever-growing amount of people are accessing collections of such documents every day in order to satisfy the most disparate information needs.Indeed,the number and types of multimedia information providers are steadily increasing:Currently, it includes publishing houses;cultural,scienti?c and academic institutions;cinema and TV studios;as well as commercial enterprises offering their catalogs on-line. The potential users of the provided information range from highly skilled special-ists to laymen.All these users share the same expectation:to be able to retrieve the documents that are relevant to their information needs.While commercial prod-ucts offering generalized MIR are still to come,research on MIR has witnessed a booming interest during the last years.The most striking feature of this research is its simultaneous but independent materialization within several?elds of com-puter science.To mention only the main streams,there has been work on MIR carried out within the multimedia,the information retrieval,the digital signal pro-cessing,and the database communities,while signi?cant signs of interests may be observed in other sectors,notably arti?cial intelligence.This also reveals that there are many different aspects involved in MIR,each requiring a speci?c background and methodology to be successfully tackled,and also that there may be complemen-tary approaches to the same problems,not only within the same discipline(such as different index structures for multimedia data),but also across different disci-plines(such as similarity-versus semantic-based image retrieval).Such a richness of paradigms,methods and systems is somewhat inherent in the early stages of development of a new discipline,when empirical studies are needed to understand the nature of a phenomenon and try out different ways of capturing it.However,on the long run,this very same richness may ultimately result in a fragmentation of efforts that may slow down progress.

We believe that MIR has reached the stage in which uni?cation of the existing approaches is called for,given that the basic concepts underlying the discipline are understood suf?ciently well.The primary goal of this study is to promote such uni?cation by contributing a conceptual model of MIR,which encompasses in a uni?ed and coherent perspective the many efforts and results that are being pro-duced under the label of MIR in the above-mentioned?elds.

The basic feature of our model is that it captures the kinds of retrieval investigated in the various areas of computer science mentioned above.These kinds of retrieval can be broadly classi?ed on the basis of the aspect of documents that each of them addresses.Thus,we have retrieval based on syntactic similarity,on semantics,on structure,and on pro?le.This categorization is indicative of the method that we have followed in deriving our view of MIR.Rather than proceeding from the bottom up by connecting together existing models,our approach has been a top-down one, since it models the various forms of retrieval as the projections of a basic operation on the different dimensions that make up the structure of the information space. For better clarity,our conceptual model will be speci?ed both at the informal and at the formal level,by letting each introduced notion be followed by a corresponding de?nition in a suitable formalism.The formalism we have chosen is mathemati-cal logic.The rationale of this choice lies primarily in the fact that mathematical logic has proven most successful in capturing the essential nature of information systems,and a MIR system,as it is to be expected,is no exception.The partic-ular logic we adopt is Description Logic(DL).DLs are a family of contractions of the predicate calculus that,in the recent past,have been subject to thorough investigations both from the logical and the computational viewpoint.In choosing

A Model of Multimedia Information Retrieval911 a speci?c DL,we have been guided by the need,typical of information mod-elling,of?nding a formalism that represents a good compromise between expres-sive power and the computational complexity of the reasoning problems involved in MIR.

Besides the foundational goal pointed out above,the aim of the present study is to bring concrete bene?ts to various players in the MIR arena.In particular:

—To the designers of MIR systems,the model provides guidelines for the design of systems that are able to provide a generalized retrieval service,in which the existing forms of retrieval not only co-exist,but can be combined in any desired manner.This feature places our model beyond the current state of the art. Following these guidelines,and using an appropriate tool that implements the model,a designer can quickly build a prototype of the system to be developed, and use such prototype to test its adequacy to the user’s functional requirements. We have prototyped a signi?cant portion of such a tool,as illustrated later in this paper.

—To users of MIR systems,the model is a tool for an effective communication with the application designers,enabling them to state precisely what they expect from the system.To this end,the model provides a terminology that is at the same time rigorous and shared with the system designers.Once a prototype of the application has been realized,the dialogue between users and designers moves to an operational level,where the prototype is evaluated and,if necessary, re?ned.

—To researchers who offer various contributions to MIR,the model gives the possibility to see these contributions as part of a larger endeavor.Hopefully,this will increase awareness of the limitations of current approaches and will stimulate improvement by integration of complementary approaches.As a further bene?t to researchers,the formal speci?cation of the model,by viewing MIR as a special form of implication,may be used as a basis for formal investigations on speci?c aspects of MIR,including extensions to the present model.

The paper is structured as follows:The conceptual framework underlying our model,and a corresponding terminology,is laid down in Section2,and subsequently used to review a signi?cant part of related work,in Section3.The rest of the paper presents the technical development of the model,starting with Section4, which concisely introduces the description logic that will constitute our main tool throughout the paper.Sections5to7deal with the aspects of documents that our model addresses;for each of them,we?rst discuss issues related to modelling and then switch to the semantics of the related query facilities.Section8presents a uni?ed,hierarchically structured query language which brings together all the issues discussed in Sections5to7.In Section9,we deal with retrieval and show how the degree of relevance of a document to a query may be seen in terms of the fuzzy DL that underlies both the representation and query languages.Section10 discusses the implementation of the model,and presents a general technique for query evaluation.Relevance feedback is tackled in the following section,where it is shown how this important stage of retrieval is incorporated into the previously presented model.Section12brie?y sketches the main traits of the tool that we have implemented to support the development of prototypical MIR systems.Section13 concludes this article.


2.A View of Multimedia Information Retrieval

After settling for such ambitious goals,we limit the scope of our work to the treat-ment of two media:text and images.These media are by far the most investigated and therefore best understood ones,therefore they suit the foundational work we are presenting here.Throughout the paper,it is argued that the principles that inspire our approach can also be applied to other media,such as audio and video.However, we have preferred to deal only with text and images in order to be able to discuss the relevant issues with the necessary depth,while relieving the reader from the burden of getting acquainted with a large formal lexicon.

2.1.I NFORMATION R ETRIEVAL.The notion of information retrieval(IR,for short)has much evolved from the late50’s,when it attracted signi?cant scienti?c interest in the context of textual document retrieval.Early characterizations of IR simply relied on an“objective”notion of topic-relatedness(of a document to a query).Later,the essentially subjective concept of relevance gained ground,and eventually became the cornerstone of IR[Saracevic1975].Nowadays,everybody in the IR community would agree that IR is synonymous with“determination of relevance.”Unfortunately,relevance is itself a vaguely de?ned concept.In quite a different context,philosophers of language[Anderson and Belnap1975]and cognitive scientists[Sperber and Wilson1986]have proposed alternative charac-terizations of this notion,all making the subject a controversial one.Not surprisingly then,the debate on what“relevance”may mean in the context of IR is still open (a recent account of such a debate can be found in Mizzaro[1997]).

In the meantime,the area of multimedia documents came into existence and demanded an IR functionality that no classical method was able to answer,due to the medium mismatch problem(in the image database?eld,this is often called the medium clash problem).This problem refers to the fact that,when documents and queries are expressed in different media,matching is dif?cult,as there is an inherent intermediate mapping process that needs to reformulate the concepts ex-pressed in the medium used for queries(e.g.,text)in terms of the other medium (e.g.,images).In response to this demand,a wide range of methods for achiev-ing IR on multimedia documents has been produced,often based on techniques largely foreign to the IR?eld.Our basic motivation in conducting the research presented here is that a reconciliation between these new developments and tradi-tional IR is needed,in order to foster cross-fertilization and promote the develop-ment of a more mature technology,able to enhance the respective approaches via their integration.

As a?rst step towards this reconciliation,we now propose a general de?nition of IR that preserves the meaning underlying the tradition,while being consistent with new developments in multimedia.We regard information retrieval as the task of identifying documents in a collection on the basis of properties ascribed to the documents by the user requesting the retrieval.That is,a document d is to be retrieved in response to a request r issued by a certain user if(and only if)that user would recognize d as having the property expressed by r.The many different types of retrieval that can be envisaged on a multimedia document can then be seen as special cases of the operation just de?ned,via an appropriate categorization of the properties that may be ascribed to a document.This categorization is outlined in the next section.

A Model of Multimedia Information Retrieval913

2.2.D IMENSIONS OF M ULTIMEDIA D OCUMENTS.We call a document simple if it cannot be further decomposed into other documents.In the present context, images and pieces of text are(the only)simple documents.A simple document is an arrangement of symbols that carry information via meaning,thus concurring in forming what is called the content of the document.In the case of text,the symbols are words(or their semantically signi?cant fractions,such as stems,pre?xes or suf?xes),whereas for images the symbols are colors and textures.

We thus characterize simple documents as having two parallel dimensions,that of form(or syntax,or symbol)and that of content(or semantics,or meaning).As we have just remarked,the form of a simple document is dependent on the medium that carries the document.On the contrary,we take the content dimension to be medium-independent,as we assume an objective view of meaning:the meaning of a simple document is the set of states of affairs(or“worlds”)in which it is true.For instance,the meaning of a piece of text is the set of(spatio-temporally determined) states of affairs in which the assertions made are true,and the meaning of an image is the set of such states of affairs in which the scene portrayed in the image indeed occurs.

Complex documents(or simply documents)are structured sets of simple doc-uments.This leads to the identi?cation of structure as the third dimension of our characterization of documents.For the sake of simplicity,we assume the structure of documents to be hierarchical,but this is no serious limitation,as extensions to other structures are well known and can be included in this framework without any major conceptual shift.Note that this notion of structure only applies to complex documents and should not be confused with the notion of structure that is embodied in the syntax of simple documents,such as the structure of an image as the particular arrangement of color regions that the image exhibits.

Finally,documents,whether simple or complex,exist as independent entities characterized by(meta-)attributes(often called metadata in the recent literature on digital libraries),which describe the relevant properties of such entities.The set of such attributes is usually called the pro?le of a document,and constitutes the fourth (and last)document dimension that our model considers.

Corresponding to the four dimensions of documents just introduced,there can be four categories of retrieval,each one being a projection of the general notion of retrieval de?ned in Section2.1onto a speci?c dimension.The usefulness of retrieval based on document structure or pro?le mostly lies in the possibility of using these categories of retrieval in conjunction with the other categories,which are discussed in the following section.

2.3.F ORM-AND C ONTENT-B ASED M ULTIMEDIA I NFORMATION R ETRIEVAL. The retrieval of information based on form addresses the syntactic properties of documents.In particular,form-based retrieval methods automatically create the document representations to be used in retrieval by extracting low-level features from documents,such as the number of occurrences of a certain word in a text, or the energy level in a certain region of an image.The resulting representations are abstractions which retain that part of the information originally present in the document that is considered suf?cient to characterize the document for retrieval purposes.User queries to form-based retrieval engines may be documents them-selves(this is especially true in the nontextual case,as this allows to overcome the

914C.MEGHINI ET AL. medium mismatch problem),from which the system builds abstractions analogous to those of documents.Document and query abstractions are then compared by an appropriate function,aiming at assessing their degree of relatedness.A document ranking results from these comparisons,in which the documents with the highest scores occur?rst.

In the case of text,form-based retrieval includes most of the traditional IR meth-ods,ranging from simple string matching(as used in popular Web search engines) to the classical tf-idf term weighting method,to the most sophisticated algorithms for similarity measurement.Some of these methods make use of information struc-tures,such as thesauri,for increasing retrieval effectiveness[Sch¨a uble and Knaus 1992];however,what makes them form-based retrieval methods is their relying on a fully automatic indexing.In the case of images,form-based retrieval includes all similarity-based image retrieval methods,such as those employed by system like Q BIC[Faloutsos et al.1994]or V IRAGE[Bach et al.1996].

On the contrary,semantic-based retrieval methods rely on symbolic representa-tions of the meaning of documents,that is descriptions formulated in some suitable knowledge representation language,spelling out the truth conditions of the involved document.Various languages have been employed to this end,ranging from net-based to logical.Retrieval by reasoning on the spatial relationships between image objects(see,e.g.,Gudivada and Raghavan[1995a])falls under this category.For reasons of space,this kind of retrieval is not dealt with in this paper.However, integration of spatial retrieval in the present framework can be accomplished as il-lustrated in Aiello et al.[1999].Typically,meaning representations are constructed manually,perhaps with the assistance of some automatic tool;as a consequence, their usage on text in not viable because of the remarkable size(up to millions of documents)that collections of textual documents may reach.

While semantic-based methods explicitly apply when a connection in meaning between documents and queries is sought,the status of form-based methods is, in this sense,ambiguous.On one hand,these methods may be viewed as pattern recognition tools that assist an information seeker by providing associative access to a collection of signals.On the other hand,form-based methods may be viewed as an alternative way to approach the same problem addressed by semantic-based methods,that is deciding relevance,in the sense of connection in meaning,between documents and queries.This latter,much more ambitious view,can be justi?ed only by relying on the assumption that there be a systematic correlation between“same-ness”in low-level signal features and“sameness”in meaning.Establishing the systematic correlation between the expressions of a language and their meaning is precisely the goal of a theory of meaning(see,e.g.,Davidson[1994]),a subject of the philosophy of language that is still controversial,at least as far as the mean-ing of natural languages is concerned.So,pushed to its extreme consequences, the ambitious view of form-based retrieval leads to viewing a MIR system as an algorithmic simulation of a theory of meaning,in force of the fact that the sameness assumption is relied upon in every circumstance,not just in the few,happy cases in which everybody’s intuition would bet on its truth.At present,this assumption seems more warranted in the case of text than in the case of nontextual media,as the representations employed by form-based textual retrieval methods(i.e.,vec-tors of weighted words)come much closer to a semantic representation than the feature vectors employed by similarity-based image retrieval methods.Anyway, irrespectively of the tenability of the sameness assumption,the identi?cation of the

A Model of Multimedia Information Retrieval915

alleged syntactic-semantic correlation is at the moment a remote possibility,so we subscribe to the weaker view of form-based retrieval and present a model where this is just one of the possible ways to access multimedia information.

3.Related Work

A model of structured multimedia documents?nalized to retrieval is proposed in Ounis and Chevallet[1996]and subsequently used for building the P RIME in-formation retrieval system[Berrut et al.1998].Although subscribing to the logical view of information retrieval,this model is indeed expressed in terms of Sowa’s [1984]conceptual graphs.This fact makes the comparison with our model hard, due to the more algebraic/operational,as opposed to logical/declarative,nature of conceptual graphs.From a pragmatic point of view,the two models can be con-sidered to share some basic intuitions,notably a categorization of documents into orthogonal dimensions.However,the fully formal status of our model enables us to address fundamental questions such as the computational complexity of the retrieval problem and the relation between form-and content-based retrieval.

Logical models(in a more orthodox sense)of information retrieval,?rst pro-posed in van Rijsbergen[1986],have been actively investigated in the last ten years [Crestani et al.1998;Sebastiani1998,1999].Within this research trend,the work closest in spirit to this is the work by Fuhr and colleagues on Probabilistic Datalog (see,e.g.,R¨o lleke and Fuhr[1998]),an extension of Strati?ed Datalog by prob-abilistic features.It is similar in that a logic with a model-theoretic semantics is used as a deductive tool in which to encode independently de?ned models of IR, but it is different in most other respects.For instance,Fuhr and colleagues do not de?ne independently motivated“mereologies”for the various types of media(see Sections5.1and5.3),and do not cater for the integration,within their formalism,of approaches to MIR originating from different?elds(such as those based on digital signal processing),as we instead do.

In the area of image processing,a considerable effort has been devoted to the investigation of effective methods for form-based image retrieval.1This effort has led to the development of a number of systems(see,e.g.,Gudivada and Raghavan [1995b],Orphanoudakis et al.[1994],and Smith and Chang[1996])which,in some cases,have also been turned into commercial products[Bach et al.1996;Faloutsos et al.1994].These systems,and in general the methods found in the literature, differ in the aspect of image form that is considered,in the features that are used to capture each aspect,in the algorithms employed to compute features,in the function adopted to match features for assessing similarity.While an exhaustive review is outside the scope of this paper,a general account of form-based image retrieval is given later,in Section5.2,when we will consider the representation and retrieval of image forms.A survey of current trends can be found in Gupta et al.[1997]. Given the context in which Gupta et al.[1997]originated,it is not surprising that 1The kind of image retrieval that we call“form-based”to contrast it with the retrieval based on the semantics of images,has been called“content-based retrieval”by the image processing community, to contrast it with the retrieval based on externally attributed properties of images(“metadata”).This terminological mismatch is a typical inconvenient of the integration between different worlds.We have decided to live with it in order to preserve the connotation underlying our reference model,as discussed in Section2.2.

916C.MEGHINI ET AL. its unifying trait is the lack of a proper representation and use of(what we call)the content of images.

Indexing methods based on statistical[Salton and Buckley1988]or probabilis-tic[Fuhr and Buckley1991]methods may be viewed as attempts to capture(perhaps shallow)semantic properties of the document,as they abstract from supposedly useless properties of a word(e.g.,its contexts of occurrence in the document)to concentrate on properties of it that are deemed more signi?cant from a semantic point of view(e.g.,the number of times it occurs in a document).

More daring approaches to capturing more document semantics are those based on natural language processing(NLP).So far,no real attempt has been made to capture document semantics through NLP-based understanding of the document. This is unsurprising,given the substantial distance that still separates NLP from achieving real text“understanding.”More surprisingly,also less ambitious NLP-based efforts to capture document semantics have failed to yield improved retrieval effectiveness.The paradigmatic case is the attempt to index documents by noun phrases rather than by individual words;although intuitively capturing“more se-mantics”than the latter,the former have been experimentally shown to be less effective indexing units,perhaps because of their different statistical properties (see,e.g.,Lewis[1992]).

Methods based on either statistical or probabilistic indexing have been applied to the retrieval of images via textual annotations(see,e.g.,Liu and Sun[1997],Srihari [1995],and Smeaton and Quigley[1996]),in some cases supported by the use of thesauri to semantically connect the terms occurring in the text[Alp Aslandogan et al.1997;Guglielmo and Rowe1996;Pentland et al.1994].The resulting meth-ods have proved effective only in very narrow contexts,and do not fully exploit the capabilities of human memory and the potentiality of the visual language in supporting such capabilities.

Image retrieval methods based on both textual annotation and visual similarity have also been investigated as a way of enhancing retrieval performance and system usability[Smith and Chang1997].While very naive in the representation of image semantics,the resulting systems“sell”form-based text retrieval for retrieval based on image semantics.As a consequence,they face the problem of how to combine the results of two sources of imprecision each addressing the same aspect,that is, document form,in a different way.

Models and methods for MIR developed in the database community tend to focus exclusively on the symbolic representation and usage of the content of images, regarding form-based retrieval just as a challenging application area for fast access methods to secondary storage[Faloutsos1996].In the classi?cation of multimedia systems outlined in Gudivada[1995],this would fall under the category of retrieval termed as“based on logical data units,”where a logical data unit is a multimedia data structure determined a priori.A paradigmatic case is the model presented in Marcus and Subrahmanian[1996],where visual aspects of images are treated at the symbolic level as semantic properties,and visual similarity is not provided by the model’s query language.Incidentally,this view of MIR has been pursued also by relying on a description logic as modelling tool[Goble et al.1996].

As already argued,our view of MIR is that the requirements of applications do not mirror the partition of the?eld that has been induced in practice by the different backgrounds of researchers.The application areas that have been mentioned at the beginning of this paper do require an information retrieval functionality able to

A Model of Multimedia Information Retrieval917 address all the document dimensions and,most importantly,able to address each dimension in its own modality.In order to ful?ll this goal,integration is the key concept,and,indeed,the basis of our approach.In this sense,our model can be seen as a generalization of current MIR methods.This does not mean that every functionality found in any text or image retrieval model/system is also provided by the model being presented.Rather,it means that our model provides the basis for integrating retrieval methods pertaining to different media and document dimen-sions.In so doing,it relies on a standard set of services for the various kinds of retrieval considered.It will be shown in due course that these services can be made signi?cantly more sophisticated without altering the architecture of the model. This model is the result of a research effort spanning almost a decade.The requirements of a suitable MIR model were outlined in Meghini et al.[1991],also based on the insights gained through the MULTOS Project[Thanos1990].A?rst formulation of this model based on a description logic was given in Meghini et al. [1993].Starting from that formulation,two parallel streams of development have been undertaken.On the one hand,we have worked on the tool,aiming at the de?nition of a description logic more tightly coupled with the task of information retrieval[Buongarzoni et al.1995;Meghini et al.1998;Meghini and Straccia1996; Sebastiani1994;Straccia1996,1997a,1997b,1998,2000].On the other hand,we have worked on the application of this tool to image retrieval[Meghini1995; Meghini et al.1997a],and successively generalized the model to MIR[Meghini et al.1997b].In order to simplify the presentation,the present paper does not include the full-blown logical tool resulting from the former stream of research, but rather focuses on the results of the latter stream.A preliminary version of this model can be found in Meghini et al.[1997b].

4.A Fuzzy Description Logic

Description Logics(DLs,for short—see,e.g.,Borgida[1995])are contractions of the predicate calculus that descend from the formalization of early seman-tic network-or frame-based knowledge representation languages.DLs have an “object-oriented”character that makes them especially suitable for reasoning about hierarchies of structured objects.

DL systems have been used for building a variety of applications including systems supporting con?guration management[McGuinness and Wright1998], software management[Devanbu et al.1991],browsing and querying of networked information sources[Duschka and Levy1997],data archaeology[Brachman et al. 1992],plan recognition[Weida and Litman1992],natural language understand-ing[Bollinger and Pletat1991]and multimedia data description and classi?ca-tion[Goble et al.1996].The grandfather of DL systems was KL-ONE[Brachman and Schmolze1985].Nowadays,a whole family of DL-based knowledge represen-tation systems has been built,like BACK[Peltason1991],CLASSIC[Brachman et al.1991],KRIS[Baader and Hollunder1991],and LOOM[MacGregor1991]. For most of these systems,the computational complexity of the underlying rea-soning problems is known.The systems mostly differ for the expressiveness of the language and the completeness of the inferential algorithms.

The speci?c DL that we use to express our model is ALC[Schmidt-Schau?and Smolka1991].ALC is universally considered a signi?cant representative of a family of expressive DLs and is therefore regarded as a convenient workbench


A model of multimedia information retrieval

F IG.1.Grammar rules for ALC concepts.

for carrying out any kind of logical work of an experimental nature.However,we stress that our model is not tied in any way to this particular choice.Indeed,most implemented systems rely on DLs that are less expressive than ALC.On the other hand,IR models based on more expressive DLs have been also studied[Meghini et al.1993].

4.1.A Q UICK L OOK AT ALC.Concepts,roles and individual constants are the basic building blocks of ALC.Concepts describe sets of objects,such as“Italian musician,”or,in ALC notation,Italian Musician.Roles give binary properties, such as Friend.Individual constants are simple names for individuals,such as tim.From a data modelling point of view,concepts correspond to classes,roles to attributes and individual constants to basic objects.From a logical point of view, concepts can be seen as(possibly complex)unary predicate symbols obtained by lambda-abstraction,roles as binary predicate symbols and individual constants as constant symbols.

Formally,we assume three alphabets of symbols,called primitive con-cepts,primitive roles and individual constants.The concepts of the language ALC are formed out of primitive concepts according to the syntax rules given

in Figure1.In ALC,roles are always primitive.Other DLs have instead also role-forming operators.

For example,the complex concept Musician ?Plays.?ElectricInstrument is obtained by combining the primitive concepts Musician and ElectricInstrument and the primitive role Plays by the conjunction( ),universal quanti?cation(?) and negation(?)constructors.Under the intended interpretation and in a way that will be formalized soon,such concept denotes the musicians who do not play any electric instrument.

It is immediate to verify that ALC is a notational variant of a(conservative) contraction of predicate calculus,determined by the very limited usage of quanti?ers and variables,the latter always implicit.2Accordingly,the semantics of DLs is the restriction of the Tarskian semantics for the predicate calculus corresponding to this syntactical contraction.An interpretation I is a pair I=( I,·I)consisting of a nonempty set I(called the domain)and of an interpretation function·I.Following the intuitive meaning of constants,roles and concepts,given at the beginning of this section,·I maps different individual constants into different elements of I,

2A calculus C is a contraction of a calculus C when the language L and the set of valid formulas V of the former are,respectively,subsets of the language L and of the set of valid formulas V of the latter. The contraction is conservative when(V ?V)does not contain any formula which is expressible solely by means of L.

A Model of Multimedia Information Retrieval919 primitive concepts into subsets of I and primitive roles into subsets of I× I. The interpretation of complex concepts is de?ned by structural induction via the following rules,where C,C1,C2stand for concepts and R for roles:

I= I


(C1 C2)I=C I1∩C I2

(C1 C2)I=C I1∪C I2

(?C)I= I\C I

(?R.C)I={d∈ I|?d ∈ I.(d,d )∈R I implies d ∈C I}

(?R.C)I={d∈ I|?d ∈ I.(d,d )∈R I and d ∈C I}.

For instance,(?R.C)I is the result of viewing?R.C as the?rst order formula ?y.R(x,y)→C(y).Similarly,(?R.C)I(d)is the result of viewing?R.C as the formula?y.R(x,y)∧C(y).As a consequence,it can be veri?ed that for all inter-pretations I

(?(C1 C2))I=(?C1 ?C2)I


ALC concepts and roles can be used for making crisp assertions about individual constants(metavariables a,a1,a2),that is,expressions belonging to one of the following categories:

(1)C(a),asserting that a is an instance of C;for example,(Musician

Teacher)(tim)makes the individual constant tim a Musician and a T eacher;

(2)R(a1,a2),asserting that a1is related to a2by means of R(e.g.,


(3)C1 C2,asserting that C1is more speci?c than C2(for instance,Pianist

(Artist ?Plays.Piano)).

Assertions of type1and2are called simple assertions,and have identical analogues in the predicate calculus.An assertion of type3is called an axiom,and its predicate calculus analogue is the sentence?x.C1(x)→C2(x).By stating both C1 C2and C2 C1,the primitive concept C1is de?ned to be equivalent to C2. Semantically,the assertion C(a)(respectively,R(a,b)and C1 C2)is satis?ed by I iff a I∈C I(respectively,(a I,b I)∈R I and C1I?C2I).A set of assertions will be called a knowledge base(KB).An interpretation I satis?es(is a model of) a KB iff I satis?es each element in .A KB entails an assertionα(written |=α)iff every model of also satis?esα.For instance,if is: ={Italian European,(Italian Pianist)(tom),Friend(tim,tom)}, then

|=(?Friend.(Pianist European))(tim),

that is,tim has a friend who is a European pianist.

4.2.F UZZY ALC.In order to deal with the imprecision inherent in information retrieval,we extend ALC with fuzzy capabilities[Straccia1998].The exten-sion of DLs to this end is not new.Yen[1991]was the?rst to introduce impreci-sion into a simple DL;the resulting language has interesting features:it allows the

920C.MEGHINI ET AL. de?nition of imprecise concepts by means of explicit membership functions over a domain,and it introduces concept modi?ers,like Very or Slightly,by means of which concepts like“very low pressure”can be de?ned.This last idea has been generalized to ALC in Tresp and Molitor[1998],where a certain type of concept modi?ers are allowed.The result is a more expressive language than just fuzzy

ALC,with radically different computational properties,though.

From a syntactical point of view,fuzzy ALC provides fuzzy assertions,that is,expressions of type α,n ,whereαis a crisp assertion and n∈[0,1].We use the terms fuzzy simple assertion,fuzzy axiom,and fuzzy KB with the obvious meaning.Then, ?About.Piano(i),.7 is a fuzzy simple assertion with intended meaning(see below)“the membership degree of individual constant i to concept ?About.Piano is at least.7,”or,in less technical terms,“i is likely to be about a piano.”

According to Zadeh[1965],a fuzzy set X with respect to a set S is characterized by a membership functionμX:S→[0,1],assigning a X-membership degree,μX(s),to each element s in S.This membership degree gives us an estimation of how much s belongs to X.Typically,ifμX(s)=1then s de?nitely belongs to X, whileμX(x)=.7means that s is“likely”(with degree of likelihood.7)to be an element of X.Membership functions have to satisfy the following three restrictions (for all s∈S and for all fuzzy sets X,Y with respect to S):




whereˉX is the complement of X in S,that is,S\X.Other membership functions have been proposed in the literature(the interested reader can consult,for example, Dubois and Prade[1980]or Kundu and Chen[1994]).

In fuzzy ALC,concepts and roles are interpreted as fuzzy sets,thus becoming imprecise.Formally,a fuzzy interpretation is a pair I=( I,·I),where I is,as for the crisp case,the domain,whereas·I is an interpretation function mapping:

(1)different individual constants to different elements of I,as for the crisp case:

(2)ALC concepts into a membership degree function I→[0,1],and

(3)ALC roles into a membership degree function I× I→[0,1]. Therefore,if C is a concept then C I will be interpreted as the membership degree function of the fuzzy set which is denoted by C with respect to I,that is,if d is an object of the domain I then C I(d)gives us the degree of membership of d in the denotation of C under the interpretation I.Similarly for roles.In order to re?ect intuition,·I has to satisfy the following equations(for all d∈ I):



(C1 C2)I(d)=min{C1I(d),C2I(d)}

(C1 C2)I(d)=max{C1I(d),C2I(d)}

(?C)I(d)=1?C I(d)

(?R.C)I(d)=min d ∈ I{max{1?R I(d,d ),C I(d )}}

(?R.C)I(d)=max d ∈ I{min{R I(d,d ),C I(d )}}

A Model of Multimedia Information Retrieval921 Again,(?R.C)I(d)is the result of viewing?R.C as the formula?y.R(x,y)→C(y),where F→G is?F∨G and the universal quanti?er?is viewed as a pos-sibly in?nite conjunction over the elements of the domain(in the literature,several different de?nitions of the fuzzy implication connective→have been proposed—see,for example,Kundu and Chen[1994]for a discussion).The overall effect is, informally:


d ∈ I ?R(d,d )∨C(d )≡

d ∈ I

{?R(d,d ),C(d )}.

By applying(in this order)the rules for conjunction,disjunction and negation to the last formula,the min-max expression above yields.Similarly,(?R.C)I(d)is the result of viewing?R.C as?y.R(x,y)∧C(y),where the existential quanti?er?is considered as a possibly in?nite disjunction over the elements of the domain(see also Lee[1972]).

Also in this case,it is veri?ed that for all interpretations I and indi-vidual constants d∈ I,(?(C1 C2))I(d)=(?C1 ?C2)I(d)and(?(?R.C))I (d)=(?R.?C)I(d).Moreover,every“crisp”interpretation I can be seen as a fuzzy interpretation with membership degree in{0,1}rather than in[0,1].

An important operator of DLs concerns number restrictions on role?llers [Buchheit et al.1993].The interested reader can?nd the fuzzy semantic rules for number restriction operators in Yen[1991].

An interpretation I satis?es(is a model of)a fuzzy assertion C(a),n (respectively, R(a1,a2),n and C1 C2,n )iff C I(a1I)≥n(respectively, R I(a1I,a2I)≥n and min d∈ I{max{1?C1I(d),C2I(d)}}≥n).Note that the sat-is?ability condition min d∈ I{max{1?C1I(d),C2I(d)}}≥n for C1 C2,n is a consequence of viewing C1 C2as the formula?x.C1(x)→C2(x),or?x.?C1(x)∨C2(x).

An interpretation I satis?es(is a model of)a fuzzy KB iff I satis?es each element of .A fuzzy KB entails a fuzzy assertionγ(written |=γ)iff every model of also satis?esγ.Given a fuzzy KB and a fuzzy assertionα,we de?ne the maximal degree of truth ofαwith respect to (written Maxdeg( ,α))as max{n>0| |= α,n }(max?=0).Note that |= α,n iff Maxdeg( ,α)≥n. For example,suppose we have two images i1and i2indexed as follows:

i1={ About(i1,tim),.9 , T all(tim),.8 , About(i1,tom),.6 , T all(tom),.7 } i2={ About(i2,joe),.6 , T all(joe),.9 }

Moreover,let the background KB be

B={ Image(i1),1 , Image(i2),1 ,

Musician(tim),1 , Musician(tom),1 , Musician(joe),1 ,

T all Adult,.9 },

and let 1= i1∪ B and 2= i2∪ B.Our intention is to retrieve all images in which there is an adult musician.This can be formalized by the query concept C=Image ?About.(Adult Musician).

It can be veri?ed that Maxdeg( 1,C(i1))=.9,whereas Maxdeg( 1,C(i2))=.6. Therefore,in retrieving both images we rank i1before i2.

922C.MEGHINI ET AL. The pivotal role that fuzzy ALC has in the context of our model will become clear in the next sections.The connection between logical reasoning in fuzzy ALC and non-logical computation through medium-speci?c document process-ing techniques will be realized by identifying a number of special ALC individual constants and predicate symbols and imposing that their semantics be not a generic subset of I(or I× I)but one that complies with the results of the document processing analysis.


We now proceed to discussing the“form”dimension of simple documents.We present models for image layouts and text layouts,which consist of the symbolic representations of the form-related aspects of an image or text,respectively.Each notion is endowed with a mereology,that is,a theory of parts,based on notions such as atomic region,region and grounded region.Each of these three notions will be de?ned twice,once for images and once for text.The context will tell which notion is meant from time to time.Note also that the term“layout”is used elsewhere in the literature in a different sense,namely to denote the rendering of a document on a display device.In order to query image and text models,we introduce special predicate symbols,which will be used in the uni?ed query language discussed in Section8.The reader will note the evident parallelism,even down to many details, in our treatment of image form and text form;given that form is the only medium-speci?c aspect of documents,this shows the potential of this model for extension to other media.

5.1.M ODELLING I MAGE L AYOUTS.In order to make the paper self-contained, some elementary notions from digital geometry are brie?y recalled below(see also, for example,Rosenfeld and Kak[1982,Chap.11]).

Let I N be the set of natural numbers.A zone is any subset of I N2,that is,a set of points.A zone S is aligned if

min{x| x,y ∈S}=0and min{y| x,y ∈S}=0.

The neighbors of a point P= x,y ,when both x and y are nonzero,are the points x?1,y , x,y?1 , x,y+1 ,and x+1,y .If only one of P’s coordinates is 0,then P has only three neighbors; 0,0 has only two neighbors.Two zones are said to be neighbor to each other if they are disjoint and a point in one of them is a neighbor of a point in the other one.A path of length n from point P to point P is a sequence of points P=P0,P1,...,P n=P such that P0=P,P n=P and P i is a neighbor of P i?1,1≤i≤n.Let S be a zone and P and P points of S:P is connected to P in S if there is a path from P to P consisting entirely of points of S.For any P in S,the set of points that are connected to P in S is called a connected component of S.If S has only one connected component,it is called a connected zone.

We now formalize the notion of an image layout.Given a set of colors C,an image layout is a triple i= A i,πi,f i ,where:

—A i,the domain,is a?nite,aligned,rectangular zone(“rectangular”zone being de?ned in the obvious way);

—πi is a partition of A i into nonempty connected zones{T1,...,T n},called atomic regions;

A Model of Multimedia Information Retrieval 923

A model of multimedia information retrieval


G .2.The atomic regions of a simple image.

—f i is a total function (the color function )from πi to C assigning a color to each atomic region in such a way that no two neighbor atomic regions have the same color.

Figure 2shows the partition πunderlying a simple image layout.From an image processing point of view,an image layout is a segmentation,based on a color uni-formity criterion.As a consequence,in our model “atomic region”is synonymous to “color region.”The reason behind this choice is that the mereology induced by this criterion permits us:

—to model a widely used class of queries on image form,namely those addressing single color regions as well as shapes;

—to model regions corresponding to objects within images,as these regions can be seen as aggregations of color regions.

This choice,on the other hand,has no impact on global image similarity,such as color-or texture-based similarity,because these kinds of similarity are not expressed nor computed in terms of single image regions.Also,it is impor-tant to note that other image segmentation criteria,and the consequent mere-ologies,can be accommodated into the model by generalizing the de?nition of image layout in the following way:i = A i , πi 1,f i 1 , πi 2,f i 2 ,..., πi n ,f i n ,where each pair πi j ,f i j captures a different segmentation criterion.To keep the model simple,and the paper readable,we consider only one segmenta-tion criterion.

The calculation of the partition πi from a pixel representation is notoriously dif?cult and,in fact,still an open problem.For this reason,we next provide a more general notion of image region,which is de?ned on top of the notion of atomic region,but which could as well be assumed as primitive,thus reconciling the model with the current status of image processing.

The regions of an image layout i = A i ,πi ,f i are de?ned to be the set:πi e = S |?T 1,...,T k ∈πi ,k ≥1,S =k j =1

T j ,S connected

A region is a connected zone obtained by the union of one or more atomic regions.The fact that we allow S to have holes enables the model to deal with partial occlusion (e.g.,the area of an image showing a goal-keeper partly covered by an approaching ball counts as a region).Accordingly,the extended color function of an image layout i = A i ,πi ,f i is de?ned as the function f i e that assigns to each region S the color distribution induced by the atomic regions that make up S .Technically,if S =∪k i =1T i ,f i e (S )is a mapping from C to [0,1]de?ned as follows

924C .MEGHINI ET AL .(for all c ∈C and letting |A |be the cardinality of A ):f i e (S )(c )=

T ∈Z c |T ||S |

,where Z c ={T ∈{T 1,...,T k }|f i (T )=c }.Since Z c is the subset of the atomic regions making up S that have color c ,f i e (S )(c )gives the percentage of S that has color c .In fact,it is immediately veri?ed that: c ∈C

f i e (S )(c )=1,for all regions S .

For any given region S ,let φ(S )stand for the shape of the region,that is the closed curve that delimits S .We let S be the set of closed curves in I N 2.

By de?nition,a region may occur in more than one image,since it is de?ned in a purely extensional way.For instance,whenever an object shows up in two images in exactly the same way,those two images will share at least one region,namely the portion of the image domain containing the object.In general,the same region will occur in all images having at least one atomic region in common.In order to evaluate image queries,though,we need a more selective notion of region,bound to the speci?c image where a region belong.To this end,we introduce the notion of grounded image region ,which we de?ne as a pair i ,S where i = A i ,πi ,f i is an image layout and S ∈πi e .For simplicity of notation,in what follows we directly refer to the shape of a grounded image region i ,S ,actually meaning the shape of its component region S .Formally,we extend the function φby stipulating that φ( i ,S )=φ(S ).Analogously,we de?ne the function f e on grounded image regions i ,S as follows:

f e ( i ,S )(c )=f i e (S )(c ).

Finally,we de?ne the image universe IU as the set of all possible image layouts of any domain.The set of all grounded image regions,denoted as R ,is de?ned on top of the image universe as:R = i ,S ∈ IU ×2

I N 2 S ∈πi e .5.2.Q UERYING I MAGE L AYOUTS .Queries referring to the form dimension of images are called visual queries,and can be partitioned as follows (for a taxonomy of approaches to image retrieval,see,e.g.,Gudivada and Raghavan [1997]):

(1)Concrete Visual Queries.These consist of full-?edged images that are sub-

mitted to the system as a way to indicate a request to retrieve “similar”im-ages;the addressed aspect of similarity may concern color [Bach et al.1996;Faloutsos et al.1994;Stricker and Orengo 1995;Swain and Ballard 1991],texture [Liu and Picard 1996;Pentland et al.1994;Smith and Chang 1994],appearance [Ravela and Manmatha 1997]:combinations of these are gaining ground [Rui et al.1998;Ciocca and Schettini 1999];

(2)Abstract Visual Queries.These are arti?cially constructed image elements

(hence,“abstractions”of image layouts)that address speci?c aspects of image similarity;they can be further categorized into:

(a)Color Queries.Speci?cations of color patches,used to indicate a request

to retrieve those images in which a similar color patch occurs [Del Bimbo et al.1997;Faloutsos et al.1994];

A Model of Multimedia Information Retrieval925

(b)Shape Queries.Speci?cations of one or more shapes(closed simple

curves in the2D space),used to indicate a request to retrieve those

images in which the speci?ed shapes occur as contours of signi?cant objects[Del Bimbo and Pala1997;Hirata and Kato1992;Petrakis and Faloutsos1997];

(c)combinations of the above[Jain and Vailaya1996].

As mentioned in Section2.3,visual queries are processed by matching a vector of features extracted from the query image,with each of the homologous vectors extracted from the images candidate for retrieval.For concrete visual queries,the features are computed on the whole image,while for abstract visual queries only the features indicated in the query(such as shape or color)are represented in the vectors involved.For each of the above categories of visual queries,a number of different techniques have been proposed for performing image matching,depending on the features used to capture the aspect addressed by the category,or the method used to compute such features,or the function used to assess similarity.For instance, a color similarity criterion for concrete visual query is captured by the following function[Stricker and Orengo1995]:

s(i,i )=

j∈HSB w1j

m i1j?m i 1j


m i2j?m i 2j


m i3j?m i 3j



—H S B is the set of color channels,H S B={H,S,B};

—m g

k j is the k th moment of the true color histogram of image layout g,for the

channel j of the HBS color space,that is:

m g

1j =1




p g


m g

2j =





p g


?m g1j


m g

3j =3





p g


?m g1j


where p g


is the value of the channel j in the point l of the image layout g;—w i j≥0,(1≤j,i≤3)are user speci?ed weights.

As the last example shows,the inference carried out by a similarity retrieval en-gine is heavily based on numerical techniques,hence the least apt to be captured by logic.For this reason,the model does not provide the machinery for de?ning simi-larity functions,and indeed assumes that similarity reasoning will not be performed by means of logic.

However,against current practice,we argue that visual queries are expressions of a formal language,and that logic is a most suitable tool for the speci?cation of such a language.Of course,the primitive elements of the language for querying image forms we are about to introduce will be speci?able in a visual way,being visual in nature.So,at the surface level,any system based on this model can be made to look identical to the similarity retrieval engines in current practice nowadays.But

926C.MEGHINI ET AL. the existence of the query language marks an important difference,since it permits,

among other things,the integration of form-based image retrieval with other kinds

of retrieval,on images as well as on documents pertaining to any other medium.

The categorization of visual queries pointed out above will be used as the guide-

line for the de?nition of the query language on image layouts.The building blocks of

the query language are,as anticipated at the end of Section4.2,special ALC individ-

ual constants and predicate symbols(abbreviated in SICs and SPSs,respectively).

In particular,in order to express visual queries,two kinds of symbols are called for: (1)symbols for denoting image layouts,their component regions,the properties of

regions(such as colors and shapes),and the appropriate relationships among

these entities;we call these symbols mereological SICs and SPSs;

(2)symbols for denoting similarity between whole images(for concrete visual

queries)and image components(for abstract visual queries);we call these

symbols similarity SPSs.

As far as mereological SICs are concerned,the following disjoint countable alpha-

bets are introduced,each consisting of description logic individual constants:— I,the names of image layouts(metavariable i,optionally followed by a natural


— R,the names of grounded image regions(r);

— C,the names of colors(c);

— S,the names of shapes(s).

Even though,at?rst sight,these alphabets may seem to be an unnecessary compli-

cation imposed by the formalism in which this model is stated,it is worthwhile to

observe that they are in everyday use.For instance,one could think of I as the set

of image URLs. C can be thought of as naming the elements of one of the many

color spaces proposed in the literature;for instance,in the RGB space each such

name is a triple giving the level of energy of a pixel on the corresponding chan-

nel.Analogously, S may be understood as any suitable notation for representing

contours,such as,for instance,the8-contour notation,given by elements of the set {0,1,...,7}+.Finally,each element in R could consist of the composition of an image name,a region shape and a point for locating the shape within the image

range,thereby uniquely identifying a region of the image.Formally,the intended

semantics of these SICs is given by conditions which constraint the generic fuzzy

interpretation I to use a speci?c function to interpret each of them.In particular,

we treat each of the above introduced individual constants as a rigid designator

(i.e.,interpretation-independent name,compliant with the given intuitive account

of the alphabets)for a corresponding semantic entity.These conditions require·I to

be a total bijective mapping from I to IU,from R to R,from C to C,and from S to S.In this way,for instance,each image layout has a name in the language, and each name in the language names a different layout.

Furthermore,the following mereological SPSs are assumed,each having the

syntactical status of a description logic role:

—HAIR(i,r)(standing for Has Atomic Image Region).Relates the image layout

i to one of its grounded atomic regions r;

A Model of Multimedia Information Retrieval927—HIR(i,r)(Has Image Region).Relates the image layout i to one of its grounded regions r;

—HS(r,s)(Has Shape).Relates a grounded image region r to its shape s;—HC(r,c)(Has Color).Relates a grounded image region r to its color c.

The semantic conditions on mereological SPSs are as follows(for all image layouts i,i ,regions S,shapes s and colors c):


IU×2I N2


such that HAIR I(i, i ,S )=

1if i=i and S∈πi




IU×2I N2


such that HIR I(i, i ,S )=

1if i=i and S∈πi e




IU×2I N2


such that HS I( i,S ,s)=

1if S∈πi e and s=φ(S)




IU×2I N2


such that HC I( i,S ,c)=f i e(S)(c)(5) HAIR and HIR give raise,as expected,only to crisp assertions,that are valuated as true(i.e.,1)just in case the grounded image region given as second argument belongs to the image layout given as?rst argument.The reason why we have included both these SPSs in the language,despite the obvious subsumption relation that links them(i.e.,HAIR HIR),is that color and shape queries,while always allowed on atomic image regions,will be allowed on extended regions only in restricted cases.The rationale for this choice is of a computational nature and will be discussed in detail later,in Section8.2.Also HS is a crisp role,true only if the given closed curve is the contour of the given grounded image region.Finally, HC is a truly fuzzy role,assigning to each pair(grounded image region,color)the percentage of the latter that occurs in the former.Note that,in order to compute that percentage,the color function must be known,hence it is mandatory to have grounded image regions,in this case.Clearly,HC behaves as a crisp role on atomic regions,on which it is true just in case the given color is indeed the color of the region,otherwise HC is false,that is0.

As far as similarity SPSs,these are categorized into two groups,mirroring the categorization of visual queries:

—global similarity SPSs.In general,there will be a family of such SPSs,each capturing a speci?c similarity criterion.Since from the conceptual viewpoint these SPSs form a uniform class,this model provides just one of them,to be understood as a representative of the whole class.Any other symbol of the same sort can be added without altering the structure and philosophy of the language. So,for global similarity matching we use the role: