文档库

最新最全的文档下载
当前位置:文档库 > Elsevier Science Publishers B.V. (North-Holland), 1996 Chapter

Elsevier Science Publishers B.V. (North-Holland), 1996 Chapter

Using Natural Language Interfaces

Handbook of Human-Computer Interaction M. Helander (ed.)

? Elsevier Science Publishers B.V . (North-Holland), 1996

Chapter

William C. Ogden Philip Bernick

Computing Research Laboratory New Mexico State University Las Cruces, New Mexico 88003ogden | pbernick@http://www.wendangku.net/doc/7f16782ee2bd960590c677f8.html 1.0Introduction....................................1Habitability......................................22.0Evaluation Issues ...........................43.0

Evaluations of Prototype and

Commercial Systems......................6Laboratory Evaluations....................6Field Studies ..................................10Natural Language V ersus Other

Interface Designs...........................144.0

Design Issues ................................18What is Natural?............................18Restrictions on V ocabulary............21Restrictions on Syntax...................22Functional Restrictions..................23Effects of Feedback .......................24Empirically Derived Grammars.....255.0Design Recommendations...........276.0Conclusion ...................................297.0Acknowledgments .......................308.0

References (30)

1.0Introduction

A goal of human factors research with com-puter systems is to develop human-computer communication modes that are both error tol-erant and easily learned. Since people already have extensive communication skills through

their own native or natural language (e.g.English, French, Japanese, etc.) many believe that natural language interfaces (NLIs) can provide the most useful and ef?cient way for people to interact with computers.

Although some authors express a belief that computers will never be able to under-stand natural language (e.g. Winograd &Flores, 1986), others feel that natural language processing technology needs only to advance suf?ciently to make general purpose NLIs pos-sible. Indeed, there have been several attempts to produce commercial systems.

The goal for most natural language sys-tems is to provide an interface that minimizes the training required for users. T o most, this means a system that uses the words and syntax of a natural language such as English. There is,however, some disagreement as to the amount of “understanding” or ?exibility required in the system.

Systems have been proposed that permit users to construct English sentences by select-ing words from menus (Tennant et al. 1983).However, Woods (1977) rejects the idea that a system using English words in an arti?cial for-mat should be considered a natural language system, and assumes that the system should have an awareness of discourse rules that make it possible to omit easily inferred details. In further contrast, Perlman (1984) suggests that “naturalness” be determined by the context of the current application and urges the design of restricted “natural arti?cial” languages.

Philosophical issues about the plausibility of computers understanding and generating natural language aside, it was widely believed

that the number of NLI applications would continue to grow (Waltz, 1983). Since then, work in graphical user interfaces (GUIs) has solved many of the problems that NLIs were expected to solve. As a result, NLIs have not grown at the rate ?rst anticipated, and those that have been produced are designed to use constrained language in limited domains.

Research into NLIs continues, and a fre-quently asked question is how effective are these interfaces for human-computer commu-nication. The focus of this chapter is not to address the philosophical issues of natural lan-guage processing. Rather, it is to review empir-ical methods that have been applied to the evaluation of these limited NLIs and to review the results of user studies. We have not included the evaluation of speech systems, pri-marily due to space limitations. Readers inter-ested in speech system evaluations should consult Hirschman et al. (1992), and Goodine et al. (1992).

This discussion of empirical results is also not limited to a single de?nition of natural lan-guage. Instead, it uses the most liberal de?ni-tion of natural language and looks at systems that seek to provide ?exible input languages that minimize training requirements. The dis-cussion is limited to two study categories that report empirical results obtained from observ-ing users interacting with these systems. The ?rst category consists of prototype system studies developed in research environments. These studies have been conducted both in lab-oratory settings and in ?eld settings. The sec-ond category consists of simulated systems studied in the laboratory that are designed to help identify desirable attributes of natural lan-guage systems. Before reviewing these studies, some criteria for evaluating NLIs are pre-sented.

Habitability

Habitability is a term coined by Watt (1968) to indicate how easily, naturally, and effectively users can use language to express themselves within the constraints of a system language. A language is considered habitable if users can express everything that is needed for a task using language they would expect the system to understand. For example, if there are 26 ways that a user population would be likely to use to describe an operation, a habitable sys-tem will process all 26.

In this review of studies, at least four domains in which a language can be habitable should be considered: conceptual, functional, syntactic, and lexical. Users of a NLI must learn to stay within the limits of all four domains.

Conceptual: The conceptual domain of a lan-guage describes the language’s total area of coverage, and de?nes the complete set of objects and the actions covered by the inter-face. Users may only reference those objects and actions processable by the system. For example, a user should not ask about staff members who are managers if the computer system has no information about managers. Such a system would not understand the sen-tence:

1.What is the salary of John Smith’s man-

ager?

Users are limited to only those concepts the system has information about. There is a difference between the conceptual domain of a language and the conceptual domain of the underlying system. The conceptual domain of a language can be expanded by recognizing concepts (e.g. manager) that exceed the sys-tem’s coverage and responding appropriately (Codd, 1974), (e.g. “There is no information on managers.”) The query could then be said to be part of the language’s conceptual domain, but not supported by the system’s. However,

there will always be a limit to the number of concepts expressible within the language, and users must learn to refer to only these con-cepts.

Functional: Functional domain is de?ned by constraints about what can be expressed within the language without elaboration, and determines the processing details that users may leave out of their expressions. While con-ceptual coverage determines what can be expressed, functional coverage determines how it can be expressed. Natural language allows speakers to reference concepts many ways depending on listener knowledge and context. The functional domain is determined by the number of built-in functions or knowl-edge the system has available. For example, although a database may have salary informa-tion about managers and staff, a natural lan-guage interface still may not understand the question expressed in Sentence 1 if the proce-dure for getting the answer is too complicated to be expressed in one question. For example, the answer to Sentence 1 may require two steps; one to retrieve the name of the manager and another to retrieve the salary associated with the name. Thus, the system may allow the user to get the answer with two questions:

2aWho is the manager of John Smith? System: MARY JONES

2bWhat is the salary of Mary Jones?

With these questions, the user essentially speci?es procedures that the system is capable of. The question in Sentence 1 does not exceed the conceptual domain of the language because salaries of managers are available. Instead, it expresses a function that does not exist (i.e. a function that combines two retrievals from one question). Nor is it a syntactic limitation since the system might understand a question with the same syntactic structure as Sentence 1, but that can be answered with a single database retrieval:

3.What is the name of John Smith’s man-

ager?

Other, more formal, languages vary on functional coverage as well. Concepts like square root can be expressed directly in some languages but must be computed in others. A habitable system provides functions that users expect in the interface language. Since there will be a limit on the functional domain of the language, users must learn to refer to only those functions contained in the language. Syntactic: The syntactic domain of a language refers to the number of paraphrases of a single command that the system understands. A sys-tem that did not allow possessives might not understand Sentence 1, but would understand Sentence 4:

4.What is the salary of the manager of

John Smith?

A habitable system must provide the syn-tactic coverage users expect.

Lexical: The lexical domain of a language refers to words contained in the system’s lexi-con. Sentence 1 might not be understood if the language does not accept the word “salary” but accepts “earnings.” Thus, Sentence 5 would be understood:

5.What are the earnings of the manager of

John Smith?

A natural language system must be made habitable in all four domains because it will be dif?cult for users to learn which domain is vio-lated when the system rejects an expression. A user entering a command like Sentence 1 that is rejected by the system, but does not violate the conceptual domain, might be successful with any of the paraphrases in Sentence 2, 4, or 5. Determining which one will depend upon the functional, syntactic, or lexical coverage of the language. When evaluating the results of user studies, it is important to keep the distinc-tions between habitable domains in mind.

NLIs attempt to cover each domain by meeting the expectations of the user, and inter-face habitability is determined by how well these expectations are met. Of course, the most habitable NLI would be one capable of passing a Turing Test or winning the Loebner Prize1.

It can be dif?cult to measure the habitabil-ity of a language. To determine how much cov-erage of each domain is adequate for a given task requires good evaluation methods. The next section reviews some of the methodologi-cal issues to consider when evaluating natural language interfaces.

2.0Evaluation Issues

When evaluating the studies presented in this chapter there are several methodological issues that need to be considered: user selection and training, task selection and presentation, so-called Wizard-of-Oz simulations (Kelley, 1984), parsing success rates, and interface cus-tomization. Here we present a brief discussion of these issues.

User selection and training: Evaluating any user-system interface requires test participants who represent the intended user population. For natural language evaluations, this is a par-ticularly important factor. A language’s habit-ability depends on how well it matches user knowledge about the domain of discourse. Therefore, test participants should have knowl-edge similar to that held by the actual users in the target domain. Some studies select care-fully, but most have tried to train participants in the domain to be tested. It is likely, however, 1.In the 1950’s Turing proposed a test designed to challenge

our beliefs about what it means to think (Turing, 1950).

The Loebner variant involves computer programs with NLIs that have been designed to converse with users (Epstein, 1993). Users do not know whether they are com-municating with a computer or another user. Winning the prize involves fooling the users into thinking they are con-versing with another human, when, in fact, they are com-municating with a computer.that participants employed for an experiment will be less motivated to use the NLI produc-tively than existing users of the database. On the other hand, a measure of control is pro-vided when participants are trained to have a common understanding of the domain.

Training ranges from a minimal introduc-tion to the domain to extensive training on the interface language that can include instructions and practice on how to avoid common traps. Obviously, the quality of training given to users signi?cantly affects user performance. Therefore, the kind and amount of training users receive should represent the training that users are expected to have with the actual product.

Task generation and presentation: The goal of studies designed to evaluate the use of natu-ral language is to collect unbiased user expres-sions as they are engaged in computer-related tasks. These tasks should be representative of the type of work the user would be expected to accomplish with the interface and be presented in a way that would not in?uence the form of expression. In most studies, tasks are gener-ated by the experimenter who attempts to cover the range of function available in the interface. Letting users generate their own tasks is an alternative method, but results in a reduction of experimenter control. This method also requires that users be motivated to ask appropriate questions.

Experimenter-generated tasks are neces-sary if actual users are not available or need prior training. These tasks can simulate a hypothesized level of user knowledge and experience by presenting tasks assumed to be representative of questions that would be asked of the real system. The disadvantage of experi-menter-generated tasks are that they do not allow for assessment of the language’s concep-tual habitability because experimenters usually generate only solvable tasks.

User-generated tasks have the advantage of being able to assess the language’s conceptual and functional coverage because users are free to express problems that may be beyond the capabilities of the system. The disadvantage of user-generated tasks is that a study’s results may not generalize beyond the set of questions asked by the selected set of users; no attempt at covering all of the capabilities of the interface will have been made. Another disadvantage is that actual users of the proposed system must be available.

How experimenter-generated or user-gen-erated tasks are presented to test participants has strong in?uence on the expressions test participants generate. In an extreme case, par-ticipants would be able to solve the task merely by entering the task instructions as they were presented. At the other extreme, task instructions would encourage participants to generate invalid expressions.

Researchers usually choose one of two methods for overcoming these extremes. One method presents the task as a large generally-stated problem that requires several steps to solve. This method not only tests the habitabil-ity of the language, but also the problem-solv-ing ability of the participants. Participants are free to use whatever strategy seems natural and to use whatever functions they expect the inter-face to have. Like user-generated tasks, this method does not allow researcher to test all of the anticipated uses of the system because par-ticipants may not ask suf?ciently complicated questions.

An alternative method has been used to test more of a system’s functions. In this method, participants are given items like tables or graphs, with some information missing, and are then asked to complete these items by ask-ing the system for this missing information. This method gives an experimenter the most control over the complexity of expressions that participants would be expected to enter. How-ever, some expressions may be dif?cult to rep-resent non-linguistically, so coverage may not be as complete as desired (c.f. Zoeppritz, 1986). An independent measure of how partic-ipants interpreted the questions should also be used to determine whether participants under-stood the task. For example, participants may be asked to do the requested task manually before asking the computer to do it.

Wizard-of-Oz (WOz) simulations: WOz stud-ies simulate a natural language system by using a human to interpret participants’ com-mands. In a typical experiment a participant will type a natural language command on one terminal that will appear on a terminal moni-tored by an operator (the Wizard) hidden in another location. The Wizard interprets the command and takes appropriate actions that result in messages that appear on the partici-pant’s terminal. Usually the Wizard makes decisions about what a real system would or would not understand. It is likely that the Wiz-ard will not be as consistent in responding as the computer would, and this problem should be taken into account when reviewing these studies. However, WOz simulations are useful for quickly evaluating potential designs. Evaluation of parsing success rates: An often reported measure of habitability is the propor-tion of expressions that can be successfully parsed by the language processor returning a result. But studies that report parsing success rates as a global indicator of how well the sys-tem is doing assume that all commands are equally complex when they are not (Tennant, 1980).

A high success rate may be due to partici-pants in the study repeating a simple request many times. For example, Tennant observed a participant asking “How many NOR hours did plane 4 have in Jan of 1973,” and then repeat-ing this request again for each of the 12 months. This yields 12 correctly parsed ques-tions. On the other hand, another participant

trying to get the information from all 12 months in one request had two incorrectly parsed requests before correctly requesting “List the NOR hours in each month of 1973 for plane 4.” The second participant had a lower parse rate percentage, but obtained the desired information with less work than the ?rst partic-ipant. In fact, Tennant found that successful task solution scores did not correlate with pars-ing success rates. Therefore, a user’s ability to enter allowable commands does not guarantee that they can accomplish their tasks.

Parsing success rates need to be interpreted in light of other measures such as number of requests per task, task solution success, and solution time. Since these measures depend on the tasks users are given, which vary from study to study, it would be inappropriate to compare systems across studies.

Interface customization: Finally, how a sys-tem is customized for the application being evaluated is an important methodological issue. Each NLI requires that semantic and pragmatic information about the task domain be encoded and entered. Evaluation results are signi?cantly affected if researchers capture and enter this information poorly. Since most evaluations have been conducted on systems that were customized by system developers, these systems represent ideally adapted inter-faces. However, most operational systems will not have the advantage of being customized by an expert, so performance with the operational system may be worse.

3.0Evaluations of Prototype and

Commercial Systems

The feasibility of the natural language approach is usually shown by building and demonstrating a prototype system prior to delivering it to the marketplace. Very few of these prototype systems have been evaluated by actually measuring user performance. The few that have been evaluated are reviewed here, beginning with a review of evaluations done under controlled laboratory conditions, and followed by a review of ?eld studies.

Laboratory Evaluations LADDER: Hershman et al. (1979) studied ten Navy of?cers using LADDER, a natural lan-guage query system designed to provide easy access to a naval database. The goal of this study was to simulate as closely as possible the actual operational environment in which LAD-DER would be implemented. To accomplish this, Navy of?cers were trained to be interme-diaries between a hypothetical decision maker and the computer’s database in a simulated search and rescue operation. Of?cers were given global requests for information and were asked to use LADDER to obtain the necessary information. Training consisted of a 30 minute tutorial session that included a lengthy discus-sion of LADDER’s syntax and vocabulary, fol-lowed by one hour of practice that involved typing in canned queries and solving some simple problems. Compared to participants in other studies, these participants were moder-ately well trained.

Participants were largely successful at obtaining necessary information from the data-base, and were able to avoid requests for infor-mation not relevant to their task. Thus, it seems that participants were easily able to stay within the conceptual domain of the language. Hersh-man et al, however, report that LADDER parsed only 70.5 percent of the 366 queries submitted. Participants also used twice the number of queries that would have been required by an expert LADDER user. Almost 80 percent of the rejected queries were due to syntax errors. Apparently, LADDER’s syntac-tic coverage was too limited for these moder-ately trained users. However, a contributing factor may have been the wording of informa-tion requests given to the participants. These requests were designed not to be understood

by LADDER and this could have in?uenced the questions typed by the participants.

Hershman et al. concluded that the system could bene?t from expanded syntactic and lex-ical coverage, but that users would still require training. Apparently, the training that was given to participants in this study was adequate for teaching the system’s functional and con-ceptual coverage, but not for teaching its syn-tactic and lexical coverage.

PLANES: Tennant (1979) conducted several studies of prototype systems and came to simi-lar conclusions. However, Tennant also pro-vides evidence that users have trouble staying within the conceptual limits of a natural lan-guage query interface. Tennant studied users of two natural language question answering sys-tems: PLANES and the Automatic Advisor. PLANES was used with a relational database of ?ight and maintenance records for naval air-craft, and the Automatic Advisor was used to provide information about engineering courses offered at a university. Participants were uni-versity students who were unfamiliar with the database domains of the two systems. Partici-pants using PLANES were given a 600-word script that described the information contained in the database. Participants using the Auto-matic Advisor were only given a few sentences describing the domain. Participants received no other training. Problems were presented either in the form of partially completed tables and charts or in the form of long descriptions or high-level problems that users had to decompose.

Problems were generated either by people familiar with the databases or by people who had received only the brief introduction given to the participants. The purpose of having problems generated by people who had no experience with the system was to test the con-ceptual completeness of the natural language systems. If all of the problems generated by inexperienced users could be solved, then the system could be considered conceptually com-plete. However, this was not the case. Tennant does not report statistics, but claims that some of the problems generated by the inexperi-enced users could not have been solved using the natural language system, and consequently participants were less able to solve these prob-lems than the problems generated by people familiar with the database.

Tennant concluded that the systems were not conceptually or functionally complete and that without extending the conceptual coverage beyond the limits of the database contents, nat-ural language systems would be as dif?cult to use as formal language systems.

NLC: Other laboratory evaluations of proto-type systems have been conducted using a nat-ural language programming system called NLC (Biermann et al., 1983). The NLC system allows users to display and manipulate numeri-cal tables or matrices. Users are limited to commands that begin with an imperative verb and can only refer to items shown on their dis-play terminals. Thus, the user is directly aware of some of the language’s syntactic and con-ceptual limitations.

A study conducted by Biermann et al. (1983) compared NLC to a formal program-ming language, PL/C. Participants were asked to solve a linear algebra problem and a “grade-book” problem using either NLC or PL/C. Par-ticipants using NLC were given a written tuto-rial, a practice session, the problems, and some brief instructions on using an interactive termi-nal. Participants using PL/C were given the problems and used a batch card reading sys-tem. The 23 participants were just completing a course in PL/C and were considered to be in the top one-third of the class. Each participant solved one of the problems using NLC and the other using PL/C. Problems were equally divided among languages.

Results show that 10 of 12 (83 percent) participants using NLC correctly completed

the linear algebra problem in an average of 34 minutes. This performance compared favor-ably to that of the PL/C group in which 5 of 11 (45 percent) participants correctly completed this problem in an average 165 minutes. For the “grade-book” problem, 8 of 11 (73 per-cent) NLC participants completed the problem correctly in an average of 68 minutes while 9 of 12 (75 percent) PL/C participants correctly completed the problem in an average of 125 minutes The reliability of these differences was not tested statistically, but it was clear that participants with 50 minutes of self-paced training could use a natural language program-ming tool on problems generated by the sys-tem designers. These participants also did at least as well as similar participants who used a formal language which they had just learned.

The system was able to process 81 percent of the natural language commands correctly. Most of the incorrect commands were judged to be the result of “user sloppiness” and non-implemented functions. Users stayed within the conceptual domain when they were given an explicit model of the domain (as items on the display terminal) and were given problems generated by the system designers. However, users seemed to have dif?culty staying within the functional limitation of the system and were not always perfect in their syntactic and lexical performance.

The idea of referring to an explicit concep-tual model that can be displayed on the screen is a good one. Biermann et al. also pointed out the necessity of providing immediate feedback via an on-line display to show users how their commands were interpreted. If there was a misinterpretation, it would be very obvious, and the command could be corrected with an UNDO instruction.

In another study of NLC, Fink et al. (1985), examined the training issue. Eighteen participants with little or no computer experi-ence were given problems to solve with NLC.To solve these problems, participants had to formulate conditional statements that the sys-tem was capable of understanding. Participants received no training, nor were they given examples of how to express these conditions. In other respects the experimental methodol-ogy was the same as Biermann et al. (1983). The following is an example of an allowable conditional statement in NLC.

For i = 1 to 4, double row i if it contains

a positive entry.

Fink et al. reported large individual differ-ences in the participants’ abilities to discover rules for generating conditional statements. One participant made only one error in solving 13 problems, whereas another participant could not solve any problems. In general, par-ticipants made large numbers of errors solving the ?rst few problems and made few errors after discovering a method that worked. These ?ndings support the conclusion that training is required for these kinds of natural language systems.

A commercial system: Customization can make it dif?cult to study prototype natural lan-guage interfaces. The ?exibility of commercial systems demands customization, but a sys-tem’s usability depends on its natural language processor and on how well the interface has been customized for the application. When usability problems occur, evaluators may have trouble determining whether the natural lan-guage processor or the customization is responsible.

Ogden & Sorknes (1987) evaluated a PC-based natural language query product that allowed users to do their own customizing. The evaluation goal was to assess how well a commercially available NLI would meet the needs of a database user who was responsible for customization, but who had no formal query training. The interface was evaluated by observing seven participants as they learned and used the product. They were given the

product’s documentation and asked to solve a set of 47 query-writing problems. Problems were presented as data tables with some miss-ing items, and participants were asked to enter a question that would retrieve only the missing data.

The interface had dif?culty interpreting participants’ queries, with ?rst attempts result-ing in a correct result only 28 percent of the time. After an average of 3.6 attempts, 72 per-cent of the problems were answered correctly. Another 16 percent of the problems users thought were correctly answered were not. This undetected error frequency would be unacceptable to most database users.

The results also showed that participants frequently used the system to view the data-base structure; an average of 17 times for each participant (36 percent of the tasks). This indi-cates that the system’s natural language user needed to have speci?c knowledge of the data-base to use the interface. Clari?cation dialog occurred when the parser prompted users for more information. This occurred an average of 20 times for each participant (42 percent of the tasks). The high frequency of undetected errors, coupled with the frequent need for clar-i?cation dialogs, suggests that users were struggling to be understood.

The following is an example of how unde-tected errors can occur:

*User: How many credits does “David Lee”

have?

System: count: 2

*User: What are the total credits for “David Lee”System: total credits: 7

Here the system gives a different answer to paraphrases of the same question. In the ?rst case, the phrase “How many” invoked a count function, and “2” is the number of courses the student took. In the second case, the word “total” invoked a sum function. The system did not contain the needed semantic information about credits to determine that they should be summed in both cases. Thus, to use this sys-tem, users would need to know the functional differences between saying “How many...” and “What are the total...” The language was not habitable for the participants in the study who did not know these differences due to limita-tions in the system’s functional coverage. The conclusion is that users need speci?c training on the functional characteristics of the lan-guage and database in much the same way as users of formal languages do. Without this training, users cannot be expected to customize their own language interface.

Harman and Candela (1990) report an eval-uation of an information retrieval system’s NLI. The prototype system, which used a very simple natural language processing (NLP) model, allowed users to enter unrestricted nat-ural language questions, phrases, or a set of terms. The system would respond with a set of text document titles ranked according to rele-vance to the question. This system’s statistical ranking mechanism considered each word in a query independently. Together these words were statistically compared to the records in an information ?le. This comparison was used to estimate the likelihood that a record was rele-vant to the question.

The system contained test data already used by this study’s more than 40 test partici-pants. The questions submitted to the system were user-generated and relevant to the partici-pant’s current or recent research interests. Nine participants were pro?cient Boolean retrieval system users, and ?ve others had limited expe-rience. The rest of the participants were not familiar with any retrieval systems. All partici-pants were very familiar with the data con-tained in the test sets.

Evaluating results from this and other text retrieval systems is problematic in that the measure of success is often a subjective evalu-ation of how relevant the retrieved documents

are, and this makes it impossible to determine how successful the interaction was. However, Harman and Candela report that for a select set of text queries, 53 out of 68 queries (77 per-cent) retrieved at least one relevant record. The other reported results were qualitative judge-ments made by Harman and Candela based on the comments of the test participants. Gener-ally, this study’s participants found useful information very quickly, and ?rst time users seemed to be just as successful as the experi-enced Boolean system users. While the study did not test a Boolean system, Harman and Candela point out that, in contrast, ?rst time users of Boolean systems have little success.

Harman and Candela conclude that this type of NLI for information retrieval is a good solution to a dif?cult Boolean interface. Whereas some queries are not handled cor-rectly by statistically based approaches, such as queries requiring the NOT operator, these problems could be overcome. A study by Tur-tle (1994) that directly compares results of a Boolean and NLI information retrieval system are reviewed later in this chapter. Summary: Laboratory studies of natural lan-guage prototypes make possible the observa-tion that users do relatively well if they 1) are knowledgeable about the domain or are given good feedback about the domain, 2) are given language-speci?c training and 3) are given tasks that have been generated by the experi-menters. Users perform poorly when 1) train-ing is absent, 2) domain knowledge is limited, or 3) the system is functionally impoverished.

Tennant’s studies suggest that user-gener-ated tasks will be much more dif?cult to per-form than experimenter-generated tasks. Since actually using an interface will involve user-generated tasks, it is important to evaluate NLIs under ?eld conditions.

Field Studies

Field studies of prototype natural language systems have focused on learning how the lan-guage was used, what language facilities were used, and identifying system requirements in an operational environment with real users working with genuine data. Two of these eval-uations (Krause, 1980; Damerau, 1981) are discussed here. Harris (1977) has reported a ?eld test that resulted in a 90 percent parsing success rate, but since no other details were reported the study is hard to evaluate. Three other studies, one that is a ?eld study of a com-mercial system (Capindale and Crawford, 1990), a second that compares a prototype nat-ural language system with a formal language system (Jarke et al., 1985), and third that describes a ?eld test of a conversational hyper-text natural language information system (Patrick and Whalen, 1992) will also be exam-ined.

USL: Krause (1980) studied the use of the User Specialty Language (USL) system for database query answering in the context of an actual application. The USL system was installed as a German Language interface to a computer database. The database contained grade and other information about 430 stu-dents attending a German Gymnasium. The users were teachers in the school who wanted to analyze data on student development. For example, they wanted to know if early grades predicted later success. Users were highly motivated and understood the application domain well. Although the amount of training users received is not reported, the system was installed under optimal conditions: by its developers after they interviewed users to understand the kinds of questions that would be asked.

The system was used over a one-year period. During this time, about 7300 questions were asked in 46 different sessions. For each session, a user would come to a laboratory

with a set of questions and would use the sys-tem in the presence of an observer. Study data consisted of the session logs, observer’s notes, and user questionnaires. The observer did not provide any user assistance.

The results reported by Krause come from an analysis of 2121 questions asked by one of the users. Generally, this user successful entered questions into USL. Overall, only 6.9 percent of the user’s questions could be classi-?ed as errors, and most of these (4.4 percent) were correctable typing errors. Krause attributes part of this low error rate to the observation that the user was so involved in the task and the data being analyzed that there lit-tle effort spent to learn more about the USL system. This user may have found some simple question structures that worked well and used them over and over again. This may be very indicative of how natural language systems will be used. It is unclear whether the user actually got the wanted answers since Krause provides no data on this. However, Krause reports two observations that suggest that the user did get satisfactory answers: 1) the user remained very motivated and 2) a research report based on the data obtained with USL was published.

A major ?nding by Krause was that syntac-tic errors gave the user more dif?culty than semantic errors. It was easier to recover from errors resulting from “which students go to class Y?” if it required a synonym change resulting in “which students attend class Y?”than if it required a syntactic change resulting in “List the class Y students.” From these observations, Krause concludes that broad syn-tactic coverage is needed even when the semantics of the database are well understood by the users.

TQA: Damerau (1981) presents a statistical summary of the use of another natural lan-guage query interface called the Transforma-tional Question Answering (TQA) system.Over a one year period, a city planning depart-ment used TQA to access a database consisting of records of each parcel of land in the city. Users, who were very familiar with the data-base, received training on the TQA language. Access was available to users whenever needed via a computer terminal connected to the TQA system. Session logs were collected automatically and consisted of a trace of all the output received at the terminal as well as a trace of system’s performance.

The results come primarily from one user, although other users entered some requests. A total of 788 queries were entered during the study year, and 65 percent of these resulted in an answer from the database. There is no way to know what proportion of these answers were actually useful to the users. Clari?cation was required when TQA did not recognize a word or when it recognized an ambiguous question, and thirty percent of the questions entered required clari?cation. In these cases, the user could re-key the word or select among alternate interpretations of the ambiguous question.

Damerau also reports instances of users echoing back the system’s responses, and this was not allowable input in the version of TQA being tested. The TQA system would repeat a question after transforming phrases found in the lexicon. Thus, the phrase “gas station”would by echoed back to the user as “GAS_STATION.” Users would create errors by entering the echoed version. TQA was sub-sequently changed to echo some variant of what the user entered which would be allow-able if entered.

The results reported by Damerau are mainly descriptive, but the researchers were encouraged by these results and reported that users had positive attitudes toward the system. Evaluating this study is dif?cult because no measures of user-success are available.

INTELLECT: Capindale and Crawford (1990) report a ?eld evaluation of INTELLECT, the ?rst commercial NLI to appear on the market. INTELLECT is a NLI to existing relational database systems. Nineteen users of the data, who had previously accessed it through a menu system, were given a one-hour introduc-tion to INTELLECT. They were then free to use INTELLECT to access the data for a period of ten weeks. Users ranged in their familiarity with the database but were all true end-users of the data. They generated their own questions, presumably to solve particular job-related problems, although Capindale and Crawford did not analyze the nature of these questions. The capabilities and limitations of INTELLECT are summarized well by Capin-dale and Crawford, who make a special note of Martin’s (1985) observation that the success of an INTELLECT installation depends on build-ing a custom lexicon and that “To build a good lexicon requires considerable work” (p. 219). Surprisingly, no mention is made of the effort or level of customization that went into install-ing INTELLECT for their study. This makes it dif?cult to evaluate Capindale’s and Craw-ford’s results.

Most of the reported results are of ques-tionnaire data obtained from users after the ten week period and are of little consequence to the present discussion except to say that the users were mostly pleased with the idea of using a NLI and rated many of INTELLECT’s features highly. Objective data was also recorded in transaction logs which Capindale and Crawford analyzed by de?ning success as the parse success rate. The parse success rate was 88.5 percent. There was no attempt to determine task success rate as is often the case in ?eld studies.

Comparison to Formal Language: Since no system can cover all possible utterances of a natural language, they are in some sense a for-mal computer language. Therefore, these sys-tems must be compared against other formal language systems in regard to function, ease of learning and recall, etc. (Zoeppritz, 1986). In a comprehensive ?eld study that compared a natural language system (USL) with a formal language system (SQL), Jarke et al. (1985) used paid participants to serve as “advisors” or surrogates to the principal users of a database. The database contained university alumni records and the principal users were university alumni of?cers. This could be considered a ?eld study because USL and SQL were used on a relational database containing data from an actual application, and the participants’tasks were generated by the principal users of these data. However, it could also be consid-ered a laboratory evaluation because the partic-ipants were paid to learn and use both languages, and they were recruited solely to participate in the study. Unfortunately, it lacked the control of a laboratory study since 1) participants were given different tasks (although some tasks were given to both lan-guage groups), and 2) the USL system was modi?ed during the study. Also, the database management system was running on a large time-shared system; response times and sys-tem availability was poor and varied between language conditions.

Eight participants were selected non-ran-domly from a pool of 20 applicants who, in the experimenter’s judgment, represented a homo-geneous group of young business profession-als. Applicants were familiar with computers, but only had limited experience with them. Classroom instruction was given for both SQL and USL, and each participant learned and used both. Instruction for USL, the natural lan-guage system, was extensive and speci?c, and identi?ed the language’s restrictions and strat-egies to overcome them.

Analysis of the tasks generated by the prin-cipal users for use by the participants indicated that 15.6 percent of the SQL tasks, and 26.2

percent of the USL tasks were unanswerable. The proportion of these tasks that exceeded the database’s conceptual coverage versus the pro-portion that exceeded the query language’s functional coverage is not reported. Neverthe-less, Jarke et al conclude that SQL is function-ally more powerful than USL. The important point is, however, that principal users of the database (who knew the conceptual domain very well) generated many tasks that could not be solved by either query language. Thus, this study supports the ?ndings suggested by Ten-nant (1979); users who know the conceptual domain but who have had no experience with computers, may still ask questions that cannot be answered.

SQL users solved more than twice as many tasks as USL users. Of the fully solvable tasks, 52.4 percent were solved using SQL versus 23.6 percent using USL. A fairer comparison is to look at the paired tasks (tasks that were given to both language groups), but Jarke et al. do not present these data clearly. They report that SQL was “better” on 60.7 percent, that USL was “better” on 17.9 percent and that SQL and USL were “equal” on 21.4 percent of the paired tasks. They do not indicate how “better” or “equal” performance was deter-mined. Nevertheless, the results indicate that it was dif?cult for participants to obtain responses to the requests of actual users regardless of which language was used. Fur-thermore, the natural language system tested in this study was not used more effectively than the formal language system.

In trying to explain the dif?culty users had with natural language, Jarke et al. cited lack of functionality as one main reason for task fail-ure. Of the solvable tasks, 24 percent were not solved because participants tried to invoke unavailable functions. Apparently, the USL system tested in this study did not provide the conceptual and/or functional coverage that was needed for tasks generated by the actual users of the database.

Many task failures had to do with the sys-tem’s hardware and operating environment. System unavailability and interface problems contributed to 29 percent of the failures. In contrast, system and interface problems con-tributed to only 7 percent of the task failures when SQL was used. This represents a source of confounding between the two language con-ditions and weakens the comparison that can be made. Therefore, little can be said about the advantage of natural language over formal lan-guages based on this study.

It is clear, however, that the prototype USL system studied in this evaluation could not be used effectively to answer the actual questions raised by the principal users of the database. It should be noted that the system was installed and customized by the experimenters and not by the system developers. Thus, a sub-optimal customization procedure might have been major contributor to the system’s performance. COMODA: Patrick and Whalen (1992) con-ducted a large ?eld test of COMODA, their conversational hypertext natural language information system for publicly distributing information about the disease AIDS to the public. In this test, users with computers and modems could call a dial-up AIDS information system and use natural language to ask ques-tions or just browse.

Whalen and Patrick report that during a two month period the COMODA system received nearly 500 calls. The average call was approximately 10 minutes, and involved an average of 27 exchanges (query-response, request) between the user and the system. Of these, approximately 45 percent were direct natural language queries, and though they pro-vide no speci?c numbers, Whalen and Patrick report that the system successfully answered many of them.

Users were obtained by advertising in local newspapers, radio, and television in Alberta, Canada. A close analysis of the calls for the ?nal three weeks of data collection was done that evaluated not only those inputs from users that the system could parse, but success rates of system responses. Correct answers were those that provided the information requested by the user or gave a response of “I don’t know about that.” when no information responding to the request was available. Incorrect responses were those that provided the wrong information when correct information was available, provided wrong information when correct information was not available, a response of “I don’t know about that.” when correct information was available, or when the query was ambiguous so that a correct answer could not be identi?ed. Whalen and Patrick report a 70 percent correct response rate.

Patrick and Whalen were surprised by the number of requests to browse since the system was designed to enable users to easily obtain information about a particular topic area. How-ever, since topic focus requires knowledge of the domain by a user, and since there is no way for the experimenter to know how knowledge-able users were, the request for browsing might be explained by the novelty of the sys-tem, and users’ interest in exploring the system in conjunction with learning about the infor-mation it contained.

It also isn’t clear from the study whether users thought they were interacting with a human or a computer. Previously, Whalen and Patrick have reported that their system does not lead users to believe that they are interact-ing with a human (Whalen and Patrick, 1989). However, Whalen went on to enter a variant of this system in the 1994 Loebner competition where he took ?rst prize. Though his system fooled none of the judges into thinking it was a human (which is the goal of the competition) he did receive the highest median score of all the computer entrants. An important difference between Whalen’s entry and other systems is that it contains no natural language under-standing component. Like COMODA it is lim-ited to recognizing actual words and phrases people use to discuss a topic.

The result of this work contributes signi?-cantly to the notion that NLP is not an essential component of successful NLIs, and that NLIs are useful for databases other than relational databases.

Summary: Field studies are not usually intended to be generalized beyond their limited application. Only the relative success of the implementations can be assessed. The studies that have been presented offer mixed results. In general, the results of the ?eld studies tend to agree the laboratory study results. If users are very familiar with the database, their major dif?culties are caused by syntactic limitations of the language. However, if the system does not provide the conceptual or functional cover-age the user expects, performance will suffer dramatically. If this is the case, it appears that training will be required to instruct users about the functional capabilities of the system, and the language must provide broad syntactic coverage. The type of training that is required has not been established.

Natural Language Versus Other

Interface Designs

A ?rst issue concerns the question of whether a natural language would really be any better for an interface than a formal arti?cial language designed to do the same task. The previously discussed studies that compared prototype lan-guages with arti?cial languages reported mixed results. In the case of a database appli-cation, Jarke et al. (1985) showed an advantage for the arti?cial language, whereas in a pro-gramming application Biermann et al. (1983) showed an advantage for the natural language.

This section reviews other studies that com-pare natural and arti?cial languages.

Small and Weldon (1983) simulated two database query languages using WOz. One language was based on a formal language (SQL) and had a ?xed syntax and vocabulary. The other allowed unrestricted syntax and free use of synonyms. However, users of both lan-guages had to follow the same conceptual and functional restrictions. Thus, users of the natu-ral language had to specify the database tables, columns, and search criteria to be used to answer the query. For example, the request “Find the doctors whose age is over 35.” would not be allowed because the database table that contains doctors is not mentioned. Thus, a valid request would have been “Find the doc-tors on the staff whose age is over 35.”Because Small and Weldon were attempting to control for the information content necessary in each language, this study compared unre-stricted syntax and vocabulary to restricted syntax and vocabulary while trying to control for functional capabilities.

The participant’s task was to view a subset of the data and write a query that could retrieve that subset. Although it is unclear how much of the required functional information (e.g. table and column names) was contained in these answer sets, this method of problem presenta-tion may have helped the natural language par-ticipants include this information in their requests. Participants used both languages in a counterbalanced order. Ten participants used natural language ?rst and ten participants used the arti?cial language ?rst. The natural lan-guage users received no training (although they presumably were given the names of the database tables and columns) and the arti?cial language users were given a self-paced study guide of the language. The participants were then given four practice problems followed by 16 experimental problems with each language.

Results showed that there was no differ-ence in the number of language errors between the two languages. It appears that the dif?culty the natural language users had in remembering to mention table and column names was roughly equivalent to the dif?culty arti?cial language participants had in remembering to mention the table and column names while fol-lowing the syntactic and lexical restrictions. The structured order of the formal language must have helped the participants remember to include the column and table names. Thus, it is likely that it was more dif?cult for the natural language participants to remember to include table and column information than it was for the formal language participants. This analysis is based on the assumption that the participants in the formal language condition made more syntactic and lexical errors than the partici-pants using natural language. However, Small and Weldon only present overall error rates, so this assumption may be incorrect.

The results also show that participants using the structured language could enter their queries faster than those using the natural lan-guage, especially for simple problems. Small and Weldon use this result to conclude that for-mal languages are superior to natural lan-guages. However, the tested set of SQL was limited in function compared to what is avail-able in most implementations of database query languages, and the speed advantage reported for SQL was not as pronounced for more complicated problems. Thus, a better conclusion is that NLIs should provide more function than their formal language counter-parts if they are going to be easier to use than formal languages. To provide a ?exible syntax and vocabulary may not be enough.

Shneiderman (1978) also compared a natu-ral language to a formal relational query lan-guage. However, unlike Small and Weldon (1983), Shneiderman chose not to impose any limits on the participant’s use of natural lan-

guage. Participants were told about a depart-ment store employee database and were instructed to ask questions that would lead to information about which department they would want to work in. One group of partici-pants ?rst asked questions in natural language and then used the formal language. Another group used the formal language ?rst and then natural language. In the formal query language condition, participants had to know the struc-ture and content of the database, but in the nat-ural language condition they were not given this information. The results showed that the number of requests that could not be answered with data in the database was higher using nat-ural language than when using formal lan-guage. This was especially true for participants in the natural language ?rst condition. This should not be surprising given that the partici-pants did not know what was in the database. However, the result highlights the fact that users’ expectations about the functional capa-bilities of a database will probably exceed what is available in current systems.

In another laboratory experiment, Boren-stein (1986) compared several methods for obtaining on-line help, including a human tutor and a simulated natural language help system. Both allowed for unrestricted natural language, but the simulated natural language help system required users to type queries on a keyboard. He compared these to two other tra-ditional methods, the standard UNIX “man”and “key” help system and a prototype window and menu help system. The UNIX help system was also modi?ed to provide the same help texts that were provided by the menu and natu-ral language systems. The participants were asked to accomplish a set of tasks using a UNIX-based system but had only prior experi-ence with other computer systems. As a mea-sure of effectiveness of the help system, Borenstein measured the time these partici-pants needed to complete the tasks.

The results showed that participants com-pleted tasks faster when they had a human tutor to help them, and slowest when they used the standard UNIX help system. But Boren-stein found little difference between the modi-?ed UNIX command interface, the window/ menu system, and the natural language system. Because all of the methods provided the same help texts, Borenstein concluded that the qual-ity of the information provided by the help sys-tem is more important than the interface.

Hauptmann and Green (1983) compared a NLI with a command language and a menu-based interface for a program that generated simple graphs. Participants were given hand-sketched graphs to reproduce using the pro-gram. The NLI was embedded in a mixed ini-tiative dialog in which the computer or the users could initiate the dialog. Hauptmann and Green report no differences between the three interface styles in the time to complete the task or in the number of errors. However, they do report many usability problems with all three interfaces. One such problem, which may have been a more critical problem for the NLI, was a restrictive order in which the operations could be performed with the system. Also, the NLI was a simple keyword system that was customized based on what may have been a too small of sample. The authors concluded that NLIs may give no advantage over com-mand and menu systems unless they can also overcome rigid system constraints by adding ?exibility not contained in the underlying pro-gram.

Turtle (1994) also compares a NLI to an arti?cial language interface. He compared the performance of several information retrieval systems that accept natural language queries to the performance of expert users of a Boolean retrieval system when searching full-text legal materials. In contrast to all other studies described in this section, Turtle found a clear advantage for the NLI.

In Turtle’s study, experienced attorneys developed a set of natural language issue state-ments to represent the type of problems law-yers would research. These natural language statements were then used as input to several commercial and prototype search systems. The top 20 documents retrieved by each system where independently rated for relevance. The issue statements were also given to experi-enced users of a Boolean query system (WESTLAW). Users wrote Boolean queries and were allowed to iterate each against a test database until they were satis?ed with the results. The set of documents obtained using these queries contained fewer relevant ones than those sets obtained by the NLI systems.

This is impressive support for using an NLI for searching full text information, and ampli?es the earlier results of Harman and Candela (1990). The weakness of this study, however, is that it did not actually consider or measure user interactions with the systems. There is little detail on how the issues state-ments were generated and one can only guess as to how different these might have been had they been generated by users interacting with a system.

Another study presents evidence that an NLI interface may be superior to a formal lan-guage equivalent. Napier et al. (1989) com-pared the performance of novices using Lotus HAL, a restricted NLI, with Lotus 1-2-3, a menu/command interface. Different groups of participants were each given a day and a half training on the respective spread-sheet inter-faces, and then solved sets of spread-sheet problems. The Lotus HAL users consistently solved more problems than did the Lotus 1-2-3 users. Napier et al. suggest that the HAL users were more successful because the language allowed reference to spread-sheet cells by col-umn names. It should be pointed out that Lotus HAL has a very restricted syntax with English-like commands. HAL does, however, provide some ?exibility, and clearly provides more functionality than the menu/command inter-face of Lotus 1-2-3.

Walker and Whittaker (1989) conducted a ?eld study that compared a menu-based and a natural language database query language. The results they report regarding the usefulness of and problems with a restricted NLI for data-base access are similar to those of the studies previously summarized (e.g. high task failure rates due to lexical and syntactic errors). An interesting aspect of their study was the ?nding that a set of users persisted in using the NLI despite a high frequency of errors. Although this set of users was small (9 of 50), this ?nd-ing suggests the NLI provided a necessary functionality that was not available in the menu system. However, the primary function used by these users was a sort function (e.g.“... by department”). This function is typically found in most formal database languages and may re?ect a limitation of the menu system rather than an inherent NLI capability. Walker and Whittaker also found a persistent use of coordination, which suggests another menu system limitation. Coordination allows the same operation on more than one entity at a time (e.g. “List sales to Apple AND Microsoft”), and is also typically found in for-mal database languages. Thus, it seems that this study primarily shows that the menu sys-tem better met the needs of most of the users.

Summary: With the exception of Turtle (1994) and Napier et al. (1989), there is no convincing evidence that interfaces that allow natural language have any advantage over those restricted to arti?cial languages. It could be effectively argued that some of these labo-ratory investigations put natural language at a disadvantage. In the Small and Weldon (1983) study, natural language was functionally restricted, and in the Shneiderman (1978) study users were uninformed about the appli-

cation domain. These are unrealistic con-straints for an actual natural language system.

When the NLI provides more functionality than the traditional interface, then clearly —as in the case of Lotus HAL— an advantage can be demonstrated. What needs to be clari?ed is whether added functionality is inherently due to properties of natural language, or whether it can be engineered as part of a more traditional GUI. Walker (1989) discusses a taxonomy of communicative features that contribute to the ef?ciency of natural language and suggests, along with others (e.g. Cohen et al, 1989), that a NLI could be combined with a direct manip-ulation GUI for a more effective interface. The evidence suggests that a well-designed restricted language may be just as effective as a ?exible natural language. But what is a well-designed language? The next section will look at what users expect a natural language to be.

4.0Design Issues

Several laboratory experiments have been con-ducted to answer particular design issues con-cerning NLIs. The remainder of the chapter will address these design issues.

What is Natural?

Early research into NLIs was based on the premise that unconstrained natural language would prove to be the most habitable and easy-to-use method for people to interact with com-puters. Later, Chin (1984), Krause (1990), and others observed that people converse differ-ently with computers (or when they believe that their counterpart is a computer) than they do when their counterpart is (or they believe their counterpart to be) another person. For example, Chin (1984) discovered that users communicating with each other about a topic will rely heavily on context. In contrast to this, users who believed they were talking to a com-puter relied less on context. This result, as well as those of Guindon (1987), suggests that con-text is shared poorly between users and com-puters. Users may frame communication based upon notions about what computers can and cannot understand. Thus, ‘register’ might play an important role in human-computer interac-tion via NLIs.

Register: Fraser (1993) reminds us that regis-ter can be minimally described as a “variety according to use.” In human-human conversa-tion, participants discover linguistic and cogni-tive features (context, levels of interest, attention, formality, vocabulary, and syntax) that affect communication. Combined, these features can be referred to as register. Fraser points out that people may begin communicat-ing using one register, but, as the conversation continues the register may change. Conver-gence in this context refers to the phenomenon of humans adapting, or adopting the character-istics of each other’s speech, in ways that facil-itate communication. Fraser suggests that for successful NLIs it is the task that should con-strain the user and the language used, not the sublanguage or set of available commands. In other words, it would be inappropriate for a NLI interface to constrain a user to a limited vocabulary and syntax. Rather, users should be constrained in their use of language by the task and domain.

Register refers to the ways language use changes in differing communication situations, and is determined by speaker beliefs about the listener and the context of the communication. Several laboratory studies have been con-ducted to investigate how users would natu-rally communicate with a computer in an unrestricted language. For example, Malhotra (1975) and Malhotra and Sheridan (1976) reported a set of WOz studies that analyzed users’ inputs to a simulated natural language system. They found that a large portion of the input could be classi?ed into a fairly small number of simple syntactic types. In the case of a simulated database retrieval application,

78 percent of the utterances were parsed into ten sentence types, with three accounting for 81 percent of the parsed sentences (Malhotra, 1975). It seems that users are reluctant to put demands on the system that they feel might be too taxing.

Malhotra’s tasks were global, open-ended problems that would encourage a variety of expressions. In contrast, Ogden and Brooks (1983) conducted a WOz simulation study in which the tasks were more focused and con-trolled. They presented tables of information with missing data, and participants were to type one question to retrieve the missing data. In an unrestricted condition, Ogden and Brooks found that 89 percent of the questions could be classi?ed into one global syntactic category2. Thus, participants seem to naturally use a somewhat limited subset of natural lan-guage. These results have been replicated by Capindale and Crawford’s (1990) study of INTELLECT. Using the same syntactic analy-sis, they found 94 percent of the questions fell into the same category identi?ed by Ogden and Brooks. Burton and Steward (1993), also using the same method, report 95 percent of the questions to be of the same type.

Ringle and Halstead-Nussloch (1989) con-ducted a series of experiments to explore the possibilities for reducing the complexity of natural language processing. They were inter-ested in determining whether user input could be channeled toward a form of English that was easier to process, but that retained the qualities of natural language. The study was designed to test whether feedback could be used to shape user input by introducing an alternative to an ordinary human-human con-versation model that would maintain users’perception of natural and effective question-and-answer dialogue. Nine college undergrad-2.See section ‘Restrictions on Syntax’ for a description of

this type.uates who were classi?ed as casual computer users of email and word processors were par-ticipants in this study. The task was to use an unfamiliar electronic text processor to edit and format an electronic text ?le to produce a printed document that was identical to a hard-copy version they had been given. The com-puter terminal was split into two areas, one for editing an electronic document, and a second dialog window for communicating with a human tutor. Human tutors used two modes of interaction; a natural mode where tutors could answer questions in any appropriate manner, and a formal mode in which tutors were instructed to simulate the logical formalism of an augmented transition network (A TN). For example, in simulating the ATN, response times should be longer when input was ill formed or contained multiple questions. Each participant had two sessions, one in formal mode, and one in natural mode. Four partici-pants began with formal mode sessions, and ?ve with natural. The study’s two measure-ment factors were tractability (how easily the simulated ATN could correctly parse, extract relevant semantic information, identify the correct query category, and provide a useful reply), and perceived naturalness by the user. Tractability was measured by analyzing and comparing the transcripts of natural versus for-mal user-tutor dialogs for fragmentation, pars-ing complexity, and query category. Perceived naturalness was determined by looking at users’ subjective assessments of usability and ?exibility of natural versus formal help modes.

Of the 480 user-tutor exchanges in the experiment, 49 percent were in the natural mode, and 51 percent in the formal. Dialogue analysis for fragmented sentences found a rate of 21 percent in natural mode versus 8 percent in formal mode. This suggests that feedback in the formal mode may have motivated queries that were syntactically well formed. A ?ve point scale was used for evaluating utterance