自动化外文翻译---改进型智能机器人的语音识别方法

□测控技术与仪器

2、语音识别概述

3、理论与方法

3.1线性预测倒谱系数

LPC可以用来估计语音信号的倒谱。在语音信号的短时倒谱分析中，这是一种特殊的处理方法。信道模型的系统函数可以通过如下的线形预分析来得到：

h(n)来表示，假设h（n）的倒谱是。那么（1）式可以扩展为（2）式：

（5）中计算的倒谱系数叫做LPCC，n代表LPCC命令。

3.2 语音分形维数计算

3.3 改进的特征提取方法

★一个视觉信号可以让学习者把他们的语调同模型扬声器发出的语调进行对比。★学习者发音的准确度通常以数字7来度量（越高越好）

★那些发音失真的词语会被识别出来并被明显地标注。

Improved speech recognition method

for intelligent robot

2、Overview of speech recognition

Speech recognition has received more and more attention recently due to the important theoretical meaning and practical value [5 ]. Up to now, most speech recognition is based on conventional linear system theory, such as Hidden Markov Model (HMM) and Dynamic Time Warping(DTW) . With the deep study of speech recognition, it is found that speech signal is a complex nonlinear process. If the study of speech recognition wants to break through, nonlinear

-system theory method must be introduced to it. Recently, with the developmentof nonlinea-system theories such as artificial neural networks(ANN) , chaos and fractal, it is possible to apply these theories to speech recognition. Therefore, the study of this paper is based on ANN and chaos and fractal theories are introduced to process speech recognition.

Speech recognition is divided into two ways that are speaker dependent and speaker independent. Speaker dependent refers to the pronunciation model trained by a single person, the identification rate of the training person?sorders is high, while others’orders is in low identification rate or c an’t be recognized. Speaker independent refers to the pronunciation model

trained by persons of different age, sex and region, it can identify a group of persons’orders. Generally,speaker independent system ismorewidely used, since the user is not required to conduct the training. So extraction of speaker independent features from the speech signal is the fundamental problem of speaker recognition system.

Speech recognition can be viewed as a pattern recognition task, which includes training and recognition.Generally, speech signal can be viewed as a time sequence and characterized by the powerful hidden Markov model (HMM). Through the feature extraction, the speech signal is transferred into feature vectors and act asobservations. In the training procedure, these observationswill feed to estimate the model parameters of HMM. These parameters include probability density function for the observations and their corresponding states, transition probability between the states, etc. After the parameter estimation, the trained models can be used for recognition task. The input observations will be recognized as the resulted words and the accuracy can be evaluated. Thewhole process is illustrated in Fig. 1.

Fig. 1Block diagram of speech recognition system

3 Theory andmethod

Extraction of speaker independent features from the speech signal is the fundamental problem of speaker recognition system. The standard methodology for solving this problem uses Linear Predictive Cepstral Coefficients (LPCC) and Mel-Frequency Cepstral Co-efficient (MFCC). Both these methods are linear procedures based on the assumption that speaker features have properties caused by the vocal tract resonances. These features form the basic spectral structure of the speech signal. However, the non-linear information in speech signals is not easily extracted by the present feature extraction methodologies. So we use fractal dimension to measure non2linear speech turbulence.

This paper investigates and implements speaker identification system using both traditional LPCC and non-linear multiscaled fractal dimension feature extraction.

3. 1L inear Predictive Cepstral Coefficients

Linear prediction coefficient (LPC) is a parameter setwhich is obtained when we do linear prediction analysis of speech. It is about some correlation characteristics between adjacent speech samples. Linear prediction analysis is based on the following basic concepts. That is, a speech sample can be estimated approximately by the linear combination of some past speech samples. According to the minimal square sum principle of difference between real speech sample in certain analysis frame

short-time and predictive sample, the only group ofprediction coefficients can be determined.

LPC coefficient can be used to estimate speech signal cepstrum. This is a special processing method in analysis of speech signal short-time cepstrum. System function of channelmodel is obtained by linear prediction analysis as follow.

Where p represents linear prediction order, ak,(k=1,2,…,p) represent sprediction coefficient, Impulse response is represented by h(n). Suppose

cepstrum of h(n) is represented by ,then (1) can be expanded as (2).

The cepstrum coefficient calculated in the way of (5) is called LPCC, n represents LPCC order.

When we extract LPCC parameter before, we should carry on speech signal pre-emphasis, framing processing, windowingprocessing and endpoints detection etc. , so the endpoint detection of Chinese command word“Forward”is shown in Fig.2, next, the speech waveform ofChinese command word“Forward”and LPCC parameter waveform after Endpoint detection is shown in Fig. 3.

3. 2 Speech Fractal Dimension Computation

Fractal dimension is a quantitative value from the scale relation on the meaning of fractal, and also a measuring on self-similarity of its structure. The fractal measuring is fractal dimension[6-7]. From the viewpoint of measuring, fractal dimension is extended from integer to fraction, breaking the limitof the general to pology set dimension being integer Fractal dimension,fraction mostly, is dimension extension in Euclidean geometry.

There are many definitions on fractal dimension, eg.,similar dimension, Hausdoff dimension, inforation dimension, correlation dimension, capability imension, box-counting dimension etc. , where,Hausdoff dimension is oldest and also most important, for any sets, it is defined as[3].

Where, M￡(F) denotes how many unit ￡needed to cover subset F.

In thispaper, the Box-Counting dimension (DB) of ,F, is obtained by partitioning the plane with squares grids of side ￡, and the numberof squares that intersect the plane (N(￡)) and is defined as[8].

The speech waveform of Chinese command word“Forward”and fractal dimension waveform after Endpoint detection is shown in Fig. 4. 3. 3Improved feature extractions method

Considering the respective advantages on expressing speech signal of LPCC and fractal dimension,we mix both to be the feature signal, that is, fractal dimension denotes the self2similarity, periodicity and randomness of speech time wave shape, meanwhile LPCC feature is good for speech quality and high on identification rate.

Due to ANN′s nonlinearity, self-adaptability, robust and self-learning such obvious advantages, its good classification and input2output reflection ability are suitable to resolve speech recognition problem.

Due to the number of ANN input nodes being fixed, therefore time regularization is carried out to the feature parameter before inputted to the neural network[9]. In our experiments, LPCC and fractal dimension of each

sample are need to get through the network of time regularization separately, LPCC is 4-frame data(LPCC1,LPCC2,LPCC3,LPCC4, each frame parameter is 14-D), fractal dimension is regularized to be12-frame data(FD1,FD2,…,FD12, each frame parameter is 1-D), so that the feature vector of each sample has 4*14+1*12=68-D, the order is, the first 56 dimensions are LPCC, the rest 12 dimensions are fractal dimensions. Thus, such mixed feature parameter can show speech linear and nonlinear characteristics as well.

Architectures and Features of ASR ASR is a cutting edge technology that allows a computer or even a hand-held PDA (Myers, 2000) to identify words that are read aloud or spoken into any sound-recording device. The ultimate purpose of ASR technology is to allow 100% accuracy with all words that are intelligibly spoken by any person regardless of vocabulary size, background noise, or speaker variables (CSLU, 2002). However, most ASR engineers admit that the current accuracy level for a large vocabulary unit of speech (e.g., the sentence) remains less than 90%. Dragon's Naturally Speaking or IBM's ViaV oice, for example, show a baseline recognition accuracy of only 60% to 80%, depending upon accent, background noise, type of utterance, etc. (Ehsani & Knodt, 1998). More expensive systems that are reported to outperform these two are Subarashii (Bernstein, et al., 1999), EduSpeak (Franco, et al., 2001), Phonepass (Hinks, 2001), ISLE Project (Menzel, et al., 2001) and RAD (CSLU, 2003). ASR accuracy is expected to improve. Among several types of speech recognizers used in ASR products, both implemented and proposed, the Hidden Markov Model (HMM) is one of the most dominant algorithms and has proven to be an effective method of dealing with large units of speech (Ehsani & Knodt, 1998). Detailed descriptions of how the HHM model works go beyond the scope of this paper and can be found in any text concerned with language processing; among the best are Jurafsky & Martin (2000) and Hosom, Cole, and Fanty

(2003). Put simply, HMM computes the probable match between the input it receives and phonemes contained in a database of hundreds of native speaker recordings (Hinks, 2003, p. 5). That is, a speech recognizer based on HMM computes how close the phonemes of a spoken input are to a corresponding model, based on probability theory. High likelihood represents good pronunciation; low likelihood represents poor pronunciation (Larocca, et al., 1991).

While ASR has been commonly used for such purposes as business dictation and special needs accessibility, its market presence for language learning has increased dramatically in recent years (Aist, 1999; Eskenazi, 1999; Hinks, 2003). Early ASR-based software programs adopted template-based recognition systems which perform pattern matching using dynamic programming or other time normalization techniques (Dalby & Kewley-Port, 1999). These programs include Talk to Me (Auralog, 1995), the Tell Me More Series (Auralog, 2000), Triple-Play Plus (Mackey & Choi, 1998), New Dynamic English (DynEd, 1997), English Discoveries (Edusoft, 1998), and See it, Hear It, SAY IT! (CPI, 1997). Most of these programs do not provide any feedback on pronunciation accuracy beyond simply indicating which written dialogue choice the user has made, based on the closest pattern match. Learners are not told the accuracy of their pronunciation. In particular, Neri, et al. (2002) criticizes the graphical wave forms presented in products such as Talk to Me and Tell Me More because