Why mfcc is used in speech recognition




















The prediction error is given as [ 16 , 25 ]:. Subsequently, each frame of the windowed signal is autocorrelated, while the highest autocorrelation value is the order of the linear prediction analysis. A summary of the procedure for obtaining the LPC is as seen in Figure 2. LPC can be derived by [ 7 ]:. Block diagram of LPC processor. Linear predictive analysis efficiently selects the vocal tract information from a given speech [ 16 ].

It is known for the speed of computation and accuracy [ 18 ]. LPC excellently represents the source behaviors that are steady and consistent [ 23 ]. Furthermore, it is also be used in speaker recognition system where the main purpose is to extract the vocal tract properties [ 25 ].

It gives very accurate estimates of speech parameters and is comparatively efficient for computation [ 14 , 26 ]. Traditional linear prediction suffers from aliased autocorrelation coefficients [ 29 ]. LPC estimates have high sensitivity to quantization noise [ 30 ] and might not be well suited for generalization [ 23 ].

Cepstral analysis is commonly applied in the field of speech processing because of its ability to perfectly symbolize speech waveforms and characteristics with a limited size of features [ 31 ]. It was observed by Rosenberg and Sambur that adjacent predictor coefficients are highly correlated and therefore, representations with less correlated features would be more efficient, LPCC is a typical example of such.

In speech processing, LPCC analogous to LPC, are computed from sample points of a speech waveform, the horizontal axis is the time axis, while the vertical axis is the amplitude axis [ 31 ]. It pictorially explains the process of obtaining LPCC. LPCC can be calculated using [ 7 , 15 , 33 ]:. Block diagram of LPCC processor. LPCC have low vulnerability to noise [ 30 ]. Cepstral coefficients of higher order are mathematically limited, resulting in an extremely extensive array of variances when moving from the cepstral coefficients of lower order to cepstral coefficients of higher order [ 34 ].

Similarly, LPCC estimates are notorious for having great sensitivity to quantization noise [ 35 ]. Cepstral analysis on high-pitch speech signal gives small source-filter separability in the quefrency domain [ 29 ]. Cepstral coefficients of lower order are sensitive to the spectral slope, while the cepstral coefficients of higher order are sensitive to noise [ 15 ]. LSF defines the two resonance situations taking place in the inter-connected tube model of the human vocal tract.

The model takes into consideration the nasal cavity and the mouth shape, which gives the basis for the fundamental physiological importance of the linear prediction illustration. The two resonance situations define the vocal tract as either being completely open or completely closed at the glottis [ 36 ]. The two situations begets two groups of resonant frequencies, with the number of resonances in each group being deduced from the quantity of linked tubes.

The resonances of each situation are the odd and even line spectra correspondingly, and are interwoven into a singularly rising group of LSF [ 36 ].

The LSF representation was proposed by Itakura [ 37 , 38 ] as a substitute to the linear prediction parametric illustration. In the area of speech coding, it has been realized that this illustration has an improved quantization features than the other linear prediction parametric illustrations LAR and RC.

Apart from quantization, LSF illustration of the predictor are also suitable for interpolation. Theoretically, this can be inspired by the point that the sensitivity matrix linking the LSF-domain squared quantization error to the perceptually relevant log spectrum is diagonal [ 41 , 42 ].

LP is established on the point that a speech signal can be defined by Eq. The a k coefficients are determined in order to reduce the prediction error by method of autocorrelation or covariance.

As such, a small part of the speech signal is anticipated to be given as an output to the all-pole filter H z. The new equation is. In order to compute the LSF coefficients, an inverse polynomial filter is split into two polynomials P z and Q z [ 36 , 38 , 40 , 41 ]:.

The block diagram of the LSF processor is as seen in Figure 4. The most prominent application of LSF is in the area of speech compression, with extension into the speaker recognition and speech recognition. This technique has also found restricted use in other fields. LSF have been investigated for use in musical instrument recognition and coding. LSF have also been applied to animal noise identification, recognizing individual instruments and financial market analysis.

The advantages of LSF include their ability to localize spectral sensitivities, the fact that they characterize bandwidths and resonance locations and lays emphasis on the important aspect of spectral peak location. In most instances, the LSF representation provides a near-minimal data set for subsequent classification [ 36 ]. Block diagram of LSF processor.

Since LSF represents spectral shape information at a lower data rate than raw input samples, it is reasonable that a careful use of processing and analysis methods in the LSP domain could lead to a complexity reduction against alternative techniques operating on the raw input data itself.

LSF play an important role in the transmission of vocal tract information from speech coder to decoder with their widespread use being a result of their excellent quantization properties. The generation of LSP parameters can be accomplished using several methods, ranging in complexity.

The major problem revolves around finding the roots of the P and Q polynomials defined in Eqs. This can be obtained through standard root solving methods, or more obscure methods and it is often performed in the cosine domain [ 36 ].

Wavelet Transform WT theory is centered around signal analysis using varying scales in the time and frequency domains [ 45 ]. With the support of theoretical physicist Alex Grossmann, Jean Morlet introduced wavelet transform which permits high-frequency events identification with an enhanced temporal resolution [ 45 , 46 , 47 ]. A wavelet is a waveform of effectively limited duration that has an average value of zero.

Many wavelets also display orthogonality, an ideal feature of compact signal representation [ 46 ]. WT is a signal processing technique that can be used to represent real-life non-stationary signals with high efficiency [ 33 , 46 ]. It has the ability to mine information from the transient signals concurrently in both time and frequency domains [ 33 , 45 , 48 ].

Continuous wavelet transform CWT is used to split a continuous-time function into wavelets. However, there is redundancy of information and huge computational efforts is required to calculate all likely scales and translations of CWT, thereby restricting its use [ 45 ].

Discrete wavelet transform DWT is an extension of the WT that enhances the flexibility to the decomposition process [ 48 ].

It was introduced as a highly flexible and efficient method for sub band breakdown of signals [ 46 , 49 ]. In earlier applications, linear discretization was used for discretizing CWT. Daubechies and others have developed an orthogonal DWT specially designed for analyzing a finite set of observations over the set of scales dyadic discretization [ 47 ]. Wavelet transform decomposes a signal into a group of basic functions called wavelets.

Wavelets are obtained from a single prototype wavelet called mother wavelet by dilations and shifting. The main characteristic of the WT is that it uses a variable window to scan the frequency spectrum, increasing the temporal resolution of the analysis [ 45 , 46 , 50 ]. WT decomposes signals over translated and dilated mother wavelets.

Mother wavelet is a time function with finite energy and fast decay. The different versions of the single wavelet are orthogonal to each other. The continuous wavelet transform CWT is given by [ 33 , 45 , 50 ]:.

The WT coefficient is an expansion and a particular shift represents how well the original signal corresponds to the translated and dilated mother wavelet. Thus, the coefficient group of CWT a, b associated with a particular signal is the wavelet representation of the original signal in relation to the mother wavelet [ 45 ].

Since CWT contains high redundancy, analyzing the signal using a small number of scales with varying number of translations at each scale, i.

DWT theory requires two sets of related functions called scaling function and wavelet function given by [ 33 ]:. There are several ways to discretize a CWT. The DWT of the continuous signal can also be given by [ 45 ]:. The input signal is filtered by a low-pass filter and high-pass filter to obtain the approximate components and the detail components respectively.

This is summarized in Figure 5. The approximate signal at each stage is further decomposed using the same low-pass and high-pass filters to get the approximate and detail components for the next stage. This type of decomposition is called dyadic decomposition [ 33 ].

Block diagram of DWT. The DWT parameters contain the information of different frequency scales. This enhances the speech information obtained in the corresponding frequency band [ 33 ]. The ability of the DWT to partition the variance of the elements of the input on a scale by scale basis is an added advantage.

This partitioning leads to the opinion of the scale-dependent wavelet variance, which in many ways is equivalent to the more familiar frequency-dependent Fourier power spectrum [ 47 ]. Classic discrete decomposition schemes, which are dyadic do not fulfill all the requirements for direct use in parameterization. DWT does provide adequate number of frequency bands for effective speech analysis [ 51 ]. Since the input signals are of finite length, the wavelet coefficients will have unwantedly large variations at the boundaries because of the discontinuities at the boundaries [ 50 ].

Perceptual linear prediction PLP technique combines the critical bands, intensity-to-loudness compression and equal loudness pre-emphasis in the extraction of relevant information from speech. It is rooted in the nonlinear bark scale and was initially intended for use in speech recognition tasks by eliminating the speaker dependent features [ 11 ].

PLP gives a representation conforming to a smoothed short-term spectrum that has been equalized and compressed similar to the human hearing making it similar to the MFCC. In the PLP approach, several prominent features of hearing are replicated and the consequent auditory like spectrum of speech is approximated by an autoregressive all—pole model [ 52 ]. PLP gives minimized resolution at high frequencies that signifies auditory filter bank based approach, yet gives the orthogonal outputs that are similar to the cepstral analysis.

It uses linear predictions for spectral smoothing, hence, the name is perceptual linear prediction [ 28 ]. PLP is a combination of both spectral analysis and linear prediction analysis. This gives the power spectral estimates. A trapezoidal filter is then applied at 1-bark interval to integrate the overlapping critical band filter responses in the power spectrum.

This effectively compresses the higher frequencies into a narrow band. The symmetric frequency domain convolution on the bark warped frequency scale then permits low frequencies to mask the high frequencies, concurrently smoothing the spectrum. The spectrum is subsequently pre-emphasized to approximate the uneven sensitivity of human hearing at a variety of frequencies.

The spectral amplitude is compressed, this reduces the amplitude variation of the spectral resonances. Spectral smoothing is performed, solving the autoregressive equations. The autoregressive coefficients are converted to cepstral variables [ 28 ]. The equation for computing the bark scale frequency is:.

The identification achieved by PLP is better than that of LPC [ 28 ], because it is an improvement over the conventional LPC because it effectively suppresses the speaker-dependent information [ 52 ]. Also, it has enhanced speaker independent recognition performance and is robust to noise, variations in the channel and microphones [ 53 ].

PLP reconstructs the autoregressive noise component accurately [ 54 ]. PLP based front end is sensitive to any change in the formant frequency. PLP has low sensitivity to spectral tilt, consistent with the findings that it is relatively insensitive to phonetic judgments of the spectral tilt. Also, PLP analysis is dependent on the result of the overall spectral balance formant amplitudes. The formant amplitudes are easily affected by factors such as the recording equipment, communication channel and additive noise [ 52 ].

Furthermore, the time-frequency resolution and efficient sampling of the short-term representation are addressed in an ad-hoc way [ 54 ]. Block diagram of PLP processor. View 1 excerpt, cites background. View 3 excerpts, cites methods and background. View 1 excerpt, cites methods. A system that recognizes and authenticates the voice of a user by extracting the distinct features of their voice samples is usually termed as Voice recognition system.

Voice identification is … Expand. Highly Influenced. Speech feature extraction which attempts to obtain a parametric representation of an input speech signal plays a crucial role in the overall performance of an Automatic Speech Recognition ASR … Expand. Spoken Tamil word Recognition System.

The ultimate aim of this work is to enhance an efficient Tamil speech word recognition system for recognition speech words to determine the authorized and unauthorized persons. The recognition of … Expand. View 11 excerpts, cites methods. View 1 excerpt, references methods. Content based clinical depression detection in adolescents. Analysis of vocal tract characteristics for near-term suicidal risk assessment.

Support-Vector Networks. Applied Multivariate Statistical Analysis. Aspects of Multivariate Analysis.



0コメント

  • 1000 / 1000