Preprocessing

The preprocessing stage in speech recognition systems is used in order to increase the efficiency of subsequent feature extraction and classification stages and therefore to improve the overall recognition performance. Commonly the preprocessing includes the sampling step, a windowing and a denoising step. At the end of the preprocessing the compressed and filtered speech frames are forwarded to the feature extraction stage. The general preprocessing pipeline is depicted in the following figure.

Due to increasing mobile use it is noticeable that speech recognition systems need to be robust with respect to their acoustic environment. Together with the feature extraction stage, the motivation of the preprocessing is to generate a parametric representation of the speech signal that is very compact and still stores all the necessary information for automatic speech recognition.

1. Sampling

In order that a computer is able to process the speech signal, it first has to be digitized. Therefore the time-continuous speech signal is sampled and quantized. The result is a time- and value-discrete signal. According to the Nyquist-Shannon sampling theorem a time-continuous signal $x(t)$ that is bandlimited to a certain finite frequency fmax needs to be sampled with a sampling frequency of at least 2fmax. In this way it can be reconstruced by its time-discrete signal $x[n]$. Studies of Sanderson et al. have shown that the sampling frequency in combination with the feature vector size has a direct effect on recognition accuracy.

Since human speech has a relatively low bandwidth (mostly between 100Hz and 8KHz - see chapter speech for detailed information) a sampling frequency of 16KHz is sufficient for speech recognition tasks.

For the purpose of having a value discrete signal the sampled values are quantized. This leads to a significant reduction of data. Usually speech recognition systems encode the samples with 8 or 16 bits per sample depending on the available processing power. 8 bit per sample would mean 28 = 256 quantization levels, 16 bit per sample provide 216 = 65536 quantization levels. Concluding, if you have enough processing power, a higher bit resolution for the sampled values is preferable.

2. Windowing and frame formation

Speech is a non-stationary time variant signal. A short but precise explanation of stationary and non-stationary signals aswell as the non-stationary nature of speech is given here. In brief, a signal is considered to be stationary if its frequency or spectral components do not change over time.  We assume that human speech is built from a dictionary of phonemes, while for most of the phonemes the properties of speech remain invariant for a short period of time (~ 5-100ms). [ref.] Thus we assume (and hope) the signal behaves stationary for those time frames. In order to obtain frames we multiply the speech signal with a windowing function. This windowing function weights the signal in the time domain and divides it into a sequence of partial signals. By doing so we gain time information of every partial signal keeping in mind that an important step of the preprocessing and feature extraction is a spectral analysis of each frame.

where

• $s(n)$ denotes the sampled speech signal,
• $Q$ is the frame length
• $K$ is the window length
• $q$ is the sample point the window is applied
• and $s_K(n,q)$ is a resulting short time signal, with $s_K(n,q)&space;=&space;s(n)w(q-n)$.

As it can be seen in the figure above there can be an overlapping of the windows. The frame length and window length is dependent on the scope of application and algorithms used. In speech processing the value for the frame length $Q$ typically varies between 5 to 25ms and for the window length $K$ between 20 and 25ms [ref.]. Smaller overlapping means larger time shift in the signal, therefore lower processor demand, but the difference of parameter values (e.g. feature vectors) of neighbouring frames can be higher. Whereas larger overlapping can result in a smoother change of the parameter values of the frames, although higher processing power is needed.

There are various windowing functions, each with different characteristics, weighting the original signal in a different way and therefore producing different windowed signals. In speech processing however the shape of the window function is not that crucial but usually some soft window like the Von-Hann or Hamming window is used [ref.] in order to reduce discontinuities of the speech signal at the edges of each frame. The hamming window is described by  $w(n)=0.54-0.46&space;\cos&space;(\frac{2\pi(n-1)}{K-1})$ with $n=0...K-1$ and it is depicted below.

From now on each frame can be analyzed independently and later on in the feature extraction stage represented by a single feature vector.

3. Denoising & speech enhancement

The stage of denoising or noise reduction, also referred to as enhancing of speech degraded by noise, aims to improve the speech signals quality. The objective is to improve the intelligibility, a measure of how comprehensible speech is. Noise corrupting speech signals can be grouped coarsely into the following 3 classes:

1. Microphone related noise
2. Electrical noise (e.g. electromagnetically induced or radiated noise)
3. Environmental noise

The first two types of noise can be easily compensated by training the speech recognizers on corresponding noisy speech samples, but compensating the environmental noise is not that elementary, due to its high variability. The basic problem of noise reduction is to reduce the external noise without disturbing the unvoiced and low-intensity noise-like components of the speech signal itself [4].

The algorithms of noise reduction can be grouped intro three fundamental classes. In the following exemplary algorithms for denoising are briefly described.

1. Filtering Techniques
2. Spectral Restoration (speech enhancement)
3. Speech-Model-Based

Filtering Techniques. Prominent algorithms based on filtering techniques are Adaptive Wiener filtering and the spectral subtraction methods. Adaptive Wiener filtering depends on the adaption of the filter transfer function from sample to sample based on the speech signal statistics (mean and variance). Spectral subtraction methods estimate the spectrum of the clean signal by the subtraction of the estimated noise magnitude spectrum from the noisy signal magnitude spectrum while keeping the phase spectrum of the noisy signal [6].

Spectral Restoration.  Spectral Restoration refers to the inducing of missing spectral components of nonverbal sounds by adding noise to increase intelligibility [7].

Speech-Model-Based.  Harmonic Decomposition refers to a denoising technique that uses a harmonic+noise model of the speech, assuming that the speech signal is composed of a periodic/voiced and random/unvoiced part. By processing the components separately and recombining them, the speech signal can be enhanced. An exemplary realization is described in the article harmonic decomposition in greater detail. Nonnegative Matrix Factorization algorithms factorize the Mel-magnitude spectra of noisy speech into a nonnegative weighted linear combination of speech and noise basis functions. The weights are then used to obtain an estimate of the clean speech [8]. Nonnegative Matrix Factorization may also be used as a feature extraction technique in speech processing.

Conclusively the general measure of quality of a noise reduction system is its improvement in signal-to-noise ratio (SNR), but with respect to speech recognition, the best measure is the improvement in recognition performance.

References

[1] C. E. Shannon. Communication in the presence of noise. In Proc. Institute of Radio Engineers, vol. 37, no. 1, pp. 10–21, Jan. 1949.

[2] C. Sanderson, K. K. Paliwal. Effect of different sampling rates and feature vector sizes on speech recognition performance. In TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications., Proceedings of IEEE, vol. 1, pp. 161-164, Dec. 1997.

[3] N. A. Meseguer. Speech Analysis for Automatic Speech Recognition. Norwegian University of Science and Technology, Department of Electronics and Telecommunications, July 2009.

[4] A. G. Maher, R. W. Kind, J.G. Rathmell. A Comparison of Noise Reduction Techniques for Speech Recognition in Telecommunications Environments. In The Institution of Engineers Australia Communications Conference, Sydney, October, 1992.

[5] J. Benesty, M. M. Sondhi, Y. Huang. Springer Handbook of Speech Processing, pp.843-869. Springer, 2007.

[6] M. A. Abd El-Fattah, M. I. Dessouky, S. M. Diab, F. E. Abd El-samie. Adaptive Wiener Filtering Approach for Speech Enhancement. In Ubiquitous Computing and Communication Journal, Vol. 3, No. 2, pp. 1-8. April 2008.

[7] R. M. Warren, K. R. Hainsworth, B. S. Brubaker, J. A. Bashford, E. W. Healy. Spectral restoration of speech: intelligibility is increased by inserting noise in spectral gaps. In Perception and psychophysics, 59 (2) (1997), pp. 275-283.

[8] J. T. Geiger, J. F. Gemmeke, B. Schuller, G. Rigoll. Investigating NMF Speech Enhancement for Neural Network based Acoustic Models. In Proc. Interspeech 2014, ISCA, Singapore, 2014.