p.p1 computers. In this research review we examine three

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.0px Helvetica}
span.s1 {font: 24.0px Helvetica}
span.s2 {color: #002486}

Speech recognition is the translation, through
some methodologies, of human speech into text
by computers. In this research review we examine
three dierent methods that are used in
speech recognition field and we investigate their
performance in dierent cases. We analyze the
state-of-art deep neural networks (DNNs), that
have evolved into complex architectures and they
achieve significant results in a variety of speech
benchmarks. Afterward, we explain convolutional
neural networks (CNNs) and we explore
their potential in this field. Finally, we present
the recent research in highway deep neural networks
(HDNNs) that seem to be more flexible
for resource constrained platforms. Overall, we
critically try to compare these methods and show
their strengths and limitations. We conclude that
each method has its advantages but also has its
weaknesses and we use them for dierent purposes.
I. Introduction
Machine Learning (ML) is a field of computer science
that gives the computers the ability to learn through
dierent algorithms and techniques without being programmed.
Automatic speech recognition (ASR) is closely
related with ML because it uses methodologies and procedures
of ML 1, 2, 3. ASR has been around for decades
but it was not until recently that there was a tremendous
development because of the advances in both machine learning
methods and computer hardware. New ML techniques
made speech recognition accurate enough to be useful outside
of carefully controlled environments, so it could easily
be deployed in many electronic devices nowadays (i.e.
computers, smart-phones) and be used in many applications
such as identifying and authenticating a user via of his/her
Speech is the most important mode of communication
between human beings and that is why from the early part
of the previous century, eorts have been made in order
to make computers do what only humans could perceive.
Research has been conducted through the past five decades
and the main reason was the desire of making tasks automated
using machines 2. Many motivations using dierent
theories such as probabilistic modeling and reasoning,
pattern recognition and artificial neural networks aected
the researchers and helped to advance ASR.
The first single advance in the history of ASR occurred
in the middle of 70’s with the introduction of the
expectation-maximization (EM) 4 algorithm for training
hidden Markov models (HMMs). The EM technique gave
the possibility to develop the first speech recognition systems
using Gaussian mixture models (GMMs). Despite all
the advantages of the GMMs, they are not able to model
eciently data that lie on or near a nonlinear surface in
the data space (i.e. sphere). This problem could be solved
by artificial neural networks but the computer hardware of
that era did not allow us to build complex neural networks.
As a result, in the beginning most speech recognition systems
were based on HMMs. Later they used the neural
network and hidden Markov model (NN/HMM) hybrid architecture,
first investigated in the early 1990s 5. After
2000s and over the last years the improvement of computer
hardware and the invention of new machine learning algorithms
made possible the training for DNNs. DNNs with
many hidden layers have been shown to outperform GMMs
in many dierent databases (with speech data) 6. After
the huge success of DNNs, researchers try other more complex
neural architectures such as recurrent neural networks
with long short-term memory units (LSTM-RNNs) 7 and
CNNs, and it seems that each one of them has its benefits
and applications.
In this literature review we present three types of artificial
neural networks (DNNs, CNNs, and HDNNs). We
analyze each method, we explain how they are used for
training and what are their advantages and disadvantages.
Finally we compare these methods in the context of ASR,
identifying where each one of them is more suitable and
what are their limitations. Finally, we draw some conclusions
from these comparisons and we carefully suggest
some probable future directions.
II. Methods
 A. Deep Neural Networks
DNNs are feed-forward artificial neural networks with
more than one layer of hidden units. Each hidden
layer has a number of units (or neurons) each of which
takes all outputs of the lower layer as input, multiplies them
by a weight vector, sums the result and passes it through
a non linear activation function (i.e. sigmoid function, hyperbolic
tangent function, some kind of rectified linear unit
function (ReLU 8, 9), or exponential linear unit function
(ELU 10)). For a multi-class classification problem, the
posterior probability of each class can be estimated using
an output softmax layer. DNNs can be discriminatively
Informatics Research Review (s1736880)
trained by propagating derivatives of a cost function that
measures the discrepancy between the target outputs and
the actual outputs. For large training sets, it is typically
more convenient and ecient to compute derivatives on a
mini-batch of the training set rather than the whole training
set (this is called stochastic gradient descent). As cost function
we often use the cross-entropy (CE) but this actually
depends on the case.
The diculty to optimize DNNs with many hidden layers
along with overfitting problem force us to use pretraining
methods. One such a popular method is the restricted
Boltzmann machine (RBM) as the authors describe in the
overview paper 6. If we use a stack of RBMs then we can
construct a deep belief network (DBN) (you should not be
confused with dynamic Bayesian network). The purpose
of this is to add an initial stage of generative pretraining.
The pretraining is very important for DNNs because it reduces
overfitting and it also reduces the time required for
discriminative fine-tuning with propagation.
DNNs in the context of ASR play a major role. Many
architectures have been used by dierent research groups in
order to gain better and better accuracy in acoustic models.
You can see some methodologies in the article 6 that it
presents some significant results and shows that DNNs in
general achieve higher speech recognition accuracy than
GMMs on a variety of speech recognition benchmarks. The
main reason is that they take advantage from the fact that
they can learn much better models of data that lie on or
near a non-linear surface. However, we have to mention
that they use many model parameters in order to achieve
a good enough speech accuracy and this is sometimes a
drawback. Furthermore, they are complex enough and
need many computational resources. Finally, they have
been criticized because they do not preserve some structure,
they are dicult to be interpreted and they possess limited
 B. Convolutional Neural Networks
Convolutional neural networks (CNNs) can be regarded
as DNNs with the main dierence that instead of using
fully connected hidden layers (as it happens in DNNs)
they use a special network structure, which consists of convolution
and pooling layers 11, 12, 13. Basic rule is that
the data have to be organized as a number of feature maps
(CNNs firstly used for image recognition) in order to be
passed in CNN. One significant problem we have when
we want to train our speech data with CNNs concerns frequency
because we are not able to use the conventional
mel-frequency cepstral coecient (MFCC) technique. The
reason is that this technique does not preserve the locality
of our data, although we want to preserve locality in both
axes of frequency and time. Hence, a solution is the usage
of mel-frequency spectral coecients (MFSC features)
13. You can see the use of convolution process and the
pooling layers in the same paper 13. Our main target is
to learn the weights that are being shared among the convolutional
layers. Moreover, as it happens for DNNs with
RBMs there is a corresponding procedure CRBM 14 for
CNNs that allow us pretraining our data. In the paper 13
the authors also examine the case of a CNN with limited
weight sharing for ASR (LWS model) and they propose to
pretrain it modifying the CRBM model.
CNNs have three major properties: locality, weight
sharing, and pooling. Each one of them has the potential
to improve speech recognition performance. We are cared
for locality because it adds robustness against non-white
noise and it also reduces the number of network weights
to be learned. The second property – weight sharing – is
also important in CNNs because it can improve model
robustness and it can reduce overfitting. Besides that, it also
reduces the number of weights that need to be learned. Both
locality and weight sharing are significant factors for the
property of pooling which is very helpful in handling small
frequency shifts that are common in speech signals. These
shifts may result from dierences in vocal tract lengths
among dierent speakers. In general, CNNs seem to have a
relative better performance in ASR taking advantage from
their special network structure.
 C. Highway Deep Neural Networks
HDNNs are depth-gated feed-forward neural networks
15. They are distinguished from the conventional
DNNs for two main reasons. Firstly they use much less
model parameters and secondly they use two types of gate
functions to facilitate the information flow through the hidden
HDNNs are multi-layer networks with many hidden
layers. In each layer we have the transformation of the
initial input or of the previous hidden layer with the corresponding
parameter of the current layer followed by a
non-linear activation function (i.e. sigmoid function). The
output layer is parameterized with the parameter and we
usually use the softmax function as the output function in
order to obtain the posterior probability of each class given
the input feature. Afterward, given the target labels, the
network is usually trained by gradient descent to minimize
a loss function such as cross-entropy. The architecture and
the process are the same as in DNNs.
The dierence from the standard DNNs is that highway
deep neural networks (HDNNs) were proposed to enable
very deep networks to be trained by augmenting the hidden
layers with gate functions 16. These are the transform
gate that scales the original hidden activations and the carry
gate, which scales the input before passing it directly to the
next hidden layer as the authors describe in the paper 15.
In the same paper 15 three main methods are presented
for training, the sequence training, the adaptation
technique and the teacher-student training. Combining
these methodologies with the two gates the authors demonstrate
how important role the carry and the transform gate
play in the training. They allow us to achieve comparable
speech recognition accuracy to the classic DNNs but
with much less model parameters. This result is crucial for
Informatics Research Review (s1736880)
resource-constrained platforms such as mobile devices (i.e.
voice recognition on mobiles).
 D. Comparison of the Methods
These methods, that we presented, have their benefits and
weaknesses. In general, DNNs behave very well and in
many cases they have enough better performance compared
to GMMs on a variety of speech recognition benchmarks.
The main reason is that they take advantage from the fact
that they can learn much better models of data that lie on or
near a non-linear surface. On the other hand, their biggest
drawback compared with GMMs is that it is much harder
to make good use of large cluster machines to train them
on massive data 6.
As far as the CNNs are concerned, they can handle
frequency shifts that are dicult to be handled within other
models such as GMMs and DNNs. Furthermore, it is also
dicult to learn such an operation as max-pooling in standard
artificial neural networks. Moreover, CNNs can handle
the temporal variability in the speech features as well. On
the other hand, pooling or shift size may aect the fine
resolution, since the CNNs are required to compute an
output for each frame for decoding. As a result, a large
pooling size may aect the locality of the labels. This may
cause phonetic confusion, especially at segment boundaries.
Hence, we have to be careful and we need to choose an
appropriate pooling size 13.
HDNNs are considered to be more compact than regular
DNNs due to the fact that they can achieve similar
recognition accuracy with many fewer model parameters.
Furthermore, they are more controllable than DNNs and
this is because through the gate functions we can control the
behavior of the whole network using a very small number
of model parameters (the parameters of the gates). Moreover,
HDNNs are more controllable because the authors
in paper 15 show that simply updating the gate functions
using adaptation data they can gain considerably in speech
recognition accuracy. We cannot conclude much for their
general performance because they are a recent proposal
and it is needed more research to see their overall benefits
and limitations. However, the main idea is to use them in
order to have comparable ASR accuracy with DNNs and
simultaneously to reduce the model parameters.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now