Specify Layers of Convolutional Neural Network

The first step of creating and training a new convolutional neural network (ConvNet) is to define the network architecture. This topic explains the details of ConvNet layers, and the order they appear in a ConvNet. For a complete list of deep learning layers and how to create them, seeList of Deep Learning Layers. To learn about LSTM networks for sequence classification and regression, seeLong Short-Term Memory Networks. To learn how to create your own custom layers, seeDefine Custom Deep Learning Layers.

The network architecture can vary depending on the types and numbers of layers included. The types and number of layers included depends on the particular application or data. For example, if you have categorical responses, you must have a softmax layer and a classification layer, whereas if your response is continuous, you must have a regression layer at the end of the network. A smaller network with only one or two convolutional layers might be sufficient to learn on a small number of grayscale image data. On the other hand, for more complex data with millions of colored images, you might need a more complicated network with multiple convolutional and fully connected layers.

To specify the architecture of a deep network with all layers connected sequentially, create an array of layers directly. For example, to create a deep network which classifies 28-by-28 grayscale images into 10 classes, specify the layer array

layers = [ imageInputLayer([28 28 1]) convolution2dLayer(3,16,'Padding',1) batchNormalizationLayer reluLayer maxPooling2dLayer(2,'Stride',2) convolution2dLayer(3,32,'Padding',1) batchNormalizationLayer reluLayer fullyConnectedLayer(10) softmaxLayer classificationLayer];

layersis an array ofLayerobjects. You can then uselayersas an input to the training functiontrainNetwork.

To specify the architecture of a neural network with all layers connected sequentially, create an array of layers directly. To specify the architecture of a network where layers can have multiple inputs or outputs, use aLayerGraphobject.

Image Input Layer

Create an image input layer usingimageInputLayer.

An image input layer inputs images to a network and applies data normalization.

Specify the image size using theinputSizeargument. The size of an image corresponds to the height, width, and the number of color channels of that image. For example, for a grayscale image, the number of channels is 1, and for a color image it is 3.

Convolutional Layer

A 2-D convolutional layer applies sliding convolutional filters to the input.Create a 2-D convolutional layer usingconvolution2dLayer.

The convolutional layer consists of various components.^[1]

Filters and Stride

A convolutional layer consists of neurons that connect to subregions of the input images or the outputs of the previous layer. The layer learns the features localized by these regions while scanning through an image. When creating a layer using theconvolution2dLayerfunction, you can specify the size of these regions using thefilterSizeinput argument.

For each region, thetrainNetworkfunction computes a dot product of the weights and the input, and then adds a bias term. A set of weights that is applied to a region in the image is called afilter. The filter moves along the input image vertically and horizontally, repeating the same computation for each region. In other words, the filter convolves the input.

这张图片显示了一个3×3过滤器扫描the input. The lower map represents the input and the upper map represents the output.

The step size with which the filter moves is called astride. You can specify the step size with theStridename-value pair argument. The local regions that the neurons connect to can overlap depending on thefilterSizeand'Stride'values.

这张图片显示了一个3×3过滤器扫描the input with a stride of 2. The lower map represents the input and the upper map represents the output.

The number of weights in a filter ish*w*c, wherehis the height, andwis the width of the filter, respectively, andcis the number of channels in the input. For example, if the input is a color image, the number of color channels is 3. The number of filters determines the number of channels in the output of a convolutional layer. Specify the number of filters using thenumFiltersargument with theconvolution2dLayerfunction.

Dilated Convolution

A dilated convolution is a convolution in which the filters are expanded by spaces inserted between the elements of the filter. Specify the dilation factor using the'DilationFactor'property.

Use dilated convolutions to increase the receptive field (the area of the input which the layer can see) of the layer without increasing the number of parameters or computation.

The layer expands the filters by inserting zeros between each filter element. The dilation factor determines the step size for sampling the input or equivalently the upsampling factor of the filter. It corresponds to an effective filter size of (Filter Size– 1) .*Dilation Factor+ 1. For example, a 3-by-3 filter with the dilation factor[2 2]is equivalent to a 5-by-5 filter with zeros between the elements.

This image shows a 3-by-3 filter dilated by a factor of two scanning through the input. The lower map represents the input and the upper map represents the output.

Feature Maps

As a filter moves along the input, it uses the same set of weights and the same bias for the convolution, forming afeature map. Each feature map is the result of a convolution using a different set of weights and a different bias. Hence, the number of feature maps is equal to the number of filters. The total number of parameters in a convolutional layer is ((h*w*c+ 1)*Number of Filters), where 1 is the bias.

Zero Padding

你也可以申请补零bor输入图像ders vertically and horizontally using the'Padding'name-value pair argument. Padding is rows or columns of zeros added to the borders of an image input. By adjusting the padding, you can control the output size of the layer.

这张图片显示了一个3×3过滤器扫描the input with padding of size 1. The lower map represents the input and the upper map represents the output.

Output Size

The output height and width of a convolutional layer is (Input Size– ((Filter Size– 1)*Dilation Factor+ 1) + 2*Padding)/Stride+ 1. This value must be an integer for the whole image to be fully covered. If the combination of these options does not lead the image to be fully covered, the software by default ignores the remaining part of the image along the right and bottom edges in the convolution.

Number of Neurons

The product of the output height and width gives the total number of neurons in a feature map, sayMap Size. The total number of neurons (output size) in a convolutional layer isMap Size*Number of Filters.

For example, suppose that the input image is a 32-by-32-by-3 color image. For a convolutional layer with eight filters and a filter size of 5-by-5, the number of weights per filter is 5 * 5 * 3 = 75, and the total number of parameters in the layer is (75 + 1) * 8 = 608. If the stride is 2 in each direction and padding of size 2 is specified, then each feature map is 16-by-16. This is because (32 – 5 + 2 * 2)/2 + 1 = 16.5, and some of the outermost zero padding to the right and bottom of the image is discarded. Finally, the total number of neurons in the layer is 16 * 16 * 8 = 2048.

Usually, the results from these neurons pass through some form of nonlinearity, such as rectified linear units (ReLU).

Learning Parameters

You can adjust the learning rates and regularization options for the layer using name-value pair arguments while defining the convolutional layer. If you choose not to specify these options, thentrainNetworkuses the global training options defined with thetrainingOptionsfunction. For details on global and layer training options, seeSet Up Parameters and Train Convolutional Neural Network.

Number of Layers

A convolutional neural network can consist of one or multiple convolutional layers. The number of convolutional layers depends on the amount and complexity of the data.

Batch Normalization Layer

Create a batch normalization layer usingbatchNormalizationLayer.

A batch normalization layer normalizes each input channel across a mini-batch. To speed up training of convolutional neural networks and reduce the sensitivity to network initialization, use batch normalization layers between convolutional layers and nonlinearities, such as ReLU layers.

The layer first normalizes the activations of each channel by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Then, the layer shifts the input by a learnable offsetβand scales it by a learnable scale factorγ.βandγare themselves learnable parameters that are updated during network training.

Batch normalization layers normalize the activations and gradients propagating through a neural network, making network training an easier optimization problem. To take full advantage of this fact, you can try increasing the learning rate. Since the optimization problem is easier, the parameter updates can be larger and the network can learn faster. You can also try reducing the L₂and dropout regularization. With batch normalization layers, the activations of a specific image during training depend on which images happen to appear in the same mini-batch. To take full advantage of this regularizing effect, try shuffling the training data before every training epoch. To specify how often to shuffle the data during training, use the'Shuffle'name-value pair argument oftrainingOptions.

ReLU Layer

Create a ReLU layer usingreluLayer.

A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero.

Convolutional and batch normalization layers are usually followed by a nonlinear activation function such as a rectified linear unit (ReLU), specified by a ReLU layer. A ReLU layer performs a threshold operation to each element, where any input value less than zero is set to zero, that is,

$f (x) = {\begin{matrix} x, & x \geq 0 \\ 0, & x < 0 \end{matrix} .$

The ReLU layer does not change the size of its input.

There are other nonlinear activation layers that perform different operations and can improve the network accuracy for some applications. For a list of activation layers, seeActivation Layers.

Cross Channel Normalization (Local Response Normalization) Layer

Create a cross channel normalization layer usingcrossChannelNormalizationLayer.

A channel-wise local response (cross-channel) normalization layer carries out channel-wise normalization.

This layer performs a channel-wise local response normalization. It usually follows the ReLU activation layer. This layer replaces each element with a normalized value it obtains using the elements from a certain number of neighboring channels (elements in the normalization window). That is, for each element $x$ in the input,trainNetworkcomputes a normalized value $x^{'}$ using

$x^{'} = \frac{x}{{(K + \frac{α * s s}{w i n d o w C h a n n e l S i z e})}^{β}},$

whereK,α, andβare the hyperparameters in the normalization, andssis the sum of squares of the elements in the normalization window[2]. You must specify the size of the normalization window using thewindowChannelSizeargument of thecrossChannelNormalizationLayerfunction. You can also specify the hyperparameters using theAlpha,Beta, andKname-value pair arguments.

The previous normalization formula is slightly different than what is presented in[2]. You can obtain the equivalent formula by multiplying thealphavalue by thewindowChannelSize.

Max and Average Pooling Layers

A max pooling layer performs down-sampling by dividing the input into rectangular pooling regions, and computing the maximum of each region.Create a max pooling layer usingmaxPooling2dLayer.

An average pooling layer performs down-sampling by dividing the input into rectangular pooling regions and computing the average values of each region.Create an average pooling layer usingaveragePooling2dLayer.

Pooling layers follow the convolutional layers for down-sampling, hence, reducing the number of connections to the following layers. They do not perform any learning themselves, but reduce the number of parameters to be learned in the following layers. They also help reduce overfitting.

A max pooling layer returns the maximum values of rectangular regions of its input. The size of the rectangular regions is determined by thepoolSizeargument ofmaxPoolingLayer. For example, ifpoolSizeequals[2,3], then the layer returns the maximum value in regions of height 2 and width 3.An average pooling layer outputs the average values of rectangular regions of its input. The size of the rectangular regions is determined by thepoolSizeargument ofaveragePoolingLayer. For example, ifpoolSizeis [2,3], then the layer returns the average value of regions of height 2 and width 3.

Pooling layers scan through the input horizontally and vertically in step sizes you can specify using the'Stride'name-value pair argument. If the pool size is smaller than or equal to the stride, then the pooling regions do not overlap.

For nonoverlapping regions (Pool SizeandStrideare equal), if the input to the pooling layer isn-by-n, and the pooling region size ish-by-h, then the pooling layer down-samples the regions byh[6]. That is, the output of a max or average pooling layer for one channel of a convolutional layer isn/h-by-n/h. For overlapping regions, the output of a pooling layer is (Input Size–Pool Size+ 2*Padding)/Stride+ 1.

Dropout Layer

Create a dropout layer usingdropoutLayer.

A dropout layer randomly sets input elements to zero with a given probability.

At training time, the layer randomly sets input elements to zero given by the dropout mask兰特(大小(X)) <概率, whereXis the layer input and then scales the remaining elements by1/(1-Probability). This operation effectively changes the underlying network architecture between iterations and helps prevent the network from overfitting[7],[2]. A higher number results in more elements being dropped during training. At prediction time, the output of the layer is equal to its input.

Similar to max or average pooling layers, no learning takes place in this layer.

Fully Connected Layer

Create a fully connected layer usingfullyConnectedLayer.

A fully connected layer multiplies the input by a weight matrix and then adds a bias vector.

The convolutional (and down-sampling) layers are followed by one or more fully connected layers.

顾名思义,都在一个完全conne神经元cted layer connect to all the neurons in the previous layer. This layer combines all of the features (local information) learned by the previous layers across the image to identify the larger patterns. For classification problems, the last fully connected layer combines the features to classify the images. This is the reason that theoutputSizeargument of the last fully connected layer of the network is equal to the number of classes of the data set. For regression problems, the output size must be equal to the number of response variables.

You can also adjust the learning rate and the regularization parameters for this layer using the related name-value pair arguments when creating the fully connected layer. If you choose not to adjust them, thentrainNetworkuses the global training parameters defined by thetrainingOptionsfunction. For details on global and layer training options, seeSet Up Parameters and Train Convolutional Neural Network.

A fully connected layer multiplies the input by a weight matrixWand then adds a bias vectorb.

If the input to the layer is a sequence (for example, in an LSTM network), then the fully connected layer acts independently on each time step. For example, if the layer before the fully connected layer outputs an arrayXof sizeD-by-N-by-S, then the fully connected layer outputs an arrayZof sizeoutputSize-by-N-by-S. At time stept, the corresponding entry ofZis $W X_{t} + b$ , where $X_{t}$ denotes time steptofX.

Output Layers

Softmax and Classification Layers

A softmax layer applies a softmax function to the input.Create a softmax layer usingsoftmaxLayer.

A classification layer computes the cross entropy loss for multi-class classification problems with mutually exclusive classes.Create a classification layer usingclassificationLayer.

For classification problems, a softmax layer and then a classification layer must follow the final fully connected layer.

The output unit activation function is the softmax function:

$y_{r} (x) = \frac{\exp (a_{r} (x))}{\sum_{j = 1}^{k} \exp (a_{j} (x))},$

where $0 \leq y_{r} \leq 1$ and $\sum_{j = 1}^{k} y_{j} = 1$ .

The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:

$P (c_{r} | x, θ) = \frac{P (x, θ | c_{r}) P (c_{r})}{\sum_{j = 1}^{k} P (x, θ | c_{j}) P (c_{j})} = \frac{\exp (a_{r} (x, θ))}{\sum_{j = 1}^{k} \exp (a_{j} (x, θ))},$

where $0 \leq P (c_{r} | x, θ) \leq 1$ and $\sum_{j = 1}^{k} P (c_{j} | x, θ) = 1$ . Moreover, $a_{r} = \ln (P (x, θ | c_{r}) P (c_{r}))$ , $P (x, θ | c_{r})$ is the conditional probability of the sample given classr, and $P (c_{r})$ is the class prior probability.

The softmax function is also known as thenormalized exponentialand can be considered the multi-class generalization of the logistic sigmoid function[8].

For typical classification networks, the classification layer must follow the softmax layer. In the classification layer,trainNetworktakes the values from the softmax function and assigns each input to one of theKmutually exclusive classes using the cross entropy function for a 1-of-Kcoding scheme[8]:

$loss = - \sum_{i = 1}^{N} \sum_{j = 1}^{K} t_{i j} \ln y_{i j},$

whereNis the number of samples,Kis the number of classes, $t_{i j}$ is the indicator that theith sample belongs to thejth类, $y_{i j}$ is the output for sampleifor classj, which in this case, is the value from the softmax function. That is, it is the probability that the network associates theith input with classj.

Regression Layer

Create a regression layer usingregressionLayer.

A regression layer computes the half-mean-squared-error loss for regression problems.For typical regression problems, a regression layer must follow the final fully connected layer.

For a single observation, the mean-squared-error is given by:

$MSE = \sum_{i = 1}^{R} \frac{{(t_{i} - y_{i})}^{2}}{R},$

whereRis the number of responses,t_iis the target output, andy_iis the network’s prediction for responsei.

For image and sequence-to-one regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses, not normalized byR:

$loss = \frac{1}{2} \sum_{i = 1}^{R} {(t_{i} - y_{i})}^{2} .$

For image-to-image regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each pixel, not normalized byR:

$loss = \frac{1}{2} \sum_{p = 1}^{H W C} {(t_{p} - y_{p})}^{2},$

whereH,W, andCdenote the height, width, and number of channels of the output respectively, andpindexes into each element (pixel) oftandylinearly.

For sequence-to-sequence regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each time step, not normalized byR:

$loss = \frac{1}{2 S} \sum_{i = 1}^{S} \sum_{j = 1}^{R} {(t_{i j} - y_{i j})}^{2},$

whereSis the sequence length.

When training, the software calculates the mean loss over the observations in the mini-batch.

References

[1] Murphy, K. P.Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press, 2012.

[2] Krizhevsky, A., I. Sutskever, and G. E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks."Advances in Neural Information Processing Systems. Vol 25, 2012.

[3] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., et al. ''Handwritten Digit Recognition with a Back-propagation Network.'' InAdvances of Neural Information Processing Systems, 1990.

[4] LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. ''Gradient-based Learning Applied to Document Recognition.''Proceedings of the IEEE.Vol 86, pp. 2278–2324, 1998.

[5] Nair, V. and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proc. 27th International Conference on Machine Learning, 2010.

[6] Nagi, J., F. Ducatelle, G. A. Di Caro, D. Ciresan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, L. M. Gambardella. ''Max-Pooling Convolutional Neural Networks for Vision-based Hand Gesture Recognition''.IEEE International Conference on Signal and Image Processing Applications (ICSIPA2011), 2011.

[7] Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting."Journal of Machine Learning Research. Vol. 15, pp. 1929-1958, 2014.

[8] Bishop, C. M.Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

[9] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift."preprint, arXiv:1502.03167(2015).