Main Content

Anchor Boxes for Object Detection

Object detection using deep learning neural networks can provide a fast and accurate means to predict the location and size of an object in an image. Ideally, the network returns valid objects in a timely manner, regardless of the scale of the objects. The use of anchor boxes improves the speed and efficiency for the detection portion of a deep learning neural network framework.

What Is an Anchor Box?

Anchor boxesare a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and aspect ratio of specific object classes you want to detect and are typically chosen based on object sizes in your training datasets. During detection, the predefined anchor boxes are tiled across the image. The network predicts the probability and other attributes, such as background, intersection over union (IoU) and offsets for every tiled anchor box. The predictions are used to refine each individual anchor box. You can define several anchor boxes, each for a different object size. Anchor boxes are fixed initial boundary box guesses.

The network does not directly predict bounding boxes, but rather predicts the probabilities and refinements that correspond to the tiled anchor boxes. The network returns a unique set of predictions for every anchor box defined. The final feature map represents object detections for each class. The use of anchor boxes enables a network to detect multiple objects, objects of different scales, and overlapping objects.

Advantage of Using Anchor Boxes

使用锚点时,您可以一次评估所有对象预测。锚盒消除了使用滑动窗口扫描图像的需求,该滑动窗口在每个潜在位置都计算单独的预测。使用滑动窗口的探测器的示例是基于聚合通道特征(ACF)或梯度(HOG)特征的直方图的示例。使用锚盒的对象检测器可以一次处理整个图像,从而使实时对象检测系统成为可能。

由于卷积神经网络(CNN)可以以卷积方式处理输入图像,因此输入中的空间位置可能与输出中的空间位置有关。该卷积对应关系意味着CNN可以一次提取整个图像的图像特征。然后,提取的功能可以回到该图像中的位置。锚盒的使用替换并大大降低了滑动窗口方法的成本,以从图像中提取功能。使用锚点,您可以设计有效的深度学习对象检测器,以涵盖基于滑动窗口的对象检测器的所有三个阶段(检测,特征编码和分类)。

锚箱如何工作?

The position of an anchor box is determined by mapping the location of the network output back to the input image. The process is replicated for every network output. The result produces a set of tiled anchor boxes across the entire image. Each anchor box represents a specific prediction of a class. For example, there are two anchor boxes to make two predictions per location in the image below.

Each anchor box is tiled across the image. The number of network outputs equals the number of tiled anchor boxes. The network produces predictions for all outputs.

Localization Errors and Refinement

The distance, orstride, between the tiled anchor boxes is a function of the amount of downsampling present in the CNN. Downsampling factors between 4 and 16 are common. These downsampling factors produce coarsely tiled anchor boxes, which can lead to localization errors.

To fix localization errors, deep learning object detectors learn offsets to apply to each tiled anchor box refining the anchor box position and size.

Downsampling can be reduced by removing downsampling layers. To reduce downsampling, lower the ‘Stride’ property of the convolution or max pooling layers, (such asconvolution2dLayer(Deep Learning Toolbox)andmaxPooling2dLayer(Deep Learning Toolbox)。)您还可以在网络中早期选择功能提取层。网络早期的特征提取层具有较高的空间分辨率

Generate Object Detections

为了生成最终的对象检测,删除了属于背景类的瓷砖锚盒,其余的锚箱通过其置信度得分过滤。使用非最大抑制(NMS)选择具有最大置信度得分的锚点。有关NMS的更多详细信息,请参阅selectStrongestBboxMulticlassfunction.

Anchor Box Size

多尺度处理使网络能够检测大小不同的对象。为了实现多尺度检测,您必须指定尺寸不同的锚固箱,例如64 x64、128-by-128和256-by-256。指定大小,这些大小密切表示培训数据中对象的比例和纵横比。有关估计尺寸的示例,请参见Estimate Anchor Boxes From Training Data.

Related Examples

More About