Local keypoint-based Faster R-CNN

Region-based Convolutional Neural Network (R-CNN) detectors have achieved state-of-the-art results on various challenging benchmarks. Although R-CNN has achieved high detection performance, the research of local information in producing candidates is insufficient. In this paper, we design a Keypoint-based Faster R-CNN (K-Faster) method for object detection. K-Faster incorporates local keypoints in Faster R-CNN to improve the detection performance. In detail, a sparse descriptor, which first detects the points of interest in a given image and then samples a local patch and describes its invariant features, is first employed to produce keypoints. All 2-combinations of the produced keypoints are second selected to generate keypoint anchors, which are helpful for object detection. The heterogeneously distributed anchors are then encoded in feature maps based on their areas and center coordinates. Finally, the keypoint anchors are coupled with the anchors produced by Faster R-CNN, and the coupled anchors are used for Region Proposal Network (RPN) training. Comparison experiments are implemented on PASCAL VOC 07/12 and MS COCO. The experimental results show that our K-Faster approach not only increases the mean Average Precision (mAP) performance but also improves the positioning precision of the detected boxes.


Introduction
General object detection is a complex problem. One of the main tasks for object detection is the localization problem, which is used to assign accurate bounding boxes to different objects [1]. In the last two decades, object detectors based on Convolutional Neural Networks (CNNs) [2][3][4][5] have achieved state-of-the-art results on various challenging benchmarks [6][7][8]. As two representative Region-based CNN (R-CNN) methods, both Fast/ Faster R-CNN [3,4] and Region-based Fully Convolutional Network (R-FCN) [9] use a Region Proposal Network (RPN) to generate region proposals. RPN initializes anchors of different scales and aspect ratios at each convolutional feature map location [4]. Although the anchor potentially covers the object of interest, it does not focus on local information. When a human identifies an object, both global structural information and local individual information are used in the identification [10].
Our work is motivated by the following two questions. First, is it possible to use local information to generate region proposals? Second, if the first answer is positive, are the generated proposals helpful for the performance of object detection? To meet this goal, we employ a keypoint as a kind of local information and fuse it in RPN in this study. Figure 1 shows an example of object detection by combining Faster and keypoint anchors. Faster R-CNN cannot produce proposals for the bus, partially due to the error between the ground truth and the initialized anchor box. Although Faster R-CNN uses bounding-box regression to adjust the error from an anchor box to a ground-truth box, the regression may be powerless when the error between them is too large.
By combining the two kinds of anchors, both the bus and car in Fig. 1a can be correctly detected.
In this work, we propose the keypoint-based Faster R-CNN method (K-Faster), which incorporates local keypoints in Faster R-CNN for object detection. All 2-combinations of the produced keypoints on an image are selected to generate bounding boxes, which are keypoint anchors (Fig. 1c). Every keypoint anchor is arranged in a convolutional network based on an Area Ratio of the Anchor to the Image (ARoAI) and the center coordinates of the anchor. On the one hand, a feature map with a lower index is designed to encode an anchor with a greater ARoAI, while a feature map with a higher index is used to encode an anchor with a smaller ARoAI. On the other hand, a grid point in a feature map is used to encode an anchor if the center coordinates of the anchor are convolved to the grid. The keypoint anchors together with Faster anchors, which are boxes with fixed scales and aspect ratios [4], are trained by RPN to produce proposals. Our approach employs the keypoint as a kind of local information and fuses it with Faster R-CNN for region proposal. The keypoints around the object may produce a keypoint anchor with a certain aspect ratio or size that cannot be covered by the anchors of Faster R-CNN. The keypoint anchor is helpful for region proposals, and the coupling anchors make the model have a more powerful ability. Extensive experiments implemented on PASCAL VOC 07/12 and MS COCO demonstrate that K-Faster can improve the detection performance.
Our main contributions are summarized as follows: & We incorporate local keypoints in Faster R-CNN to improve object detection. The designed keypoint anchors are coupled with Faster anchors to cover the deficiency of Faster R-CNN, which initializes anchor boxes with preset aspect ratios and sizes.
& We design an area-based technology to encode anchors with a heterogeneous distribution. Because the keypoints are attracted to pixels with heterogeneous intensity, the heterogeneously distributed anchors are partitioned in groups using ARoAI and are encoded in feature maps for RPN training. & Compared with Faster R-CNN, our K-Faster approach not only increases the mean Average Precision (mAP) performance but also improves the positioning precision of the detected boxes.

Two-stage method
R-CNN [17], Fast R-CNN [3], Faster R-CNN [4], R-FCN [9], and Mask R-CNN [12] are representative two-stage methods. Although R-CNN can be considered as a milestone of the twostage method for object detection, it classifies every region proposal separately and is time-consuming [14,16]. Fast R-CNN shares computation between the proposal extraction and classification steps using ROI-Pooling and therefore improves the efficiency greatly. Faster R-CNN designs RPN to generate proposals from anchor boxes and achieves further increases in speed and precision. R-FCN further improves speed and accuracy by removing fully connected layers and adopting position-sensitive score maps for final detection [9,16]. Mask R-CNN, which may be used for object instance segmentation and bounding box detection, extends Faster R-CNN by adding a branch that outputs the object mask [12].
Recently, many technologies have been designed to improve object detection. In order to combine multiple convolutional features, the multi-scale technology is employed for improvement [18,19]. Sean et al. designed an Inside-Outside Net (ION) [18], which concatenated multiple convolutional features to improve object detection. In [19], a Multi-scale Location-aware Kernel Representation (MLKP) is proposed to capture high-order statistics of deep features in proposals. Li et al. proposed a Zoom-out-and-In network for object Proposals (ZIP) and employed a Map Attention Decision (MAD) unit to search for neuron activations [20].
Since knowledge may be helpful for object detection, some studies focus on knowledge [10,21,22]. As shown in [10], the authors designed CoupleNet to couple the global structure with local parts for object detection. As a class of cornerbased detector, DeNet evaluates the distribution of the box corners to refine bounding boxes for improvement [21]. Jiang et al. introduced an object co-detection method (CRFNet) that exploited contextual information among multiple images through a higher-order conditional random field [22].

One-stage method
Alternatively, one-stage detectors, such as You Only Look Once (YOLO) [5], Single Shot MultiBox Detector (SSD) [23], and CornerNet [24], detect objects in a single network. They are usually more computationally efficient than twostage detectors [1,24]. Compared with two-stage methods, YOLO does not require proposals. YOLO forwards the input image once through a convolutional network to directly predict object classes and locations, and therefore it is extremely fast [5]. Combining dense anchor boxes and pyramid features, SSD directly classifies anchor boxes [23]. CornerNet, which uses an hourglass network as its backbone, detects each object bounding box as a pair of keypoints [24].
Many one-stage detectors share similarity with SSD. Similar to the frame work of SSD, the network structure of Deeply Supervised Object Detector (DSOD) is divided into two parts: the backbone sub-network for feature extraction and the front-end sub-network for prediction over multiscale response maps [16]. RetinaNet uses features pyramids occurred in SSD to address the extreme foregroundbackground class imbalance during training of dense detectors [25]. By attaching the Context Enhancement Blocks (CEBs) to the shallow layers in SSD, an advanced one-stage detector CEBNet is proposed in [26].
In this study, we focus on the improvement of Faster R-CNN using local keypoints. Although DeNet [21] and CornerNet [24] are point-based CNN methods, their flowcharts are complex. As a two-stage method, DeNet needs to jointly optimize the costs of both stages, i.e., the corner probability distribution, final classification distribution and bounding box regression cost [21]. The network of CornerNet may be divided in three modules: an hourglass backbone network and two prediction streams [24]. In our design, the network modules are similar with Faster R-CNN. The main task of K-Faster is to couple keypoint anchors with Faster anchors to cover the deficiency of Faster R-CNN, which initializes anchor boxes with preset aspect ratios and sizes.

Overview of methodology
The main design of our proposed method is to use local information to train candidates. In this study, we employ keypoint anchors to achieve the design. Figure 2 shows the main design of our method. Figure 2a shows the input image I with a size W × H. Figure 2b shows the produced keypoints. Figure 2c shows all the resulting anchor boxes, which total C 2 N , and N is the number of total keypoints. To train an RPN, a positive or negative anchor with a high or low Intersection over Union (IoU) overlap with a ground-truth box, called the Anchor of Interest (AoI), should be assigned a positive or negative label. The bias between the anchor box and ground truth box, which is called the ground bias, should be calculated to supervise bounding box regression. To arrange the labels and ground biases to 3-Dimensional (3-D) tensors, which serve as supervision information for the RPN, the anchors in Fig. 2c are partitioned into k S groups based on the ARoAI, as shown in Fig. 2d. All labels of the AoIs in each group are arranged on a corresponding 2-Dimensional (2-D) feature map, which indicates the convolution results of one channel [27] (Fig. 2e). For every AoI, its label is mapped to a layer position based on the coordinates of the anchor center (Fig. 2e). Every ground bias is parameterized by 4 coordinates and therefore is encoded on 4 feature maps in a similar way, as shown in Fig. 2f. Figure 2g and h are label and ground bias tensors obtained from Faster R-CNN, respectively. Figure 2i shows supervision information for RPN, and Fig. 2i1 and 2i2 show coupled tensors. Based on the integrated supervision information, our method follows a network similar to that of Faster R-CNN.

Keypoint detection
In order to improve the smartness of the region proposal, the keypoint, which involves local information, is employed to improve the coverage ability of anchors. Generally, the keypoint is coupled with a descriptor. In this work, a sparse descriptor, which first detects the points of interest in a given image and then samples a local patch and describes its invariant features [28], is employed to produce the keypoint.
During the last two decades, a large variety of sparse descriptors have been developed, including classical Scale-Invariant Feature Transform (SIFT) [29] and Speeded-Up Robust Feature (SURF) [30] descriptors, binary descriptors [31], and learning-based descriptors [32,33]. Although learning methods can achieve great accuracy, they usually produce large dimension features and are computationally expensive [32,33]. In addition, the preparation of substantial data for training is not a trivial subject. In order to produce as many AoI as possible, a robust keypoint detector that detects as many corresponding keypoints on the image is needed. Although binary descriptors are usually proposed in a simple presentation, they may be insufficient for producing massive keypoints based on experiments in [34]. Although SURF is more efficient for registration types of tasks, it is insufficient for producing enough keypoints [34]. In this paper, a Global OPtimized SIFT (GOP), which was demonstrated to produce many more keypoints in [34], is employed for keypoint detection.

Generating keypoint anchors
All 2-combinations of the produced keypoints are enumerated to generate anchors. Let the number of keypoints produced by GOP be N G . Because multiple descriptors may be produced at one keypoint [29], the same keypoints are merged into one keypoint. Let the keypoints after merging be P K ¼ fp 0 For a 2combination of points p 0 i ; p 0 j extracted from P K , let its corresponding anchor be a 0 The anchor a 0 ij with small area should be ignored because a proposal with small area is filtered in the stage of proposal refinement [4]. Let A = A fas ∪ A des be the final resulting anchors, where A fas is the set of anchors produced by Faster R-CNN; is the set of anchors produced by keypoints, s(•) is the pixel area of •, and T sa is a threshold.

Encoding keypoint anchors
To generate region proposals, a box-regression layer (reg) and a box-classification layer (cls) are used to map the coordinates and scores of the proposals in the framework of Faster R-CNN [4]. In this study, we integrate keypoint and Faster R-CNN anchors to design reg and cls. The label and ground bias induced by the keypoint should be mapped in 3-D tensors as the supervision information to train reg and cls. Let fm RPN = {(r, c)| r = 0, 1, ⋯, h − 1; c = 0, 1, ⋯, w − 1} be the 2-D feature map of the last shared convolutional layer; h and w are respectively the numbers of rows and columns of the map. Let the labels fed to cls be a 3-D tensor, denoted as Lt where k c = k f + k S , k S and k f are respectively the number of maximum possible proposals induced by keypoints and Faster R-CNN at each point of fm RPN . Lt may be decomposed into two sub-tensors, as shown in (1): where Lt fas = Lt(l < k f ) is the part of the tensor induced by Faster R-CNN (Fig. 2g) and Lt des = Lt(k f ≤ l < k c ) is the part of the tensor induced by the keypoint (Fig. 2e). Furthermore, the 3-D Lt des may be expressed as a set of 2-D feature maps, as shown in (2).
where f m t des is the feature map of the t-th channel. In order to arrange the label and ground bias in 3-D tensors, we partition the set of keypoint anchors A des into k S groups, as shown in (3): where if a ij ∈A t des , then s T is a threshold that is the base size of the image set.
Combining (2) and (3), we map a ij ¼ x i ; y i ; x j ; y j ∈A t des to the t-th feature map f m t des at position The ground bias tensor Bt = {(l b , r, c)| l b = 0, 1, ⋯, 4k c − 1} may be decomposed into two bias tensors, as shown in (4).
where Bt fas = Bt(l b < 4k f ) is the part of tensor induced by Faster R-CNN (Fig. 2h) and Bt des = Bt(4k f ≤ l b < 4k c ) is the part of the tensor induced by the keypoint (Fig. 2f). Bt des may be expressed as (5): Combining (3) and (5), for any a ij ∈A t des , we map its ground biases to f m 4t des , f m 4tþ1 des , f m 4tþ2 des , and f m 4tþ3 des at the position p c ij . For a ij ∈ A fas , the mappings of Lt fas and Bt fas are similar to those in [4]. After the label and ground biases of all a ij are mapped in their corresponding tensors, the 3D tensors Lt and Bt serve as supervision information to train the RPN.

The encoding steps
In this Section, we present encoding steps of our proposed method. Table 1 shows the input variables and comments used in this Section. P K and k S are two key parameters of our method. k f is the parameter for anchors of Faster R-CNN. b gt , w, and h are parameters produced by input image. The other four variables are thresholds for RPN training.
For convenience, Fig. 3 shows the illustration of our mapping. Let the Faster anchors be an array A f as ∈N N a f Â4 , the keypoint anchors be an array A des ∈N N ad Â4 . Then is the array of all the anchors, which is coupled vertically by A fas and A des (Fig. 3a). Let which is resulted from A after removing the anchors that are not inside the image (Fig. 3b). Let a query list between A and A ′ be qA, where the k-th anchor in A ′ corresponds to the qA k -th anchor in A ( Fig. 3a and b). The labels of A ′ are initialized to where lA ′ f as ∈ Z n a f and lA des ∈ Z N ad .
Let Lt 1 f as ∈ Z whk f and Lt 1 f as ∈ Z whk S be 1-Dimensional (1-D) label tensors, which are respectively reshaped from 3-D tensors Lt fas and Lt des in the dimension order of channel, width, and height, as shown the red arrowed line in Fig. 3d. Then the 1-D label tensor of Lt is . In order to map the anchors in the list of A des to the 1-D tensor Lt 1 des introduced in Section 3.4, a query list is designed to map the ind(a i, j )-th anchor in A des to the lq(ind(a i, j ))-th label in the 1-D tensor Figure 3c-3e illustrate the mapping from the list of keypoint anchors to the label tensors. Take the 0-th keypoint anchor a 01 (Fig. 3c) for example, its label is arranged in the 3-D tensor Lt des at x c 01 ; y c 01 ; t À Á , as shown the yellow square in Fig. 3d, where x c 01 ; y c 01 À Á ¼ p c 01 corresponds to the center coordinate of a 01 , t is obtained by ARoAI. After fetching the elements along with the red arrowed line in Fig. 3d, the 3-D tensor Lt des (Fig. 3d) is reshaped to 1-D tensor Lt 1 des (Fig. 3e). The label of the 0-th keypoint anchor (the blue item in Fig. 3c) is arranged to Lt 1 Fig. 3e). Let the overlapping array generated by the anchors and the ground truth boxes be where o ij is the overlap between the i-th anchor of A ′ and the j-th ground truth box in b gt . Let the row maximums of O be a vector r max ∈ N n a f þN ad , i.,e., the k-th element r k of r max is the maximum of the k-th row in O; Let the anchor indexes of the column maximums of O be the vector c max ∈ N N gt , i.e., the k-th element c k of c max is the index of the maximum of the k-th column in O. Then, the indexes of the anchors with positive label in A ′ for RPN training can be obtained as follows: Obtain all the indexes of the anchors with the positive label 1, i.e., ind(lA  The number of channels of the label tensor induced by keypoints k f The number of channels of the label tensor induced by Faster R-CNN b gt The array of ground truth boxes with the size of 4 × N gt w The number of columns of the last shared convolutional tensor h The number of rows of the last shared convolutional tensor The IoU threshold for positive anchors The IoU threshold for negative anchors The threshold of the number of anchors for RPN training r RPN _ fg The ratio of the anchors that are labeled as foreground The indexes of the anchors with negative label in A ′ for RPN training are obtained as follows: Based on pa des (i) and na des (i), i.e., the positive and negative indexes of A des in A ′ (Fig. 3b), the 1-D label tensor Lt 1 des is obtained using query list lq: i. For every pa des (i), the label of the pa des (i) − n af -th anchor in A des is 1. Therefore, let the lq(pa des (i) − n af )-th label in Lt 1 des be 1, i.e., Lt 1 des lq pa des i ð Þ−n af À Á À Á ¼ 1; j. For every na des (i), Lt 1 des lq na des i ð Þ−n af À Á À Á ¼ 0.
For the 2-D array of ground bias Bt 2 ∈R wh k f þk S ð ÞÂ4 , it is initialized to Bt 2 = 0. In order to evaluate the loss of the bounding box regression, the regressions of the anchors with positive and negative labels are weighted. Let The encoding steps of the supervision information for RPN training are as follows. Lt, Bt, W in , and W out are outputs, where Lt and Bt are label and ground bias tensors, respectively; W in and W out are weight tensors used to evaluate the loss of the bounding box regression.
1. Generate the array of Faster anchors A fas . 2. Generate the keypoint anchors A des based on Section 3.3. (6) and (7), respectively. 4. Build a query list qA between A and A ′ . 5. Build a query list lq(ind(a i, j )) from the anchors in A des to the labels in the 1-D tensor Lt 1 des based on (8). 6. Assign the overlapping array O, and result in r max and c max . 7. Obtain the indexes of the anchors with positive and negative labels in A ′ for RPN training based on (a) -(c) and (d) -(f), respectively. 8. Assign anchor labels to Lt 1 fas and Lt 1 des based on (g) -(h) and (i) -(j), respectively. 9. Reshape Lt 1 fas and Lt 1 des to 3-D tensors Lt fas and Lt des respectively in the dimension order of channel, width, and height. Couple the two tensors to Lt, as shown in (1). 10. Compute the ground bias of all the anchors in A ′ , and assign them to Bt 2 using query lists qA and lq. 11. Obtain W 2 in and W 2 out . 12. Separate Bt 2 , W 2 in , and W 2 out vertically into two parts that are respectively in size of whk f × 4 and whk S × 4. Reshape and couple them to 3-D tensors Bt, W in , and W out similar to Step 9.

Experiments
We evaluate our method on three datasets: PASCAL VOC 2007 [8], PASCAL VOC 2012 [8], and MS COCO [6]. Our experiments are implemented based on the framework of Faster R-CNN [4]. Both VGG16 [11] and ResNet101 [13] are employed as our backbone networks. The VGG16-based and ResNet101-based experiments are carried out respectively on Caffe [35] and TensorFlow [36]. We train and test networks on images of a single scale in which the shorter side is s=600 pixels [4]. The publicly available VGG16 and ResNet101 models pre-trained on ImageNet [2,7] are used for corresponding initialization. We use a 1-GPU implementation, and thus the mini-batch size of RPN is 1. The models are trained starting from conv3_1 using an end-to-end schedule. The momentums are set as 0.9. The weight decay of K-Faster based on VGG16 is set to 0.0005 and that of K-Faster based on ResNet101 is set to 0.0001. The mAP is primarily used to evaluate the detection performance.

Implementation detail
We tune the parameters on PASCAL VOC. The models are trained on the union set of VOC 2007 trainval and VOC 2012 trainval ("07+12"). They are evaluated on VOC 2007 test set. We initialize a learning rate of 0.001 and make the learning rate drop 10 times after every 50k iterations on the 07+12 dataset. A total of 140k training iterations are run.
In addition to thresholds T sa and S T , two main parameters N G and k S are introduced in this work. N G and k S are the number of keypoints and the number of the channels of the keypoint-based label tensor, respectively. In our design, T sa is set to 16 to filter out small keypoint anchors, and S T is set to 600 × 1000, which is the multiplication of the lower boundary and upper boundary of the input image size. For k S , it is appropriate to let it be 16 when S T = 600 × 1000 because the anchors with the size less than 16 × 16 are convoluted to 1 point in the last shared convolutional layer. In detail, the anchors with the size less than 600 × 1000/2 15 Table 2 summarizes the key results on the seven models. The test results with the greatest mAP are listed in the line of mAP in Table 2. The lines of Rec, Pre, Nt pre NMS , and Nt post NMS in Table 2 are respectively the recalls, precisions, and testing parameters that correspond to the greatest mAP. Together with ground truth, the true positive and false positive samples are used to calculate the recall and precision on every class. A predicted detection is regarded as a true positive if the predicted class label is the same as the ground truth label and the IoU between the predicted bounding box and the ground truth one is greater than 0.5, otherwise the detection is a false positive one. The results of Rec and Pre listed in Table 2 are the averages of the recalls and precisions over all the classes.
N G = 400 is chosen for our proposed method. The recall rate ranges roughly from 88.4% to 86.1%. Simultaneously, the precision ranges from 19.9% to 30.0%. It can be seen that a greater N G takes advantage in precision, but a greater N G results in a smaller recall rate. Although N G =160, 200, and 400 are all able to obtain a similar greatest mAP 76.2%, N G = 400 brings a greatest precision (Table 2). Therefore, N G with a value of 400 is chosen for K-Faster.

RPN parameters
In this Section, we implement experiments on 07+12 using VGG16-based network to tune N RPN BS , N RoI , N pre NMS , and N post NMS . For N G = 400, the number of the induced keypoint anchors ranges roughly from 40k to 60k. The number of the keypoint anchors with IoU≥0.7 ranges from zero to several hundreds. We tune the four parameters of K-Faster in the range of one to two times of those used in Faster R-CNN. For simplicity, the parameters Nt pre NMS and Nt post NMS used for NMS in the stage of detection are set to 9k and 300, respectively. Table 3 shows our main experimental results.
As shown in the first and second lines in Table 3, the experimental levels of N RPN BS are different, but the other three parameters are the same. Both the recall and precision obtained by N RPN BS ¼ 448 are greater than those obtained by N RPN BS ¼ 512, therefore, N RPN BS ¼ 448 takes advantage in mAP. The one-factor-at-a-time experiments on N RoI are listed in the first and third lines in Table 3. It can be seen that N RoI = 256 is more appropriate than 196. Similarly, N pre NMS ¼ 24k and N post NMS ¼ 4k take advantage in mAP. Extended experiments on the same four parameters based on ResNet101 show similar results. This is partially due to the same structure of their RPN. Overall, the RPN parameters

Results on PASCAL VOC
In this Section, we demonstrate that local information is helpful in terms of object detection. We evaluate K-Faster on the PASCAL VOC 2007 detection benchmark [8]  We initialize a learning rate of 0.001, and make the learning rate drop 10 times after every 80k iterations on the 07+12 dataset. The networks based on VGG16 and ResNet101 are respectively run 180k and 200k training iterations. Table 4 shows our experimental results on the test set of VOC 2007. The rows of K-Faster16 and K-Faster101 show the results of our method using VGG16 and ResNet101 as the backbone networks, respectively. The notation "-" shows that corresponding result is unavailable.
Keypoint anchors provide an extra auxiliary discrimination. As shown in Table 4, our K-Faster16 and K-Faster101 achieve mAPs of 77.2% and 80.5% on the PASCAL VOC 2007 test set, respectively. Compared with the baseline Faster R-CNN, corresponding improvements of the mAPs are respectively 4.0% and 4.1%. Because K-Faster shares a similar framework with Faster R-CNN, we owe the gains to the proposed keypoint anchors. In addition, K-Faster outperforms many state-of-the-art models, including ION [18], SSD512 [23], and YOLOv2 [5]. Our ResNet101-based results are comparable to R-FCN [9], which also uses ResNet101 as the backbone network. Although CoupleNet [10], CEBNet [26], and MLKP101 [19] show their superiority in mAP, the superiority of MLKP to K-Faster is insignificant. As for CoupleNet, it uses two branches, i.e., local FCN and global FCN, to improve ResNet101-based R-FCN. Compared with R-FCN, which achieves mAP of 80.5%, CoupleNet obtains an mAP of 82.7%, as shown in Table 4. The improvement of the mAP is 2.2%, which is less than the improvement from Faster101 to K-Faster101. CEBNet embeds six layers into SSD detector, including four CEB modules. Every CEB module consists of two sub-networks. Compared with SSD512, the improvement of CEBNet512 on the mAP is 5.7%. However, the network of CEBNet is much more complicated than SSD. Figure 4 shows some results on the PASCAL VOC 2007 test set. The implementation model is K-Faster16 (77.2% mAP). A score threshold of 0.6 is used to draw the detection bounding boxes. The blue and red colors respectively show the detections launched by Faster and keypoint anchors. Figure 5 shows the detection ratios launched by keypoint anchors on the VOC 2007 test set using the model of K-Faster16. The detection ratio is a proportion of the number of the objects detected by keypoint anchors to the total detections. Both Figs. 4 and 5 demonstrate that keypoint anchors are helpful for object detection.
Because the main task for object detection not only includes classification but also involves localization problem, we add an evaluation on localization. In order to investigate the positioning precision of the detection, we evaluate the mean IoUs of the true positive detections (IoU≥0.5). Figure 6 shows the positioning precision on the 20 classes of the PASCAL VOC 2007 test set. Figure 6a and b respectively show the mean IoUs that are obtained from the backbone networks of VGG16 and ResNet101. The results of K-Faster and Faster R-CNN are shown in red and gray, respectively. The mean IoUs of K-Faster and Faster R-CNN over the 20 classes in Fig. 6a are 78.2% and 77.8%, respectively. The mean IoUs of K-Faster and Faster R-CNN in Fig. 6b are respectively 81.7% and 81.5%. Both the mean IoUs of K-Faster are greater than those of Faster R-CNN. Overall, our K-Faster improves the positioning precision of the detection.
Take Faster R-CNN as a benchmark, we evaluate the runtime of K-Faster using a dual-core i-3 4160 CPU and an NVIDIA GTX1080 GPU. We summarize the test time (ms per image) of Faster R-CNN and K-Faster on the PASCAL VOC 2007 test set in Table reftab:5. "Keypoint" and "Proposal" are respectively the runtimes in generating keypoints and proposals on CPU. "GPU" is the runtime on convolution, pooling, full-connection, and softmax layers.     As shown the the total runtime in Table 5, K-Faster is slower than Faster R-CNN for a same Nt post NMS . The inferiority is largely due to the time cost on Keypoint. It takes K-Faster about 100ms to generate keypoints on CPU. However, K-Faster takes advantage in GPU runtime. As shown in Table 5, the runtimes of K-Faster on GPU are 80.6, 202.0, 136.7, and 386.1 ms. The first three GPU runtimes of K-Faster are less than those of Faster R-CNN. This is partially due to the improvement of the performance of the proposals. Overall, K-Faster is competitive against Faster R-CNN in terms of GPU runtime.
As shown in Table 6, we also compare our method with the state-of-the-art methods on the test sets of VOC 2012.
The K-Faster16 and K-Faster101 are run 260k and 280k training iterations, respectively. The learning rate is dropped 10 times after every 100k iterations on the 07++ 12 dataset. Because the labels of the test set of VOC 2012 are not issued, the evaluation results in Table 6 are produced from the server of PASCAL VOC. Except for CoupleNet, K-Faster101 achieves the greatest mAP of 77.7%. Compared with the standard Faster R-CNN, K-Faster16 improves the mAP of 3.5% and K-Faster101 improves the mAP of 3.9%. MLKP, which outperforms K-Faster on the PASCAL VOC 2007 test set, is inferior to K-Faster on the PASCALVOC 2012 test set. Keypoint anchors are helpful for object detection on the PASCAL VOC 2012 test set.

Results on MS COCO
In this Section, we present experimental results on the Microsoft COCO object detection dataset. COCO involves 80 object classes. The dataset consists of 80k images for training (train2014), 40k images for validation (val2014), and 20k images for testing (test-dev2015). We use the train + val (trainval) to train our model. We report COCO AP on the test-dev set, which has no public labels and requires evaluation from the server of COCO. The COCO standard metric is denoted as AP, which is the average precision evaluated at IoU in [0.5: 0.05: 0.95]. AP 50 and AP 75 are   10 , and AR 100 are the average recall given 1, 10, and 100 detections per image, respectively. AP s , AP m , and AP l are AP for small (area ≤ 32 2 ), medium (32 2 < area ≤ 96 2 ), and large (area > 96 2 ) objects, respectively. AR s , AR m , and AR l are the similar notations. The learning rate is initialized with 0.001 and is decayed 10 times after every 550k iterations until the iterations reach 1400k. Table 7 shows our results on COCO. The training set trainval35k is the union of 80k train images and a random 35k subset of val images. All the results are reported on the test-dev split except for Faster-Res101 * , which is reported on val split. For a fair comparison, we re-implement ResNet101-based Faster R-CNN [13] on trainval split. The learning rate is initialized with 0.001 and is reduced by a factor of 10 after 600k iterations until the iterations reach 800k [37]. The test results are shown as Faster-Res101 † in Table 7. Although Table 7 lists results on three different training sets (train, trainval, trainval35k), it is reasonable to implement comparison within groups.
As shown in Table 7, K-Faster16 and K-Faster101 respectively achieve the APs of 26.5% and 34.7% on the trainval set.
Both of them outperform corresponding standard Faster R-CNN. Corresponding APs are respectively 3.6 and 3.4 points higher than those of Faster R-CNN. Although K-Faster101 is inferior to CEBNet512, CoupleNet, and MLKP on the test set of VOC 2007 and VOC 2012 (Tables 4 and 6), K-Faster101 outperforms them on the test set of COCO (Table 7). It should be noted that K-Faster cannot outperform Mask R-CNN and CornerNet on COCO. Mask R-CNN is a kind of segmentation method that is designed for pixel-to-pixel alignment using FPN-based ResNet101. The advantage of Mask R-CNN on box detection is partially due to the benefits of segmentation branch and multi-task training [12]. Because CornerNet integrates many technologies in training, it takes advantage in high performance. Besides multi-stream technology is employed to design CornerNet, data augmentation techniques and Principal Component Analysis (PCA) are applied to the input images. In addition, CornerNet uses an optimized training loss for training. It seems that a method integrating several technologies is in favor of a high performance. Although K-Faster cannot outperform all the listed state-of-the-art methods, Keypoint anchors are helpful for object detection for the more challenging COCO dataset.   In this paper, a local keypoint-based Faster R-CNN is proposed. The 2-combinations of the produced keypoints are selected to generate anchors. An area-based technology is designed to encode the keypoint anchors with a heterogeneous distribution. The keypoint anchors are coupled with Faster anchors to improve object detection. With the coupling anchors, our K-Faster approach not only increases the mAP performance but also improves the positioning precision of the detected boxes. In the future work, we first plan to improve detection performance using geometry knowledge since knowledge, such as global structure and context, may be helpful for detection. Second, we plan to improve CNN-based method with the help of intuitionistic fuzzy set. We hope to annotate object using membership and non-membership classifications and design dual-network for object detection.