Yolo-like network for vehicle detection using KITTI dataset
Vivek Yadav, PhD
Disclaimer: This series of post is intended to outline steps for implementing YOLO9000 (or YOLOv2) from scratch in tensorflow. YOLOv2 has become my go-to algorithm because the authors correctly identified majority of short comings of YOLO model, and made specific changes in their model to address these issues. Futher YOLOv2 borrows several ideas from other network designs that makes it more powerful than other models like Single Shot Detection. Please note, I did not use any other previously published implementation of YOLOv2 or YOLO. All the code and different features implemented here are based upon my readings from papers on object detection. I will specifically point out the differences between YOLOv2 and my implementation, as and when they come up. Also, if you spot any errors, please be patient and let me know, so I can correct. I will share notebooks as and when I get time.
Previously I wrote a post on using Single Shot Detection network for vehicle detection. After which I got a lot of requests asking to share code for it. While compiling the code for it, I learned about YOLO9000, and I found it very interesting, and refreshingly simple. I wanted to implement it, and it This series of posts outlines the work I did over the last 2 weeks, and will do in the coming weeks.
YOLO9000 is a combined classification and detection framework that is capable of making predictions in real-time, and is on par with state of art detection frameworks. In YOLO9000, first a high resolution image is taken and passed through convolution networks that learn dataset specific features. The YOLO9000 architecture uses an anchor box (or prior) approach where bounding box corrections are made on 5 anchor boxes located at each superpixel of down-sampled convolution filters. In this series of posts, I will post steps I took to implement YOLO9000 from scratch in tensorflow. This will be a series of multiple posts, and I will cover as many concepts as I can from object detection framework. The coming posts are organized as follows (the actual number may vary).
- Getting anchor boxes (Link here): YOLO9000, like SSD predicts corrections on top of anchor boxes of different shapes and aspect ratios. However, in YOLO9000, the bounding boxes are computed directly from the data, instead of hand-selecting them before. One way to generate these anchor boxes is to use a clustering method. YOLO9000 uses k-means clustering with 5 centroids, with 1 minus intersection over union (IOU) as the distance measure. The number 5 was obtained by varying the number of centroids, as the value that gave the best trade-off between mean IOU and number of centroids. In this post, I will go over steps needed to load image data and compute candidate anchor boxes using K-means clustering.
- Preprocessing ground truth bounding box and images data, making a test network (to write): In YOLO9000, the output of the network is a large convolution map, where each item filter corresponds to a specific prediction. In YOLO9000, for each anchor box, 5 predictions are made regarding the quality of anchor box; 4 correspond to error between ground truth bounding box and anchor boxes, 1 prediction of IOU between bounding box and ground truth, and class of object at each grid point.
- The next step is to preprocess the ground truth bounding box labels and image data to generate target predictions (to write). Note that in YOLO9000, all the layers are convolution layers. I have seen some implementations where users predict a fully connected layer and then convert that to convolution type layer to compute losses and predictions, but I will stay true to the original implementation and present it.
- Overfitting a deep Learning framework for detection and localicalization (to write). This is perhaps the most crucial step of designing a neural network model. In this step objective is to design a neural network, and fit input-output for a training set comprised of 1 entry alone. If a neural network cannot overfit and predict ground truth for a single image, it most likely to fail on a bigger network. Note that vice versa is not true. This is perhaps the most crucial and most overlooked step in training deep learning models. Its always easier to overfit model on your data set, and then add aspects to not overfit, than to have a network with poor representation power and modify it.
- Data augmentation (to write) can be used to generate new ‘unseen’ data. Data augmentation for detection is more difficult than for classification tasks. .
- Using pretrained classifier (to write): Last step is to use a pretrained classifier, like VGG16 or inception model to precompute bottleneck features (output from the last layer of base model), and then use these features for training only the convolution layers for final bounding box prediction and classification. This makes the training much faster, and as we precompute the bottleneck features, the larger batch sizes can be used.
The next post will detail the steps involved in computing the anchor boxes.
PS: I am writing these posts while traveling. As a result, I dont have access to my linux machine, so some details may be missing. If you find any place that needs more clarification, please let me know, will add it. Also, will release all the code once I get my linux computer back.
PPS: I have completed all the coding and testing of model, so am hoping to have all of these out soon. I have included status of the work in brackets, and will keep adding links as I write more.