Vehicle detection using LIDAR: EDA, augmentation and feature extraction (Udacity/Didi challenge)

Vivek Yadav
Chatbots Life
Published in
8 min readApr 7, 2017

--

### THIS DOCUMENT IS NOT COMPLETE YET. I NEED TO ADD MORE GRAPHS AND DATA PREPROCESSING STEPS, WILL UPDATE THIS AS I GO ###

This is first in a series of post that will detail my approach for solving the real-time vehicle detection using LIDAR data. The challenge is hosted by Udacity and Didi, and more details can be found here. One of the primary objectives of the challenge is to have detection in real time (10 Hz) on an i-7 CPU with 1 Titan X GPU. Before getting into models and designing algorithms for any application, it is crucial to understand the data itself. As I had never worked with LIDAR data, I chose to spend time understanding LIDAR data and the involved computational nuances. In this post, we will go over visualization of KITTI data, LIDAR augmentation methods and some preprocessing to generate feature maps.

As I did not have direct access to LIDAR, I chose to work with KITTI dataset first. Therefore, bulk of my initial work will focus on KITTI dataset.

KITTI dataset

KITTI dataset is a freely available data of a car driving in urban environment. The KITTI car has 4 cameras (2 stereo color and 2 stereo grayscale), velodyne’s VLP-64 LIDAR and an GPS/IMU. In addition calibration data are provided so transformations between velodyne (LIDAR), IMU and camera images can be made.

KITTI car

The KITTI data set is composed of images from the 4 cameras, annotated 3D boxes, LIDAR data and telemetry data from GPS/IMU. Several benchmark problems are posed using KITTI data set. More details can be found in the video below,

Before getting into exploratory data analysis, I will first define what LIDAR is and how LIDAR works.

LIDAR

LIDAR can be defined as, LIDAR, which stands for Light Detection and Ranging, is a remote sensing method that uses light in the form of a pulsed laser to measure ranges (variable distances) to the Earth. These light pulses — combined with other data recorded by the airborne system — generate precise, three-dimensional information about the shape of the Earth and its surface characteristics (reference).

LIDAR is a time-of-flight sensor where a laser-pulse is fired and the reflected pulse is recorded by a diode, the distance is then estimated as the velocity of light divided by the time it takes for the laser pulse to return. Its also not uncommon to have 1 laser unit, and rotate/swivel it to construct a full 3D view. This however is very slow, and applicable only in static environments. A nice illustration of 1 laser unit can be used for 3D scanning can be found in the video below,

The above solution however has very poor temporal resolution. As we want to map the whole environment faster, a quick solution is to stack a bunch of these laser units, and rotate them to get a 360 view. This is precisely the working principle of LIDAR used in self-driving cars. The LIDAR on KITTI is called VLP-64, the number 64 indicates that an array of 64 laser units are used. Other variants of the same model are called VLP-16 and VLP-32 referring to a LIDAR with 16 and 32 laser units. Udacity/Didi challenge dataset is collected using VLP-16 lidar. A typical LIDAR is composed of an array of laser units offset by an angle that fire laser pulses following a certain predefined protocol. Its typical multiple laser units take turns in firing laser pulse such that only one laser unit is active at any given time. As each laser unit sends out pulse at different times, the LIDAR has poor temporal (time) resolution, especially in scenarios where objects are moving faster than cycling frequency of the LIDAR units.

LIDAR although gives a good solution for getting 3D view, it is prone to errors. Of particular interest is the scenario where there are multiple LIDAR in an environment. Consider an urban scenario where there are several cars each equipped with its own VLP-64. As the LIDAR measurements are based on measuring the signal that is reflected back, a crowded environment can make it difficult to determine if the returned signal is the same as one that was transmitted. There are ways to work around this where by each laser unit sends out an encoded pulse (say 0101001) and waits for that pattern to return. This however degrades the temporal resolution of a LIDAR.

As a final note, the final LIDAR data output is a set of points x,y,z and reflective intensity from these points. Therefore, two LIDAR ‘snapshots’ of the same scene may have different data due to different x,y,z values. We will therefore use a voxel-based method to first discretize the space and use statistics of points in these voxels to construct feature maps.

As we understand more about LIDAR and how it works now, we can next go ahead and delve into KITTI dataset.

Exploratory data analysis (EDA) of KITTI dataset

The first step is to perform exploratory data analysis to better understand how data is represented in KITTI dataset. It is crucial to note that the 3D bounding box data in KITTI dataset are provided as tracklet objects in LIDAR’s coordinate frames. However, as cameras and other sensors are not all located at the same positions, there is an offset between these 3D boxes in different frames. The transformation from one frame to another are provided in KITTI dataset via calibration files. Each calibration file contains a transformation matrix between camera and velodyne’s coordinate axis. These transformation matrices are called homogenous transformation matrices in robotics, that when multiplied by coordinates of a point result in the point being translated and rotated. For more details refer to the Robot Modeling and Control book by Spong and Vidyasagar.

Stereo images from left and right grayscale and color cameras

Figure above presents a scene from KITTI data set. In figure below, ground truth 3D tracklet data from LIDAR points are projected onto the camera frame.

3D Bounding boxes projected from velodyne to camera frame

We will next plot the LIDAR data from KITTI data set with bounding boxes imposed on top of them. Figure below presents LIDAR data drawn with 3D bounding boxes superimposed on the cloud points.

Original KITTI data has no vehicles on the back marked

As evident, the KITTI dataset only has the vehicles in front of the car marked, and the cars on side and back are not marked. Therefore, a model trained on such a data set has the potential to completely ignore the vehicles that are not in the field of view of the front camera. To fix this issue, I extracted all the front views from multiple cameras, and extracted the forward-facing 120-degree view of the car. I then stitched 3 of these pieces together to generate a full 360-degree view. A representative example of augmented lidar data is presented below,

Augmented LIDAR-data

Once the full 360-view of the LIDAR is obtained, the augmented-LIDAR data can be further rotated, translated and zoomed to generate different views. I have however not decided if zooming makes sense in this scenario, as a car is not expected to be twice the size of another car.

LIDAR Feature extraction

Feature extraction method proposed here is used on LIDAR data.

a) Height Features

As the LIDAR data output is a set of points x,y,z and reflective intensity from these points, two LIDAR ‘snapshots’ of the same scene may have different data due to different x,y,z values. I will therefore chose to use a voxel-based method to first discretize the space and use statistics of points in these voxels to construct feature maps. A voxel is a volume unit in space, similar to pixel in 2D images. I first constrained our space so x-dimension (front), y-dimension (L-R) varied between -30 and 30, and vertical dimension varied between -.1.5 and 1 m. I next constructed voxels of width and length .1 m and height 0.3125 m. I then computed maximum height in each voxel and used this value as the height of the point cloud in that voxel. This gave us a height map of 600X600X5 features. We specifically chose 5 height maps because Udacity’s data uses vlp-16 lidar and having more fine discretization can result in height slices without any points.

The height maps above show that the pixels corresponding to the cars appear only in the first two maps, where as in the other maps, these features are absent.

Some random stuff I tried and rejected

1. Adding LIDAR-cars

Another approach I tried was to crop cars out and paste them at different locations to generate more augmented data. This although seems to be a good way to generate more LIDAR data, it however may be overkill, and I will hold this off until I am at model training stage.

Representative-cropped cars to paste at different locations to generate more augmented data

2. 3D reconstruction using stereo images

I initially I thought of fusing LIDAR data with 3D scenes reconstructed from stereo images. However, as Udacity/Didi data had only one camera in the front, I quickly rejected this idea. However, before rejecting the idea, I got some cool plots, so would like to share. I used openCV’s stereo image processing methods to generate disparity maps, and used the depth to develop a 3D voxel of points where the x-y of each pixel was its location and the depth was the depth computed from disparity map. I further colored each voxel by the color of the corresponding pixel. Although, I completely discarded this approach, I learned new plotting methods and worked with point cloud data.

3D reconstruction from 2D stereo images on KITTI dataset (uselss but cool)

Learning from EDA and issues with data

  1. Working with LIDAR data is not easy. LIDAR data is sparse, and a black space does not mean the same in different location of the space.
  2. KITTI dataset uses VLP-64 and Udacity data uses VLP-16. Therefore, some type of upsampling may be needed to generate more points for processing.
  3. As about 7000 different LIDAR front-veiws were combined to generate 360 degree views, there are a total of 343 billion (7000³) combinations of unique scenes that we can generate, this added with rotation and translation of LIDAR cloud points has the potential to give us a LARGE number of different point-clouds and ground truth labels. As the number of scenes far exceeds the size of any anticipated deep learning model, I anticipate little to no overfitting issues during the training process.

PS: I extracted some more features, will include them later.

--

--

Staff Software Engineer at Lockheed Martin-Autonomous System with research interest in control, machine learning/AI. Lifelong learner with glassblowing problem.