Aaron Guan, zhongg@andrew.cmu.edu
Yi Gu, yig2@andrew.cmu.edu
Wanzhi Zhang, wanzhiz@andrew.cmu.edu
Robotics Institute
Carnegie Mellon University
The next generation transportation system is one of the most important research topics to resolve problems such as traffic accidents, traffic congestion, and environmental degradation. Among the components of the system, autonomous vehicles (AVs) have been drawing increasing attention over the last decade from academia, industry, and government. In order for the AVs to plan good enough future trajectories, the self-driving car should make predictions of other agent's next movement, such as other vehicles, pedestrians, and etc. In this project, we designed a CNN regression pipeline to predict surrounding agent’s motion over 5 seconds given a historical 5 second bird-eyes-view images. ResNet28 [1] and XceptionNet41 [2] are used as the backbone and Negative Log Likelihood (NLL) is used as the loss function.
Related Works
Motion Prediction Models
Traditionally, motional models were used to predict motion and access risks for autonomous vehicles. These models can be categorized into three classes, with an increasing degree of abstraction. The most basic approach is physics-based motion models which assumes the motion of vehicles only depends on the laws of physics. Maneuver-based motion models are more advanced in that they consider the maneuver that the driver intends to perform to predict vehicles motion. Finally there are interaction-aware motion models which takes into account the inter-dependencies between different vehicles’ maneuvers.
Convolutional Neural Networks
Considering the complex nature of the task, there have been works that introduce CNNs to predict the intention of surrounding vehicles. [4] introduced a convolution-deconvolution architecture to predict vehicle behaviour to predict the intention of surrounding vehicles by utilizing a six layer CNN with convolution and fully connected layers. It uses two backbone CNNs to extract the features from lidar and dynamic maps, and another three to detect and predict the motion.
In our work, we used ResNet28 and XceptionNet41 as the backbone. Deep Residual Networks was introduced by He et al. in [1], and applied shortcut connections to preserve gradients and allows for deeper networks. It has been empirically shown to increase performance in ImageNet classification and has been widely used since. Xception [2] by Google was built upon Inception and residual connections, and utilizes depth-wise separable convolution (a depthwise convolution followed by a pointwise convolution) in neural computer vision architectures and allows for a more efficient use of model parameters.
Dataset
In this project, we selected Lyft dataset [3] as our training dataset. The dataset is the largest collection of the traffic agent motion data. The dataset includes 1000+ hours of traffic agent movement, 16k miles of data from 23 vehicles and 15k semantic map annotations. The dataset consists of 170,000 scenes capturing the environment around the autonomous vehicle. Each scene encodes the state of the vehicle’s surroundings at a given point in time. One example of the agent and its history on the semantic map is shown in Figure 2.
To process the dataset, we utilize the l5kit library, which is a python software library that can preprocess and visualize the lyft dataset. We extract the dataset from .zarr files, which has helped us split the dataset into training, validate and test set. In our project, we used training set to train our models and validate set to evaluate the models performance. The dataset contains 4 arrays, which are agents, frames, scenes, and traffic light faces. L5kit contains several dataset package that already implements pytorch-ready dataset. There are two kinds of dataset could be used: EgoDataset and AgentDataset. EgoDataset iterates over the AV annotations. AgentDataset iterates other agents annotations. Both of them can be iterated and return multi-channel images and future trajectories offsets from the rasterizer. In this project, our task is to predict surrounding agents motions, so we only used AgentDataset to train and evaluate our models. Other dataset package that we used is ChunkedDataset, which could help to make zarr dataset object.
Method
We used ResNet and Xception as the backbone. Considering our goal is to predict the surrounding agents motions of the autonomous vehicle over a 5-second-horizon given their 5 second historical positions and current position, we need 50 historical frames and 1 current frame, in which the distance of each two continuing frames are 0.1 second. Each frame contains the ego car and agents that are on different channels. So our input can be represented as an image with 3+(50+1) x 2=105 channels. Here the first 3 channels are the RGB map. Then we have 50 history time steps and one current time step. Every step is represented by two channels: (1) The mask representing the location of the current agent, and (2) the mask representing all other agents nearby. Because we want to output the next 5 second agents motions, we need 50 frames, which could be represented as 50 coordinates in two axes. Since, We only predict 1 trajectory, we set our output size as 50 x 2 = 100.
Our backbone uses ResNet and Xception followed by one fully-connected layer which takes an input image with C channels and predicts 1 trajectory. We used use negative log-likelihood (NLL) of the ground truth in the distribution defined by the prediction as our evaluation metric. Given the ground truth trajectory GT and K predicted trajectory hypotheses, we compute the likelihood of the ground truth trajectory under the mixture of Gaussians with the means equal to the predicted trajectories and the Identity matrix as a covariance. The likelihood is yielded as:
Therefore, the NLL loss can be computed as:
The above loss funcion is originally designed for multiple hypothesis, but in this project we only have one trajectory hypothesis as output with confidence score of 1. Therefore, we set k = 1 and c = 1 for the loss function.
Result
We resized the image as 224 x 224 and used the batch size of 32. SGD optimizer with learning rate of 0.001 is used to train the network. We used pytorch to implement the models. We did not have time to train all the models until convergence. Given the limited time and resources, we trained the models with 1 Tesla T4 GPU trained with ~280000 iterations for about 100h on AWS for each backbone. Figure 4 shows the training loss of both ResNet28 and XceptionNet41. As shown in the figure, the XceptionNet41 outperforms the ResNet28.
To better validate our trained network, we visualize the ground truth trajectory and predicted trajectories on the semantic road map for the validation dataset. In these visualizations, the magenta trajectory is the ground truth trajectory, and the cyan trajectory is the predicted trajectory. Those blue boxes are the agent cars, and the one green box is the ego car. We can see both ResNet and Xception can predict the correct trajectory for simple scenario where the ground truth trajectory is almost a straight line.
For more complicated scenario, we can clearly see that the Xception can generate almost the same trajectory as the ground truth trajectory while the resnet failed to predict the correct trajectories. For those scenario involving the negotiation at intersection, we can observe some agents car stop at the intersection waiting for the traffic light and there are some cars making their turn. Both ResNet and Xception can have similar predicted trajectories as the ground truth trajectory. But the key difference her is that the Xception have a better prediction about the turning direction and angle for the agent cars.
Conclusion
We developed a simple yet effective deep learning regression pipeline to predict the motion of autonomous vehicles. We used ResNet and Xception as the backbone and Xception outperforms the ResNet in our experiment. We used the Negative Log likelihood as the loss function and the prediction result is really promising. We also visualized the predicted trajectories on the semantic map of the lyft dataset and the predicted trajectories are almost the same as the ground truth and the system can also predict reasonable trajectories in complicated scenario involving negotiation at intersection.
In this future, we will train more models using different backbones and ensemble these trained models to generate better prediction trajectory. In this project, we only generate 1 prediction trajectory. However, in reality there are multiple possible trajectories for the agent cars. Therefore, we aim to generate more trajectories with thier confidence scores. In this project, we only used the NLL as loss function. We can also try other loss functions such as L1/L2 loss. After we can generate robust prediction trajectory for agent cars, we can then develop the module to predict the ego car's future trajectory based on our prediction result.
Presentation Video
Reference
-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
-
Franc ̧ois Chollet. Xception: Deep learning with depthwise separable convolutions, 2017.
-
John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset, 2020.
-
W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.