% Remember, if you use this you must call \IEEEpubidadjcol in the second
% column for its text to clear the IEEEpubid mark.
\maketitle
\begin{abstract}
Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes are often slender, lengthy, and partially obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior straight line anchors to extract features and refine lane location and shape. Though achieving high performance, manually setting prior anchors is cumbersome, and ensuring sufficient anchor coverage across diverse datasets requires a large number of dense anchors. Furthermore, NMS postprocessing should be applied to supress the redundant predictions. In this study, we introduce PolarRCNN, a two-stage nms-free anchor-based method for lane detection. By introducing local polar head, the proposal of anchors are dynamic. The number of anchors are decreasing greatly without sacrificing performace. What's more, a GNN based nms free head is proposed to enable the model reach an end-to-end format, which is deployment friendly. Our model yields competitive results on five popular lane detection benchmarks (Tusimple, CULane, LLAMAS, Curvelanes and DL-Rail) while maintaining a lightweight size and a simple structure.
\IEEEPARstart{L}{ane} detection is a significant problem in computer vision and autonomous driving, forming the basis for accurately perceiving the driving environment in intelligent driving systems. While extensive research has been conducted in ideal environments, it remains a challenging task in adverse scenarios such as night driving, glare, crowd, and rainy conditions, where lanes may be occluded or damaged. Moreover, the slender shapes, complex topologies of lanes and the global property to the complexity of detection challenges. An effective lane detection method should take into account both global high-level semantic features and local low-level features to address these varied conditions and ensure robust performance in real-time applications such as autonomous driving.
Traditional methods predominantly concentrate on handcrafted local feature extraction and lane shape modeling. Techniques such as the Canny edge detector\cite{canny1986computational}, Hough transform\cite{houghtransform}, and deformable templates for lane fitting\cite{kluge1995deformable} have been extensively utilized. Nevertheless, these approaches often encounter limitations in practical settings, particularly when low-level and local features lack clarity or distinctiveness.
In recent years, fueled by advancements in deep learning and the availability of large datasets, significant strides have been made in lane detection. Deep models, including convolutional neural networks (CNNs) and transformer-based architectures, have propelled progress in this domain. Previous approaches often treated lane detection as a segmentation task, albeit with simplicity came time-intensive computations. Some methods relied on parameter-based models, directly outputting lane curve parameters instead of pixel locations. These models offer end-to-end solutions, but the curve parameter sensitivity to lane shape compromises robustness.
\caption{Compare with the anchor setting with other methods. (a) The initial anchor settings of CLRNet. (b) The learned anchor settings of CLRNet trained on CULane. (c) The proposed anchors of our method. (d) The ground truth.}
\caption{Comparision between different anchor thresholds in different scenarios. (a) Ground truth in dense scenario. (b) Predictions with large nms thresholds in dense scenario. (c) Ground truth in sparse scenario. (d) Predictions with small nms threshold in sparse scenario.}
Drawing inspiration from object detection methods such as Yolos and Fast RCNN, several anchor-based approaches have been introduced for lane detection, the representative work including LanesATT and CLRNet. These methods have demonstrated superior performance by leveraging anchor priors and enabling larger receptive fields for feature extraction. However, anchor-based methods encounter similar drawbacks as anchor-based general object detection method as follows:
\begin{itemize} (1) A large amount of dense anchors should be configured to ensure the recall of detection result since the lane distributions are complex in real scenarios (i.e the direction and the localtion), as the Fig. \ref{anchor setting}(a) shows.
\end{itemize}
\begin{itemize} (2) Due to the large anchor setting, redundant predictions should be remove by postprocessing such as NMS \cite{} and FastNMS \cite{}, which brings the difficulty to deployment and the threshold of NMS should be manual setting.
\end{itemize}
In order to solve the first problem, CLRNet uses learned anchors which location are optimized during training to adapt to the lane distributions (see Fig \ref{anchor setting} (b)) in real scenarios and use cascade cross layer anchor refinement to make the anchor more closer to the groundtruth. However, the anchors in CLRNet are still numerous to cover the potential distributions of lanes. To solve this problem, ADNet \cite{} uses start points generate unit to propose flexible anchors for each image rather than uses the same set of anchors for all images. However, the start points of lanes are subjective and lack of clear visual evidence due to the gloal property of lanes, so the performance of ADNet is not ideal. SRLane uses local angle map to propose sketch anchors according the direction of groundtruth. This method only consider the direction and ignore the accurate location of anchors, leading to worse performance without cascade anchor refinement. Moreover, all methods mentioned above fail to avoid the redundant predictions in the second proplem.
In order to address the issue we mentioned above better than the previous work, we analysis the reasons causing these issues and proposed a new lane detection method called PolarRCNN, which is two-stage nms-free anchor-based model. PolarRCNN uses local and global coordinates to describe the anchors and the number of proposed anchors are much less than previous work, as shown in fig. \ref{anchor setting} (c). Moreover, aheuristic graph neural network block is proposed to make the model nms-free. The model architecture is simple without complex mechanism using in previous work(i.e. attenion, cascade refinement, etc.), making the model deployment easier and speed faster. Besides, simple architecture helps us to inspect the key factors for performance for anchor based lane detection methods.
We conducted ecperiment on five mainstream benchmarks including TuSimple \cite{}, CULane \cite{}, LLAMAS\cite{}, Curvelanes\cite{} and DL-Rail\cite{}. Our proposed method is blessed with competitive performance with the state-of-art methods.
Our main contribution are summarized as:
\begin{itemize}
\item We simplified the anchor parameters with local and global polar coordinate systems, and apply them to two-stage lane detection frameworks. Compared with other sparse two-stage methods, the number of porposed anchors are greatly decreasing with a better performace.
\item We proposed a novel heuristic graph neural network (GNN) head to implement a nms-free paradigm. The architecture of GNN is designed according to Fast NMS with interpretability. The whole training and testing process of our model is end-to-end.
\item Our proposed method applies simple model architectures and get competitive performance with other state-of-art methods on five datasets. The high performace with fewer anchors and nms-free paradigm and demonstrate the effectiveness of our method.
\end{itemize}
\section{Related Works}
The lane detection aims to detect lane instances in a image. In this section, we only introduce deep-leanrning based methods for lane detection. The lane detection methods can be categorized by segmentation based parameter-based methods and anchor-based methods.
\textbf{Segmentation-based Methods.} Segmentation-based methods focus on pixel-wise prediction. They predefined each pixel into different categories according to different lane instances and background\cite{} and predicted information pixel by pixel. However, these methods overly focus on low-level and local features, neglecting global semantic information and real-time detection. SCNN uses a larger receptive field to overcome this problem. Some methods such as UFLDv1 and v2\cite{}\cite{} and CondLaneNet\cite{} utilize row-wise or column-wise classification instead of pixel classification to improve detection speed. Another issue with these methods is that the lane instance prior is learned by the model itself, leading to a lack of prior knowledge. Lanenet uses post-clustering to distinguish each lane instance. UFLD divides lane instances by angles and locations and can only detect a fixed number of lanes. CondLaneNet utilizes different conditional dynamic kernels to predict different lane instances. Some methods such as FOLOLane\cite{} and GANet\cite{} use bottom-up strategies to detect a few key points and model their global relations to form lane instances.
\textbf{Parameter-based Methods.} Instead of predicting a series of points locations or pixel classes, parameter-based methods directly generate the curve parameters of lane instances. PolyLanenet\cite{} and LSTR\cite{} consider the lane instance as a polynomial curve and output the polynomial coefficients directly. BézierLaneNet\cite{} treats the lane instance as a Bézier curve and generates the locations of control points of the curve. BSLane uses B-Spline to describe the lane, and the curve parameters focus on the local shapes of lanes. Parameter-based methods are mostly end-to-end without postprocessing, which grants them faster speed. However, since the final visual lane shapes are sensitive to the lane shape, the robustness and generalization of parameter-based methods may be less than ideal.
\textbf{Anchor-Based Methods.} Inspired by some methods in general object detection like YOLO \cite{} and DETR \cite{}, anchor-based methods have been proposed for lane detection. Line-CNN is the earliest work, to our knowledge, that utilizes line anchors to detect lanes. The lines are designed as rays emitted from the three edges (left, bottom, and right) of an image. However, the receptive field of the model only focuses on edges and is slower than some methods. LaneATT \cite{} employs anchor-based feature pooling to aggregate features along the whole line anchor, achieving faster speed with better performance. Nevertheless, the grid sampling strategy and label assignment limit its potential. CLRNet \cite{} utilizes cross-layer refinement strategies, SimOTA label assignment \cite{}, and Liou loss to enhance anchor-based performance beyond most methods. The main advantage of anchor-based methods is that many strategies from anchor-based general object detection can be easily applied to lane detection, such as label assignment, bounding box refinement, GIOU loss, etc. However, the disadvantages of existing anchor-based lane detection are also evident. The line anchors need to be handcrafted and the anchor number is large, NMS postprocessing are needed, resulting in high computational consumption.
some work such as ADNet\cite{}, SRLane\cite{} and Sparse Laneformer\cite{} attempt to reduce the anchors and give proposals.
\textbf{NMS-Free Object Detections}. NMS is an import postprocessing step in most general object detection methods. Detr \cite{} use one to one label assignment to avoid redundant predictions without NMS. Other nms-free method \cite{} successively proposed. These methods analysis this issue in to aspects, the model architecture and label assignment. \cite{}\cite{} hold the view that one to one assignments are the key points for nms-free predictions. Other works also consider the model expression ability to provided the non-redundant predictions. However, few anchor-based lane detecction methods analysis the nms-free paradigm as the general object detection, and rely on the NMS postprocessing. In our work, we find both the labal assignment and the expressive ability of nms-free module (e.g. the architecture and the inputs of module) both play an important role in the nms-free lane detection task for ancnor-based models.
This paper aims to address the two issue mentioned above (reducing anchors numbers and nms-free) for the anchor-based lanes proposed methods.
\section{Method}
The overall architecture of PolarRCNN is illustrated in fig. \ref{overall_architecture}. Our model consists of backbone-FPN, local polar head and global polar head. Only simple network layers such as convolution, MLP and pooling ops are used in each bolck (rather than attention, dynamic kernels, etc.).
\caption{The overall pipeline of PolarRCNN. The architecture is simple and lightweight. The backbone (e.g. ResNet18) and FPN aims to extract feature of the image. And the Local polar head aims to proposed sparse line anchors. After pooling features sample along the line anchors, the global polar head give the final predictions. Trilet subheads are set in the Global polar Head, including an one-to-one classification head (O2O Cls head), an one-to-many classification head (O2M Cls head) and an one-to-many regression head (O2M Reg Head). The one-to-one cls head aim to replace the NMS postprocessing and select only one positive prediction sample for each groundtruth from the redundant predictions from the O2M head.}
\label{overall_architecture}
\end{figure*}
\subsection{Lane and Line Anchor Representation}
Lanes are thin and long curves, a suitable lane prior helps the model to extract features and predict location and modeling the shapes of lane curves more accurately. Keeping the same as privious works\cite{}\cite{}, the lane prior (also called lane anchor) in our work are straight lines and we sample a sequense of 2D points on each line anchor, i.e. $ P\doteq\left\{\left( x_1, y_1\right) , \left( x_2, y_2\right) , ....,\left( x_n, y_n \right)\right\}$, where N is the number of sampled points, The y coordinate of points is uniform sampled from the image vertically, i.e. $y_i=\frac{H}{N-1}*i$, where H is the image height. The same y coordinate of points are also sampled from the groundtruth lane and the model regress the x coordinate offset from line anchor to lane instance ground truth. The only differernce between PolarRCNN and previous work is the description of straight line anchors. It will be introduced in follows.
\textbf{Polar Coordinate system.} Since the lane anchor are set to be straight by default, it could be described by the straight line parameter. Previous work uses a ray to describe a 2D line anchor, and the parameters of a ray contain the start point's coordinates and the orientation/angle, i.e., $\left\{\theta, P_{xy}\right\}$, as shown in Figure \ref{coord} (a). \cite{}\cite{} define the start points locates on the three image boundary. And \cite{} points out that this not reasonable because the real start point of a lane could be in any location within an image. In our analysis, using a ray may cause ambiguity in describing a line because a line may have infinite start points and the start point of the lane is subjective. As illustrated in Figure \ref{coord} (a), the yellow and darkgreen start points with the same orientation $\theta$ describe the same line, and either of them could be chosen in different datasets. This ambiguity arises because a straight line has two degrees of freedom while a ray has three degrees of freedom. To address this issue, as shown in Figure \ref{coord} (b), we use polar coordinate systems to describe a lane anchor with two parameters for radius and angle $\left\{\theta, r\right\}$, where $\theta\in\left[-\frac{\pi}{2}, \frac{\pi}{2}\right)$ and $r \in\left(-\infty, +\infty\right)$.
\caption{Different descriptions for anchor parameters. (a) Ray: start point and orientation. (b) polar: radius and angle.}
\label{coord}
\end{figure}
We define two kinds of polar coordinate systems called the global coordinate system and the local coordinate system, with the origin points denoted as the global origin point $P_{0}^{\text{global}}$ and the local origin point $P_{0}^{\text{local}}$, correspondingly. For convenience, the global origin point is set around the static vanishing point of the whole lane image dataset, while the local origin points are set as lattice within the image. From Figure \ref{coord}, it is easy to see that only the radius parameters are influenced by the choise of the origin point, with the angle/orientation parameters keeping consistent.
\subsection{Local polar Head}
Dispired by the region proposal network in Faster RCNN \cite{}, the local polar proposal module aims to propose flexible anchors with high-quality in an image. As fig.\ref{lph} and fig. \ref{overall_architecture}. The highest level (P3) of FPN feature maps the input of $F \in\mathbb{R}^{C_{f}\times H_{f}\times W_{f}}$ are chosen as the input of Local Polar Head (LPH). After downsampling opereation, the feature map are fed into two branch, namely the regression branch and the classification branch:
\begin{equation}
\begin{aligned}
&F_d\gets downsample\left( F \right), \,F_d\in\mathbb{R}^{C_f\times H_l\times W_l}\\
The regression branch aim to proposed lane anchors by predicting the two parameters $F_{reg\,\,}\equiv\left[\mathbf{\Theta}^{H_{l}\times W_{l}}, \mathbf{\xi}^{H_{l}\times W_{l}}\right]$ under the local polar coordinate system, which denotes the angles and the radius. The classification branch predicts the heat map of the local polar origin grid. By removing the local origin points with lower confidence, the potential positive lane anchors around the groundtruth are more likely to chosen while the background lane anchors are removed. Keeping it simple, the regression branch $\phi_{reg}^{lph}\left(\cdot\right)$ and the classification branch $\phi_{cls}^{lph}\left(\cdot\right)$ consists of one conv 1x1 layers and two conv 1x1 layers correspondingly.
During the training stage, as fig. \ref{lphlabel},the ground truth label of local polar head is constructed as follows. The radius ground truth is defined as the shortest distance from a grid point (local plot origin point) to the ground truth lane curve. The ground truth of angle is defined as the orientation of the link from the grid point to the nearest points on the curve. Only one grid with the label of radius less than a threshold $\tau$ is set as a positive sample, while others are set as negative samples. Once the regression and classification labels are constructed, it can be easy to train the LPH by smooth-l1 loss and cross entropy loss (BCE). The LPH loss function is defined as follows:
where $BCE\left(\cdot , \cdot\right)$ denotes the binary cross entropy loss and $d\left(\cdot\right)$ denotes the smooth-l1 loss. In order to keep the backbone training stability, the gradiants from the confidential branch to the backbone feature map are detached.
\caption{Label construction for local polar proposal module.}
\label{lphlabel}
\end{figure}
\subsection{Global polar Head}
Global polar head serves has the second stage of PolarRCNN, which accept the line pooling features as input and predict the accurate lane shape and localtion. The global polar head consist of 3 partsd.
Once the local polar parameter of a line anchor is provided, it can be transformed to the global polar coordinates with the following euqation:
where $\left( x^{local}, y^{local}\right)$ and $\left( x^{global}, y^{global}\right)$ are the Cartesian coordinates of local and global origin points correspondingly.
Then the feature points can be sample on the line anchor. The y coordinate of points is uniform sampled from the image vertically as mentioned before, and the $x_{i}$ is caculated using the global polar axis by the following equation:
The RCNN Module consists of several MLP layers and predicts the confidence and the coordinate offset of $x_{i}$. During the training stage, all the $F\in\mathbb{R}^{C_{f}\times H_{f}\times W_{f}}$ proposed anchors participate, and the SimOTA\ref{} label assignment strategy is used for the RCNN module to determine which anchors are positive anchors, irrespective of the confidence predicted by the LPM module. These strategies are employed because the negative/background anchors are also crucial for the adaptability of the RCNN module.
where $\mathcal{L}_{cls}$ is focal loss, and $\mathcal{L}_{loc}$ is LaneIou loss\cite{}.
In the testing stage, anchors with the top-$k_{l}$ confidence are the chosed as the proposal anchors, and $k_{l}$ anchors are fed into the RCNN module to get the final predictions.
\caption{Comparision between different anchor thresholds in different scenarios. (a) Ground truth in dense scenario. (b) Predictions with large nms thresholds in dense scenario. (c) Ground truth in sparse scenario. (d) Predictions with small nms threshol in sparse scenario.}
Use $\backslash${\tt{begin\{IEEEbiography\}}} and then for the 1st argument use $\backslash${\tt{includegraphics}} to declare and link the author photo.
Use the author name as the 3rd argument followed by the biography text.
\end{IEEEbiography}
\vspace{11pt}
\bf{If you will not include a photo:}\vspace{-33pt}
\begin{IEEEbiographynophoto}{John Doe}
Use $\backslash${\tt{begin\{IEEEbiographynophoto\}}} and the author name as the argument followed by the biography text.