This commit is contained in:
王老板 2024-10-30 02:33:35 +08:00
parent 4080ac0d71
commit 631a201472

139
main.tex
View File

@ -48,14 +48,14 @@
\begin{abstract}
Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient coverage across diverse datasets often requires a large amount of dense anchors. Furthermore,
the use of \textit{Non-Maximum Suppression} (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose \textit{Polar R-CNN}, a NMS-free anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising performance. Additionally, we introduce a triplet head with heuristic structure that supports NMS-free paradigm, enhancing deployment efficiency and performance in scenarios with dense lanes. Our method achieves competitive results on five popular lane detection benchmarks—\textit{Tusimple}, \textit{CULane}, \textit{LLAMAS}, \textit{CurveLanes}, and \textit{DL-Rail}—while maintaining a lightweight design and straightforward structure. Our source code is available at \href{https://github.com/ShqWW/PolarRCNN}{\textit{https://github.com/ShqWW/PolarRCNN}}.
the use of \textit{Non-Maximum Suppression} (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose \textit{Polar R-CNN}, an end-to-end anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising performance. Additionally, we introduce a triplet head with heuristic structure that supports NMS-free paradigm, enhancing deployment efficiency and performance in scenarios with dense lanes. Our method achieves competitive results on five popular lane detection benchmarks—\textit{Tusimple}, \textit{CULane}, \textit{LLAMAS}, \textit{CurveLanes}, and \textit{DL-Rail}—while maintaining a lightweight design and straightforward structure. Our source code is available at \href{https://github.com/ShqWW/PolarRCNN}{\textit{https://github.com/ShqWW/PolarRCNN}}.
\end{abstract}
\begin{IEEEkeywords}
Lane Detection, NMS-Free, Graph Neural Network, Polar Coordinate System.
\end{IEEEkeywords}
\section{Introduction}
\IEEEPARstart{L}{ane} detection is a critical task in computer vision and autonomous driving, aimed at identifying and tracking lane markings on the road \cite{adas}. While extensive research has been conducted in ideal environments, it is still challenging in adverse scenarios such as night driving, glare, crowd, and rainy conditions, where lanes may be occluded or damaged \cite{scnn}. Moreover, the slender shapes and complex topologies of lanes further complicate detection efforts \cite{polylanenet}. %Therefore, an effective lane detection method should take into account both global high-level semantic features and local low-level features to address these varied conditions and ensure robust performances in a real-time application. along with their global properties,
\IEEEPARstart{L}{ane} detection is a critical task in computer vision and autonomous driving, aimed at identifying and tracking lane markings on the road \cite{adas}. While extensive research has been conducted in ideal environments, it is still challenging in adverse scenarios such as night driving, glare, crowd, and rainy conditions, where lanes may be occluded or damaged \cite{scnn}. Moreover, the slender shapes and complex topologies of lanes further complicate detection efforts \cite{polylanenet}.
\par
In the past few decades, a lot of methods primarily focus on handcrafted local feature extraction and lane shape modeling. Techniques such as the \textit{Canny edge detector}\cite{cannyedge},\textit{ Hough transform}\cite{houghtransform}, and \textit{deformable templates}\cite{kluge1995deformable} have been widely employed for lane fitting. However, these approaches often face limitations in real-world scenarios, especially when low-level and local features lack clarity and distinctiveness.
\par
@ -119,7 +119,7 @@ In recent years, advancements in deep learning and the availability of large dat
Drawing inspiration from object detection methods such as \textit{YOLO} \cite{yolov10} and \textit{Faster R-CNN} \cite{fasterrcnn}, several anchor-based approaches have been introduced for lane detection, with representative works including \textit{LaneATT} \cite{laneatt} and \textit{CLRNet} \cite{clrnet}. These methods have shown superior performance by leveraging anchor \textit{priors} (as shown in Fig. \ref{anchor setting}) and enabling larger receptive fields for feature extraction. However, anchor-based methods encounter similar drawbacks to those in general object detection, including the following:
\begin{itemize}
\item As shown in Fig. \ref{anchor setting}(a), a large amount of lane anchors are predefined in the image, even in \textbf{\textit{sparse scenarios}}---the situations where lanes are distributed widely and located far apart from each other, as illustrated in the Fig. \ref{anchor setting}(d).
\item A \textit{Non-Maximum Suppression} (NMS) \cite{nms} post-processing step is required to eliminate redundant predictions but may struggle in \textbf{\textit{dense scenarios}} where lanes are close to each other, such as forked lanes and double lanes, as illustrated in the Fig. \ref{NMS setting}(a).
\item \textit{Non-Maximum Suppression} (NMS) \cite{nms} post-processing is required to eliminate redundant predictions but may struggle in \textbf{\textit{dense scenarios}} where lanes are close to each other, such as forked lanes and double lanes, as illustrated in the Fig. \ref{NMS setting}(a).
\end{itemize}
\par
Regrading the first issue, \cite{clrnet} introduced learned anchors that optimize the anchor parameters during training to better adapt to lane distributions, as shown in Fig. \ref{anchor setting}(b). However, the number of anchors remains excessive to adequately cover the diverse potential distributions of lanes. Furthermore, \cite{adnet} proposes flexible anchors for each image by generating start points with directions, rather than using a fixed set of anchors. Nevertheless, these start points of lanes are subjective and lack clear visual evidence due to the global nature of lanes. In contrast, \cite{srlane} uses a local angle map to propose sketch anchors according to the direction of ground truth. While this approach considers directional alignment, it neglects precise anchor positioning, resulting in suboptimal performance. Overall, the abundance of anchors is unnecessary in sparse scenarios.% where lane ground truths are sparse. The trend in new methodologies is to reduce the number of anchors while offering more flexible anchor configurations.%, which negatively impacts its performance. They also employ cascade cross-layer anchor refinement to bring the anchors closer to the ground truth. in the absence of cascade anchor refinement
@ -136,7 +136,7 @@ To address the above two issues, we propose Polar R-CNN, a novel anchor-based me
\begin{figure*}[ht]
\centering
\includegraphics[width=0.99\linewidth]{thesis_figure/ovarall_architecture.png}
\caption{An illustration of the Polar R-CNN architecture. It has a similar pipeline with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_1, P_2, P_3$, followed by a \textit{Local Polar Module}, and a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the final accurate lane predictions. The global polar module includes a triplet head, which comprises the \textit{one-to-one (O2O)} classification subhead, the \textit{one-to-many} (O2M) classification subhead, and the \textit{one-to-many} (O2M) regression subhead.}
\caption{An illustration of the Polar R-CNN architecture. It has a similar pipeline with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_1, P_2, P_3$, followed by a \textit{Local Polar Module}, and a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the final accurate lane predictions. The global polar module includes a triplet head, which comprises the \textit{one-to-one} (O2O) classification subhead, the \textit{one-to-many} (O2M) classification subhead, and the \textit{one-to-many} (O2M) regression subhead.}
\label{overall_architecture}
\end{figure*}
\section{Related Works}
@ -181,7 +181,7 @@ The overall architecture of our Polar R-CNN is illustrated in Fig. \ref{overall_
%
Lanes are characterized by their thin, elongated, and curved shapes. A well-defined lane prior aids the model in feature extraction and location prediction.
\par
\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of width $W$ and height $H$, a lane is represented by a set of 2D points $X=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ with equally spaced y-coordinates, i.e., $y_i=i\times\frac{H}{N}$, where $N$ is the number of data points. Since the y-coordinate is fixed, a lane can be uniquely defined by its x-coordinates. Previous studies \cite{linecnn}\cite{laneatt} have introduced \textit{lane priors}, also known as \textit{lane anchors}, which are represented as straight lines in the image plane and served as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a start point $(x^{s},y^{s})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta^s$. The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance.
\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of width $W$ and height $H$, a lane is represented by a set of 2D points $X=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ with equally spaced y-coordinates, \textit{i.e.}, $y_i=i\times\frac{H}{N}$, where $N$ is the number of data points. Since the y-coordinate is fixed, a lane can be uniquely defined by its x-coordinates. Previous studies \cite{linecnn}\cite{laneatt} have introduced \textit{lane priors}, also known as \textit{lane anchors}, which are represented as straight lines in the image plane and served as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a start point $(x^{s},y^{s})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta^s$. The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance.
\par
However, the representation of lane anchors as rays presents certain limitations. Notably, a lane anchor can have an infinite number of potential start points, which makes the definition of its start point ambiguous and subjective. As illustrated in Fig. \ref{coord}(a), the studies in \cite{dalnet}\cite{laneatt}\cite{linecnn} define the start points as being located at the boundaries of an image, such as the green point in Fig. \ref{coord}(a). In contrast, the research presented in \cite{adnet} defines the start points, exemplified by the purple point in Fig. \ref{coord}(a), based on their actual visual locations within the image. Moreover, occlusion and damage to the lane significantly affect the detection of these start points, highlighting the need for the model to have a large receptive field \cite{adnet}. Essentially, a straight lane has two degrees of freedom: the slope and the intercept, under a Cartesian coordinate system, implying that the lane anchor could be described using just two parameters instead of the three redundant parameters (\textit{i.e.}, two for the start point and one for the direction) employed in ray representation.
%
@ -296,23 +296,24 @@ where $\tau^{\theta}$ and $\lambda^g$ are the thresholds to measure the geometri
\par
And then, by considering the suppressive effect of the lane anchors induced by the overall adjacency matrix $\boldsymbol{A}$, the lane anchor features $\boldsymbol{F}_j^{roi}$ can be further refined from the semantic distance tensor $\mathcal{D}^{edge}=\{\boldsymbol{D}_{ij}^{edge}\}\in\mathbb{R}^{K\times K\times d_n}$ as follows:
\begin{align}
\boldsymbol{D}_j^{roi}\in \mathbb{R}^{d_n}\gets\mathrm{MPool_{col}}\left(\mathcal{D}^{edge}(:,j,:)|\boldsymbol{A}(:,j)=1\right),
\boldsymbol{D}_j^{roi}\in \mathbb{R}^{d_n}\gets\mathrm{MPool}_{col}\left(\mathcal{D}^{edge}(:,j,:)|\boldsymbol{A}(:,j)=1\right),
\label{maxpooling}
\end{align}
where $j=1,2,\cdots,K$ and $\mathrm{MPool_{col}}(\cdot|\boldsymbol{A}(:,j)=1)$ is an element-wise max pooling operator along the $j$-th column of adjacency matrix $\boldsymbol{A}$ with the element $A_{:j}=1$. This is in inspired by the existing works\cite{o3d}\cite{pointnet}, which aims to extract the most distinctive features from the lane anchors that may potentially suppress the refined lane anchors. With the refined anchor features $\boldsymbol{D}_j^{roi}$, the final confidence scores of the O2O classification subhead are generated by a three-layer MLPs:
where $j=1,2,\cdots,K$ and $\mathrm{MPool}_{col}(\cdot|\boldsymbol{A}(:,j)=1)$ is an element-wise max pooling operator along the $j$-th column of adjacency matrix $\boldsymbol{A}$ with the element $A_{:j}=1$. This is in inspired by the existing works\cite{o3d}\cite{pointnet}, which aims to extract the most distinctive features from the lane anchors that may potentially suppress the refined lane anchors. With the refined anchor features $\boldsymbol{D}_j^{roi}$, the final confidence scores of the O2O classification subhead are generated by a three-layer MLPs:
\begin{align}
\tilde{s}_{j}^{g}\gets \mathrm{MLP}_{roi}\left( \boldsymbol{D}_{j}^{roi} \right), j=1,\cdots,K. \label{node_layer}
\end{align}
As stated above, the O2O classification subhead is formed from Eqs. (\ref{edge_layer_1})-(\ref{node_layer}), which can be seen as a directed graph driven by neural networks.
As stated above, the O2O classification subhead is formed from Eqs. (\ref{edge_layer_1})-(\ref{node_layer}), which can be seen as a directed graph driven by neural networks. The structure in O2O classification subhead is referred to as \textit{graph neural network} (GNN) block.
\par
\textbf{Dual Confidence Selection With NMF-free.} With the help of adjacency matrix $A$, the variability among semantic features $\{\boldsymbol{D}_j^{roi}\}$ has been enlarged, resulting in a significant gap in confidence scores $\{\tilde{s}_{j}^{g}\}$ generated by O2O classification subhead, which makes them easier to distinguish. Therefore, unlike conventional methods that feed the confidence scores $\{\tilde{s}_{j}^{g}\}$ obtained by O2M classification subhead into the NMS post-processing stage to remove redundant candidates, we have implemented the following dual confidence selection criterion for selecting positive anchors:
\textbf{Dual Confidence Selection with NMF-free.} With the help of adjacency matrix $A$, the variability among semantic features $\{\boldsymbol{D}_j^{roi}\}$ has been enlarged, resulting in a significant gap in confidence scores $\{\tilde{s}_{j}^{g}\}$ generated by O2O classification subhead, which makes them easier to distinguish. Therefore, unlike conventional methods that feed the confidence scores $\{\tilde{s}_{j}^{g}\}$ obtained by O2M classification subhead into the NMS post-processing stage to remove redundant candidates, we have implemented the following dual confidence selection criterion for selecting positive anchors:
\begin{align}
\Omega^{pos}=\left\{i|\tilde{s}_{i}^{g}>\tau_{o2o} \right\} \cap \left\{ i|s_{i}^{g}>\tau_{o2m} \right\}
\end{align}
where $\tau_{o2o}$ and $\tau_{o2m}$ are two confidence thresholds. The $\Omega^{pos}$ can allow for non-redundant positive predictions without NMS post-processing as the O2O classification subhead enhances the confidence score variability among similar anchors, making it less sensitive to the two confidence thresholds.
\par
\textbf{Loss function for GPM.} After obtaining the positive candidate set $\Omega^{pos}$ for the O2O classification subhead, the Hungarian algorithm \cite{detr} is applied to perform label assignment, \textit{i.e.}, a one-to-one assignment between the positive anchors and the ground truth instances. As for the O2M classification and O2M regression subheads, we use the same approach as in SimOTA \cite{yolox} for label assignment. More details about label assignment can be found in Appendix \ref{assign_appendix}. In the training, the Focal loss \cite{focal} is applied for both O2O classification subhead and the O2M classification subhead, respectively denoted as $\mathcal{L}^{o2o}_{cls}$ and $\mathcal{L}^{o2m}_{cls}$. Furthermore, we adopt the rank loss $\mathcal{L}_{rank}$ \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification subhead. Note that, similar to \cite{pss}, we stop the gradient flow from the O2O classification subhead during the training stage to preserve the quality of RoI feature learning. To train the O2M regression subhead
we have redefined the GIoU concept (refer to Appendix \ref{giou_appendix} for more details) and adopt the GIoU loss $\mathcal{L}_{GIoU}^{o2m}$ to regress the x-coordinate offsets $\{\Delta\boldsymbol{x}_j\}$ for each positive lane anchor. The end points of lanes are trained with a $Smooth_{L1}$ loss $\mathcal{L}_{end}^{o2m}$. In addition, we propose an auxiliary loss $\mathcal{L}_{aux}$ to facilitate the learning of global features. As illustrated in Fig. \ref{auxloss}, the anchors and ground truth are divided into several segments, with each anchor segment being regressed to the primary components of the corresponding segment of the ground truth. The auxiliary loss $\mathcal{L}_{aux}$ helps the detection head gain a deeper understanding of the global geometric structure. Finally, the classification loss $\mathcal{L} _{cls}^{g}$ and the regression loss $\mathcal{L} _{reg}^{g}$ for GPM are given as follows:
\textbf{Loss function for GPM.} After obtaining the positive candidate set $\Omega^{pos}$ for the O2O classification subhead, the Hungarian algorithm \cite{detr} is applied to perform label assignment, \textit{i.e.}, a one-to-one assignment between the positive anchors and the ground truth instances. As for the O2M classification and O2M regression subheads, we use the same approach as in SimOTA \cite{yolox} for label assignment. More details about label assignment and cost function can be found in Appendix \ref{giou_appendix} and \ref{assign_appendix}. In the training, the Focal loss \cite{focal} is applied for both O2O classification subhead and the O2M classification subhead, respectively denoted as $\mathcal{L}^{o2o}_{cls}$ and $\mathcal{L}^{o2m}_{cls}$. Furthermore, we adopt the rank loss $\mathcal{L}_{rank}$ \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification subhead. Note that, similar to \cite{pss}, we stop the gradient flow from the O2O classification subhead during the training stage to preserve the quality of RoI feature learning.
To train the O2M regression subhead, we have redefined the GIoU concept (refer to Appendix \ref{giou_appendix} for more details) and adopt the GIoU loss $\mathcal{L}_{GIoU}^{o2m}$ to regress the x-coordinate offsets $\{\Delta\boldsymbol{x}_j\}$ for each positive lane anchor. The end points of lanes are trained with a $Smooth_{L1}$ loss $\mathcal{L}_{end}^{o2m}$. In addition, we propose an auxiliary loss $\mathcal{L}_{aux}$ to facilitate the learning of global features. As illustrated in Fig. \ref{auxloss}, the anchors and ground truth are divided into several segments, with each anchor segment being regressed to the primary components of the corresponding segment of the ground truth. The auxiliary loss $\mathcal{L}_{aux}$ helps the detection head gain a deeper understanding of the global geometric structure and the auxiliary regression branch is dropped during the evaluation stage. Finally, the classification loss $\mathcal{L} _{cls}^{g}$ and the regression loss $\mathcal{L} _{reg}^{g}$ for GPM are given as follows:
\begin{align}
\mathcal{L} _{cls}^{g}&=w^{o2m}_{cls}\mathcal{L}^{o2m}_{cls}+w^{o2o}_{cls}\mathcal{L}^{o2o}_{cls}+w_{rank}\mathcal{L}_{rank},
\\
@ -345,16 +346,16 @@ Rec\,\,&=\,\,\frac{TP}{TP+FN}.
\\
F1&=\frac{2\times Pre\times Rec}{Pre\,\,+\,\,Rec},
\end{align}
where $TP$, $FP$ and $FN$ represent the true positives, false positives, and false negatives of the entire dataset, respectively. In our experiment, we use different IoU thresholds to calculate the F1-score for different datasets: F1@50 and F1@75 for CULane \cite{clrnet}, F1@50 for LLAMAS \cite{clrnet} and Curvelanes \cite{CondLaneNet}, and F1@50, F1@75, and mF1 for DL-Rail \cite{dalnet}. The mF1 is defined as:
where $TP$, $FP$ and $FN$ represent the true positives, false positives, and false negatives of the entire dataset, respectively. In our experiment, we use different IoU thresholds to calculate the F1-score for different datasets: $F1@50$ and $F1@75$ for CULane \cite{clrnet}, $F1@50$ for LLAMAS \cite{clrnet} and Curvelanes \cite{CondLaneNet}, and $F1@50$, $F1@75$, and $mF1$ for DL-Rail \cite{dalnet}. The $mF1$ is defined as:
\begin{align}
mF1=\left( F1@50+F1@55+...+F1@95 \right) /10.
mF1=\left( F1@50+F1@55+\ldots+F1@95 \right) /10.
\end{align}
where $F1@50, F1@55, \ldots, F1@95$ are F1 metrics when IoU thresholds are $0.5, 0.55, \ldots, 0.95$, respectively.
For Tusimple, the evaluation is formulated as follows:
\begin{align}
Accuracy=\frac{\sum{C_{clip}}}{\sum{S_{clip}}}.
\end{align}
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate ($\mathrm{FPR}=1-\mathrm{Precision}$) and False Negative Rate ($\mathrm{FNR}=1-\mathrm{Recall}$) formular.
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimple also reports the \textit{False Positive Rate} ($\mathrm{FPR}=1-\mathrm{Precision}$) and \textit{False Negative Rate} ($\mathrm{FNR}=1-\mathrm{Recall}$) metrics.
\subsection{Implement Detail}
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6. The training processing of the whole model (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane}. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details can be seen in Appendix \ref{vis_appendix}.
@ -417,10 +418,6 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
\label{culane result}
\end{table*}
\begin{table}[h]
\centering
\caption{Comparison results on the TuSimple test set with other methods.}
@ -533,25 +530,13 @@ We also compare the number of anchors and processing speed with other methods. F
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{thesis_figure/anchor_num_method.png}
\caption{Anchor numbers vs F1@50 of different methods on CULane lane detection benchmark.}
\caption{Anchor numbers vs. F1@50 of different methods on CULane lane detection benchmark.}
\label{anchor_num_method}
\end{figure}
\subsection{Ablation Study}
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation studies on CULane and CurveLanes datasets.
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius parameters contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic $k=4$ for SimOTA, with no more than 4 positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground. It is evident that most of the proposed anchors are clustered around the lane ground truth.
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{thesis_figure/speed_method.png}
\caption{Latency vs F1@50 of different methods on CULane lane detection benchmark.}
\label{speed_method}
\end{figure}
\begin{table}[h]
\begin{table}[t]
\centering
\caption{Ablation study of anchor proposal strategies}
\begin{adjustbox}{width=\linewidth}
@ -575,26 +560,12 @@ We also explore the effect of different local polar map sizes on our model, as i
\label{aba_lph}
\end{table}
\begin{figure*}[t]
\begin{figure}[t]
\centering
\def\subwidth{0.325\textwidth}
\def\imgwidth{\linewidth}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing_p.png}
\end{subfigure}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing_r.png}
\end{subfigure}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing.png}
\end{subfigure}
\caption{F1@50 preformance of different polar map sizes and different top-$K$ anchor selections on the CULane test set.}
\label{anchor_num_testing}
\end{figure*}
\includegraphics[width=\linewidth]{thesis_figure/speed_method.png}
\caption{Latency vs. F1@50 of different methods on CULane lane detection benchmark.}
\label{speed_method}
\end{figure}
\begin{figure}[t]
\centering
@ -619,17 +590,11 @@ We also explore the effect of different local polar map sizes on our model, as i
\includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/heatmap/anchor2.jpg}
\caption{}
\end{subfigure}
\caption{(a) and (c) The heap map of the local polar map; (b) and (d) The final anchor selection during the evaluation stage.}
\caption{(a) and (c) Heap maps of the local polar map; (b) and (d) Anchor proposals during the evaluation stage.}
\label{cam}
\end{figure}
\textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free paradigm in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, confidence-prior adjacency matrix $\boldsymbol{A}^{C}$ proves crucial, indicating that conditional probability is effective. Other components, such as the geometric-prior adjacency matrix $\boldsymbol{A}^{G}$ and rank loss, also contribute to the performance of the NMS-free block.
To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both proposal and fixed anchor strategies (employing a fixed set of anchors as illustrated in Fig. \ref{anchor setting}(b)). Table \ref{NMS vs NMS-free} presents the results of these experiments. In the table, ``O2M'' and ``O2O'' refer to the NMS (the gray dashed route in Fig. \ref{gpm}) and NMS-free paradigms (the green route in Fig. \ref{gpm}) respectively. The suffix ``-B'' signifies that the head consists solely of MLPs, whereas ``-G'' indicates that the head is equipped with the GNN architecture. In the fixed anchor paradigm, although the O2O classification subhead without GNN effectively eliminates redundant predictions, the performance still improved by incorporating GNN structure. In the proposal anchor paradigm, the O2O classification subhead without GNN fails to eliminate redundant predictions due to high anchor overlaps. Thus, the GNN structure is essential for Polar R-CNN in the NMS-free paradigm. In both the fixed and proposed anchor paradigms, the O2O classification subhead with the GNN structure successfully eliminates redundant predictions, indicating that our GNN-based O2O classification subhead can supplant the NMS post-processing in sparse scenarios without decline in performance.
We also explore the stop-gradient strategy for the O2O classification subhead. As shown in Table \ref{stop}, the gradient of the O2O classification subhead negatively impacts both the O2M classification subhead (with NMS post-processing) and the O2O classification subhead. This observation indicates that the one-to-one assignment induces significant bias into feature learning, thereby underscoring the necessity of the stop-gradient strategy to preserve optimal performance.
\begin{table}[h]
\begin{table}[t]
\centering
\caption{Ablation study on the O2O classification subhead.}
\begin{adjustbox}{width=\linewidth}
@ -648,8 +613,31 @@ We also explore the stop-gradient strategy for the O2O classification subhead. A
\label{aba_NMSfree_block}
\end{table}
\subsection{Ablation Study}
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation studies on CULane and CurveLanes datasets.
\begin{table}[h]
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (\textit{i.e.}, angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius parameters contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using two anchor proposal strategies: ``fixed'' and ``proposal''. The term ``fixed'' denotes the fixed anchor configurations (192 anchors) trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Conversely, ``proposals'' signifies anchors proposed by LPM (20 anchors). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively. Moreover, the flexible anchors proposed by the LPM surpass fixed anchor configurations with auxiliary loss, utilizing fewer anchors to achieve superior performance.
Fig. \ref{cam} displays the heat map and top-$K$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground. It is evident that most of the proposed anchors are clustered around the lane ground truth. We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with the increasing in size of the local polar map and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose $k_{dynamic}=4$ for SimOTA, with no more than 4 positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant improvement in the early stages of anchor number increasing but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance.
\begin{figure*}[t]
\centering
\def\subwidth{0.325\textwidth}
\def\imgwidth{\linewidth}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing_p.png}
\end{subfigure}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing_r.png}
\end{subfigure}
\begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing.png}
\end{subfigure}
\caption{F1@50 preformance of different polar map sizes and different top-$K$ anchor selections on the CULane test set.}
\label{anchor_num_testing}
\end{figure*}
\begin{table}[t]
\centering
\caption{The ablation study for NMS and NMS-free on the CULane test set.}
\begin{adjustbox}{width=\linewidth}
@ -682,11 +670,13 @@ We also explore the stop-gradient strategy for the O2O classification subhead. A
\label{NMS vs NMS-free}
\end{table}
\textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free paradigm in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, confidence-prior adjacency matrix $\boldsymbol{A}^{C}$ proves crucial, indicating that the O2M confidence score is still essential in the NMS-free paradigm. Other components, such as the geometric-prior adjacency matrix $\boldsymbol{A}^{G}$ and rank loss, also contribute to the performance of the NMS-free block.
To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both ``proposal'' and ``fixed'' anchor strategies. Table \ref{NMS vs NMS-free} presents the results of these experiments. ``O2M'' and ``O2O'' refer to the NMS (the gray dashed route in Fig. \ref{gpm}) and NMS-free paradigms (the green route in Fig. \ref{gpm}) respectively. The suffix ``-B'' signifies that the head consists solely of MLPs, whereas ``-G'' indicates that the head is equipped with the GNN block. In the fixed anchor paradigm, although the O2O classification subhead without GNN effectively eliminates redundant predictions, the performance still improved by incorporating GNN structure. In the proposal anchor paradigm, the O2O classification subhead without the GNN block fails to eliminate redundant predictions due to high anchor overlaps. In both the fixed and proposed anchor paradigms, the O2O classification subhead with the GNN block successfully eliminates redundant predictions, indicating that both label assignments and the architectural design of the head are pivotal in achieving end-to-end detection with non-redundant predictions.
We also explore the stop-gradient strategy for the O2O classification subhead. As shown in Table \ref{stop}, the gradient of the O2O classification subhead negatively impacts both the O2M classification subhead (with NMS post-processing) and the O2O classification subhead. This observation indicates that the one-to-one assignment induces significant bias into feature learning, thereby underscoring the necessity of the stop-gradient strategy to preserve optimal performance.
\begin{table}[h]
\begin{table}[t]
\centering
\caption{The ablation study for the stop gradient strategy on the CULane test set.}
\begin{adjustbox}{width=\linewidth}
@ -708,13 +698,7 @@ We also explore the stop-gradient strategy for the O2O classification subhead. A
\end{table}
\textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification subhead in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}.
In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification subhead achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification subhead with proposed GNN structure is capable of learning both explicit geometric distance and implicit semantic distances between anchors, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing.
\begin{table}[h]
\begin{table}[t]
\centering
\caption{NMS vs NMS-free on CurveLanes validation set.}
\begin{adjustbox}{width=\linewidth}
@ -736,20 +720,18 @@ In the traditional NMS post-processing \cite{clrernet}, the default IoU threshol
\end{tabular}
\end{adjustbox}
\label{aba_NMS_dense}
\end{table}
\end{table}
\textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification subhead in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}.
In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the O2O classification subhead achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification subhead with proposed GNN block is capable of learning both explicit geometric distance and implicit semantic distances between anchors, thus providing a more effective solution for dense scenarios compared to traditional NMS post-processing.
\section{Conclusion and Future Work}
In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification subhead with GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Users have the option to use either the O2M classification subhead with NMS post-processing or the O2O classification subhead for a NMS-free approach. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore incorporating new structures, such as large kernels or attention mechanisms, and experimenting with new label assignment, training, and anchor sampling strategies. We also plan to extend Polar R-CNN to video instance lane detection and 3D lane detection, utilizing advanced geometric modeling for these new tasks.
%
%
In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification subhead with GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore new label assignment, anchor sampling strategies and complicated model structures, such as large kernels and attention mechanisms. We also plan to extend Polar R-CNN to video instance and 3D lane detection tasks, utilizing advanced geometric modeling techniques.
%
\bibliographystyle{IEEEtran}
\bibliography{reference}
%\newpage
%
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/wsq.jpg}}]{Shengqi Wang}
received the Master degree from Xi'an Jiaotong University, Xi'an, China, in 2022. He is now pursuing for the Ph.D. degree in statistics at Xi'an Jiaotong University. His research interests include low-level computer vision, deep learning, and so on.
\end{IEEEbiography}
@ -770,11 +752,8 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/sunkai.jpg}}]{Kai Sun}
received his Ph.D. degree in statistics from Xi'an Jiaotong University, Xi'an, China, in 2020. He jointed Xi'an Jiaotong University, China, in 2020, where he is currently an associate professor in School of Mathematics and Statistics. His research interests include deep learning and image processing. Up to now, he has authored and coauthored one monograph and 20+ academic papers, primarily in journals such as IEEE TIP, IEEE TNNLS and others. Additionally, he has published one ESI highly cited paper and ESI hot paper as the first author.
\end{IEEEbiography}
\vfill
\newpage
% 附录有多个section时
\appendices
@ -1255,7 +1234,7 @@ Some important implement details for each dataset are shown in Table \ref{datase
Fig. \ref{vis_sparse} illustrates the visualization outcomes in sparse scenarios spanning four datasets. The top row depicts the ground truth, while the middle row shows the proposed lane anchors and the bottom row exhibits the predictions generated by Polar-RCNN with NMS-free paradigm. In the top and bottom row, different colors aim to distinguish different lane instances, which do not correspond across the images. From images of the middle row, we can see that LPH of Polar R-CNN effectively proposes anchors that are clustered around the ground truth, providing a robust prior for GPH to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory.
Fig. \ref{vis_dense} shows the visualization outcomes in dense scenarios. The first column displays the ground truth, while the second and the third columns reveal the detection results with NMS paradigm of large (\textit{i.e.} the default threshold NMS@50 with 50 pixels) and small (\textit{i.e.} the optimal threshold NMS@15 with 15 pixels) NMS thresholds, respectively. The final column shows the detection results with NMS-free paradigm. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate some redundant predictions, leading to false positives. This underscores that the trade-off struggles between the large NMS threshold and the small NMS threshold. The visualization distinctly demonstrates that distance becomes less effective in dense scenarios. Only the proposed O2O classification subhead, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in the last column of Fig. \ref{vis_dense}, the O2O classification subhead successfully eliminates redundant predictions while preserving dense predictions, despite their minimal geometric distances.
Fig. \ref{vis_dense} shows the visualization outcomes in dense scenarios. The first column displays the ground truth, while the second and the third columns reveal the detection results with NMS paradigm of large (\textit{i.e.} the default threshold NMS@50 with 50 pixels) and small (\textit{i.e.} the optimal threshold NMS@15 with 15 pixels) NMS thresholds, respectively. The final column shows the detection results with NMS-free paradigm. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate some redundant predictions, leading to false positives. This underscores that the trade-off struggles between large and small NMS thresholds. The visualization distinctly demonstrates that distance becomes less effective in dense scenarios. Only the proposed O2O classification subhead, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in the last column of Fig. \ref{vis_dense}, the O2O classification subhead successfully eliminates redundant predictions while preserving dense predictions, despite their minimal geometric distances.
\label{vis_appendix}
\end{document}