This commit is contained in:
ShqWW 2024-09-13 12:36:47 +08:00
parent 8ccf5a9a4d
commit 21b3190aba
2 changed files with 62 additions and 55 deletions

117
main.tex
View File

@ -31,14 +31,15 @@
\title{Polar R-CNN:\@ End-to-End Lane Detection with Fewer Anchors} \title{Polar R-CNN:\@ End-to-End Lane Detection with Fewer Anchors}
\author{Shengqi Wang and Junmin Liu\\ \author{Shengqi Wang, Junmin Liu, Xiangyong Cao, Zengjie Song, and Kai Sun\\
\thanks{This work was supported in part by the National Nature Science Foundation of China (Grant Nos. 62276208, 12326607) and in part by the Natural Science Basic Research Program of Shaanxi Province (Grant No. 2024JC-JCQN-02).}% \thanks{This work was supported in part by the National Nature Science Foundation of China (Grant Nos. 62276208, 12326607) and in part by the Natural Science Basic Research Program of Shaanxi Province (Grant No. 2024JC-JCQN-02).}%
\thanks{S. Wang is with the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China, and is also with the School of Mathematics and Statistics, The University of Melbourne, VIC 3010 Australia.} \thanks{S. Wang, J. Liu, Z. Song and K. Sun are with the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.}
\thanks{J. Liu is with the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.} \thanks{X. Cao is with the School of Computer Science and Technology and the Ministry of Education Key Lab for Intelligent Networks and Network Security, Xian Jiaotong University, Xian 710049, China.}
} }
%\thanks{Manuscript received April 19, 2021; revised August 16, 2021.}} %\thanks{Manuscript received April 19, 2021; revised August 16, 2021.}}
% The paper headers % The paper headers
@ -128,17 +129,17 @@ In recent years, fueled by advancements in deep learning and the availability of
Drawing inspiration from object detection methods such as Yolos \cite{yolov10} and Faster R-CNN \cite{fasterrcnn}, several anchor-based approaches have been introduced for lane detection, the representative works including LaneATT \cite{laneatt} and CLRNet \cite{clrnet}. These methods have demonstrated superior performance by leveraging anchor priors and enabling larger receptive fields for feature extraction. However, anchor-based methods encounter similar drawbacks as anchor-based general object detection method as follows: Drawing inspiration from object detection methods such as Yolos \cite{yolov10} and Faster R-CNN \cite{fasterrcnn}, several anchor-based approaches have been introduced for lane detection, the representative works including LaneATT \cite{laneatt} and CLRNet \cite{clrnet}. These methods have demonstrated superior performance by leveraging anchor priors and enabling larger receptive fields for feature extraction. However, anchor-based methods encounter similar drawbacks as anchor-based general object detection method as follows:
(1) A large amount of lane anchors are set among the image even in sparse scenarios. (1) A large number of lane anchors are placed throughout the image, even in sparse scenarios. Sparse scenarios refer to situations where lanes are distributed sparsely and are located far from each other, as illustrated in Fig. \ref{anchor setting}(d).
(2) Non-maximum suppression (NMS) post-processing is necessary for the remove of redundant prediction but may fail in dense scenarios. (2) Non-maximum suppression (NMS) post-processing is required to remove redundant predictions but may struggle in dense scenarios. Dense scenarios involve situations where lanes are close to each other, such as forked lanes and double lanes, as depicted in Fig. \ref{NMS setting}(a).
Regrading the first issue, \cite{clrnet} introduced learned anchors, where the anchor parameters are optimized during training to adapt to the lane distributions (see Fig. \ref{anchor setting} (b)) in real dataset. Additionally, they employ cascade cross-layer anchor refinement to bring the anchors closer to the ground truth. However, the anchors are still numerous to cover the potential distributions of lanes. Moving further, \cite{adnet} proposes flexible anchors for each image by generating start points, rather than using a fixed set of anchors for all images. Nevertheless, the start points of lanes are subjective and lack clear visual evidence due to the global nature of lanes, which affects its performance. \cite{srlane} uses a local angle map to propose sketch anchors according to the direction of ground truth. This approach only considers the direction and neglects the accurate positioning of anchors, resulting in suboptimal performance without cascade anchor refinement. Overall, numerous anchors are unnecessary in sparse scenarios (where lane ground truths are sparse). The trend in newly proposed methods is to reduce the number of anchors and offer more flexible anchor configurations. Regrading the first issue, \cite{clrnet} introduced learned anchors, where the anchor parameters are optimized during training to adapt to the lane distributions (see Fig. \ref{anchor setting}(b)) in real dataset. Additionally, they employ cascade cross-layer anchor refinement to bring the anchors closer to the ground truth. However, the anchors are still numerous to cover the potential distributions of lanes. Moving further, \cite{adnet} proposes flexible anchors for each image by generating start points, rather than using a fixed set of anchors for all images. Nevertheless, the start points of lanes are subjective and lack clear visual evidence due to the global nature of lanes, which affects its performance. \cite{srlane} uses a local angle map to propose sketch anchors according to the direction of ground truth. This approach only considers the direction and neglects the accurate positioning of anchors, resulting in suboptimal performance without cascade anchor refinement. Overall, numerous anchors are unnecessary in sparse scenarios (where lane ground truths are sparse). The trend in newly proposed methods is to reduce the number of anchors and offer more flexible anchor configurations.
Regarding the second issue, nearly all anchor-based methods (including those mentioned above) require direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On the one hand, NMS is not deployment-friendly because it involves defining and calculating distances (e.g., Intersection over Union) between lane pairs. This is more challenging than bounding boxes in general object detection due to the complexity of lane geometry. On the other hand, NMS fails in some dense scenarios where the lane ground truths are closer together compared to sparse scenarios. A large distance threshold may result in false negatives, as some true positive predictions might be eliminated (as shown in Fig. \ref{NMS setting} (a) and (b)) by mistake. Conversely, a small distance threshold may not eliminate redundant predictions effectively and can leave false positives (as shown in Fig. \ref{NMS setting} (c) and (d)). Achieving an optimal trade-off in all scenarios by manually setting the distance threshold is challenging. The root cause of this problem is that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. Thus, when two predictions are “close” to each other, it is nearly impossible to determine whether one of them is redundant. Regarding the second issue, nearly all anchor-based methods (including those mentioned above) require direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On the one hand, NMS is not deployment-friendly because it involves defining and calculating distances (e.g., Intersection over Union) between lane pairs. This is more challenging than bounding boxes in general object detection due to the complexity of lane geometry. On the other hand, NMS fails in some dense scenarios where the lane ground truths are closer together compared to sparse scenarios. A large distance threshold may result in false negatives, as some true positive predictions might be eliminated (as shown in Fig. \ref{NMS setting}(a)\&(b)) by mistake. Conversely, a small distance threshold may not eliminate redundant predictions effectively and can leave false positives (as shown in Fig. \ref{NMS setting}(c)\&(d)). Achieving an optimal trade-off in all scenarios by manually setting the distance threshold is challenging. The root cause of this problem is that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. Thus, when two predictions are “close” to each other, it is nearly impossible to determine whether one of them is redundant.
To address the two issues outlined above, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations and reduce the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting} (c). Compared to state-of-the-art previous work \cite{clrnet}\cite{clrernet} which uses 192 anchors, Polar R-CNN employs only 20 anchors to cover potential lane ground truths. For the second issue, we have revised Fast NMS to Graph-based Fast NMS and introduced a new heuristic graph neural network block (Polar GNN block) integrated into the NMS head. The Polar GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: TuSimple \cite{tusimple}, CULane \cite{scnn}, LLAMAS \cite{llamas}, CurveLanes \cite{curvelanes}, and DL-Rail \cite{dalnet}. Our proposed method demonstrates competitive performance compared to state-of-the-art methods. To address the two issues outlined above, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations and reduce the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). Compared to state-of-the-art previous work \cite{clrnet}\cite{clrernet} which uses 192 anchors, Polar R-CNN employs only 20 anchors to cover potential lane ground truths. For the second issue, we have revised Fast NMS to Graph-based Fast NMS and introduced a new heuristic graph neural network block (Polar GNN block) integrated into the NMS head. The Polar GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: TuSimple \cite{tusimple}, CULane \cite{scnn}, LLAMAS \cite{llamas}, CurveLanes \cite{curvelanes}, and DL-Rail \cite{dalnet}. Our proposed method demonstrates competitive performance compared to state-of-the-art methods.
Our main contributions are summarized as follows: Our main contributions are summarized as follows:
@ -172,39 +173,40 @@ In this work, we aim to address to two issues in anchor-based lane detection men
\section{Proposed method} \section{Proposed method}
The overall architecture of Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. Our model adheres to the Faster R-CNN \cite{fasterrcnn} framework, consisting of a backbone, Feature Pyramid Network (FPN), Region Proposal Network (RPN), and Region of Interest (RoI) pooling. To investigate the fundamental factors affecting model performance, such as anchor settings and NMS post-processing, and make the model easier to deploy, Polar R-CNN employs a simple and straightforward network structure. It relies on basic components including convolutional layers, Multi-Layer Perceptrons (MLPs), and pooling operations, deliberately excluding advanced elements like attention mechanisms, dynamic kernels, and cross-layer refinement used in pervious works \cite{clrnet}\cite{clrernet}. The overall architecture of Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. Our model adheres to the Faster R-CNN \cite{fasterrcnn} framework, consisting of a backbone, Feature Pyramid Network (FPN), Region Proposal Network (RPN), and Region of Interest (RoI) pooling. To investigate the fundamental factors affecting model performance, such as anchor settings and NMS post-processing, and make the model easier to deploy, Polar R-CNN employs a simple and straightforward network structure. It relies on basic components including convolutional layers, Multi-Layer Perceptrons (MLPs), and pooling operations, deliberately excluding advanced elements like attention mechanisms, dynamic kernels, and cross-layer refinement used in pervious works \cite{clrnet}\cite{clrernet}.
\begin{table}[h] % \begin{table}[h]
\centering % \centering
\caption{Notations of some important variable} % \caption{Notations of some important variable}
\begin{adjustbox}{width=\linewidth} % \begin{adjustbox}{width=\linewidth}
\begin{tabular}{lll} % \begin{tabular}{lll}
\toprule % \toprule
\textbf{Variable} & \textbf{Type} & \hspace{10em}\textbf{Defination} \\ % \textbf{Variable} & \textbf{Type} & \hspace{10em}\textbf{Defination} \\
\midrule % \midrule
$\mathbf{P}_{i}$ & tensor& The $i_{th}$ output feature map from FPN\\ % $\mathbf{P}_{i}$ & tensor& The $i_{th}$ output feature map from FPN\\
$H^{l}$& scalar& The height of the local polar map\\ % $H^{l}$& scalar& The height of the local polar map\\
$W^{l}$& scalar& The weight of the local polar map\\ % $W^{l}$& scalar& The weight of the local polar map\\
$K_{a}$ & scalar& The number of anchors selected during evaluation\\ % $K_{a}$ & scalar& The number of anchors selected during evaluation\\
$\mathbf{c}^{g}$& tensor& The origin point of global polar coordinate\\ % $\mathbf{c}^{g}$& tensor& The origin point of global polar coordinate\\
$\mathbf{c}^{l}$& tensor& The origin point of local polar coordinate\\ % $\mathbf{c}^{l}$& tensor& The origin point of local polar coordinate\\
$r^{g}_{i}$& scalar& The $i_{th}$ anchor radius under global polar coordinate\\ % $r^{g}_{i}$& scalar& The $i_{th}$ anchor radius under global polar coordinate\\
$r^{l}_{i}$& scalar& The $i_{th}$ anchor radius under global polar coordinate\\ % $r^{l}_{i}$& scalar& The $i_{th}$ anchor radius under global polar coordinate\\
$\theta_{i}$& scalar& The $i_{th}$ anchor angle under global/local polar coordinate\\ % $\theta_{i}$& scalar& The $i_{th}$ anchor angle under global/local polar coordinate\\
\midrule % \midrule
$\mathbf{X}^{pool}_{i}$& tensor& The pooling feature of the $i_{th}$ anchor\\ % $\mathbf{X}^{pool}_{i}$& tensor& The pooling feature of the $i_{th}$ anchor\\
$N^{nbr}_{i}$& set& The adjacent node set of the $i_{th}$ of anchor node\\ % $N^{nbr}_{i}$& set& The adjacent node set of the $i_{th}$ of anchor node\\
$C_{o2m}$ & scalar& The positive threshold of one-to-many confidence\\ % $C_{o2m}$ & scalar& The positive threshold of one-to-many confidence\\
$C_{o2o}$ & scalar& The positive threshold of one-to-one confidence\\ % $C_{o2o}$ & scalar& The positive threshold of one-to-one confidence\\
$d_{dim}$ & scalar& Dimension of the distance tensor.\\ % $d_{dim}$ & scalar& Dimension of the distance tensor.\\
% \midrule % $w_{b}$ & scalar& Base width of the lane instance.\\
% & & \\ % % \midrule
% & & \\ % % & & \\
% & & \\ % % & & \\
% & & \\ % % & & \\
% & & \\ % % & & \\
\bottomrule % % & & \\
\end{tabular} % \bottomrule
\end{adjustbox} % \end{tabular}
\end{table} % \end{adjustbox}
% \end{table}
@ -212,7 +214,7 @@ The overall architecture of Polar R-CNN is illustrated in Fig. \ref{overall_arch
Lanes are characterized by their thin and elongated curved shapes. A suitable lane prior aids the model in extracting features, predicting locations, and modeling the shapes of lane curves with greater accuracy. Consistent with previous studies \cite{linecnn}\cite{laneatt}, our lane priors (also referred to as lane anchors) consists of straight lines. We sample a sequence of 2D points along each lane anchor, denoted as $ P\doteq \left\{ \left( x_1, y_1 \right) , \left( x_2, y_2 \right) , ....,\left( x_n, y_n \right) \right\} $, where N is the number of sampled points. The y-coordinates of these points are uniformly sampled from the vertical axis of the image, specifically $y_i=\frac{H}{N-1}*i$, where H is the image height. These y-coordinates are also sampled from the ground truth lane, and the model is tasked with regressing the x-coordinate offset from the line anchor to the lane instance ground truth. The primary distinction between Polar R-CNN and previous approaches lies in the description of the lane anchors, which will be detailed in the following sections. Lanes are characterized by their thin and elongated curved shapes. A suitable lane prior aids the model in extracting features, predicting locations, and modeling the shapes of lane curves with greater accuracy. Consistent with previous studies \cite{linecnn}\cite{laneatt}, our lane priors (also referred to as lane anchors) consists of straight lines. We sample a sequence of 2D points along each lane anchor, denoted as $ P\doteq \left\{ \left( x_1, y_1 \right) , \left( x_2, y_2 \right) , ....,\left( x_n, y_n \right) \right\} $, where N is the number of sampled points. The y-coordinates of these points are uniformly sampled from the vertical axis of the image, specifically $y_i=\frac{H}{N-1}*i$, where H is the image height. These y-coordinates are also sampled from the ground truth lane, and the model is tasked with regressing the x-coordinate offset from the line anchor to the lane instance ground truth. The primary distinction between Polar R-CNN and previous approaches lies in the description of the lane anchors, which will be detailed in the following sections.
\textbf{Polar Coordinate system.} Since lane anchors are typically represented as straight lines, they can be described using straight line parameters. Previous approaches have used rays to describe 2D lane anchors, with the parameters including the coordinates of the starting point and the orientation/angle, denoted as $\left\{\theta, P_{xy}\right\}$, as shown in Fig. \ref{coord} (a). \cite{linecnn}\cite{laneatt} define the start points as lying on the three image boundaries. However, \cite{adnet} argue that this approach is problematic because the actual starting point of a lane could be located anywhere within the image. In our analysis, using a ray can lead to ambiguity in line representation because a line can have an infinite number of starting points, and the choice of the starting point for a lane is subjective. As illustrated in Fig. \ref{coord} (a), the yellow (the visual start point) and green (the point located on the image boundary) starting points with the same orientation $\theta$ describe the same line, and either could be used in different datasets \cite{scnn}\cite{vil100}. This ambiguity arises because a straight line has two degrees of freedom, whereas a ray has three (two for the start point and one for orientation). To resolve this issue , we propose using polar coordinates to describe a lane anchor with only two parameters: radius and angle, deoted as $\left\{\theta, r\right\}$, where $\theta \in \left[-\frac{\pi}{2}, \frac{\pi}{2}\right)$ and $r \in \left(-\infty, +\infty\right)$. This representation isillustrated in Fig. \ref{coord} (b). \textbf{Polar Coordinate system.} Since lane anchors are typically represented as straight lines, they can be described using straight line parameters. Previous approaches have used rays to describe 2D lane anchors, with the parameters including the coordinates of the starting point and the orientation/angle, denoted as $\left\{\theta, P_{xy}\right\}$, as shown in Fig. \ref{coord}(a). \cite{linecnn}\cite{laneatt} define the start points as lying on the three image boundaries. However, \cite{adnet} argue that this approach is problematic because the actual starting point of a lane could be located anywhere within the image. In our analysis, using a ray can lead to ambiguity in line representation because a line can have an infinite number of starting points, and the choice of the starting point for a lane is subjective. As illustrated in Fig. \ref{coord}(a), the yellow (the visual start point) and green (the point located on the image boundary) starting points with the same orientation $\theta$ describe the same line, and either could be used in different datasets \cite{scnn}\cite{vil100}. This ambiguity arises because a straight line has two degrees of freedom, whereas a ray has three (two for the start point and one for orientation). To resolve this issue , we propose using polar coordinates to describe a lane anchor with only two parameters: radius and angle, deoted as $\left\{\theta, r\right\}$, where $\theta \in \left[-\frac{\pi}{2}, \frac{\pi}{2}\right)$ and $r \in \left(-\infty, +\infty\right)$. This representation isillustrated in Fig. \ref{coord}(b).
\begin{figure}[t] \begin{figure}[t]
\centering \centering
@ -232,7 +234,7 @@ Lanes are characterized by their thin and elongated curved shapes. A suitable la
\label{coord} \label{coord}
\end{figure} \end{figure}
We define two types of polar coordinate systems: the global coordinate system and the local coordinate system, with the origin points denoted as the global origin $\boldsymbol{c}^{g}$ and the local origin $\boldsymbol{c}^{l}$, respectively. For convenience, the global origin is positioned near the static vanishing point of the entire lane image dataset, while the local origins are set at lattice points within the image. As illustrated in Fig. \ref{coord} (b), only the radius parameters are affected by the choice of the origin point, while the angle/orientation parameters remain consistent. We define two types of polar coordinate systems: the global coordinate system and the local coordinate system, with the origin points denoted as the global origin $\boldsymbol{c}^{g}$ and the local origin $\boldsymbol{c}^{l}$, respectively. For convenience, the global origin is positioned near the static vanishing point of the entire lane image dataset, while the local origins are set at lattice points within the image. As illustrated in Fig. \ref{coord}(b), only the radius parameters are affected by the choice of the origin point, while the angle/orientation parameters remain consistent.
\subsection{Local Polar Head} \subsection{Local Polar Head}
@ -300,7 +302,7 @@ Next, feature points are sampled on the lane anchor. The y-coordinates of these
\label{gph} \label{gph}
\end{figure} \end{figure}
Suppose the $P_{0}$, $P_{1}$ and $P_{2}$ denote the last three levels from FPN and $\boldsymbol{F}_{L}^{s}\in \mathbb{R} ^{N_p\times d_f}$ represent the $L_{th}$ sample point feature from $P_{L}$. The grid featuers from the three levels are extracted and fused together without cross layer cascade refinenment unlike CLRNet. To reduce the number of parameters, we employ a weight sum strategy to combine features from different layers, similar to \cite{detr}, but in a more compact form: Suppose the $P_{0}$, $P_{1}$ and $P_{2}$ denote the last three levels from FPN and $\boldsymbol{F}_{L}^{s}\in \mathbb{R} ^{N_p\times d_f}$ represent the $L_{th}$ sample point feature from $P_{L}$. The grid featuers from the three levels are extracted and fused together without cross layer cascade refinenment unlike CLRNet. To reduce the number of parameters, we employ a weight sum strategy to combine features from different layers (denoted as $L$), similar to \cite{detr}, but in a more compact form:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}
\boldsymbol{F}^s=\sum_{L=0}^2{\boldsymbol{F}_{L}^{s}\times \frac{e^{\boldsymbol{w}_{L}^{s}}}{\sum_{L=0}^2{e^{\boldsymbol{w}_{L}^{s}}}}}, \boldsymbol{F}^s=\sum_{L=0}^2{\boldsymbol{F}_{L}^{s}\times \frac{e^{\boldsymbol{w}_{L}^{s}}}{\sum_{L=0}^2{e^{\boldsymbol{w}_{L}^{s}}}}},
@ -323,7 +325,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\ The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\ The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
The positive confidence get from o2m cls head, $s_i$;\\ The positive confidence get from o2m cls head, $s_i$;\\
The positive regressions get from o2m reg head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$;\\ The positive regressions get from o2m reg head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
\ENSURE ~~\\ %算法的输出Output \ENSURE ~~\\ %算法的输出Output
\STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows: \STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows:
\begin{equation} \begin{equation}
@ -335,7 +337,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
\end{aligned} \end{aligned}
\label{al_1-1} \label{al_1-1}
\end{equation} \end{equation}
where the $\land$ logical ``AND'' operation between two Boolean values. where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
\STATE Calculate the geometric prior adjacent matrix $\boldsymbol{M} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $M_{ij}$ in $\boldsymbol{M}$ is caculate as follows: \STATE Calculate the geometric prior adjacent matrix $\boldsymbol{M} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $M_{ij}$ in $\boldsymbol{M}$ is caculate as follows:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}
@ -377,7 +379,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
\label{gnn} \label{gnn}
\end{figure} \end{figure}
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting} (b)&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows: \textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}
&\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi}, &\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
@ -466,11 +468,10 @@ It should be noted that the O2O cls head depends on the predictons of O2M cls he
\\ \\
&d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right), &d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
\\ \\
&d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right). &d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right),
\end{aligned} \end{aligned}
\end{equation} \end{equation}
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}, GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
@ -607,7 +608,7 @@ For Tusimple, the evaluation is formulated as follows:
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FP=1-Precision) and False Negative Rate (FN=1-Recall) formular. where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FP=1-Precision) and False Negative Rate (FN=1-Recall) formular.
\subsection{Implement Detail} \subsection{Implement Detail}
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. Other parameters, such as batch size and loss weights for each dataset, are detailed in Table \ref{dataset_info}. Since some test/validation sets for the five datasets are not accessible, the test/validation sets used are also listed in Table \ref{dataset_info}. All the expoeriments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. Other parameters, such as batch size and loss weights for each dataset, are detailed in Table \ref{dataset_info}. Since some test/validation sets for the five datasets are not accessible, the test/validation sets used are also listed in Table \ref{dataset_info}. All the expoeriments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}.
\begin{table*}[htbp] \begin{table*}[htbp]
@ -799,7 +800,7 @@ We also compare the number of anchors and processing speed with other methods. F
\subsection{Ablation Study and Visualization} \subsection{Ablation Study and Visualization}
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance. To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance.
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting} (b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively. \textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth. We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
@ -1247,10 +1248,16 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
%\newpage %\newpage
% %
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/wsq.jpg}}]{Shengqi Wang} \begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/wsq.jpg}}]{Shengqi Wang}
received the Master degree from Xi'an Jiaotong University, Xi'an, China, in 2020. He is now pursuing for the Ph.D. degree in statistics at Xi'an Jiaotong University. His research interests include low-level computer vision, deep learning, and so on. received the Master degree from Xi'an Jiaotong University, Xi'an, China, in 2022. He is now pursuing for the Ph.D. degree in statistics at Xi'an Jiaotong University. His research interests include low-level computer vision, deep learning, and so on.
\end{IEEEbiography} \end{IEEEbiography}
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/ljm.pdf}}]{Junmin Liu} \begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/ljm.pdf}}]{Junmin Liu}
was born in 1982. He received the Ph.D. degree in Mathematics from Xi'an Jiaotong University, Xi'an, China, in 2013. From 2011 to 2012, he served as a Research Assistant with the Department of Geography and Resource Management at the Chinese University of Hong Kong, Hong Kong, China. From 2014 to 2017, he worked as a Visiting Scholar at the University of Maryland, College Park, USA. He is currently a full Professor at the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China. His research interests are mainly focused on the theory and application of machine learning and image processing. He has published over 60+ research papers in international conferences and journals. was born in 1982. He received the Ph.D. degree in Mathematics from Xi'an Jiaotong University, Xi'an, China, in 2013. From 2011 to 2012, he served as a Research Assistant with the Department of Geography and Resource Management at the Chinese University of Hong Kong, Hong Kong, China. From 2014 to 2017, he worked as a Visiting Scholar at the University of Maryland, College Park, USA. He is currently a full Professor at the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China. His research interests are mainly focused on the theory and application of machine learning and image processing. He has published over 60+ research papers in international conferences and journals.
\end{IEEEbiography}
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/xiangyongcao.jpg}}]{Xiangyong Cao (Member, IEEE)}
received the B.Sc. and Ph.D. degrees from Xian Jiaotong University, Xian, China, in 2012 and 2018, respectively. From 2016 to 2017, he was a Visiting Scholar with Columbia University, New York, NY, USA. He is an Associate Professor with the School of Computer Science and Technology, Xian Jiaotong University. His research interests include statistical modeling
and image processing.
\end{IEEEbiography} \end{IEEEbiography}
\vfill \vfill
\end{document} \end{document}

Binary file not shown.

After

Width:  |  Height:  |  Size: 603 KiB