upate
166
main.tex
@ -46,8 +46,8 @@
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient coverage across diverse datasets often requires a large number of dense anchors. Furthermore,
|
||||
the use of \textit{Non-Maximum Suppression} (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose \textit{Polar R-CNN}, a NMS-free anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising performance. Additionally, we introduce a heuristic \textit{Graph Neural Network} (GNN)-based NMS-free head that supports an end-to-end paradigm, enhancing deployment efficiency and performance in scenarios with dense lanes. Our method achieves competitive results on five popular lane detection benchmarks—\textit{Tusimple}, \textit{CULane}, \textit{LLAMAS}, \textit{CurveLanes}, and \textit{DL-Rail}—while maintaining a lightweight design and straightforward structure. Our source code is available at \href{https://github.com/ShqWW/PolarRCNN}{\textit{https://github.com/ShqWW/PolarRCNN}}.
|
||||
Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient coverage across diverse datasets often requires a large amount of dense anchors. Furthermore,
|
||||
the use of \textit{Non-Maximum Suppression} (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose \textit{Polar R-CNN}, a NMS-free anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising performance. Additionally, we introduce a triplet head with heuristic structure that supports nms-free paradigm, enhancing deployment efficiency and performance in scenarios with dense lanes. Our method achieves competitive results on five popular lane detection benchmarks—\textit{Tusimple}, \textit{CULane}, \textit{LLAMAS}, \textit{CurveLanes}, and \textit{DL-Rail}—while maintaining a lightweight design and straightforward structure. Our source code is available at \href{https://github.com/ShqWW/PolarRCNN}{\textit{https://github.com/ShqWW/PolarRCNN}}.
|
||||
\end{abstract}
|
||||
\begin{IEEEkeywords}
|
||||
Lane Detection, NMS-Free, Graph Neural Network, Polar Coordinate System.
|
||||
@ -117,7 +117,7 @@ In recent years, advancements in deep learning and the availability of large dat
|
||||
\par
|
||||
Drawing inspiration from object detection methods such as \textit{YOLO} \cite{yolov10} and \textit{Faster R-CNN} \cite{fasterrcnn}, several anchor-based approaches have been introduced for lane detection, with representative works including \textit{LaneATT} \cite{laneatt} and \textit{CLRNet} \cite{clrnet}. These methods have shown superior performance by leveraging anchor \textit{priors} (as shown in Fig. \ref{anchor setting}) and enabling larger receptive fields for feature extraction. However, anchor-based methods encounter similar drawbacks to those in general object detection, including the following:
|
||||
\begin{itemize}
|
||||
\item As shown in Fig. \ref{anchor setting}(a), a large number of lane anchors are predefined in the image, even in \textbf{\textit{sparse scenarios}}---the situations where lanes are distributed widely and located far apart from each other, as illustrated in the Fig. \ref{anchor setting}(d).
|
||||
\item As shown in Fig. \ref{anchor setting}(a), a large amount of lane anchors are predefined in the image, even in \textbf{\textit{sparse scenarios}}---the situations where lanes are distributed widely and located far apart from each other, as illustrated in the Fig. \ref{anchor setting}(d).
|
||||
\item A \textit{Non-Maximum Suppression} (NMS) \cite{nms} post-processing step is required to eliminate redundant predictions but may struggle in \textbf{\textit{dense scenarios}} where lanes are close to each other, such as forked lanes and double lanes, as illustrated in the Fig. \ref{NMS setting}(a).
|
||||
\end{itemize}
|
||||
\par
|
||||
@ -135,7 +135,7 @@ To address the above two issues, we propose Polar R-CNN, a novel anchor-based me
|
||||
\begin{figure*}[ht]
|
||||
\centering
|
||||
\includegraphics[width=0.99\linewidth]{thesis_figure/ovarall_architecture.png}
|
||||
\caption{An illustration of the Polar R-CNN architecture. It has a similar pipelines with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_1, P_2, P_3$, followed by a \textit{Local Polar Module}, and a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the final accurate lane predictions. The global polar module includes a triplet head, which comprises a \textit{one-to-one (O2O)} classification head, a \textit{one-to-many} (O2M) classification head , and a \textit{one-to-many} (O2M) regression head.}
|
||||
\caption{An illustration of the Polar R-CNN architecture. It has a similar pipeline with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_1, P_2, P_3$, followed by a \textit{Local Polar Module}, and a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the final accurate lane predictions. The global polar module includes a triplet head, which comprises a \textit{one-to-one (O2O)} classification head, a \textit{one-to-many} (O2M) classification head, and a \textit{one-to-many} (O2M) regression head.}
|
||||
\label{overall_architecture}
|
||||
\end{figure*}
|
||||
\section{Related Works}
|
||||
@ -187,19 +187,19 @@ However, the representation of lane anchors as rays presents certain limitations
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png}
|
||||
\caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the polar axis.}
|
||||
\caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\tau^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the polar axis.}
|
||||
\label{lpmlabel}
|
||||
\end{figure}
|
||||
\par
|
||||
\textbf{Representation in Polar Coordinate.} As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce a polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (\textit{i.e.}, pole) and an angle $\theta$ from a reference direction (\textit{i.e.}, polar axis). As shown in Fig. \ref{coord}(b), given a polar corresponding to the yellow point, a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (\textit{i.e.}, radius), $r$, and the counterclockwise angle from the polar axis to the perpendicular line of the lane anchor, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right]$.
|
||||
\textbf{Representation in Polar Coordinate.} As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce a polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (\textit{i.e.}, pole) and an angle $\theta$ from a reference direction (\textit{i.e.}, polar axis). As shown in Fig. \ref{coord}(b), given a polar corresponding to the yellow point, a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (\textit{i.e.}, radius), $r$, and the counterclockwise angle from the polar axis to the perpendicular line of the lane anchor, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right)$.
|
||||
\par
|
||||
To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framewrok for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}.
|
||||
To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framework for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}.
|
||||
\par
|
||||
The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lpmlabel}, there are two types of local poles: positive and negative. Positive local poles (\textit{e.g.}, the blue points) have a radius $r_{i}^{l}$ below a threshold $\lambda^l$, otherwise, they are classified as negative local poles (\textit{e.g.}, the red points). Each local pole is responsible for predicting a single lane anchor. While a lane ground truth may generate multiple lane anchors, as shown in Fig. \ref{lpmlabel}, there are three positive poles around the lane instance (green lane), which are expected to generate three lane anchors.
|
||||
|
||||
%This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve.
|
||||
\par
|
||||
In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its definitive position becomes immutable and independent from its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a singular and unified pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified polar coordinate for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances.
|
||||
In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its definitive position becomes immutable and independent of its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a singular and unified pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified polar coordinate for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -208,120 +208,122 @@ In the local polar coordinate system, the parameters of each lane anchor are det
|
||||
\label{lpm}
|
||||
\end{figure}
|
||||
\subsection{Local Polar Module}
|
||||
As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{lpm}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
|
||||
As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $\boldsymbol{P}_1, \boldsymbol{P}_2, \boldsymbol{P}_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $\boldsymbol{P}_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{lpm}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
|
||||
\begin{equation}
|
||||
F_d\gets DS\left( P_{3} \right)\ \text{and}\ F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}.
|
||||
\boldsymbol{F}_d\gets DS\left( \boldsymbol{P}_{3} \right)\ \text{and}\ \boldsymbol{F}_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}.
|
||||
\end{equation}
|
||||
The downsampled feature map $F_d$ is then fed into two branches: a \textit{regression} branch $\phi _{reg}^{l}\left(\cdot \right)$ and a \textit{classification} branch $\phi _{cls}^{l}\left(\cdot \right)$, \textit{i.e.},
|
||||
The downsampled feature map $\boldsymbol{F}_d$ is then fed into two branches: a \textit{regression} branch $\phi _{reg}^{l}\left(\cdot \right)$ and a \textit{classification} branch $\phi _{cls}^{l}\left(\cdot \right)$, \textit{i.e.},
|
||||
\begin{align}
|
||||
F_{reg\,\,}\gets \phi _{reg}^{l}\left( F_d \right)\ &\text{and}\ F_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\
|
||||
F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}. \label{lpm equ}
|
||||
\boldsymbol{F}_{reg}\gets \phi _{reg}^{l}\left( \boldsymbol{F}_d \right)\ &\text{and}\ \boldsymbol{F}_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\
|
||||
\boldsymbol{F}_{cls}\gets \phi _{cls}^{l}\left( \boldsymbol{F}_d \right)\ &\text{and}\ \boldsymbol{F}_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}. \label{lpm equ}
|
||||
\end{align}
|
||||
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
|
||||
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $\boldsymbol{F}_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $\boldsymbol{F}_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
|
||||
\par
|
||||
\textbf{Loss Function for LPM.} To train the LPM, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
|
||||
\textbf{Loss Function for LPM.} To train the LPM, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. Consequently, we have a label set of local poles $\hat{\boldsymbol{F}}_{cls}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
|
||||
\begin{align}
|
||||
\mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)\\
|
||||
\mathcal{L} ^{l}_{cls}&=BCE\left( \boldsymbol{F}_{cls},\hat{\boldsymbol{F}}_{cls} \right)\\
|
||||
\mathcal{L} _{reg}^{l}&=\frac{1}{N_{pos}^{l}}\sum_{j\in \left\{ j|\hat{r}_{j}^{l}<\lambda^l \right\}}{\left( S_{L1}\left( \theta _{j}^{l}-\hat{\theta}_{j}^{l} \right) +S_{L1}\left( r_{j}^{l}-\hat{r}_{j}^{l} \right) \right)}
|
||||
\label{loss_lph}
|
||||
\end{align}
|
||||
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\lambda^{l}\}\right|$ is the number of positive local poles in LPM.
|
||||
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in LPM.
|
||||
\par
|
||||
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select $K$ anchors with the top-$K$ highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.}, global polar module). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ assists it assists \textit{Global Polar Module} (the second stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
|
||||
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select $K$ anchors with the top-$K$ highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.}, global polar module). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ assists it assists \textit{Global Polar Module} (the second stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sparse scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/detection_head.png}
|
||||
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet head. The triplet head consists of three parts, namely, the O2O classification head, the O2M classification head, and the O2M regression head. The scores generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing (the gray dashed route). The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant scores $\left\{\tilde{s}_i^g\right\}$ based on $\left\{s_i^g\right\}$ (the green solid route). Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ engage in the process of selecting the ultimate non-redundant outcomes, a procedure referred to as dual confidence selection. During the backward training phase, the gradients from the O2O classification head (the blue dashed route) are stopped.}
|
||||
\caption{The primary pipeline of GPM integrates the RoI Pooling Layer with the triplet head. The triplet head comprises three components: the O2O classification head, the O2M classification head, and the O2M regression head. The O2O classification head serves as a replacement for NMS; the dashed path with ``$\times$'' indicates that NMS is no longer necessary. Both sets of $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ participate in the process of selecting the ultimate non-redundant outcomes, a procedure referred to as dual confidence selection. During the backward training phase, the gradients from the O2O classification head (the blue dashed route with ``$\times$'') are stopped.}
|
||||
\label{gpm}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Global Polar Module}
|
||||
We introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. As illustrated in Fig. \ref{overall_architecture}, GPM takes features samples from anchors proposed by LPM and provides the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{gpm}.
|
||||
\par
|
||||
\textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
|
||||
\textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first transform the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, into the equivalent in a global polar coordinate system, $r_j^g$, by the following equation:
|
||||
\begin{align}
|
||||
r^{g}_{j}&=r^{l}_{j}+\left( \boldsymbol{c}^{l}_{j}-\boldsymbol{c}^{g} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right], \\
|
||||
j&=1,2,\cdots,K,\notag
|
||||
\end{align}
|
||||
where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j$-th local pole and the global pole, respectively. Note that we keep the angle $\theta_j$ unchanged, since the local and global polar coordinate system have the same polar axis. And next, the feature points are sampled on each lane anchors by
|
||||
\begin{align}
|
||||
x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\
|
||||
i&=1,2,\cdots,N_p,\notag
|
||||
r_{j}^{g}&=r_{j}^{l}+\left[ \cos \theta _j;-\sin \theta _j \right] ^T\left( \mathbf{c}_{j}^{l}-\mathbf{c}^g \right), \label{l2g}\\
|
||||
j &= 1, 2, \cdots, K, \notag
|
||||
\end{align}
|
||||
where the y-coordinates $\boldsymbol{y}_{j}^{s}\equiv \{y_{1,j},y_{2,j},\cdots ,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. Then the x-coordinates $\boldsymbol{x}_{j}^{s}\equiv \{x_{1,j},x_{2,j},\cdots ,x_{N_p,j}\}$ are caculated by Eq. \ref{positions}.
|
||||
where $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of the global pole and the $j$-th local pole, respectively. It is noteworthy that the angle $\theta_j$ remains unaltered, as the local and global polar coordinate systems share the same polar axis. And next, the feature points are sampled on each lane anchor as follows:
|
||||
\begin{align}
|
||||
x_{i,j}^{s}&=y_{i,j}^{s}\tan \theta _j+\frac{r_{j}^{g}+\left[ \cos \theta _j;-\sin \theta _j \right] ^T\boldsymbol{c}^g}{\cos \theta _j},\label{positions}\\
|
||||
i&=1,2,\cdots,N,\notag
|
||||
\end{align}
|
||||
where the y-coordinates $\boldsymbol{y}_{j}^{s}\equiv \{y_{1,j}^s,y_{2,j}^s,\cdots ,y_{N,j}^s\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. The x-coordinates $\boldsymbol{x}_{j}^{s}\equiv \{x_{1,j}^s,x_{2,j}^s,\cdots ,x_{N,j}^s\}$ are then calculated by Eq. (\ref{positions}). Eq. (\ref{l2g})-(\ref{positions}) could be easily demonstrated through the principles of Euclidean geometry.
|
||||
\par
|
||||
Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1,j}, \boldsymbol{F}_{2,j}, \boldsymbol{F}_{3,j}\in \mathbb{R} ^{N\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
|
||||
Given the feature maps $\boldsymbol{P}_1, \boldsymbol{P}_2, \boldsymbol{P}_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j}^s,y_{1,j}^s),(x_{2,j}^s,y_{2,j}^s),\cdots,(x_{N,j}^s,y_{N,j}^s)\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1,j}, \boldsymbol{F}_{2,j}, \boldsymbol{F}_{3,j}\in \mathbb{R} ^{N\times C_f}$. To enhance representation, similar to \cite{srlane}, we employ a weighted sum strategy to combine features from different levels as:
|
||||
\begin{equation}
|
||||
\boldsymbol{F}^s_j=\sum_{k=1}^3{\boldsymbol{F}_{k,j}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=1}^3{e^{\boldsymbol{w}_{k}}}}},
|
||||
\boldsymbol{F}^s_j=\sum_{k=1}^3{\frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=1}^3{e^{\boldsymbol{w}_{k}}}}\circ \boldsymbol{F}_{k,j} },
|
||||
\end{equation}
|
||||
where $\boldsymbol{w}_{k}\in \mathbb{R}^{N}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s_j\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s_j\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are subsequently subjected to a linear transformation, thereby yielding the pooled RoI features associated with the corresponding anchor:
|
||||
where $\boldsymbol{w}_{k}\in \mathbb{R}^{N}$ represents trainable aggregate weight ascribed to $N$ sampled points. The symbol ``$\circ$'' represents element-wise multiplication (\textit{i.e.}, Hadamard product). Instead of concatenating the three sampling features into $\boldsymbol{F}^s_j\in \mathbb{R} ^{N\times 3C_f}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s_j\in \mathbb{R} ^{N\times C_f}$, which is one-third of the initial dimension. The weighted sum of the tensors is flattened into a vector $\bar{\boldsymbol{F}}^s_j\in \mathbb{R} ^{NC_f}$, and then subjected to a linear transformation:
|
||||
\begin{align}
|
||||
\boldsymbol{F}^{roi}_j\gets \boldsymbol{W}_{pool}\boldsymbol{F}^s_j, \,\boldsymbol{F}^{roi}_j\in \mathbb{R} ^{d_r}.
|
||||
\boldsymbol{F}_{j}^{roi}&\gets \boldsymbol{W}_{pool}\bar{\boldsymbol{F}}_{j}^{s},\\
|
||||
j&=1,2,\cdots,K,\notag.
|
||||
\end{align}
|
||||
Here, $\boldsymbol{W}_{pool}\in \mathbb{R} ^{d_r\times NC_f}$ is employed to reduce the dimension of $\bar{\boldsymbol{F}}_{j}^{s}$, thereby yielding the final RoI feature $\boldsymbol{F}_{j}^{roi}\in \mathbb{R} ^{d_r}$, where $d_r\ll NC_f$.
|
||||
|
||||
\textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{gpm}. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
|
||||
\textbf{Triplet Head.} With the $\left\{ \boldsymbol{F}_{i}^{roi} \right\} _{i=1}^{K}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{gpm}. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (\textit{i.e.}, end-to-end detection), both the one-to-one and one-to-many label assignments become essential during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
|
||||
%In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis.
|
||||
%As illustrated in Fig. \ref{gpm}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head.
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
|
||||
\caption{The graph construction in O2O classification head. Each anchor is conceived as a node within the graph, with the associated ROI feature $\left\{\boldsymbol{F}_i^{roi}\right\}$ as the node feature. The interconnecting directed edges are established based on the scores emanating from the O2M classification head and the anchor geometric prior. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{54}$ are euqual to $1$ in the adjacent matrix $\boldsymbol{A}$, which implicit the existence of directed edges between corresponding node pairs (\textit{i.e.}, $1\rightarrow2$, $3\rightarrow2$ and $5\rightarrow4$).}
|
||||
\label{o2o_cls_head}
|
||||
\caption{An example of the graph construction in O2O classification head. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{54}$ are equal to $1$ in the adjacent matrix $\boldsymbol{A}$, thereby indicating the presence of directed edges between the respective node pairs (\textit{i.e.}, $1\rightarrow2$, $3\rightarrow2$ and $5\rightarrow4$). This implies that the detection result $2$ may be potentially suppressed by $1$ and $3$, whereas detection result $4$ may be potentially suppressed by $5$.}
|
||||
\label{graph}
|
||||
\end{figure}
|
||||
|
||||
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are architected with a straightforward design with two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to a NMS-free paradigm, we have developed an extended O2O classification head. As shown in Fig. \ref{o2o_cls_head}, we construct a graph and incorporates a \textit{graph neural network} \cite{gnn} (GNN) to the O2O classification head. The GNN is designed to model the relationship between RoI features $\boldsymbol{F}_{i}^{roi}$ of anchors.
|
||||
Drawing upon our previous analysis, the distance between two lanes should not only be modeled by explicit geometric properties but also to encompass implicit contextual semantics, such as “double” and “forked” lanes, which shouldn't be eliminated as redundant predictions. The insight of the GNN design is derived from Fast NMS \cite{yolact}, without iterative processes. A comprehensive description of the design can be located in Appendix \ref{NMS_appendix}; in this section, we focus on elaborating the architecture of the O2O classification head.
|
||||
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are architected with a straightforward design with \textcolor{red}{two-layer Multi-Layer Perceptrons (MLPs)}. To facilitate the model’s transition to a NMS-free paradigm, we have developed an extended O2O classification head. In this section, we focus on elaborating the structure of the O2O classification head. The comprehensive details of the structure design can be located in Appendix \ref{NMS_appendix};
|
||||
|
||||
In GNN, the essential components are nodes and edges. We have constructed a directed GNN as follows. Each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the input features (\textit{i.e.}, initial signals) of these nodes. Directed edges between nodes are expressed by adjacent matrix $\boldsymbol{A}\in\mathrm{R}^{K\times K}$. Specifically, if one element $A_{ij}$ in $\boldsymbol{A}$ equals $1$, a directed edge exists from the $i$-th node and $j$-th node. The existence of an edge from one node to another contingent upon two criterias. For simplification, we encapsulate the two criterias within two matrices.
|
||||
Disregarding the intricate details, the fundamental prerequisites of Fast NMS is as follows. The detection result A is suppressed by another detection result B if:
|
||||
\begin{itemize}
|
||||
\item (1) The confidence score of B exceeds that of A;
|
||||
\item (2) The distance between A and B is less than a predefined threshold.
|
||||
\end{itemize}
|
||||
According to the two above conditions, we can construct a relation graph between anchors, as illustrated in the Fig. \ref{graph}.
|
||||
|
||||
% The first matrix is the positive selection matrix, denoted as $\boldsymbol{A}^{P}\in\mathbb{R}^{K\times K}$:
|
||||
% \begin{align}
|
||||
% A_{ij}^{P}=\begin{cases}
|
||||
% 1, s_i\geqslant \lambda^s\,\,and\,\,s_j\geqslant \lambda^s\\
|
||||
% 0, others,\\
|
||||
% \end{cases}
|
||||
% \end{align}
|
||||
% where $\lambda^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this matrix to selectively retain positive candidate anchors. As shown in Fig. \ref{o2o_cls_head}, the gray anchors/nodes are those with confidences $\left\{ s_i \right\}$ lower than $\lambda^s$, thus they are isolated nodes without any connection with any other nodes.
|
||||
In a graph, the essential components consist of nodes and edges. We have constructed a directed graph as follows. Each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the input features (\textit{i.e.}, initial signals) of these nodes. Directed edges between nodes are represented by the adjacent matrix $\boldsymbol{A}\in\mathbb{R}^{K\times K}$.
|
||||
Specifically, if one element $A_{ij}$ in $\boldsymbol{A}$ equals $1$, a directed edge exists from the $i$-th node and $j$-th node, which implies that the $j$-th prediction may be suppressed by the $i$-th prediction. The existence of an edge is determined by two matrices corresponding to the above two conditions in Fast NMS.
|
||||
|
||||
The first matrix is the confidence comparison matrix $\boldsymbol{A}^{C}\in\mathbb{R}^{K\times K}$, which is defined as follows:
|
||||
\begin{align}
|
||||
A_{ij}^{C}=\begin{cases}
|
||||
1, s_i>s_j\,\,and\,\,\left( s_i=s_j\,\,or\,\,i>j \right)\\
|
||||
0, others.
|
||||
1,\,\,s_i^g>s_j^g\,\,or\,\,\left( s_i^g=s_j^g\,\,and\,\,i>j \right)\\
|
||||
0,\,\,others.
|
||||
\end{cases}
|
||||
\label{confidential matrix}
|
||||
\end{align}
|
||||
This matrix facilitates the comparison of scores for each pair of anchors. The edges from the $i$-th and the $j$-th nodes exist \textit{only if} the two nodes statisfy the above comparision result.
|
||||
This matrix facilitates the comparison of scores for each pair of anchors. The edge from the $i$-th and $j$-th nodes exists \textit{only if} the two anchors satisfy the new condition (accounting the situation involving two equal confidence scores) derived from condition (1).
|
||||
|
||||
The second component is the geometric prior matrix, denoted by $\boldsymbol{A}^{G}\in\mathbb{R}^{K\times K}$:
|
||||
The second matrix is the geometric prior matrix, denoted as $\boldsymbol{A}^{G}\in\mathbb{R}^{K\times K}$:
|
||||
\begin{align}
|
||||
A_{ij}^{G}=\begin{cases}
|
||||
1, \left| \theta _i-\theta _j \right|<\lambda ^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\lambda ^r\\
|
||||
0, others.
|
||||
1,\,\,\left| \theta _i-\theta _j \right|<\tau^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\tau^r\\
|
||||
0,\,\,others.
|
||||
\end{cases}
|
||||
\label{geometric prior matrix}
|
||||
\end{align}
|
||||
This matrix indicates that an edge is considered to exist between two nodes \textit{only if} the two corresponding anchors are sufficiently close to each other. The distance between anchors is characterized by their global polar parameters.
|
||||
This matrix indicates that an edge is considered to exist between two nodes \textit{only if} the two corresponding anchors are sufficiently close to each other. The distance between anchors is characterized by their global polar parameters. This criterion, which takes into account the distance between anchors, introduces a slight variation of condition (2), which accounts for the distance of detection outcomes.
|
||||
|
||||
With the aforementioned two matrices, the overall adjacency matrix is formulated as $\boldsymbol{A} = \boldsymbol{A}^{C} \odot \boldsymbol{A}^{G}$; where ``$\odot$'' signifies the element-wise multiplication. This indicates that the existence of edges should statisfies the above two corresponding conditions. Subsequently, the relationships between the $i$-th anchor and the $j$-th anchor can be modeled as follows:
|
||||
With the aforementioned two matrices, the overall adjacency matrix is formulated as $\boldsymbol{A} = \boldsymbol{A}^{C} \odot \boldsymbol{A}^{G}$; where ``$\odot$'' signifies the element-wise multiplication. Though we have constructed the suppressing relation graph of each pair of anchors, the distance still remains undefined. In fast NMS, the distance is delineated by geometric properties of the detection results, constraining the model's performance in dense scenarios as we analyzed before. Some forked lanes or dashed lanes have a small geometric distance, which may cause a difficulty in the trade-off of predictions distance. So we replace the geometric distance with the high-dimension semantic distance. The semantic distance is formulated by the graph neural network, which is data-driven. Consequently, the semantic distance between the $i$-th anchor and the $j$-th anchor can be modeled as follows:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \boldsymbol{W}_{roi}\boldsymbol{F}_{i}^{roi}+\boldsymbol{b}_{roi} \right) ,\label{edge_layer_1}\\
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \boldsymbol{W}_{roi}\boldsymbol{F}_{i}^{roi}+\boldsymbol{b}_{roi} \right),\label{edge_layer_1}\\
|
||||
\boldsymbol{F}_{ij}^{edge}&\gets \boldsymbol{W}_{in}\tilde{\boldsymbol{F}}_{j}^{roi}-\boldsymbol{W}_{out}\tilde{\boldsymbol{F}}_{i}^{roi},\label{edge_layer_2}\\
|
||||
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\boldsymbol{W}_s\left( \boldsymbol{x}_{j}^{s}-\boldsymbol{x}_{i}^{s} \right) +\boldsymbol{b}_s,\label{edge_layer_3}\\
|
||||
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
|
||||
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right).\label{edge_layer_4}
|
||||
\end{align}
|
||||
Eq. (\ref{edge_layer_1})-(\ref{edge_layer_4}) establish the directed relationships from the $i$-th node and the $j$-th node. Here, tensor $\boldsymbol{D}_{ij}^{edge}$ signifies the semantic features of directed edge $E_{ij}$. With the directed edge features provided for linked node pairs, we employ an element-wise max pooling layer to aggregate all the \textit{incoming edges} features of one node to refine its node features:
|
||||
Eq. (\ref{edge_layer_1})-(\ref{edge_layer_4}) calculate the semantic distance $\boldsymbol{D}_{ij}^{edge}\in \mathbb{R}^{d_n}$ from the $i$-th node and the $j$-th node corresponding to the edge $E_{i\rightarrow j}$ with a directional characteristic. With the directed semantic distances provided for linked node pairs, we employ an element-wise max pooling layer to aggregate all the \textit{incoming edges} of a node to refine its node features to $\boldsymbol{D}_{i}^{node}\in \mathbb{R}^{d_n}$:
|
||||
\begin{align}
|
||||
\boldsymbol{D}_{i}^{node}&\gets \underset{k\in \left\{ k|A_{ki}=1 \right\}}{\max}\boldsymbol{D}_{ki}^{edge}.
|
||||
D_{i,m}^{node}&\gets {\max}\,D_{ki,m}^{edge}, \\
|
||||
m&=1,2,\cdots,d_n,\notag
|
||||
\end{align}
|
||||
Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the most distinctive features alone the column of the adjacent matrix (\textit{i.e.}, the incoming edges). With the refined node features $\boldsymbol{D}_{i}^{node}\in \mathbb{R}^{d}$, the ultimate confidence scores $\tilde{s}_{i}^{g}$ are generated by the subsequent layers:
|
||||
where $D_{i,m}^{node}$ and $D_{ki,m}^{edge}$ are the $m$-th elements of $\boldsymbol{D}_{i}^{node}$ and $\boldsymbol{D}_{ki}^{edge}$, respectively. And Additionally, $k$ is an element of set $\left\{ k|A_{ki}=1 \right\}$. In this context, drawing inspiration from by \cite{o3d}\cite{pointnet}, the max pooling aims to extract the most distinctive features alone the column of the adjacent matrix (\textit{i.e.}, the set of the incoming nodes that may potentially suppress the refined node). With the refined node features, the ultimate confidence scores $\tilde{s}_{i}^{g}$ are generated by the subsequent layers:
|
||||
\begin{align}
|
||||
\boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right) ,
|
||||
\\
|
||||
\tilde{s}_{i}^{g}&\gets \sigma \left( \boldsymbol{W}_{node}\boldsymbol{F}_{i}^{node} + \boldsymbol{b}_{node} \right) ,
|
||||
\label{node_layer}
|
||||
\end{align}
|
||||
Equations (\ref{edge_layer_1})-(\ref{node_layer}) are referred to as the newly proposed \textit{graph neural network} (GNN) in our study, which serves as the structural foundation of the O2O classification head, replacing the traditional NMS post-processing.
|
||||
|
||||
\textbf{Dual Confidence Selection.} Within the conventional NMS framework, the predictions emanating from the O2M classification heads with confidences $\left\{ s_{i}^{g} \right\} $ surpassing $\lambda_{o2m}^s$ are designated as positive candidates. They are subsequently fed into the NMS post-processing stage to remove redundant predictions. In the NMS-free paradigm of our work, the final non-redundant predictions are selected through the following certerion:
|
||||
\begin{align}
|
||||
@ -329,17 +331,17 @@ Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the mos
|
||||
\end{align}
|
||||
We employ dual confidence thresholds, denoted as $\lambda_{o2m}^s$ and $\lambda_{o2o}^s$, to select the final non-redundant positives predictions. $\varOmega _{o2o}^{pos}$ signifies the ultimate collection of non-redundant predictions, wherein both confidences satisfy the aforementioned conditions in conjunction with the dual confidence thresholds. This methodology of selecting non-redundant predictions is termed \textit{dual confidence selection}.
|
||||
|
||||
\textbf{Label Assignment and Cost Function for GPM.} As the previous work \cite{o3d}\cite{pss}, we use the dual assignment strategy for label assignment of triplet head. The cost function for the $i$-th prediction and $j$-th ground truth is given as follows:
|
||||
\begin{align}
|
||||
\mathcal{C} _{ij}^{o2m}&=s_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},\\
|
||||
\mathcal{C} _{ij}^{o2o}&=\tilde{s}_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},
|
||||
\end{align}
|
||||
where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than that in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
|
||||
% \textbf{Label Assignment and Cost Function for GPM.} As the previous work \cite{o3d}\cite{pss}, we use the dual assignment strategy for label assignment of triplet head. The cost function for the $i$-th prediction and $j$-th ground truth is given as follows:
|
||||
% \begin{align}
|
||||
% \mathcal{C} _{ij}^{o2m}&=s_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},\\
|
||||
% \mathcal{C} _{ij}^{o2o}&=\tilde{s}_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},
|
||||
% \end{align}
|
||||
% where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than that in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
|
||||
|
||||
|
||||
Given the cost matrix, we use SimOTA \cite{yolox} (one-to-many assignment) for the O2M classification head and the O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head.
|
||||
|
||||
\textbf{Loss function for GPM.}
|
||||
Focal loss \cite{focal} is utilized for both O2O classification head and the O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega_{o2o}$, is confined to the positive sample set of the O2M classification head:
|
||||
We use SimOTA \cite{yolox} (one-to-many assignment) for the O2M classification head and the O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head. More details about the label assignment can be found in Appendix \ref{giou_appendix}. Focal loss \cite{focal} is utilized for both O2O classification head and the O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega_{o2o}$, is confined to the positive sample set of the O2M classification head:
|
||||
\begin{align}
|
||||
\varOmega _{o2o}=\left\{ i\mid s_i^g>\lambda_{o2m}^s \right\}.
|
||||
\end{align}
|
||||
@ -365,19 +367,11 @@ The final loss functions for GPM are given as follows:
|
||||
% \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
|
||||
% &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg].
|
||||
% \end{align}
|
||||
\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both LPM and GPM are trained concurrently. The overall loss function is delineated as follows:
|
||||
\subsection{The Overall Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both LPM and GPM are trained concurrently. The overall loss function is delineated as follows:
|
||||
\begin{align}
|
||||
\mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{g}+\mathcal{L} _{reg}^{g}.
|
||||
\end{align}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\section{Experiment}
|
||||
\subsection{Dataset and Evaluation Metric}
|
||||
We conducted experiments on four widely used lane detection benchmarks and one rail detection dataset: CULane\cite{scnn}, TuSimple\cite{tusimple}, LLAMAS\cite{llamas}, CurveLanes\cite{curvelanes}, and DL-Rail\cite{dalnet}. Among these datasets, CULane and CurveLanes are particularly challenging. The CULane dataset consists various scenarios but has sparse lane distributions, whereas CurveLanes includes a large number of curved and dense lane types, such as forked and double lanes. The DL-Rail dataset, focused on rail detection across different scenarios, is chosen to evaluate our model’s performance beyond traditional lane detection.
|
||||
@ -390,7 +384,7 @@ Rec\,\,&=\,\,\frac{TP}{TP+FN}.
|
||||
\\
|
||||
F1&=\frac{2\times Pre\times Rec}{Pre\,\,+\,\,Rec},
|
||||
\end{align}
|
||||
In our experiment, we use different IoU thresholds to calculate the F1-score for different datasets: F1@50 and F1@75 for CULane \cite{clrnet}, F1@50 for LLAMAS \cite{clrnet} and Curvelanes \cite{CondLaneNet}, and F1@50, F1@75, and mF1 for DL-Rail \cite{dalnet}. The mF1 is defined as:
|
||||
where $TP$, $FP$ and $FN$ represent the true positives, false positives, and false negatives of the entire dataset, respectively. In our experiment, we use different IoU thresholds to calculate the F1-score for different datasets: F1@50 and F1@75 for CULane \cite{clrnet}, F1@50 for LLAMAS \cite{clrnet} and Curvelanes \cite{CondLaneNet}, and F1@50, F1@75, and mF1 for DL-Rail \cite{dalnet}. The mF1 is defined as:
|
||||
\begin{align}
|
||||
mF1=\left( F1@50+F1@55+...+F1@95 \right) /10.
|
||||
\end{align}
|
||||
@ -828,22 +822,22 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
|
||||
\section{The Design Principles of the One-to-one classification Head}
|
||||
Two fundamental prerequisites of the NMS-free framework lie in the label assignment strategies and the head structures.
|
||||
|
||||
As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{yolox}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS postprocessing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth.
|
||||
As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{yolox}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS post-processing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth.
|
||||
|
||||
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLPs and CNNs struggle to assimilate the ``one-to-one'' characteristics, resulting in the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}. Consider a trival example: Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. Assuming that the $i$-th anchor and the $j$-th anchor are both close to the ground truth and overlap with each other, we can express as follows:
|
||||
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLPs and CNNs struggle to assimilate the ``one-to-one'' characteristics, resulting in the decreasing of performance compared to one-to-many label assignments with NMS post-processing\cite{yolov10}\cite{o2o}. Consider a trival example: Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. Assuming that the $i$-th anchor and the $j$-th anchor are both close to the ground truth and overlap with each other, we can express as follows:
|
||||
\begin{align}
|
||||
\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi}.
|
||||
\end{align}
|
||||
This indicates that the RoI pooling features of the two anchors are similar. Suppose that $\boldsymbol{F}^{roi}_{i}$ is designated as a positive sample while $\boldsymbol{F}^{roi}_{j}$ as a negative sample, the ideal outcome should manifest as:
|
||||
\begin{align}
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) &\rightarrow 1,
|
||||
\boldsymbol{F}_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) &\rightarrow 1,
|
||||
\\
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) &\rightarrow 0,
|
||||
\boldsymbol{F}_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) &\rightarrow 0,
|
||||
\label{sharp fun}
|
||||
\end{align}
|
||||
where $f_{cls}^{plain}$ represents a classification head characterized by a plain architecture. The Eq. \ref{sharp fun} implies that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. In other words, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances. This ``sharp'' pattern is hard to train for MLPs or CNNs \cite{o3d} solely. Consequently, additional new heuristic structures like \cite{o3d}\cite{relationnet} need to be developed.
|
||||
where $\boldsymbol{F}_{cls}^{plain}$ represents a classification head characterized by a plain architecture. The Eq. (\ref{sharp fun}) implies that the property of $\boldsymbol{F}_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. In other words, the output of $\boldsymbol{F}_{cls}^{plain}$ changes rapidly over short periods or distances. This ``sharp'' pattern is hard to train for MLPs or CNNs \cite{o3d} solely. Consequently, additional new heuristic structures like \cite{o3d}\cite{relationnet} need to be developed.
|
||||
|
||||
We draw inspiration from Fast NMS \cite{yolact} for the design of the O2O classification head. Fast NMS serves as an iteration-free postprocessing algorithm based on traditional NMS. Furthermore, we have incorporated a sort-free strategy along with geometric priors into Fast NMS, with the specifics delineated in Algorithm \ref{Graph Fast NMS}.
|
||||
We draw inspiration from Fast NMS \cite{yolact} for the design of the O2O classification head. Fast NMS serves as an iteration-free post-processing algorithm based on traditional NMS. Furthermore, we have incorporated a sort-free strategy along with geometric priors into Fast NMS, with the specifics delineated in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\caption{Fast NMS with Geometric Prior.}
|
||||
@ -865,7 +859,7 @@ We draw inspiration from Fast NMS \cite{yolact} for the design of the O2O classi
|
||||
\STATE Calculate the geometric prior matrix $\boldsymbol{A}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
|
||||
\begin{align}
|
||||
A_{ij}^{G}=\begin{cases}
|
||||
1, \left| \theta _i-\theta _j \right|<\lambda ^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\lambda ^r\\
|
||||
1, \left| \theta _i-\theta _j \right|<\tau^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\tau^r\\
|
||||
0, others.\\
|
||||
\end{cases}
|
||||
\label{geometric prior matrix}
|
||||
@ -897,14 +891,14 @@ We draw inspiration from Fast NMS \cite{yolact} for the design of the O2O classi
|
||||
|
||||
The new algorithm has a distinct format from the original one\cite{yolact}. The geometric prior $\boldsymbol{A}_{G}$ indicated that predictions associated with adequately proximate anchors were likely to suppress one another. It is straightforward to demonstrate that, when all elements within $\boldsymbol{A}_{G}$ are all set to 1 (disregarding geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed sort-free Fast NMS with geometric prior, we can design the structure of the one-to-one classification head.
|
||||
|
||||
The principal limitations of the NMS lie in the definitions of distance derived from geometry (i.e., Eq. \ref{al_1-3}) and the threshold $\lambda^{g}$ employed to eliminate redundant predictions (i.e., Eq. \ref{al_1-4}). For instance, in the scenario of double lines, despite the minimal geometric distance between the two lanes, their semantic divergence is strikingly distinct. Consequently, we replace the above two steps with trainable neural networks, allowing them to learn the semantic distance in a data-driven fashion. The neural network blocks to replace Eq. \ref{al_1-3} are expressed as:
|
||||
The principal limitations of the NMS lie in the definitions of distance derived from geometry (i.e., Eq. (\ref{al_1-3})) and the threshold $\lambda^{g}$ employed to eliminate redundant predictions (i.e., Eq. (\ref{al_1-4})). For instance, in the scenario of double lines, despite the minimal geometric distance between the two lanes, their semantic divergence is strikingly distinct. Consequently, we replace the above two steps with trainable neural networks, allowing them to learn the semantic distance in a data-driven fashion. The neural network blocks to replace Eq. (\ref{al_1-3}) are expressed as:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \boldsymbol{W}_{roi}\boldsymbol{F}_{i}^{roi}+\boldsymbol{b}_{roi} \right) ,\label{edge_layer_1_appendix}\\
|
||||
\boldsymbol{F}_{ij}^{edge}&\gets \boldsymbol{W}_{in}\tilde{\boldsymbol{F}}_{j}^{roi}-\boldsymbol{W}_{out}\tilde{\boldsymbol{F}}_{i}^{roi},\label{edge_layer_2_appendix}\\
|
||||
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\boldsymbol{W}_s\left( \boldsymbol{x}_{j}^{s}-\boldsymbol{x}_{i}^{s} \right) +\boldsymbol{b}_s,\label{edge_layer_3_appendix}\\
|
||||
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4_appendix}
|
||||
\end{align}
|
||||
where the inverse distance $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^{d}$ is no longer a scalar but a tensor. The replacement of Eq. \ref{al_1-4} is constructed as follows:
|
||||
where the inverse distance $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^{d}$ is no longer a scalar but a tensor. The replacement of Eq. (\ref{al_1-4}) is constructed as follows:
|
||||
\begin{align}
|
||||
\boldsymbol{D}_{i}^{node}&\gets \underset{k\in \left\{ k|A_{ki}=1 \right\}}{\max}\boldsymbol{D}_{ki}^{edge}.
|
||||
\\
|
||||
|
Before Width: | Height: | Size: 958 KiB After Width: | Height: | Size: 965 KiB |
Before Width: | Height: | Size: 1.6 MiB After Width: | Height: | Size: 1.6 MiB |
Before Width: | Height: | Size: 86 KiB After Width: | Height: | Size: 54 KiB |
Before Width: | Height: | Size: 628 KiB After Width: | Height: | Size: 628 KiB |
Before Width: | Height: | Size: 1.5 MiB After Width: | Height: | Size: 1.6 MiB |