This commit is contained in:
王老板 2024-10-16 10:28:26 +08:00
parent a2e56810de
commit 84e176dcdf
8 changed files with 110 additions and 90 deletions

192
main.tex
View File

@ -125,11 +125,11 @@ Regrading the first issue, \cite{clrnet} introduced learned anchors that optimiz
\par \par
Regarding the second issue, nearly all anchor-based methods \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane} rely on direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On one hand, NMS is not deployment-friendly because it requires defining and calculating distances between lane pairs using metrics such as \textit{Intersection over Union} (IoU). This task is more challenging than in general object detection due to the intricate geometry of lanes. On the other hand, NMS can struggle in dense scenarios. Typically, a large distance threshold may lead to false negatives, as some true positive predictions could be mistakenly eliminated, as illustrated in Fig. \ref{NMS setting}(a)(c). Conversely, a small distance threshold may fail to eliminate redundant predictions effectively, resulting in false positives, as shown in Fig. \ref{NMS setting}(b)(d). Therefore, achieving an optimal trade-off across all scenarios by manually setting the distance threshold is challenging. %The root of this problem lies in the fact that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. As a result, when two predictions are ``close'' to each other, it is nearly impossible to determine whether one of them is redundant.% where lane ground truths are closer together than in sparse scenarios;including those mentioned above, Regarding the second issue, nearly all anchor-based methods \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane} rely on direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On one hand, NMS is not deployment-friendly because it requires defining and calculating distances between lane pairs using metrics such as \textit{Intersection over Union} (IoU). This task is more challenging than in general object detection due to the intricate geometry of lanes. On the other hand, NMS can struggle in dense scenarios. Typically, a large distance threshold may lead to false negatives, as some true positive predictions could be mistakenly eliminated, as illustrated in Fig. \ref{NMS setting}(a)(c). Conversely, a small distance threshold may fail to eliminate redundant predictions effectively, resulting in false positives, as shown in Fig. \ref{NMS setting}(b)(d). Therefore, achieving an optimal trade-off across all scenarios by manually setting the distance threshold is challenging. %The root of this problem lies in the fact that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. As a result, when two predictions are ``close'' to each other, it is nearly impossible to determine whether one of them is redundant.% where lane ground truths are closer together than in sparse scenarios;including those mentioned above,
\par \par
To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have revised Fast NMS \cite{yolact} to Graph-based Fast NMS, incorporating a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} block (Polar GNN block) into the NMS head. The Polar GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows: To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have incorporated a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} block into the detection head. The GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows:
\begin{itemize} \begin{itemize}
\item We design a strategy to simplify the anchor parameters by using local and global polar coordinate systems and applied these to two-stage lane detection frameworks. Compared to other anchor-based methods, this strategy significantly reduces the number of proposed anchors while achieving better performance. \item We design a strategy to simplify the anchor parameters by using local and global polar coordinate systems and applied these to two-stage lane detection frameworks. Compared to other anchor-based methods, this strategy significantly reduces the number of proposed anchors while achieving better performance.
\item We propose a novel Polar GNN block to implement a NMS-free paradigm. The block is inspired by Graph-based Fast NMS, providing enhanced interpretability. Our Polar GNN block supports end-to-end training and testing while still allowing for traditional NMS post-processing as an option for a NMS version of our model. \item We propose a novel triplet detection head with GNN block to implement a NMS-free paradigm. The block is inspired by Fast NMS, providing enhanced interpretability. Our model supports end-to-end training and testing while still allowing for traditional NMS post-processing as an option for a NMS version of our model.
\item By integrating the polar coordinate systems and Polar GNN block, we present a Polar R-CNN model for fast and efficient lane detection. And we conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness of our model in high performance with fewer anchors and a NMS-free paradigm. %Additionally, our model features a straightforward structure—lacking cascade refinement or attention strategies—making it simpler to deploy. \item By integrating the polar coordinate systems and NMS-free paradigm, we present a Polar R-CNN model for fast and efficient lane detection. And we conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness of our model in high performance with fewer anchors and a NMS-free paradigm. %Additionally, our model features a straightforward structure—lacking cascade refinement or attention strategies—making it simpler to deploy.
\end{itemize} \end{itemize}
% %
\begin{figure*}[ht] \begin{figure*}[ht]
@ -188,7 +188,7 @@ However, the representation of lane anchors as rays presents certain limitations
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png} \includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png}
\caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\tau^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}^l_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the local polar axis.} \caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}^l_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the local polar axis.}
\label{lpmlabel} \label{lpmlabel}
\end{figure} \end{figure}
\par \par
@ -196,7 +196,7 @@ However, the representation of lane anchors as rays presents certain limitations
\par \par
To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framewrok for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}. To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framewrok for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}.
\par \par
The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lpmlabel}, there are two types of local poles: positive and negative. Positive local poles (\textit{e.g.}, the blue points) have a radius $r_{i}^{l}$ below a threshold $\tau^l$, otherwise, they are classified as negative local poles (\textit{e.g.}, the red points). Each local pole is responsible for predicting a single lane anchor. While a lane ground truth may generate multiple lane anchors, as shown in Fig. \ref{lpmlabel}, there are three positive poles around the lane instance (green lane), which are expected to generate three lane anchors. This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve. The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lpmlabel}, there are two types of local poles: positive and negative. Positive local poles (\textit{e.g.}, the blue points) have a radius $r_{i}^{l}$ below a threshold $\lambda^l$, otherwise, they are classified as negative local poles (\textit{e.g.}, the red points). Each local pole is responsible for predicting a single lane anchor. While a lane ground truth may generate multiple lane anchors, as shown in Fig. \ref{lpmlabel}, there are three positive poles around the lane instance (green lane), which are expected to generate three lane anchors. This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve.
\par \par
In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its position becomes fixed and independent from its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a single pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified framework for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances. In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its position becomes fixed and independent from its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a single pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified framework for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances.
@ -218,24 +218,24 @@ F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{
\end{align} \end{align}
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors. The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
\par \par
\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows: \textbf{Loss Function for LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
\begin{align} \begin{align}
\mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right) \mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)
\label{loss_lph} \label{loss_lph}
\end{align} \end{align}
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM. where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\lambda^{l}\}\right|$ is the number of positive local poles in LPM.
\par \par
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. \textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} \includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png}
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.} \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores $\left\{\tilde{s}_i^g\right\}$ based on the redundant predictions with confidence scores $\left\{s_i^g\right\}$ from the O2M classification head. Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ participate in the selection of final non-redundant results, which is called dual confidence selection.}
\label{g} \label{g}
\end{figure} \end{figure}
\subsection{Global Polar Module} \subsection{Global Polar Module}
Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}. Similar to the pipeline of Faster R-CNN, LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}.
\par \par
\textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation \textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
\begin{align} \begin{align}
@ -247,87 +247,99 @@ where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \m
x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\ x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\
i&=1,2,\cdots,N_p,\notag i&=1,2,\cdots,N_p,\notag
\end{align} \end{align}
where the y-coordinates $\boldsymbol{y}_{j}^{b}\equiv \{y_{1,j},y_{2,j},\cdots ,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. Then the x-coordinates $\boldsymbol{x}_{j}^{b}\equiv \{x_{1,j},x_{2,j},\cdots ,x_{N_p,j}\}$ are caculated by Eq. \ref{positions}. where the y-coordinates $\boldsymbol{y}_{j}^{s}\equiv \{y_{1,j},y_{2,j},\cdots ,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. Then the x-coordinates $\boldsymbol{x}_{j}^{s}\equiv \{x_{1,j},x_{2,j},\cdots ,x_{N_p,j}\}$ are caculated by Eq. \ref{positions}.
\par \par
Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
\begin{equation} \begin{equation}
\boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}}, \boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}},
\end{equation} \end{equation}
where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor: where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are subsequently subjected to a linear transformation, thereby yielding the pooled RoI features associated with the corresponding anchor:
\begin{equation} \begin{align}
\boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}. \boldsymbol{F}^{roi}\gets \boldsymbol{W}_{pool}\boldsymbol{F}^s, \,\boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}.
\end{equation} \end{align}
\textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm. \textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head. To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名 \includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
\caption{The main architecture of O2O classification head. Each anchor is conceived as a node within the Polar GNN. The interconnecting edges (\textit{i.e.} the adjacent matrix) are formed through the amalgamation of three distinct matrices: the positive selection matrix $\left\{M_{ij}^{P}\right\}$, the confidence comparison matrix $\left\{M_{ij}^{C}\right\}$ and the geometric prior matrix $\left\{M_{ij}^{G}\right\}$. $\left\{M_{ij}^{P}\right\}$ and $\left\{M_{ij}^{C}\right\}$ are derived from the O2M classification head (the orange box), whereas $\left\{M_{ij}^{G}\right\}$ is constructed in accordance with the positional parameter of the anchors (the black dashed box).} \caption{The main architecture of O2O classification head. Each anchor is conceived as a node within the GNN, with the corresponding ROI feature $\boldsymbol{F}_i^{roi}$ as the node feature. The interconnecting directed edges are constructed according the scores output by the O2M classification head and the anchor geometric prior.}
\label{o2o_cls_head} \label{o2o_cls_head}
\end{figure} \end{figure}
As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. The Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the Polar GNN. As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture to O2O classification head, incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior. The GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The insight of the GNN design is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the O2O classification head.
In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$: In GNN, the essential components are nodes and edges. We have constructed a directed GNN as follows. Each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the input features (\textit{i.e.}, initial signals) of these nodes. Directed edges between nodes are expressed by adjacent matrix $\boldsymbol{A}\in\mathrm{R}^{K\times K}$. Specifically, if one element $A_{ij}$ in $\boldsymbol{A}$ equals $1$, a directed edge exist from the $i$-th node and $j$-th node. The existence of an edge from one node to another is contingent upon two conditions. For simplification, we encapsulate the two conditions within two matrices.
% The first matrix is the positive selection matrix, denoted as $\boldsymbol{A}^{P}\in\mathbb{R}^{K\times K}$:
% \begin{align}
% A_{ij}^{P}=\begin{cases}
% 1, s_i\geqslant \lambda^s\,\,and\,\,s_j\geqslant \lambda^s\\
% 0, others,\\
% \end{cases}
% \end{align}
% where $\lambda^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this matrix to selectively retain positive candidate anchors. As shown in Fig. \ref{o2o_cls_head}, the gray anchors/nodes are those with confidences $\left\{ s_i \right\}$ lower than $\lambda^s$, thus they are isolated nodes without any connection with any other nodes.
The first matrix is the confidence comparison matrix $\boldsymbol{A}^{C}\in\mathbb{R}^{K\times K}$, which is defined as follows:
\begin{align} \begin{align}
M_{ij}^{P}=\begin{cases} A_{ij}^{C}=\begin{cases}
1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\ 1, s_i>s_j\,\,and\,\,\left( s_i=s_j\,\,or\,\,i>j \right)\\
0,others,\\ 0, others.\\
\end{cases} \end{cases}
\end{align}
where $\tau ^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this threshold to selectively retain positive redundant predictions.
The second component is the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align}
M_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
0, others.\\
\end{cases}
\label{confidential matrix} \label{confidential matrix}
\end{align} \end{align}
This matrix facilitates the comparison of scores for each pair of anchors. This matrix facilitates the comparison of scores for each pair of anchors. The edges from the $i$-th and the $j$-th nodes exist \textit{only if} the two nodes statisfy the above comparision result.
The third component is the geometric prior matrix, denoted by $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as: The second component is the geometric prior matrix, denoted by $\boldsymbol{A}^{G}\in\mathbb{R}^{K\times K}$:
\begin{align} \begin{align}
M_{ij}^{G}=\begin{cases} A_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\ 1, \left| \theta _i-\theta _j \right|<\lambda ^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\lambda ^r\\
0, others.\\ 0, others.\\
\end{cases} \end{cases}
\label{geometric prior matrix} \label{geometric prior matrix}
\end{align} \end{align}
This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two corresponding nodes if the anchors are sufficiently close. This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two nodes \textit{only if} the two corresponding anchors are sufficiently close with each other. The distance between anchors is described by their global polar parameter.
With the aforementioned three matrices, we can define the overall adjacency matrix as $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND''. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows:
With the aforementioned two matrices, we can constructed the overall adjacency matrix as $\boldsymbol{A} = \boldsymbol{A}^{C} \odot \boldsymbol{A}^{G}$; where ``$\odot$'' denotes the element-wise multiplication. This means that the existence of edges should statisfies the above three corresponding conditions. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows:
\begin{align} \begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\label{edge_layer_1}\\ \tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \boldsymbol{W}_{roi}\boldsymbol{F}_{i}^{roi}+\boldsymbol{b}_{roi} \right) ,\label{edge_layer_1}\\
\boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\label{edge_layer_2}\\ \boldsymbol{F}_{ij}^{edge}&\gets \boldsymbol{W}_{in}\tilde{\boldsymbol{F}}_{j}^{roi}-\boldsymbol{W}_{out}\tilde{\boldsymbol{F}}_{i}^{roi},\label{edge_layer_2}\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\label{edge_layer_3}\\ \tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\boldsymbol{W}_s\left( \boldsymbol{x}_{j}^{s}-\boldsymbol{x}_{i}^{s} \right) +\boldsymbol{b}_s,\label{edge_layer_3}\\
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4} \boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
\end{align} \end{align}
The \textit{implicit Distance Module} in Fig. \ref{o2o_cls_head} including Eq. (\ref{edge_layer_2})-(\ref{edge_layer_4}) to establish the relationships between the $i$-th anchor and the $j$-th anchor. Here, $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denotes the implicit semantic distance features from the $i$-th anchor to the $j$-th anchor. Given the semantic distance features for each pair of anchors, we employ a max pooling layer to aggregate the adjacent node features and update the node attributes, ultimately yielding the final non-redundant scores $\left\{ \tilde{s}_i\right\}$: Eq. (\ref{edge_layer_1})-(\ref{edge_layer_4}) establish the directed relationships from the $i$-th node and the $j$-th node. Here, tensor $\boldsymbol{D}_{ij}^{edge}$ denotes the semantic features of directed edge $E_{ij}$. Given the directed edge features for each pair of nodes, we employ an element-wise max pooling layer to aggregate all the \textit{incoming edges} features of one node to update its node features:
\begin{align} \begin{align}
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge}, \boldsymbol{D}_{i}^{node}&\gets \underset{k\in \left\{ k|A_{ki}=1 \right\}}{\max}\boldsymbol{D}_{ki}^{edge}.
\end{align}
Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the most distinctive features alone the column of the adjacent matrix (\textit{i.e.}, the incoming edges). With the updated node features $\boldsymbol{D}_{i}^{node}$, the final confidence scores $\tilde{s}_{i}^{g}$ are yielded by the following layers:
\begin{align}
\boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right) ,
\\ \\
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right), \tilde{s}_{i}^{g}&\gets \sigma \left( \boldsymbol{W}_{node}\boldsymbol{F}_{i}^{node} + \boldsymbol{b}_{node} \right) ,
\\
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\label{node_layer} \label{node_layer}
\end{align} \end{align}
\textbf{Label Assignment and Cost Function.} As the previous work, we use the dual assignment strategy for label assignment of triplet head. The cost function of the $i$-th prediction and $j$-th ground truth is given as follows: \textbf{Dual Confidence Selection.} We use dual confidence thresholds $\lambda_{o2m}^g$ and $\lambda_{o2o}^g$ to selected the positive (\textit{i.e.}, foreground) predictions. In the traditional NMS paradigm, the predictions output by the O2M classification heads with confidences $\left\{ s_{i}^{g} \right\} $ higher than $\lambda_{o2m}^g$ are selected as the positive predictions and subsequently fed into the NMS postprocessing to eliminate the redundant predictions. In the NMS-free paradigm, the final non-redundant predictions are selected as following:
\begin{align} \begin{align}
\mathcal{C} _{ij}=s_i\times \left( GIoU_{lane} \right) ^{\beta}. \varOmega _{o2o}^{pos}\equiv \left\{ i|\tilde{s}_{i}^{g}>\lambda _{o2o}^{s} \right\} \cap \left\{ i|s_{i}^{g}>\lambda _{o2m}^{g} \right\}
\end{align} \end{align}
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}. where the $\varOmega _{o2o}^{pos}$ denoted the final set of the non-redundant predictions with the two types of confidences both statisfy the above conditions with dual confidence thresholds. The selection principle for non-redundant predictions is called dual confidence selection.
We use SimOTA \cite{yolox} with dynamic-$k=4$ (one-to-many assignment) for O2M classification head and O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head. \textbf{Label Assignment and Cost Function for GPM.} As the previous work, we use the dual assignment strategy for label assignment of triplet head. The cost function for the $i$-th prediction and $j$-th ground truth is given as follows:
\textbf{Loss function.}
We utilize focal loss \cite{focal} for both O2O classification head and O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega ^{pos}_{o2o}$ and $\varOmega ^{neg}_{o2o}$ for positive and negative target sets, is confined to the positive sample set of the O2M classification head:
\begin{align} \begin{align}
\varOmega _{o2o}^{pos}\cup \varOmega _{o2o}^{neg}=\left\{ i\mid s_i>\tau ^s. \right\} \mathcal{C} _{ij}^{o2m}&=s_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},\\
\mathcal{C} _{ij}^{o2o}&=\tilde{s}_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},
\end{align}
where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
Given the cost matrix, we use SimOTA \cite{yolox} (one-to-many assignment) for the O2M classification head and the O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head.
\textbf{Loss function for GPM.}
We utilize focal loss \cite{focal} for both O2O classification head and the O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega_{o2o}$, is confined to the positive sample set of the O2M classification head:
\begin{align}
\varOmega _{o2o}=\left\{ i\mid s_i^g>\lambda_{o2m}^s \right\}.
\end{align} \end{align}
In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head.
\begin{figure}[t] \begin{figure}[t]
@ -336,7 +348,7 @@ In essence, certain samples with lower O2M scores are excluded from the computat
\caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.} \caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.}
\label{auxloss} \label{auxloss}
\end{figure} \end{figure}
We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
The final loss functions for GPM are given as follows: The final loss functions for GPM are given as follows:
\begin{align} \begin{align}
@ -348,7 +360,7 @@ The final loss functions for GPM are given as follows:
% \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\ % \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
% &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg]. % &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg].
% \end{align} % \end{align}
\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both the LPM and the GPM are trained concurrently. The overall loss function is delineated as follows: \subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both LPM and GPM are trained concurrently. The overall loss function is delineated as follows:
\begin{align} \begin{align}
\mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{g}+\mathcal{L} _{reg}^{g}. \mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{g}+\mathcal{L} _{reg}^{g}.
\end{align} \end{align}
@ -385,7 +397,7 @@ For Tusimple, the evaluation is formulated as follows:
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular. where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
\subsection{Implement Detail} \subsection{Implement Detail}
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including the LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}. All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
\begin{table*}[htbp] \begin{table*}[htbp]
@ -571,7 +583,7 @@ To validate and analyze the effectiveness and influence of different component o
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively. \textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic-$k=4$ for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth. We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic $k=4$ for SimOTA, with no more than 4 positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
\begin{figure}[t] \begin{figure}[t]
@ -655,17 +667,17 @@ We also explore the effect of different local polar map sizes on our model, as i
\textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free head in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, the classification matrix (cls matrix) proves crucial, indicating that conditional probability is effective. Other components, such as the neighbor matrix (provided as a geometric prior) and rank loss, also contribute to the performance of the NMS-free block. \textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free head in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, the classification matrix (cls matrix) proves crucial, indicating that conditional probability is effective. Other components, such as the neighbor matrix (provided as a geometric prior) and rank loss, also contribute to the performance of the NMS-free block.
To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both proposal and fixed anchor strategies. Table \ref{NMS vs NMS-free} presents the results of these experiments. Here, O2M-B refers to the O2M classification head, O2O-B refers to the O2O classification head with a plain structure, and O2O-G refers to the O2O classification head with Polar GNN block. To assess the ability to eliminate redundant predictions, NMS post-processing is applied to each head. The results show that NMS is necessary for the traditional O2M classification head. In the fixed anchor paradigm, although the O2O classification head with a plain structure effectively eliminates redundant predictions, it is less effective than the proposed Polar GNN block. In the proposal anchor paradigm, the O2O classification head with a plain structure fails to eliminate redundant predictions due to high anchor overlap and similar RoI features. Thus, the GNN structure is essential for Polar R-CNN in the NMS-free paradigm. Both in the fixed and proposal anchor paradigms, the O2O classification head with the GNN structure successfully eliminates redundant predictions, indicating that our GNN-based O2O classification head can replace NMS post-processing in sparse scenarios without a decrease in performance. This confirms our earlier theory that both structure and label assignment are crucial for a NMS-free paradigm. To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both proposal and fixed anchor strategies. Table \ref{NMS vs NMS-free} presents the results of these experiments. Here, O2M-B refers to the O2M classification head, O2O-B refers to the O2O classification head with a plain structure, and O2O-G refers to the O2O classification head with proposed GNN structure. To assess the ability to eliminate redundant predictions, NMS post-processing is applied to each head. The results show that NMS is necessary for the traditional O2M classification head. In the fixed anchor paradigm, although the O2O classification head with a plain structure effectively eliminates redundant predictions, it is less effective than the proposed GNN structure. In the proposal anchor paradigm, the O2O classification head with a plain structure fails to eliminate redundant predictions due to high anchor overlap and similar RoI features. Thus, the GNN structure is essential for Polar R-CNN in the NMS-free paradigm. Both in the fixed and proposal anchor paradigms, the O2O classification head with the GNN structure successfully eliminates redundant predictions, indicating that our GNN-based O2O classification head can replace NMS post-processing in sparse scenarios without a decrease in performance. This confirms our earlier theory that both structure and label assignment are crucial for a NMS-free paradigm.
We also explore the stop-gradient strategy for the O2O classification head. As shown in Table \ref{stop}, the gradient of the O2O classification head negatively impacts both the O2M classification head (with NMS post-processing) and the O2O classification head. This suggests that one-to-one assignment introduces critical bias into feature learning. We also explore the stop-gradient strategy for the O2O classification head. As shown in Table \ref{stop}, the gradient of the O2O classification head negatively impacts both the O2M classification head (with NMS post-processing) and the O2O classification head. This suggests that one-to-one assignment introduces critical bias into feature learning.
\begin{table}[h] \begin{table}[h]
\centering \centering
\caption{Ablation study on Polar GNN block.} \caption{Ablation study on GNN block.}
\begin{adjustbox}{width=\linewidth} \begin{adjustbox}{width=\linewidth}
\begin{tabular}{cccc|ccc} \begin{tabular}{cccc|ccc}
\toprule \toprule
\textbf{GNN}&$\textbf{M}^{C}$&$\textbf{M}^{G}$&\textbf{Rank Loss}&\textbf{F1@50 (\%)}&\textbf{Precision (\%)} & \textbf{Recall (\%)} \\ \textbf{GNN}&$\boldsymbol{A}^{C}$&$\boldsymbol{A}^{G}$&\textbf{Rank Loss}&\textbf{F1@50 (\%)}&\textbf{Precision (\%)} & \textbf{Recall (\%)} \\
\midrule \midrule
& & & &16.19&69.05&9.17\\ & & & &16.19&69.05&9.17\\
\checkmark&\checkmark& & &79.42&88.46&72.06\\ \checkmark&\checkmark& & &79.42&88.46&72.06\\
@ -746,7 +758,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
\textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification head in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}. \textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification head in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}.
In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification head with Polar GNN is capable of learning both explicit geometric distance and implicit semantic distances between anchors in addition to geometric distances, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing. In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification head with proposed GNN structure is capable of learning both explicit geometric distance and implicit semantic distances between anchors in addition to geometric distances, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing.
\begin{table}[h] \begin{table}[h]
\centering \centering
@ -776,7 +788,7 @@ In the traditional NMS post-processing \cite{clrernet}, the default IoU threshol
\section{Conclusion and Future Work} \section{Conclusion and Future Work}
In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification head with Polar GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Users have the option to use either the O2M classification head with NMS post-processing or the O2O classification head for a NMS-free approach. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore incorporating new structures, such as large kernels or attention mechanisms, and experimenting with new label assignment, training, and anchor sampling strategies. We also plan to extend Polar R-CNN to video instance lane detection and 3D lane detection, utilizing advanced geometric modeling for these new tasks. In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification head with GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Users have the option to use either the O2M classification head with NMS post-processing or the O2O classification head for a NMS-free approach. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore incorporating new structures, such as large kernels or attention mechanisms, and experimenting with new label assignment, training, and anchor sampling strategies. We also plan to extend Polar R-CNN to video instance lane detection and 3D lane detection, utilizing advanced geometric modeling for these new tasks.
% %
% %
% %
@ -813,7 +825,7 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
\section{The Design Principles of One-to-one classification Head} \section{The Design Principles of One-to-one classification Head}
Two necessary conditions of the NMS-free paradigm are label assignment strategies and the model structure. Two necessary conditions of the NMS-free paradigm are label assignment strategies and the model structure.
As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS postprocessing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth. As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{yolox}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS postprocessing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth.
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLP and CNN is hard to learn the ``one-to-one'' features, causing the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}.Let us take a trival example. Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. If there the $i$-th anchor and the $j$-th anchor are both around the ground truth and they are nearly overlapping with each other: However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLP and CNN is hard to learn the ``one-to-one'' features, causing the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}.Let us take a trival example. Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. If there the $i$-th anchor and the $j$-th anchor are both around the ground truth and they are nearly overlapping with each other:
\begin{align} \begin{align}
@ -836,21 +848,21 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
\REQUIRE ~~\\ %算法的输入参数Input \REQUIRE ~~\\ %算法的输入参数Input
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\ The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\ The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\ The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{s}$;\\
The positive confidence get from O2M classification head, $s_i$;\\ The positive confidence get from the O2M classification head, $s_i^g$;\\
The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\ The positive regressions get from the O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
\ENSURE ~~\\ %算法的输出Output \ENSURE ~~\\ %算法的输出Output
\STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$: \STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$:
\begin{align} \begin{align}
M_{ij}^{P}=\begin{cases} A_{ij}^{P}=\begin{cases}
1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\ 1, if\,\left( s_i^g\geqslant \lambda ^s\land s_j\geqslant \lambda ^s \right)\\
0,others,\\ 0,others,\\
\end{cases} \end{cases}
\end{align} \end{align}
\STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: \STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align} \begin{align}
M_{ij}^{C}=\begin{cases} A_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\ 1, if\,s_i^g<s_j\,\,| \left( s_i^g=s_j \land i<j \right)\\
0, others.\\ 0, others.\\
\end{cases} \end{cases}
\label{confidential matrix} \label{confidential matrix}
@ -858,56 +870,56 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors. where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
\STATE Calculate the geometric prior matrix $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows: \STATE Calculate the geometric prior matrix $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
\begin{align} \begin{align}
M_{ij}^{G}=\begin{cases} A_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\ 1,if\,\left| \theta _i-\theta _j \right|<\lambda^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\lambda^{r}\\
0, others.\\ 0, others.\\
\end{cases} \end{cases}
\label{geometric prior matrix} \label{geometric prior matrix}
\end{align} \end{align}
\STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows: \STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
\begin{align} \begin{align}
D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right), D_{ij} = 1-d\left( \mathrm{lane}_i, \mathrm{lane}_j\right),
\label{al_1-3} \label{al_1-3}
\end{align} \end{align}
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions. where $d\left(\cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
\STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i$ is calculate as following: \STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i^g$ is calculate as following:
\begin{align} \begin{align}
\tilde{s}_i = \begin{cases} \tilde{s}_i^g = \begin{cases}
1, & \text{if } \underset{j \in \{ j \mid M_{ij} = 1 \}}{\max} D_{ij} < \tau^g \\ 1, & if\,\text{if } \underset{j \in \{ j \mid A_{ij} = 1 \}}{\max} D_{ij} < \lambda^g \\
0, & \text{otherwise} 0, & \text{otherwise}
\end{cases} \end{cases}
\label{al_1-4} \label{al_1-4}
\end{align} \end{align}
\RETURN The final confidence $\tilde{s}_i$. % the return result of the algorithm \RETURN The final confidence $\tilde{s}_i^g$. % the return result of the algorithm
\end{algorithmic} \end{algorithmic}
\label{Graph Fast NMS} \label{Graph Fast NMS}
\end{algorithm} \end{algorithm}
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to ``true'' (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS. The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau^g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks. The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\lambda^g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks.
In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as: In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as:
\begin{align} \begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\ \tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \mathrm{Linear}_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\
\boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\\ \boldsymbol{F}_{ij}^{edge}&\gets \mathrm{Linear}_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -\mathrm{Linear}_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\\ \tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\mathrm{Linear}_b\left( \boldsymbol{x}_{i}^{s}-\boldsymbol{x}_{j}^{s} \right) ,\\
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) . \boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .
\label{edge_layer_appendix} \label{edge_layer_appendix}
\end{align} \end{align}
where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows: where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows:
\begin{align} \begin{align}
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge}, \boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|A_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
\\ \\
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right), \boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right),
\\ \\
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right). \tilde{s}_i^g&\gets \sigma \left( \mathrm{Linear}_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\label{node_layer_appendix} \label{node_layer_appendix}
\end{align} \end{align}
In this expression, we use elementwise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\tau^g$, the real implicit decision surface is learned from data by neural work. In this expression, we use element-wise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\lambda^g$, the real implicit decision surface is learned from data by neural work.
\label{NMS_appendix} \label{NMS_appendix}

View File

@ -529,3 +529,11 @@
pages={2117--2125}, pages={2117--2125},
year={2017} year={2017}
} }
@inproceedings{pointnet,
title={Pointnet: Deep learning on point sets for 3d classification and segmentation},
author={Qi, Charles R and Su, Hao and Mo, Kaichun and Guibas, Leonidas J},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={652--660},
year={2017}
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1017 KiB

After

Width:  |  Height:  |  Size: 958 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 462 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 628 KiB

After

Width:  |  Height:  |  Size: 628 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.5 MiB

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.