diff --git a/main.tex b/main.tex index 59d0b45..cd75c00 100644 --- a/main.tex +++ b/main.tex @@ -225,7 +225,7 @@ The regression branch consists of a single $1\times1$ convolutional layer and wi \end{align} where $N^{lpm}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM. \par -\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as candidate anchor during the training stage. It is helpful to our Polar R-CNN (the second stage) to learn from a sufficient variety of features, including negative anchor samples. However, only the top-$K$ anchors with the highest confidence scores $\{c_j^l\}$ are selected and fed into the next stage during the evaluation stage. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing this, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors. The following experiments will demonstrate the effectiveness of our top-$K$ anchor selection strategy. +\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids Polar R-CNN (the second stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. \begin{figure}[t] \centering @@ -239,7 +239,7 @@ Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for g \textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation \begin{align} r^{g}_{j}&=r^{l}_{j}+\left( \boldsymbol{c}^{l}_{j}-\boldsymbol{c}^{g} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right], \\ - j&=1,2,\cdots,N^{lpm}_{pos},\notag + j&=1,2,\cdots,K,\notag \end{align} where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j$-th local pole and the global pole, respectively. Note that we keep the angle $\theta_j$ unchanged, since the local and global polar coordinate system have the same polar axis, as shown in Fig. \ref{lpmlabel}. And next, the feature points are sampled on each lane anchors by \begin{align} @@ -248,7 +248,7 @@ i&=1,2,\cdots,N_p,\notag \end{align} where the y-coordinates $\{y_{1,j}, y_{2,j},\cdots,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. \par -Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{N^{lpm}_{pos}}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{N^{lpm}_{pos}\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as +Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as \begin{equation} \boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}}, \end{equation} @@ -259,7 +259,23 @@ where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{lpm}_{pos}}$ represents the learnab \end{aligned} \end{equation} -\textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. To achieve non-redundant detection results with a NMS-free paradigm, the one-to-one paradigm becomes crucial during training, as highlighted in \cite{o2o}. Nevertheless, merely adopting the one-to-one paradigm is insufficient; the structure of the detection head also plays a pivotal role in achieving NMS-free detection. This aspect will be further analyzed in the following sections. +\textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head, as illustrated in Fig.7. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. What's more, NMS post processing will bring the dufficulty for the trade-off between recall and the precision, according to our previous analysis. To achieve an ideal non-redundant detection results with a NMS-free paradigm (\textit{i.e.} end to end detection), both the one-to-one and the one-to-many paradigms become crucial during training stage, as highlighted in \cite{o2o} \cite{}. Inspired by \cite{}\cite{} but with slight difference, we design triplet head with dual label assignment. + +To ensure the simplicity and the speed of our model, the O2M regression head and the O2M classification head are designed with a plain structure with a two-layer MLPs. Ignoring the O2O classification head, the second stage of Polar-RCNN is similar to previous anchor-based works, but without complcated structure such as attention\cite{} and cascade refinements\cite{}. The predction results are redundant and still need NMS postprocessing. In order to make the model achieve an end-to-end paradigm, we design an extent O2O classification head. As shown in Fig. \ref{}, it should be noted that the detecton process of O2O classification is not independent but based on the O2M classification head. +In the O2M paradigm as previous work, only the confidence output by the O2M classifiction head larger than a thredhold $\tau_g$ are chosen as the candidates as the positive detection results. From a porbability perspective, the confidence can be expressed as follows: +\begin{align} +\end{align} + +while the probability of the confidence output by the O2O classification head can be expressed in a conditioal probability format: +\begin{align} +\end{align} +where $s$ denotes confidence of the final non-reduntant predictions. If the $s>\tau_g$, the lane instace predicted from the i-th anchor are seen as the positive instance. In another view, the O2O classification head can be viewed as a replacement of NMS postprocessing. + + + +As shown in Fig. \ref{}, we introduce a novel architecture with \textit{graph neural network} \cite{gnn} (GNN) and polar geometric prior and we call this block Polar GNN. The Polar GNN is designed to model the relationship between features $F$ sampled different anchors. From previous analysis, the distance of lanes should not only modelled by explicit geometric property but consider the implicit contextual semantics such as ``double'' and ``forked'' lanes. These kinds of lanes with tiny geometric but shouldn;t be removed as redundant predictions. + +The design insight of Polar GNN is from the Fast NMS \cite{}, which is iteration-free. The design details can be seen in the appendix, and we only elaborate the architecture of Polar GNN. In Polar GNN, each pooling feaures $F$ output by the ROI Pooling Layer Once we get the socre from O2M classificaton head $s$ and the offset regression from $r$. Then we can conduct some kinds of adjacent matrix ') \textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i$-th anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows: \begin{equation}