diff --git a/main.tex b/main.tex index 163e4f0..d8aae35 100644 --- a/main.tex +++ b/main.tex @@ -125,7 +125,7 @@ Regrading the first issue, \cite{clrnet} introduced learned anchors that optimiz \par Regarding the second issue, nearly all anchor-based methods \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane} rely on direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On one hand, NMS is not deployment-friendly because it requires defining and calculating distances between lane pairs using metrics such as \textit{Intersection over Union} (IoU). This task is more challenging than in general object detection due to the intricate geometry of lanes. On the other hand, NMS can struggle in dense scenarios. Typically, a large distance threshold may lead to false negatives, as some true positive predictions could be mistakenly eliminated, as illustrated in Fig. \ref{NMS setting}(a)(c). Conversely, a small distance threshold may fail to eliminate redundant predictions effectively, resulting in false positives, as shown in Fig. \ref{NMS setting}(b)(d). Therefore, achieving an optimal trade-off across all scenarios by manually setting the distance threshold is challenging. %The root of this problem lies in the fact that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. As a result, when two predictions are ``close'' to each other, it is nearly impossible to determine whether one of them is redundant.% where lane ground truths are closer together than in sparse scenarios;including those mentioned above, \par -To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have incorporated a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} block into the detection head. The GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows: +To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have incorporated a triplet head with a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} bolck. The GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows: \begin{itemize} \item We design a strategy to simplify the anchor parameters by using local and global polar coordinate systems and applied these to two-stage lane detection frameworks. Compared to other anchor-based methods, this strategy significantly reduces the number of proposed anchors while achieving better performance. \item We propose a novel triplet detection head with GNN block to implement a NMS-free paradigm. The block is inspired by Fast NMS, providing enhanced interpretability. Our model supports end-to-end training and testing while still allowing for traditional NMS post-processing as an option for a NMS version of our model. @@ -188,11 +188,11 @@ However, the representation of lane anchors as rays presents certain limitations \begin{figure}[t] \centering \includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png} - \caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}^l_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the local polar axis.} + \caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the local polar axis.} \label{lpmlabel} \end{figure} \par -\textbf{Representation in Polar Coordinate.} As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce a polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (also called the pole) and an angle $\theta$ from a reference direction (called polar axis). As shown in Fig. \ref{coord}(b), given a polar corresponding to the yellow point, a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (called radius), $r$, and the counterclockwise angle from the polar axis to the perpendicular line of the lane anchor, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right]$. +\textbf{Representation in Polar Coordinate.} As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce a polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (\textit{i.e.}, pole) and an angle $\theta$ from a reference direction (\textit{i.e.}, polar axis). As shown in Fig. \ref{coord}(b), given a polar corresponding to the yellow point, a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (\textit{i.e.}, radius), $r$, and the counterclockwise angle from the polar axis to the perpendicular line of the lane anchor, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right]$. \par To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framewrok for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}. \par @@ -225,12 +225,12 @@ The regression branch consists of a single $1\times1$ convolutional layer and wi \end{align} where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\lambda^{l}\}\right|$ is the number of positive local poles in LPM. \par -\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. +\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select top-$K$ anchors with the highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. \begin{figure}[t] \centering - \includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} - \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores $\left\{\tilde{s}_i^g\right\}$ based on the redundant predictions with confidence scores $\left\{s_i^g\right\}$ from the O2M classification head. Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ participate in the selection of final non-redundant results, which is called dual confidence selection.} + \includegraphics[width=\linewidth]{thesis_figure/detection_head.png} + \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet head. The triplet head consists of three parts, namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing (the gray dashed route). The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores $\left\{\tilde{s}_i^g\right\}$ based on $\left\{s_i^g\right\}$ (the green solid route). Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ participate in the selection of final non-redundant results, which is called dual confidence selection. During backword training, the gradient from the O2O classification head are stopped (the blue dashed route) to the RoI pooling module.} \label{g} \end{figure} @@ -341,13 +341,14 @@ We utilize focal loss \cite{focal} for both O2O classification head and the O2M \begin{align} \varOmega _{o2o}=\left\{ i\mid s_i^g>\lambda_{o2m}^s \right\}. \end{align} -In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. +In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. Since there is a gap between the label assignment of the O2O classification head and the O2M classification head, to keep the quality of RoI features learning, the gradient is stopped from the O2O classification head to ROI pooling head during training porcess. This trick is also proposed in \cite{pss}. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{thesis_figure/auxloss.png} % \caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.} \label{auxloss} \end{figure} + We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. The final loss functions for GPM are given as follows: @@ -734,7 +735,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s \begin{table}[h] \centering - \caption{The ablation study for the stop grad strategy on CULane test set.} + \caption{The ablation study for the stop gradient strategy on CULane test set.} \begin{adjustbox}{width=\linewidth} \begin{tabular}{c|c|lll} \toprule @@ -758,7 +759,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s \textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification head in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}. -In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification head with proposed GNN structure is capable of learning both explicit geometric distance and implicit semantic distances between anchors in addition to geometric distances, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing. +In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification head with proposed GNN structure is capable of learning both explicit geometric distance and implicit semantic distances between anchors, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing. \begin{table}[h] \centering @@ -852,40 +853,33 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design The positive confidence get from the O2M classification head, $s_i^g$;\\ The positive regressions get from the O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\ \ENSURE ~~\\ %算法的输出:Output - \STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$: - \begin{align} - A_{ij}^{P}=\begin{cases} - 1, if\,\left( s_i^g\geqslant \lambda ^s\land s_j\geqslant \lambda ^s \right)\\ - 0,others,\\ - \end{cases} - \end{align} - \STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: + \STATE Caculate the confidence comparison matrix $\boldsymbol{A}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: \begin{align} A_{ij}^{C}=\begin{cases} - 1, if\,s_i^gs_j\,\,and\,\,\left( s_i=s_j\,\,or\,\,i>j \right)\\ + 0, others.\\ + \end{cases} \label{confidential matrix} \end{align} where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors. \STATE Calculate the geometric prior matrix $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows: \begin{align} A_{ij}^{G}=\begin{cases} - 1,if\,\left| \theta _i-\theta _j \right|<\lambda^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\lambda^{r}\\ + 1, \left| \theta _i-\theta _j \right|<\lambda ^{\theta}\,\,and\,\,\left| r_{i}^{g}-r_{j}^{g} \right|<\lambda ^r\\ 0, others.\\ - \end{cases} + \end{cases} \label{geometric prior matrix} \end{align} - \STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows: + \STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$ The element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows: \begin{align} D_{ij} = 1-d\left( \mathrm{lane}_i, \mathrm{lane}_j\right), \label{al_1-3} \end{align} - where $d\left(\cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions. + where $d\left(\cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions such as IoU. \STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i^g$ is calculate as following: \begin{align} \tilde{s}_i^g = \begin{cases} - 1, & if\,\text{if } \underset{j \in \{ j \mid A_{ij} = 1 \}}{\max} D_{ij} < \lambda^g \\ + 1, if\,\text{if } \underset{j \in \{ j \mid A_{ij} = 1 \}}{\max} D_{ij} < \lambda^g \\ 0, & \text{otherwise} \end{cases} \label{al_1-4} diff --git a/thesis_figure/coord/localpolar.png b/thesis_figure/coord/localpolar.png index 6a3a750..3bbf151 100644 Binary files a/thesis_figure/coord/localpolar.png and b/thesis_figure/coord/localpolar.png differ diff --git a/thesis_figure/thisis_pic.pptx b/thesis_figure/thisis_pic.pptx index e8abb73..e2f42f4 100644 Binary files a/thesis_figure/thisis_pic.pptx and b/thesis_figure/thisis_pic.pptx differ