diff --git a/main.tex b/main.tex index 5598a76..675edb1 100644 --- a/main.tex +++ b/main.tex @@ -218,24 +218,24 @@ F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{ \end{align} The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors. \par -\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows: +\textbf{Loss Function for LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows: \begin{align} \mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right) \label{loss_lph} \end{align} -where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM. +where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in LPM. \par \textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. \begin{figure}[t] \centering \includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} - \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.} + \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.} \label{g} \end{figure} \subsection{Global Polar Module} -Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}. +Similar to the pipeline of Faster R-CNN, LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}. \par \textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation \begin{align} @@ -255,12 +255,12 @@ Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors \end{equation} where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor: \begin{equation} - \boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}. + \boldsymbol{F}^{roi}\gets \mathrm{Linear}_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}. \end{equation} -\textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm. +\textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm. -To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head. +To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head. \begin{figure}[t] \centering @@ -274,16 +274,16 @@ As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that inco In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$: \begin{align} M_{ij}^{P}=\begin{cases} - 1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\ + 1, if \,s_i^g\geqslant \tau ^s\land s_j\geqslant \tau ^s\\ 0,others,\\ \end{cases} \end{align} -where $\tau ^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this threshold to selectively retain positive redundant predictions. +where $\tau ^s$ signifies the threshold for positive scores to selectively retain initial positive redundant predictions. The second component is the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: \begin{align} M_{ij}^{C}=\begin{cases} - 1, s_i\tau ^s. \right\} + \varOmega _{o2o}^{pos}\cup \varOmega _{o2o}^{neg}=\left\{ i\mid s_i^g>\tau ^s \right\}. \end{align} In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. \begin{figure}[t] @@ -336,7 +338,7 @@ In essence, certain samples with lower O2M scores are excluded from the computat \caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.} \label{auxloss} \end{figure} -We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. +We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. The final loss functions for GPM are given as follows: \begin{align} @@ -348,7 +350,7 @@ The final loss functions for GPM are given as follows: % \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\ % &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg]. % \end{align} -\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both the LPM and the GPM are trained concurrently. The overall loss function is delineated as follows: +\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both LPM and GPM are trained concurrently. The overall loss function is delineated as follows: \begin{align} \mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{g}+\mathcal{L} _{reg}^{g}. \end{align} @@ -385,7 +387,7 @@ For Tusimple, the evaluation is formulated as follows: where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular. \subsection{Implement Detail} -All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including the LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}. +All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}. \begin{table*}[htbp] @@ -837,20 +839,20 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\ The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\ The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\ - The positive confidence get from O2M classification head, $s_i$;\\ - The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\ + The positive confidence get from the O2M classification head, $s_i^g$;\\ + The positive regressions get from the O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\ \ENSURE ~~\\ %算法的输出:Output \STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$: \begin{align} M_{ij}^{P}=\begin{cases} - 1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\ + 1, if\,\left( s_i^g\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\ 0,others,\\ \end{cases} \end{align} \STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: \begin{align} M_{ij}^{C}=\begin{cases} - 1, s_i