This commit is contained in:
王老板 2024-10-11 10:26:41 +08:00
parent 23c6c540f9
commit 2ff681165d
2 changed files with 66 additions and 65 deletions

123
main.tex
View File

@ -204,83 +204,83 @@ In the local polar coordinate system, the parameters of each lane anchor are det
\centering \centering
\includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png} \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
\caption{The main architecture of local polar module.} \caption{The main architecture of local polar module.}
\label{lpm} \label{l}
\end{figure} \end{figure}
\subsection{Local Polar Module} \subsection{Local Polar Module}
As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{lpm}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$: As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{l}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
\begin{equation} \begin{equation}
F_d\gets DS\left( P_{3} \right)\ \text{and}\ F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}. F_d\gets DS\left( P_{3} \right)\ \text{and}\ F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}.
\end{equation} \end{equation}
The downsampled feature map $F_d$ is then fed into two branches: a \textit{regression} branch $\phi _{reg}^{lpm}\left(\cdot \right)$ and a \textit{classification} branch $\phi _{cls}^{lpm}\left(\cdot \right)$, \textit{i.e.}, The downsampled feature map $F_d$ is then fed into two branches: a \textit{regression} branch $\phi _{reg}^{l}\left(\cdot \right)$ and a \textit{classification} branch $\phi _{cls}^{l}\left(\cdot \right)$, \textit{i.e.},
\begin{align} \begin{align}
F_{reg\,\,}\gets \phi _{reg}^{lpm}\left( F_d \right)\ &\text{and}\ F_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\ F_{reg\,\,}\gets \phi _{reg}^{l}\left( F_d \right)\ &\text{and}\ F_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\
F_{cls}\gets \phi _{cls}^{lpm}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}. \label{lpm equ} F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}. \label{lpm equ}
\end{align} \end{align}
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{lpm}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors. The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
\par \par
\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the \textit{smooth-L}1 loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows: \textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows:
\begin{align} \begin{align}
\mathcal{L} ^{lpm}_{cls}&=BCE\left( F_{cls},F_{gt} \right) \mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)
\label{loss_lph} \label{loss_lph}
\end{align} \end{align}
where $N^{lpm}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM. where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM.
\par \par
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids Polar R-CNN (the second stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies. \textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} \includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png}
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant predictions (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the output scores from the O2M classification head.} \caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.}
\label{gpm} \label{gpm}
\end{figure} \end{figure}
\subsection{Global Polar Module} \subsection{Global Polar Module}
Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve accurate lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{gpm}. Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{gpm}.
\par \par
\textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation \textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
\begin{align} \begin{align}
r^{g}_{j}&=r^{l}_{j}+\left( \boldsymbol{c}^{l}_{j}-\boldsymbol{c}^{g} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right], \\ r^{g}_{j}&=r^{l}_{j}+\left( \boldsymbol{c}^{l}_{j}-\boldsymbol{c}^{g} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right], \\
j&=1,2,\cdots,K,\notag j&=1,2,\cdots,K,\notag
\end{align} \end{align}
where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j$-th local pole and the global pole, respectively. Note that we keep the angle $\theta_j$ unchanged, since the local and global polar coordinate system have the same polar axis, as shown in Fig. \ref{lpmlabel}. And next, the feature points are sampled on each lane anchors by where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j$-th local pole and the global pole, respectively. Note that we keep the angle $\theta_j$ unchanged, since the local and global polar coordinate system have the same polar axis. And next, the feature points are sampled on each lane anchors by
\begin{align} \begin{align}
x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\ x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\
i&=1,2,\cdots,N_p,\notag i&=1,2,\cdots,N_p,\notag
\end{align} \end{align}
where the y-coordinates $\{y_{1,j}, y_{2,j},\cdots,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. where the y-coordinates $\boldsymbol{y}_{j}^{b}\equiv \{y_{1,j},y_{2,j},\cdots ,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. Then the x-coordinates $\boldsymbol{x}_{j}^{b}\equiv \{x_{1,j},x_{2,j},\cdots ,x_{N_p,j}\}$ are caculated by Eq. \ref{positions}.
\par \par
Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
\begin{equation} \begin{equation}
\boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}}, \boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}},
\end{equation} \end{equation}
where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{lpm}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor: where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
\begin{equation} \begin{equation}
\boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}. \boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}.
\end{equation} \end{equation}
\textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head, as depicted in Fig. \ref{gpm}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes and other irregular geometric shapes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}\cite{}. Drawing inspiration from \cite{} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm. \textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{gpm}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{gpm}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head. To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{gpm}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名 \includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
\caption{The main architecture of O2O classification head. Each anchor is conceived as a node within the Polar GNN. The interconnecting edges are formed through the amalgamation of three distinct matrices: the positive selection matrix $\left\{M_{ij}^{P}\right\}$, the confidence comparison matrix $\left\{M_{ij}^{C}\right\}$ and the geometric prior matrix $\left\{M_{ij}^{G}\right\}$. $\left\{M_{ij}^{P}\right\}$ and $\left\{M_{ij}^{C}\right\}$ are derived from the O2M classification head (the orange box), whereas $\left\{M_{ij}^{G}\right\}$ is constructed in accordance with the positional parameter of the anchors (the dashed box).} \caption{The main architecture of O2O classification head. Each anchor is conceived as a node within the Polar GNN. The interconnecting edges (\textit{i.e.} the adjacent matrix) are formed through the amalgamation of three distinct matrices: the positive selection matrix $\left\{M_{ij}^{P}\right\}$, the confidence comparison matrix $\left\{M_{ij}^{C}\right\}$ and the geometric prior matrix $\left\{M_{ij}^{G}\right\}$. $\left\{M_{ij}^{P}\right\}$ and $\left\{M_{ij}^{C}\right\}$ are derived from the O2M classification head (the orange box), whereas $\left\{M_{ij}^{G}\right\}$ is constructed in accordance with the positional parameter of the anchors (the black dashed box).}
\label{o2o_cls_head} \label{o2o_cls_head}
\end{figure} \end{figure}
As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the Polar GNN. As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. The Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the Polar GNN.
In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\mathbf{M}^{P}\in\mathbb{R}^{K\times K}$: In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$:
\begin{align} \begin{align}
M_{ij}^{P}=\begin{cases} M_{ij}^{P}=\begin{cases}
1, \left( s_i\geqslant \tau _s\land s_j\geqslant \tau _s \right)\\ 1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\
0,others,\\ 0,others,\\
\end{cases} \end{cases}
\end{align} \end{align}
where $\tau _s$ signifies the threshold for positive scores in the NMS paradigm. We employ this threshold to selectively retain positive redundant predictions. where $\tau ^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this threshold to selectively retain positive redundant predictions.
The second component is the confidence comparison matrix $\mathbf{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: The second component is the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align} \begin{align}
M_{ij}^{C}=\begin{cases} M_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\ 1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
@ -290,23 +290,23 @@ The second component is the confidence comparison matrix $\mathbf{M}^{C}\in\math
\end{align} \end{align}
This matrix facilitates the comparison of scores for each pair of anchors. This matrix facilitates the comparison of scores for each pair of anchors.
The third component is the geometric prior matrix, denoted by $\mathbf{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as: The third component is the geometric prior matrix, denoted by $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as:
\begin{align} \begin{align}
M_{ij}^{G}=\begin{cases} M_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau_{\theta}\land \left| r_{i}^{global}-r_{j}^{global} \right|<\tau_{r}\\ 1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
0, others.\\ 0, others.\\
\end{cases} \end{cases}
\label{geometric prior matrix} \label{geometric prior matrix}
\end{align} \end{align}
This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two corresponding nodes if the anchors are sufficiently close. This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two corresponding nodes if the anchors are sufficiently close.
With the aforementioned three matrices, we can define the overall adjacency matrix as $\mathbf{M} = \mathbf{M}^{P} \land \mathbf{M}^{C} \land \mathbf{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND''. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows: With the aforementioned three matrices, we can define the overall adjacency matrix as $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND''. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows:
\begin{align} \begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi} & \gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right), \\ \tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\label{edge_layer_1}\\
\boldsymbol{F}_{ij}^{edge} & \gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) - FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) + FC_{b}\left( \varDelta \boldsymbol{x}_{ij}^{b} \right), \\ \boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\label{edge_layer_2}\\
\boldsymbol{D}_{ij}^{edge} & \gets MLP_{edge}\left( \boldsymbol{F}_{ij}^{edge} \right). \tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\label{edge_layer_3}\\
\label{edge_layer} \boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
\end{align} \end{align}
Here, $\varDelta \boldsymbol{x}_{ij}^{b}$ denotes the difference between the x-axis coordinates of sampled points between the $i$-th anchor and the $j$-th anchor. $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denotes the implicit semantic distance features from the $i$-th anchor to the $j$-th anchor. Given the semantic distance features for each pair of anchors, we employ a max pooling layer to aggregate the adjacent node features and update the node attributes, ultimately yielding the final non-redundant scores $\left\{ \tilde{s}_i\right\}$: The \textit{implicit Distance Module} in Fig. \ref{o2o_cls_head} including Eq. (\ref{edge_layer_2})-(\ref{edge_layer_4}) to establish the relationships between the $i$-th anchor and the $j$-th anchor. Here, $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denotes the implicit semantic distance features from the $i$-th anchor to the $j$-th anchor. Given the semantic distance features for each pair of anchors, we employ a max pooling layer to aggregate the adjacent node features and update the node attributes, ultimately yielding the final non-redundant scores $\left\{ \tilde{s}_i\right\}$:
\begin{align} \begin{align}
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge}, \boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
\\ \\
@ -320,23 +320,23 @@ Here, $\varDelta \boldsymbol{x}_{ij}^{b}$ denotes the difference between the x-a
\begin{align} \begin{align}
\mathcal{C} _{ij}=s_i\times \left( GIoU_{lane} \right) ^{\beta}. \mathcal{C} _{ij}=s_i\times \left( GIoU_{lane} \right) ^{\beta}.
\end{align} \end{align}
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the LaneIoU function $IOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}. This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
We use SimOTA \cite{yolox} with dynamic $k=4$ (one-to-many assignment) for O2M classification head and O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) is employed for the O2O classification head. We use SimOTA \cite{yolox} with dynamic-$k=4$ (one-to-many assignment) for O2M classification head and O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head.
\textbf{Loss function.} \textbf{Loss function.}
We utilize focal loss \cite{focal} for both O2O classification head and O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega ^{pos}_{o2o}$ and $\varOmega ^{neg}_{o2o}$ for positive and negative target sets, is confined to the positive sample set of the O2M classification head: We utilize focal loss \cite{focal} for both O2O classification head and O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega ^{pos}_{o2o}$ and $\varOmega ^{neg}_{o2o}$ for positive and negative target sets, is confined to the positive sample set of the O2M classification head:
\begin{align} \begin{align}
\varOmega^{pos}{o2o} \cup \varOmega^{neg}{o2o} = { i \mid s_i > \tau_s }. \varOmega _{o2o}^{pos}\cup \varOmega _{o2o}^{neg}=\left\{ i\mid s_i>\tau ^s. \right\}
\end{align} \end{align}
In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{\,\,rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=\linewidth]{thesis_figure/auxloss.png} % \includegraphics[width=\linewidth]{thesis_figure/auxloss.png} %
\caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.} \caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.}
\label{auxloss} \label{auxloss}
\end{figure} \end{figure}
We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and Smooth-L1 loss for the regression of end points of lanes, denoted as $\mathcal{L} _{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L} _{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
The final loss functions for GPM are given as follows: The final loss functions for GPM are given as follows:
\begin{align} \begin{align}
@ -346,11 +346,11 @@ The final loss functions for GPM are given as follows:
\end{align} \end{align}
% \begin{align} % \begin{align}
% \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\ % \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
% &\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg]. % &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg].
% \end{align} % \end{align}
\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both the LPM and the GPM are trained concurrently. The overall loss function is delineated as follows: \subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both the LPM and the GPM are trained concurrently. The overall loss function is delineated as follows:
\begin{align} \begin{align}
\mathcal{L} =\mathcal{L} _{cls}^{lpm}+\mathcal{L} _{reg}^{lpm}+\mathcal{L} _{cls}^{gpm}+\mathcal{L} _{reg}^{gpm}. \mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{gpm}+\mathcal{L} _{reg}^{gpm}.
\end{align} \end{align}
@ -385,7 +385,7 @@ For Tusimple, the evaluation is formulated as follows:
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular. where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
\subsection{Implement Detail} \subsection{Implement Detail}
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficients of cost function, $\beta_{c}$ and $\beta_{r}$, are set to 1 and 6 respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. The training processing is end-to-end just like \cite{}\cite{} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}. All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including the LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
\begin{table*}[htbp] \begin{table*}[htbp]
@ -553,9 +553,9 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
\end{table} \end{table}
\subsection{Comparison with the state-of-the-art method} \subsection{Comparison with the state-of-the-art method}
The comparison results of our proposed model with other methods are shown in Tables \ref{culane result}, \ref{tusimple result}, \ref{llamas result}, \ref{dlrail result}, and \ref{curvelanes result}. We present results for two versions of our model: the NMS-based version, denoted as Polar R-CNN-NMS, and the NMS-free version, denoted as Polar R-CNN. The NMS-based version utilizes predictions obtained from the O2M head followed by NMS post-processing, while the NMS-free version derives predictions directly from the O2O classification head without NMS. The comparison results of our proposed model with other methods are shown in Tables \ref{culane result}, \ref{tusimple result}, \ref{llamas result}, \ref{dlrail result}, and \ref{curvelanes result}. We present results for two versions of our model: the NMS-based version, denoted as Polar R-CNN-NMS, and the NMS-free version, denoted as Polar R-CNN. The NMS-based version utilizes predictions $\left\{s_i^g\right\}$ obtained from the O2M head followed by NMS post-processing, while the NMS-free version derives predictions $\left\{\tilde{s}_i^g\right\}$ directly from the O2O classification head without NMS.
To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet} on the CULane and CurveLanes datasets, as we use a similar training strategy and data split. As illustrated in the comparison results, our model demonstrates competitive performance across five datasets. Specifically, on the CULane, TuSimple, LLAMAS, and DL-Rail datasets (sparse scenarios), our model outperforms other anchor-based methods. Additionally, the performance of the NMS-free version is nearly identical to that of the NMS-based version, highlighting the effectiveness of the O2O head in eliminating redundant predictions. On the CurveLanes dataset, the NMS-free version achieves superior F1-measure and Recall compared to both NMS-based and segment\&grid-based methods. To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet} on the CULane and CurveLanes datasets, as we use a similar training strategy and dataset splits. As illustrated in the comparison results, our model demonstrates competitive performance across five datasets. Specifically, on the CULane, TuSimple, LLAMAS, and DL-Rail datasets of sparse scenarios, our model outperforms other anchor-based methods. Additionally, the performance of the NMS-free version is nearly identical to that of the NMS-based version, highlighting the effectiveness of the O2O head in eliminating redundant predictions. On the CurveLanes dataset, the NMS-free version achieves superior F1-measure and Recall compared to both NMS-based and segment\&grid-based methods.
We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement strategy. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without additional refinement, indicating that the design of flexible anchors is crucial for performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}. The combination of fast processing speed and a straightforward architecture makes our model highly deployable. We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement strategy. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without additional refinement, indicating that the design of flexible anchors is crucial for performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}. The combination of fast processing speed and a straightforward architecture makes our model highly deployable.
@ -567,11 +567,11 @@ We also compare the number of anchors and processing speed with other methods. F
\end{figure} \end{figure}
\subsection{Ablation Study} \subsection{Ablation Study}
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance. To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation studies on CULane and CurveLanes dataset to show the performance.
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively. \textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose $k=4$ for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth. We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic-$k=4$ for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
\begin{figure}[t] \begin{figure}[t]
@ -620,7 +620,7 @@ We also explore the effect of different local polar map sizes on our model, as i
\begin{subfigure}{\subwidth} \begin{subfigure}{\subwidth}
\includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing.png} \includegraphics[width=\imgwidth]{thesis_figure/anchor_num/anchor_num_testing.png}
\end{subfigure} \end{subfigure}
\caption{F1@50 preformance of different polar map sizes and different top-$K_{a}$ anchor selections on CULane test set.} \caption{F1@50 preformance of different polar map sizes and different top-$K$ anchor selections on CULane test set.}
\label{anchor_num_testing} \label{anchor_num_testing}
\end{figure*} \end{figure*}
@ -649,7 +649,7 @@ We also explore the effect of different local polar map sizes on our model, as i
\includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/heatmap/anchor2.jpg} \includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/heatmap/anchor2.jpg}
\caption{} \caption{}
\end{subfigure} \end{subfigure}
\caption{(a)\&(c): The heap map of the local polar map; (b)\&(d): The final anchor selection during the evaluation stage.} \caption{(a) and (c) The heap map of the local polar map; (b) and (d) The final anchor selection during the evaluation stage.}
\label{cam} \label{cam}
\end{figure} \end{figure}
@ -746,7 +746,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
\textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification head in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}. \textbf{Ablation study on NMS-free block in dense scenarios.} Despite demonstrating the feasibility of replacing NMS with the O2O classification head in sparse scenarios, the shortcomings of NMS in dense scenarios remain. To investigate the performance of the NMS-free block in dense scenarios, we conduct experiments on the CurveLanes dataset, as detailed in Table \ref{aba_NMS_dense}.
In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates that the GNN-based O2O classification head is capable of learning both explicit geometric distance and implicit semantic distances between anchors in addition to geometric distances, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing. In the traditional NMS post-processing \cite{clrernet}, the default IoU threshold is set to 50 pixels. However, this default setting may not always be optimal, especially in dense scenarios where some lane predictions might be erroneously eliminated. Lowering the IoU threshold increases recall but decreases precision. To find the most effective IoU threshold, we experimented with various values and found that a threshold of 15 pixels achieves the best trade-off, resulting in an F1-score of 86.81\%. In contrast, the NMS-free paradigm with the GNN-based O2O classification head achieves an overall F1-score of 87.29\%, which is 0.48\% higher than the optimal threshold setting in the NMS paradigm. Additionally, both precision and recall are improved under the NMS-free approach. This indicates the O2O classification head with Polar GNN is capable of learning both explicit geometric distance and implicit semantic distances between anchors in addition to geometric distances, thus providing a more effective solution for dense scenarios compared to the traditional NMS post-processing.
\begin{table}[h] \begin{table}[h]
\centering \centering
@ -840,14 +840,14 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
The positive confidence get from O2M classification head, $s_i$;\\ The positive confidence get from O2M classification head, $s_i$;\\
The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\ The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
\ENSURE ~~\\ %算法的输出Output \ENSURE ~~\\ %算法的输出Output
\STATE Selecte the positive candidates by $\mathbf{M}^{P}\in\mathbb{R}^{K\times K}$: \STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$:
\begin{align} \begin{align}
M_{ij}^{P}=\begin{cases} M_{ij}^{P}=\begin{cases}
1, \left( s_i\geqslant \tau _s\land s_j\geqslant \tau _s \right)\\ 1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\
0,others,\\ 0,others,\\
\end{cases} \end{cases}
\end{align} \end{align}
\STATE Caculate the confidence comparison matrix $\mathbf{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows: \STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align} \begin{align}
M_{ij}^{C}=\begin{cases} M_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\ 1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
@ -856,10 +856,10 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
\label{confidential matrix} \label{confidential matrix}
\end{align} \end{align}
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors. where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
\STATE Calculate the geometric prior matrix $\mathbf{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows: \STATE Calculate the geometric prior matrix $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
\begin{align} \begin{align}
M_{ij}^{G}=\begin{cases} M_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau_{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau_{r}\\ 1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
0, others.\\ 0, others.\\
\end{cases} \end{cases}
\label{geometric prior matrix} \label{geometric prior matrix}
@ -870,10 +870,10 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
\label{al_1-3} \label{al_1-3}
\end{align} \end{align}
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions. where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
\STATE Define the adjacent matrix $\mathbf{M} = \mathbf{M}^{P} \land \mathbf{M}^{C} \land \mathbf{M}^{G}$ and the final confidence $\tilde{s}_i$ is calculate as following: \STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i$ is calculate as following:
\begin{align} \begin{align}
\tilde{s}_i = \begin{cases} \tilde{s}_i = \begin{cases}
1, & \text{if } \underset{j \in \{ j \mid T_{ij} = 1 \}}{\max} D_{ij} < \tau_g \\ 1, & \text{if } \underset{j \in \{ j \mid M_{ij} = 1 \}}{\max} D_{ij} < \tau^g \\
0, & \text{otherwise} 0, & \text{otherwise}
\end{cases} \end{cases}
\label{al_1-4} \label{al_1-4}
@ -887,13 +887,14 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to ``true'' (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS. The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to ``true'' (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau_g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks. The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau^g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks.
In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as: In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as:
\begin{align} \begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\ \tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\
\boldsymbol{F}_{ij}^{edge} & \gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) - FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) + FC_{b}\left( \varDelta \boldsymbol{x}_{ij}^{b} \right), \\ \boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\\
\boldsymbol{D}_{ij}^{edge} & \gets MLP_{edge}\left( \boldsymbol{F}_{ij}^{edge} \right). \tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\\
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .
\label{edge_layer_appendix} \label{edge_layer_appendix}
\end{align} \end{align}
where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows: where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows:
@ -906,7 +907,7 @@ where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar bu
\label{node_layer_appendix} \label{node_layer_appendix}
\end{align} \end{align}
In this expresion, we use elementwise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\tau_g$, the real implicit decision surface is learned from data by neural work. In this expression, we use elementwise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\tau^g$, the real implicit decision surface is learned from data by neural work.
\label{NMS_appendix} \label{NMS_appendix}
@ -943,7 +944,7 @@ In this expresion, we use elementwise max pooling of tensors instead of scalar-b
\midrule \midrule
\multirow{4}*{Evaluation Hyperparameter} \multirow{4}*{Evaluation Hyperparameter}
& $H^{l}\times W^{l}$ &$4\times10$&$4\times10$&$4\times10$&$4\times10$&$6\times13$\\ & $H^{l}\times W^{l}$ &$4\times10$&$4\times10$&$4\times10$&$4\times10$&$6\times13$\\
& $K_{a}$ &20&20&20&12&50\\ & $K$ &20&20&20&12&50\\
& $C_{o2m}$ &0.48&0.40&0.40&0.40&0.45\\ & $C_{o2m}$ &0.48&0.40&0.40&0.40&0.45\\
& $C_{o2o}$ &0.46&0.46&0.46&0.46&0.44\\ & $C_{o2o}$ &0.46&0.46&0.46&0.46&0.44\\
\bottomrule \bottomrule
@ -952,7 +953,7 @@ In this expresion, we use elementwise max pooling of tensors instead of scalar-b
\label{dataset_info} \label{dataset_info}
\end{table*} \end{table*}
\section{The Defination of GLaneIoU} \section{The IoU Definations for Lane Instances}
\textbf{Label assignment and Cost function.} To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows: \textbf{Label assignment and Cost function.} To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
\begin{figure}[t] \begin{figure}[t]
\centering \centering
@ -973,9 +974,9 @@ In this expresion, we use elementwise max pooling of tensors instead of scalar-b
\end{align} \end{align}
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows: where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
\begin{align} \begin{align}
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}, GIoU_{lane}\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
\end{align} \end{align}
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$. We set g=0 for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function. where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the $GIoU_{lane}$ is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the $GIoU_{lane}$ is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of $GIoU_{lane}$ is $\left(-g, 1 \right]$. We set $g=0$ for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function.
\label{giou_appendix} \label{giou_appendix}
\section{The Supplement of Implement Detail and The Visualization Result.} \section{The Supplement of Implement Detail and The Visualization Result.}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 628 KiB

After

Width:  |  Height:  |  Size: 628 KiB