update
This commit is contained in:
parent
94d413656a
commit
f0a77edf44
648
main.tex
648
main.tex
@ -221,7 +221,9 @@ The regression branch consists of a single $1\times1$ convolutional layer and wi
|
||||
\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the \textit{smooth-L}1 loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows:
|
||||
\begin{align}
|
||||
\mathcal{L} ^{lpm}_{cls}&=BCE\left( F_{cls},F_{gt} \right), \\
|
||||
\mathcal{L} ^{lpm}_{reg}&=\frac{1}{N^{lpm}_{pos}}\sum_{j\in \left\{j|\hat{r}_j^l<\tau^{l} \right\}}{\left( S_{L1}\left( \theta_j-\hat{\theta}_j \right) +S_{L1}\left( r_j^l-\hat{r}_j^l \right) \right)}, \label{loss_lph}
|
||||
\mathcal{L} ^{lpm}_{reg}&=\frac{1}{N^{lpm}_{pos}}\sum_{j\in \left\{j|\hat{r}_j^l<\tau^{l} \right\}}{\left( S_{L1}\left( \theta_j-\hat{\theta}_j \right) +S_{L1}\left( r_j^l-\hat{r}_j^l \right) \right)}, \\
|
||||
\mathcal{L} ^{lpm} &= \mathcal{L} ^{lpm}_{cls} + w^{lpm}_{reg}\mathcal{L} ^{lpm}_{reg},
|
||||
\label{loss_lph}
|
||||
\end{align}
|
||||
where $N^{lpm}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM.
|
||||
\par
|
||||
@ -229,10 +231,11 @@ where $N^{lpm}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of p
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} %
|
||||
\caption{The main architecture of GPM. It consists of the RoI Pooling Layer and triplet heads (\textit{i.e.} the O2O classification head, O2M classification head and O2M regression head). The predictions from the O2M classification head $\left\{s_i^g\right\}$ are redundant and require NMS postprocessing. The O2O classification head serves as a replacement for NMS, directly outputting the non-redundant predictions (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the output scores from the O2M classification head.}
|
||||
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png}
|
||||
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant predictions (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the output scores from the O2M classification head.}
|
||||
\label{gpm}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Global Polar Module}
|
||||
Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve accurate lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{gpm}.
|
||||
\par
|
||||
@ -254,100 +257,15 @@ Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors
|
||||
\end{equation}
|
||||
where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{lpm}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\boldsymbol{F}^{roi}\gets FC^{pooling}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r},
|
||||
\end{aligned}
|
||||
\boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r},
|
||||
\end{equation}
|
||||
|
||||
\textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head, as illustrated in Fig.7. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. What's more, NMS post processing will bring the dufficulty for the trade-off between recall and the precision, according to our previous analysis. To achieve an ideal non-redundant detection results with a NMS-free paradigm (\textit{i.e.} end to end detection), both the one-to-one and the one-to-many paradigms become crucial during training stage, as highlighted in \cite{o2o} \cite{}. Inspired by \cite{}\cite{} but with slight difference, we design triplet head with dual label assignment.
|
||||
\textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head, as depicted in Fig. \ref{gpm}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes and other irregular geometric shapes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}\cite{}. Drawing inspiration from \cite{} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
|
||||
|
||||
To ensure the simplicity and the speed of our model, the O2M regression head and the O2M classification head are designed with a plain structure with a two-layer MLPs. Ignoring the O2O classification head, the second stage of Polar-RCNN is similar to previous anchor-based works, but without complcated structure such as attention\cite{} and cascade refinements\cite{}. The predction results are redundant and still need NMS postprocessing. In order to make the model achieve an end-to-end paradigm, we design an extent O2O classification head. As shown in Fig. \ref{}, it should be noted that the detecton process of O2O classification is not independent but based on the O2M classification head.
|
||||
In the O2M paradigm as previous work, only the confidence output by the O2M classifiction head larger than a thredhold $\tau_g$ are chosen as the candidates as the positive detection results. From a porbability perspective, the confidence can be expressed as follows:
|
||||
\begin{align}
|
||||
\end{align}
|
||||
|
||||
while the probability of the confidence output by the O2O classification head can be expressed in a conditioal probability format:
|
||||
\begin{align}
|
||||
\end{align}
|
||||
where $s$ denotes confidence of the final non-reduntant predictions. If the $s>\tau_g$, the lane instace predicted from the i-th anchor are seen as the positive instance. In another view, the O2O classification head can be viewed as a replacement of NMS postprocessing.
|
||||
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{gpm}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head.
|
||||
|
||||
|
||||
|
||||
As shown in Fig. \ref{}, we introduce a novel architecture with \textit{graph neural network} \cite{gnn} (GNN) and polar geometric prior and we call this block Polar GNN. The Polar GNN is designed to model the relationship between features $F$ sampled different anchors. From previous analysis, the distance of lanes should not only modelled by explicit geometric property but consider the implicit contextual semantics such as ``double'' and ``forked'' lanes. These kinds of lanes with tiny geometric but shouldn;t be removed as redundant predictions.
|
||||
|
||||
The design insight of Polar GNN is from the Fast NMS \cite{}, which is iteration-free. The design details can be seen in the appendix, and we only elaborate the architecture of Polar GNN. In Polar GNN, each pooling feaures $F$ output by the ROI Pooling Layer Once we get the socre from O2M classificaton head $s$ and the offset regression from $r$. Then we can conduct some kinds of adjacent matrix ')
|
||||
|
||||
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i$-th anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
&\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) \rightarrow 1,
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) \rightarrow 0.
|
||||
\end{aligned}
|
||||
\label{sharp fun}
|
||||
\end{equation}
|
||||
|
||||
The Eq. (\ref{sharp fun}) suggests that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances, it implies that $f_{cls}^{plain}$ need to captures information with higher frequency. This issue is also discussed in \cite{o3d}. Capturing the high frequency with a plain structure is difficult because a naive MLP tends to capture information with lower frequency \cite{xu2022overview}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it becomes impossible to distinguish the two anchors to positive and negative samples completely; in practice, both confidences converge to around 0.5. This problem arises from the limitations of the input format and the structure of the naive MLP, which restrict its expressive capability for information with higher frequency. Therefore, it is crucial to establish relationships between anchors and design a new model structure to effectively represent ``sharp'' information.
|
||||
|
||||
It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls branch with O2M regression and NMS post-processing. If the NMS could be replaced by some equivalent but learnable functions (\textit{e.g.} a neural network with specific structure), the O2O head could be trained to handle the one-to-one assignment. However, the NMS involves sequential iteration and confidence sorting, which are challenging to reproduce with a neural network. Although previous works, such as RNN-based approaches \cite{stewart2016end}, utilize an iterative format, they are time-consuming and introduce additional complexity into the model training process due to their iterative nature. To eliminate the iteration process, we proposed a equivalent format of Fast NMS\cite{yolact}.
|
||||
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\caption{The Algorithm of the Graph-based Fast NMS}
|
||||
\begin{algorithmic}[1] %这个1 表示每一行都显示数字
|
||||
\REQUIRE ~~\\ %算法的输入参数:Input
|
||||
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
|
||||
The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\
|
||||
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
|
||||
The positive confidence get from o2m classification head, $s_i$;\\
|
||||
The positive regressions get from o2m regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
|
||||
\ENSURE ~~\\ %算法的输出:Output
|
||||
\STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
C_{ij}=\begin{cases}
|
||||
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\end{aligned}
|
||||
\label{al_1-1}
|
||||
\end{equation}
|
||||
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
|
||||
\STATE Calculate the geometric prior adjacent matrix $\boldsymbol{M} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $M_{ij}$ in $\boldsymbol{M}$ is caculate as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
M_{ij}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|<r_{\tau}\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\end{aligned}
|
||||
\label{al_1-2}
|
||||
\end{equation}
|
||||
|
||||
\STATE Calculate the inverse distance matrix $\boldsymbol{D} \in \mathbb{R} ^{N_{pos} \times N_{pos}}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right),
|
||||
\end{aligned}
|
||||
\label{al_1-3}
|
||||
\end{equation}
|
||||
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
|
||||
\STATE Define the adjacent matrix $\boldsymbol{T}=\,\,\boldsymbol{C}\land\boldsymbol{M}$ and the final confidence $\tilde{s}_i$ is calculate as following:
|
||||
\begin{equation}
|
||||
\tilde{s}_i = \begin{cases}
|
||||
1, & \text{if } \underset{j \in \{ j \mid T_{ij} = 1 \}}{\max} D_{ij} < \delta_{\tau} \\
|
||||
0, & \text{otherwise}
|
||||
\end{cases}
|
||||
\label{al_1-4}
|
||||
\end{equation}
|
||||
|
||||
|
||||
\RETURN The final confidence $\tilde{s}_i$. % the return result of the algorithm
|
||||
\end{algorithmic}
|
||||
\label{Graph Fast NMS}
|
||||
\end{algorithm}
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -356,96 +274,68 @@ It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls
|
||||
\label{o2o_cls_head}
|
||||
\end{figure}
|
||||
|
||||
The key rule of the NMS post-processing is as follows:
|
||||
Given a series of positive detections with redundancy, a detection result A is suppressed by another detection result B if and only if:
|
||||
As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the appendix; here, we focus on elaborating the architecture of the Polar GNN.
|
||||
|
||||
(1) The confidence of A is lower than that of B.
|
||||
In the Polar GNN, each anchor is treated as a node, and the ROI features $\boldsymbol{F}_{i}^{roi}$ are considered as the attributes of these nodes. Another crucial element of the GNN is the edge, also known as the adjacency matrix. We derive the adjacency matrix from three submatrices. The first part is the positive selection matrix $\mathbf{M}^{P}$.
|
||||
\begin{align}
|
||||
M_{ij}^{P}=\begin{cases}
|
||||
1, \left( s_i\geqslant \tau _s\land s_j\geqslant \tau _s \right)\\
|
||||
0,others\\
|
||||
\end{cases}
|
||||
\end{align}
|
||||
|
||||
(2) The predefined distance (\textit{e.g.} IoU distance and L1 distance) between A and B is smaller than a threshold.
|
||||
where $\tau _s$ denotes the threshold of positive score in the NMS paradigm. We directly use the threshold to select the posivite redundant prediction.
|
||||
|
||||
(3) B is not suppressed by any other detection results.
|
||||
|
||||
For simplicity, Fast NMS only satisfies the condition (1) and (2), which may lead to an increase in false negative predictions but offers faster processing without sequential iteration. Leveraging the “iteration-free” property, we propose a further refinement called “sort-free” Fast NMS. This new approach, named Graph-based Fast NMS, is detailed in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
It is straightforward to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Graph-based Fast NMS is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
|
||||
|
||||
According to the analysis of the shortcomings of traditional NMS post-processing shown in Fig. \ref{NMS setting}, the fundamental issue arises from the definition of the distance between predictions. Traditional NMS relies on geometric properties to define distances between predictions, which often neglects the contextual semantics. For example, in some scenarios, two predicted lanes with a small geometric distance should not be suppressed, such as the case of double lines or fork lines. Although setting a threshold $d_{\tau}$ can mitigate this problem, it is challenging to strike a balance between precision and recall.
|
||||
|
||||
To address this, we replace the explicit definition of the inverse distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{alemi2016deep}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural network. Besides the geometric information, features ${F}_{i}^{roi}$ containes the implicit contextual information of an anchor, which provides additional clues for establishing implicit contextual distances between two anchors. The implicit contextual distance is calculated as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}\gets& \mathrm{Re}LU\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right),
|
||||
The second part is called the confidence comparision matrix $\mathbf{M}^{C}$, defined as follows:
|
||||
\begin{align}
|
||||
M_{ij}^{C}=\begin{cases}
|
||||
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\label{confidential matrix}
|
||||
\end{align}
|
||||
where the scores of each pair are compared.The thrid part is the geometric prior matrix $\mathbf{M}^{G}$ and the defination is given as follows:
|
||||
\begin{align}
|
||||
M_{ij}^{G}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|<r_{\tau}\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\label{geometric prior matrix}
|
||||
\end{align}
|
||||
This matrix means that if two anchors are closer enough, then an edge will be consider exists between the two corresponding nodes.
|
||||
Given the three matrix above, we can define the overall adjacent matrix by $\mathbf{M} = \mathbf{M}^{P} \land \mathbf{M}^{C} \land \mathbf{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND''. Then the relationships between the i-th anchor and the j-th anchor can be modelling by follows:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{Re}LU\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right),
|
||||
\\
|
||||
\boldsymbol{F}_{ij}^{edge}\gets& FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right)
|
||||
\boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) +FC_{b}\left( \varDelta \boldsymbol{x}_{ij}^{b} \right),
|
||||
\\
|
||||
&+FC_{base}\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right),
|
||||
\\
|
||||
\boldsymbol{D}_{ij}^{edge}\gets& MLP_{edge}\left( \boldsymbol{F}_{ij}^{graph} \right).
|
||||
\\
|
||||
\end{aligned}
|
||||
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \boldsymbol{F}_{ij}^{edge} \right).
|
||||
\label{edge_layer}
|
||||
\end{equation}
|
||||
|
||||
Eq. (\ref{edge_layer}) represents the implicit expression of Eq. (\ref{al_1-3}), where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information compared to traditional geometric distance. The confidence caculation is expressed as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
&\boldsymbol{D}_{i}^{node}\gets \underset{j\in \left\{ j|T_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
|
||||
\end{align}
|
||||
where $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denoted the implicit semantic distance features from the $i$-th anchor to $j$-th anchor. Given the semantic distance feastures of anchors pair, we use a max pooling layer \cite{} to aggregate the adjacent node update the feaures of node, and get the final non-redundant scores $\left\{ \tilde{s}_i\right\}$:
|
||||
\begin{align}
|
||||
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
|
||||
\\
|
||||
&\boldsymbol{F}_{i}^{node}\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
|
||||
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
|
||||
\\
|
||||
&\tilde{s}_i\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
|
||||
\end{aligned}
|
||||
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
|
||||
\label{node_layer}
|
||||
\end{equation}
|
||||
\end{align}
|
||||
|
||||
The Eq. (\ref{node_layer}) serves as the implicit replacement for Eq. (\ref{al_1-4}). In this approach, we use elementwise max pooling of tensors instead of scalar-based max operations. The pooled tensor is then fed into a neural network with a sigmoid activation function to directly obtain the confidence. By eliminating the need for a predefined distance threshold, all confidence calculation patterns are derived from the training data.
|
||||
\textbf{Label Assignment and Cost Function.} As the previous work, we use the dual assignment strategy for label assignment of triplet head. The cost function of the i-th prediction and j-th ground truth is given as follows:
|
||||
\begin{align}
|
||||
\mathcal{C} _{ij}=s_i\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}.
|
||||
\end{align}
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account, with $\beta$ and $\beta$ as the trade-off hyper parameters of location and confidence. We redefined the GLaneIoU function, slightly different from previous work \cite{}\cite{}. More details about the redefined GLaneIoU function can be seen in the Appendix.
|
||||
We use SimOTA \cite{}\cite{} ($k=4$) for O2M classification head and O2M regression head (one-to-many assignment) while Hungarian \cite{detr} algorithm is employed for the O2O classification head (one-to-one assignment).
|
||||
|
||||
It should be noted that the O2O classification head depends on the predictons of O2M classification head as outlined in Eq. (\ref{al_1-1}). From a probablity percpective, the confidence output by O2M classification head, $s_{j}$, represents the probability that the $j$-th detection is a positive sample. The confidence output by O2O classification head, $\tilde{s}_i$, denotes the conditional probablity that $i$-th sample shouldn't be suppressed given the condition that the $i$-th sample identified as a positive sample:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
&s_j|_{j=1}^{N_a}\equiv P\left( a_j\,\,is\,\,pos \right), \,\,
|
||||
\\
|
||||
&\tilde{s}_i|_{i=1}^{N_{pos}}\equiv P\left( a_i\,\,is\,\,retained|a_i\,is\,\,pos \right),
|
||||
\end{aligned}
|
||||
\label{probablity}
|
||||
\end{equation}
|
||||
where $N_a$ equals $H^{l}\times W^{l}$ during the training stage and $K_{a}$ during the testing stage. The overall architecture of O2O classification head is illustrated in Fig. \ref{o2o_cls_head}.
|
||||
|
||||
\textbf{Label assignment and Cost function.} We use the label assignment (SimOTA) similar to previous works \cite{clrnet}\cite{clrernet}. However, to make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名
|
||||
\caption{Illustrations of GLaneIoU redefined in our work.}
|
||||
\label{glaneiou}
|
||||
\end{figure}
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
&w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w_{b},
|
||||
\\
|
||||
&\hat{d}_{i}^{\mathcal{O}}=\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&\hat{d}_{i}^{\xi}=\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right) -\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right),
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$.
|
||||
We then define the cost function between $i$-th prediction and $j$-th ground truth as follows like \cite{detr}:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with k=4) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment.
|
||||
\textbf{Loss function.}
|
||||
We use focal loss \cite{focal} for O2O classification head and O2M classification head, which are dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$ correspondingly.
|
||||
where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head:
|
||||
\begin{align}
|
||||
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}.
|
||||
\end{align}
|
||||
only the posivite candidatas of the O2M classification head $s$ participatg in caculating the O2O classification loss, with other samples ignored. Additionally, we use the rank loss $\mathcal{L} _{\,\,rank}$ in \cite{} to increase the gap between positive and negative confidences of O2O classification head
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/auxloss.png} %
|
||||
@ -453,123 +343,53 @@ This cost function is more compact than those in previous works\cite{clrnet}\cit
|
||||
\label{auxloss}
|
||||
\end{figure}
|
||||
|
||||
\textbf{Loss function.} We use focal loss \cite{focal} for O2O classification head and O2M classification head:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\mathcal{L} _{o2m}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)},
|
||||
\\
|
||||
\mathcal{L} _{o2o}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2o}}{\alpha _{o2o}\left( 1-\tilde{s}_i \right) ^{\gamma}\log \left( \tilde{s}_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2o}}{\left( 1-\alpha _{o2o} \right) \left( \tilde{s}_i \right) ^{\gamma}\log \left( 1-\tilde{s}_i \right)}.
|
||||
\\
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
|
||||
Only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O classification head. According to \cite{pss}, to maintain feature quality during training stage, the gradient of O2O classification head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O classification head:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
&\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{j\in \varOmega _{neg}^{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}},\\
|
||||
&N_{rank}=\left| \varOmega _{pos}^{o2o} \right|\left| \varOmega _{neg}^{o2o} \right|.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
|
||||
We directly use the GLaneIoU loss, $\mathcal{L}_{GLaneIoU}$, to regression the offset of xs (with g=1) and Smooth-L1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxiliary loss illustrated in Fig. \ref{auxloss}:
|
||||
\begin{align}
|
||||
\begin{aligned}
|
||||
\mathcal{L}_{aux} &= \frac{1}{\left| \varOmega_{pos}^{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
|
||||
&\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg].
|
||||
\end{aligned}
|
||||
\end{align}
|
||||
|
||||
The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape.
|
||||
where the anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape.
|
||||
|
||||
\subsection{The whole training process of Polar-RCNN.} The whole training process is end to end with only one step just like \cite{}\cite{}. The overll loss function is given as follows:
|
||||
\begin{align}
|
||||
\mathcal{L}_{\text{overall}} &= \mathcal{L}_{\text{lpm}}^{\text{cls}} &+ \mathcal{L}_{\text{lpm}}^{\text{reg}} &+ \mathcal{L}_{\text{o2m}}^{\text{cls}} + \mathcal{L}_{\text{o2o}}^{\text{cls}} + \mathcal{L}_{\text{rank}} \\
|
||||
&+ \mathcal{L}_{\text{IoU}} &+ \mathcal{L}_{\text{end}} &+ \mathcal{L}_{\text{aux}}.
|
||||
\end{align}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\subsection{Loss function}
|
||||
|
||||
The overall loss function of Polar R-CNN is given as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\mathcal{L}_{overall} &=\mathcal{L} _{lpm}^{cls}+w_{lpm}^{reg}\mathcal{L} _{lpm}^{reg}\\&+w_{o2m}^{cls}\mathcal{L} _{o2m}^{cls}+w_{o2o}^{cls}\mathcal{L} _{o2o}^{cls}+w_{rank}\mathcal{L} _{rank}\\&+w_{IoU}\mathcal{L} _{IoU}+w_{end}\mathcal{L} _{end}+w_{aux}\mathcal{L} _{aux}.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
The first line in the loss function represents the loss for LPH, which includes both classification and regression components. The second line pertains to the losses associated with the two classification heads (O2M and O2O), while the third line represents the loss for the regression head within the triplet head. Each term in the equation is weighted by a factor to balance the contributions of each component to the gradient. The entire training process is end-to-end.
|
||||
|
||||
\begin{table*}[htbp]
|
||||
\centering
|
||||
\caption{Infos and hyperparameters for five datasets. For CULane, $*$ denotes the actual number of training samples used to train our model. Please note that labels for some validation/test sets are missing; therefore, we have selected different splits (test or validation set) for different datasets.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{l|l|ccccc}
|
||||
\toprule
|
||||
\multicolumn{2}{c|}{\textbf{Dataset}} & CULane & TUSimple & LLAMAS & DL-Rail & CurveLanes \\
|
||||
\midrule
|
||||
\multirow{7}*{Dataset Description}
|
||||
& Train &88,880/$55,698^{*}$&3,268 &58,269&5,435&100,000\\
|
||||
& Validation &9,675 &358 &20,844&- &20,000 \\
|
||||
& Test &34,680&2,782 &20,929&1,569&- \\
|
||||
& Resolution &$1640\times590$&$1280\times720$&$1276\times717$&$1920\times1080$&$2560\times1440$, etc\\
|
||||
& Lane &$\leqslant4$&$\leqslant5$&$\leqslant4$&$=2$&$\leqslant10$\\
|
||||
& Environment &urban and highway & highway&highway&railay&urban and highway\\
|
||||
& Distribution &sparse&sparse&sparse&sparse&sparse and dense\\
|
||||
\midrule
|
||||
\multirow{2}*{Dataset Split}
|
||||
& Evaluation &Test&Test&Test&Test&Val\\
|
||||
& Visualization &Test&Test&Val&Test&Val\\
|
||||
\midrule
|
||||
\multirow{1}*{Data Preprocess}
|
||||
& Crop Height &270&160&300&560&640, etc\\
|
||||
\midrule
|
||||
\multirow{6}*{Training Hyperparameter}
|
||||
& Epoch Number &32&70&20&90&32\\
|
||||
& Batch Size &40&24&32&40&40\\
|
||||
& Warm up iterations &800&200&800&400&800\\
|
||||
& $w_{aux}$ &0.2&0 &0.2&0.2&0.2\\
|
||||
& $w_{rank}$ &0.7&0.7&0.1&0.7&0 \\
|
||||
\midrule
|
||||
\multirow{4}*{Evaluation Hyperparameter}
|
||||
& $H^{l}\times W^{l}$ &$4\times10$&$4\times10$&$4\times10$&$4\times10$&$6\times13$\\
|
||||
& $K_{a}$ &20&20&20&12&50\\
|
||||
& $C_{o2m}$ &0.48&0.40&0.40&0.40&0.45\\
|
||||
& $C_{o2o}$ &0.46&0.46&0.46&0.46&0.44\\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{adjustbox}
|
||||
\label{dataset_info}
|
||||
\end{table*}
|
||||
|
||||
\section{Experiment}
|
||||
\subsection{Dataset and Evaluation Metric}
|
||||
We conducted experiments on four widely used lane detection benchmarks and one rail detection dataset: CULane\cite{scnn}, TuSimple\cite{tusimple}, LLAMAS\cite{llamas}, CurveLanes\cite{curvelanes}, and DL-Rail\cite{dalnet}. Among these datasets, CULane and CurveLanes are particularly challenging. The CULane dataset consists various scenarios but has sparse lane distributions, whereas CurveLanes includes a large number of curved and dense lane types, such as forked and double lanes. The DL-Rail dataset, focused on rail detection across different scenarios, is chosen to evaluate our model’s performance beyond traditional lane detection. The details for five dataset are shown in Table. \ref{dataset_info}
|
||||
We conducted experiments on four widely used lane detection benchmarks and one rail detection dataset: CULane\cite{scnn}, TuSimple\cite{tusimple}, LLAMAS\cite{llamas}, CurveLanes\cite{curvelanes}, and DL-Rail\cite{dalnet}. Among these datasets, CULane and CurveLanes are particularly challenging. The CULane dataset consists various scenarios but has sparse lane distributions, whereas CurveLanes includes a large number of curved and dense lane types, such as forked and double lanes. The DL-Rail dataset, focused on rail detection across different scenarios, is chosen to evaluate our model’s performance beyond traditional lane detection.
|
||||
|
||||
We use the F1-score to evaluate our model on the CULane, LLAMAS, DL-Rail, and Curvelanes datasets, maintaining consistency with previous works. The F1-score is defined as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\begin{align}
|
||||
F1=\frac{2\times Precision\times Recall}{Precision\,\,+\,\,Recall},
|
||||
\\
|
||||
Precision\,\,=\,\,\frac{TP}{TP+FP},
|
||||
\\
|
||||
Recall\,\,=\,\,\frac{TP}{TP+FN}.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
\end{align}
|
||||
In our experiment, we use different IoU thresholds to calculate the F1-score for different datasets: F1@50 and F1@75 for CULane \cite{clrnet}, F1@50 for LLAMAS \cite{clrnet} and Curvelanes \cite{CondLaneNet}, and F1@50, F1@75, and mF1 for DL-Rail \cite{dalnet}. The mF1 is defined as:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\begin{align}
|
||||
mF1=\left( F1@50+F1@55+...+F1@95 \right) /10.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
\end{align}
|
||||
|
||||
For Tusimple, the evaluation is formulated as follows:
|
||||
\begin{equation}
|
||||
\begin{aligned}
|
||||
\begin{align}
|
||||
Accuracy=\frac{\sum{C_{clip}}}{\sum{S_{clip}}}.
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
\end{align}
|
||||
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
|
||||
|
||||
\subsection{Implement Detail}
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficients of cost function, $\beta_{c}$ and $\beta_{r}$, are set to 1 and 6 respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. Other parameters, such as batch size and loss weights for each dataset, are detailed in Table \ref{dataset_info}. Since some test/validation sets for the five datasets are not accessible, the test/validation sets used are also listed in Table \ref{dataset_info}. All the expoeriments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}.
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficients of cost function, $\beta_{c}$ and $\beta_{r}$, are set to 1 and 6 respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. The training processing is end-to-end just like \cite{}\cite{} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix.
|
||||
|
||||
|
||||
\begin{table*}[htbp]
|
||||
@ -743,13 +563,6 @@ To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet
|
||||
|
||||
We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement strategy. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without additional refinement, indicating that the design of flexible anchors is crucial for performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}. The combination of fast processing speed and a straightforward architecture makes our model highly deployable.
|
||||
|
||||
\subsection{Ablation Study and Visualization}
|
||||
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance.
|
||||
|
||||
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
|
||||
|
||||
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/anchor_num_method.png}
|
||||
@ -757,6 +570,13 @@ We also explore the effect of different local polar map sizes on our model, as i
|
||||
\label{anchor_num_method}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Ablation Study and Visualization}
|
||||
To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance.
|
||||
|
||||
\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
|
||||
|
||||
We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
|
||||
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -956,6 +776,265 @@ In the traditional NMS post-processing \cite{clrernet}, the default IoU threshol
|
||||
\label{aba_NMS_dense}
|
||||
\end{table}
|
||||
|
||||
|
||||
|
||||
|
||||
\section{Conclusion and Future Work}
|
||||
In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification head with Polar GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Users have the option to use either the O2M classification head with NMS post-processing or the O2O classification head for a NMS-free approach. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore incorporating new structures, such as large kernels or attention mechanisms, and experimenting with new label assignment, training, and anchor sampling strategies. We also plan to extend Polar R-CNN to video instance lane detection and 3D lane detection, utilizing advanced geometric modeling for these new tasks.
|
||||
%
|
||||
%
|
||||
%
|
||||
\bibliographystyle{IEEEtran}
|
||||
\bibliography{reference}
|
||||
%\newpage
|
||||
%
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/wsq.jpg}}]{Shengqi Wang}
|
||||
received the Master degree from Xi'an Jiaotong University, Xi'an, China, in 2022. He is now pursuing for the Ph.D. degree in statistics at Xi'an Jiaotong University. His research interests include low-level computer vision, deep learning, and so on.
|
||||
\end{IEEEbiography}
|
||||
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/ljm.pdf}}]{Junmin Liu}
|
||||
was born in 1982. He received the Ph.D. degree in Mathematics from Xi'an Jiaotong University, Xi'an, China, in 2013. From 2011 to 2012, he served as a Research Assistant with the Department of Geography and Resource Management at the Chinese University of Hong Kong, Hong Kong, China. From 2014 to 2017, he worked as a Visiting Scholar at the University of Maryland, College Park, USA. He is currently a full Professor at the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China. His research interests are mainly focused on the theory and application of machine learning and image processing. He has published over 60+ research papers in international conferences and journals.
|
||||
\end{IEEEbiography}
|
||||
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/xiangyongcao.jpg}}]{Xiangyong Cao (Member, IEEE)}
|
||||
received the B.Sc. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 2012 and 2018, respectively. From 2016 to 2017, he was a Visiting Scholar with Columbia University, New York, NY, USA. He is an Associate Professor with the School of Computer Science and Technology, Xi’an Jiaotong University. His research interests include statistical modeling
|
||||
and image processing.
|
||||
\end{IEEEbiography}
|
||||
\vfill
|
||||
|
||||
|
||||
\newpage
|
||||
% 附录有多个section时
|
||||
\appendices
|
||||
\section{Title of the 1nd appendix}
|
||||
This is the first paragraph of Appx. A ...
|
||||
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i$-th anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
|
||||
\begin{align}
|
||||
&\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) \rightarrow 1,
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) \rightarrow 0.
|
||||
\label{sharp fun}
|
||||
\end{align}
|
||||
|
||||
|
||||
|
||||
|
||||
The Eq. (\ref{sharp fun}) suggests that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances, it implies that $f_{cls}^{plain}$ need to captures information with higher frequency. This issue is also discussed in \cite{o3d}. Capturing the high frequency with a plain structure is difficult because a naive MLP tends to capture information with lower frequency \cite{xu2022overview}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it becomes impossible to distinguish the two anchors to positive and negative samples completely; in practice, both confidences converge to around 0.5. This problem arises from the limitations of the input format and the structure of the naive MLP, which restrict its expressive capability for information with higher frequency. Therefore, it is crucial to establish relationships between anchors and design a new model structure to effectively represent ``sharp'' information.
|
||||
|
||||
It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls branch with O2M regression and NMS post-processing. If the NMS could be replaced by some equivalent but learnable functions (\textit{e.g.} a neural network with specific structure), the O2O head could be trained to handle the one-to-one assignment. However, the NMS involves sequential iteration and confidence sorting, which are challenging to reproduce with a neural network. Although previous works, such as RNN-based approaches \cite{stewart2016end}, utilize an iterative format, they are time-consuming and introduce additional complexity into the model training process due to their iterative nature. To eliminate the iteration process, we proposed a equivalent format of Fast NMS\cite{yolact}.
|
||||
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\caption{The Algorithm of the Graph-based Fast NMS}
|
||||
\begin{algorithmic}[1] %这个1 表示每一行都显示数字
|
||||
\REQUIRE ~~\\ %算法的输入参数:Input
|
||||
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
|
||||
The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\
|
||||
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
|
||||
The positive confidence get from o2m classification head, $s_i$;\\
|
||||
The positive regressions get from o2m regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
|
||||
\ENSURE ~~\\ %算法的输出:Output
|
||||
\STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows:
|
||||
\begin{align}
|
||||
C_{ij}=\begin{cases}
|
||||
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\label{al_1-1}
|
||||
\end{align}
|
||||
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
|
||||
\STATE Calculate the geometric prior adjacent matrix $\boldsymbol{M} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $M_{ij}$ in $\boldsymbol{M}$ is caculate as follows:
|
||||
\begin{align}
|
||||
M_{ij}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|<r_{\tau}\\
|
||||
0, others\\
|
||||
\end{cases}
|
||||
\label{al_1-2}
|
||||
\end{align}
|
||||
|
||||
\STATE Calculate the inverse distance matrix $\boldsymbol{D} \in \mathbb{R} ^{N_{pos} \times N_{pos}}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
|
||||
\begin{align}
|
||||
D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right),
|
||||
\label{al_1-3}
|
||||
\end{align}
|
||||
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
|
||||
\STATE Define the adjacent matrix $\boldsymbol{T}=\,\,\boldsymbol{C}\land\boldsymbol{M}$ and the final confidence $\tilde{s}_i$ is calculate as following:
|
||||
\begin{align}
|
||||
\tilde{s}_i = \begin{cases}
|
||||
1, & \text{if } \underset{j \in \{ j \mid T_{ij} = 1 \}}{\max} D_{ij} < \delta_{\tau} \\
|
||||
0, & \text{otherwise}
|
||||
\end{cases}
|
||||
\label{al_1-4}
|
||||
\end{align}
|
||||
|
||||
|
||||
\RETURN The final confidence $\tilde{s}_i$. % the return result of the algorithm
|
||||
\end{algorithmic}
|
||||
\label{Graph Fast NMS}
|
||||
\end{algorithm}
|
||||
|
||||
The key rule of the NMS post-processing is as follows:
|
||||
Given a series of positive detections with redundancy, a detection result A is suppressed by another detection result B if and only if:
|
||||
|
||||
(1) The confidence of A is lower than that of B.
|
||||
|
||||
(2) The predefined distance (\textit{e.g.} IoU distance and L1 distance) between A and B is smaller than a threshold.
|
||||
|
||||
(3) B is not suppressed by any other detection results.
|
||||
|
||||
For simplicity, Fast NMS only satisfies the condition (1) and (2), which may lead to an increase in false negative predictions but offers faster processing without sequential iteration. Leveraging the “iteration-free” property, we propose a further refinement called “sort-free” Fast NMS. This new approach, named Graph-based Fast NMS, is detailed in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
It is straightforward to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Graph-based Fast NMS is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
|
||||
|
||||
According to the analysis of the shortcomings of traditional NMS post-processing shown in Fig. \ref{NMS setting}, the fundamental issue arises from the definition of the distance between predictions. Traditional NMS relies on geometric properties to define distances between predictions, which often neglects the contextual semantics. For example, in some scenarios, two predicted lanes with a small geometric distance should not be suppressed, such as the case of double lines or fork lines. Although setting a threshold $d_{\tau}$ can mitigate this problem, it is challenging to strike a balance between precision and recall.
|
||||
|
||||
To address this, we replace the explicit definition of the inverse distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{alemi2016deep}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural network. Besides the geometric information, features ${F}_{i}^{roi}$ containes the implicit contextual information of an anchor, which provides additional clues for establishing implicit contextual distances between two anchors. The implicit contextual distance is calculated as follows:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}\gets& \mathrm{Re}LU\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right),
|
||||
\\
|
||||
\boldsymbol{F}_{ij}^{edge}\gets& FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right)
|
||||
\\
|
||||
&+FC_{base}\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right),
|
||||
\\
|
||||
\boldsymbol{D}_{ij}^{edge}\gets& MLP_{edge}\left( \boldsymbol{F}_{ij}^{graph} \right).
|
||||
\\
|
||||
\label{edge_layer}
|
||||
\end{align}
|
||||
|
||||
Eq. (\ref{edge_layer}) represents the implicit expression of Eq. (\ref{al_1-3}), where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information compared to traditional geometric distance. The confidence caculation is expressed as follows:
|
||||
\begin{align}
|
||||
&\boldsymbol{D}_{i}^{node}\gets \underset{j\in \left\{ j|T_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
|
||||
\\
|
||||
&\boldsymbol{F}_{i}^{node}\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
|
||||
\\
|
||||
&\tilde{s}_i\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
|
||||
\label{node_layer}
|
||||
\end{align}
|
||||
|
||||
The Eq. (\ref{node_layer}) serves as the implicit replacement for Eq. (\ref{al_1-4}). In this approach, we use elementwise max pooling of tensors instead of scalar-based max operations. The pooled tensor is then fed into a neural network with a sigmoid activation function to directly obtain the confidence. By eliminating the need for a predefined distance threshold, all confidence calculation patterns are derived from the training data.
|
||||
|
||||
It should be noted that the O2O classification head depends on the predictons of O2M classification head as outlined in Eq. (\ref{al_1-1}). From a probablity percpective, the confidence output by O2M classification head, $s_{j}$, represents the probability that the $j$-th detection is a positive sample. The confidence output by O2O classification head, $\tilde{s}_i$, denotes the conditional probablity that $i$-th sample shouldn't be suppressed given the condition that the $i$-th sample identified as a positive sample:
|
||||
\begin{align}
|
||||
&s_j|_{j=1}^{N_a}\equiv P\left( a_j\,\,is\,\,pos \right), \,\,
|
||||
\\
|
||||
&\tilde{s}_i|_{i=1}^{N_{pos}}\equiv P\left( a_i\,\,is\,\,retained|a_i\,is\,\,pos \right),
|
||||
\label{probablity}
|
||||
\end{align}
|
||||
where $N_a$ equals $H^{l}\times W^{l}$ during the training stage and $K_{a}$ during the testing stage. The overall architecture of O2O classification head is illustrated in Fig. \ref{o2o_cls_head}.
|
||||
|
||||
\textbf{Label assignment and Cost function.} We use the label assignment (SimOTA) similar to previous works \cite{clrnet}\cite{clrernet}. However, to make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名
|
||||
\caption{Illustrations of GLaneIoU redefined in our work.}
|
||||
\label{glaneiou}
|
||||
\end{figure}
|
||||
\begin{align}
|
||||
&w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w_{b},
|
||||
\\
|
||||
&\hat{d}_{i}^{\mathcal{O}}=\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&\hat{d}_{i}^{\xi}=\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right) -\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right),
|
||||
\end{align}
|
||||
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
|
||||
\begin{align}
|
||||
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
|
||||
\end{align}
|
||||
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$.
|
||||
We then define the cost function between $i$-th prediction and $j$-th ground truth as follows like \cite{detr}:
|
||||
\begin{align}
|
||||
\mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}.
|
||||
\end{align}
|
||||
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with k=4) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment.
|
||||
|
||||
|
||||
\textbf{Loss function.} We use focal loss \cite{focal} for O2O classification head and O2M classification head:
|
||||
\begin{align}
|
||||
\mathcal{L} _{o2m}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)},
|
||||
\\
|
||||
\mathcal{L} _{o2o}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2o}}{\alpha _{o2o}\left( 1-\tilde{s}_i \right) ^{\gamma}\log \left( \tilde{s}_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2o}}{\left( 1-\alpha _{o2o} \right) \left( \tilde{s}_i \right) ^{\gamma}\log \left( 1-\tilde{s}_i \right)}.
|
||||
\\
|
||||
\end{align}
|
||||
where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head:
|
||||
\begin{align}
|
||||
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}.
|
||||
\end{align}
|
||||
|
||||
Only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O classification head. According to \cite{pss}, to maintain feature quality during training stage, the gradient of O2O classification head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O classification head:
|
||||
\begin{align}
|
||||
&\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{j\in \varOmega _{neg}^{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}},\\
|
||||
&N_{rank}=\left| \varOmega _{pos}^{o2o} \right|\left| \varOmega _{neg}^{o2o} \right|.
|
||||
\end{align}
|
||||
|
||||
We directly use the GLaneIoU loss, $\mathcal{L}_{GLaneIoU}$, to regression the offset of xs (with g=1) and Smooth-L1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxiliary loss illustrated in Fig. \ref{auxloss}:
|
||||
\begin{align}
|
||||
\mathcal{L}_{aux} &= \frac{1}{\left| \varOmega_{pos}^{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
|
||||
&\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg].
|
||||
\end{align}
|
||||
|
||||
The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape.
|
||||
|
||||
\subsection{Loss function}
|
||||
|
||||
The overall loss function of Polar R-CNN is given as follows:
|
||||
\begin{align}
|
||||
\mathcal{L}_{overall} &=\mathcal{L} _{lpm}^{cls}+w_{lpm}^{reg}\mathcal{L} _{lpm}^{reg}\\&+w_{o2m}^{cls}\mathcal{L} _{o2m}^{cls}+w_{o2o}^{cls}\mathcal{L} _{o2o}^{cls}+w_{rank}\mathcal{L} _{rank}\\&+w_{IoU}\mathcal{L} _{IoU}+w_{end}\mathcal{L} _{end}+w_{aux}\mathcal{L} _{aux}.
|
||||
\end{align}
|
||||
The first line in the loss function represents the loss for LPH, which includes both classification and regression components. The second line pertains to the losses associated with the two classification heads (O2M and O2O), while the third line represents the loss for the regression head within the triplet head. Each term in the equation is weighted by a factor to balance the contributions of each component to the gradient. The entire training process is end-to-end.
|
||||
\section{Title of the 2nd appendix}
|
||||
This is the first paragraph of Appx. B ..
|
||||
|
||||
\begin{table*}[htbp]
|
||||
\centering
|
||||
\caption{Infos and hyperparameters for five datasets. For CULane, $*$ denotes the actual number of training samples used to train our model. Please note that labels for some validation/test sets are missing; therefore, we have selected different splits (test or validation set) for different datasets.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{l|l|ccccc}
|
||||
\toprule
|
||||
\multicolumn{2}{c|}{\textbf{Dataset}} & CULane & TUSimple & LLAMAS & DL-Rail & CurveLanes \\
|
||||
\midrule
|
||||
\multirow{7}*{Dataset Description}
|
||||
& Train &88,880/$55,698^{*}$&3,268 &58,269&5,435&100,000\\
|
||||
& Validation &9,675 &358 &20,844&- &20,000 \\
|
||||
& Test &34,680&2,782 &20,929&1,569&- \\
|
||||
& Resolution &$1640\times590$&$1280\times720$&$1276\times717$&$1920\times1080$&$2560\times1440$, etc\\
|
||||
& Lane &$\leqslant4$&$\leqslant5$&$\leqslant4$&$=2$&$\leqslant10$\\
|
||||
& Environment &urban and highway & highway&highway&railay&urban and highway\\
|
||||
& Distribution &sparse&sparse&sparse&sparse&sparse and dense\\
|
||||
\midrule
|
||||
\multirow{2}*{Dataset Split}
|
||||
& Evaluation &Test&Test&Test&Test&Val\\
|
||||
& Visualization &Test&Test&Val&Test&Val\\
|
||||
\midrule
|
||||
\multirow{1}*{Data Preprocess}
|
||||
& Crop Height &270&160&300&560&640, etc\\
|
||||
\midrule
|
||||
\multirow{6}*{Training Hyperparameter}
|
||||
& Epoch Number &32&70&20&90&32\\
|
||||
& Batch Size &40&24&32&40&40\\
|
||||
& Warm up iterations &800&200&800&400&800\\
|
||||
& $w_{aux}$ &0.2&0 &0.2&0.2&0.2\\
|
||||
& $w_{rank}$ &0.7&0.7&0.1&0.7&0 \\
|
||||
\midrule
|
||||
\multirow{4}*{Evaluation Hyperparameter}
|
||||
& $H^{l}\times W^{l}$ &$4\times10$&$4\times10$&$4\times10$&$4\times10$&$6\times13$\\
|
||||
& $K_{a}$ &20&20&20&12&50\\
|
||||
& $C_{o2m}$ &0.48&0.40&0.40&0.40&0.45\\
|
||||
& $C_{o2o}$ &0.46&0.46&0.46&0.46&0.44\\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{adjustbox}
|
||||
\label{dataset_info}
|
||||
\end{table*}
|
||||
|
||||
|
||||
|
||||
\textbf{Visualization.} We present the Polar R-CNN predictions for both sparse and dense scenarios. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances.
|
||||
|
||||
|
||||
@ -1199,39 +1278,4 @@ In the traditional NMS post-processing \cite{clrernet}, the default IoU threshol
|
||||
\end{figure*}
|
||||
|
||||
|
||||
\section{Conclusion and Future Work}
|
||||
In this paper, we propose Polar R-CNN to address two key issues in anchor-based lane detection methods. By incorporating a local and global polar coordinate system, our Polar R-CNN achieves improved performance with fewer anchors. Additionally, the introduction of the O2O classification head with Polar GNN block allows us to replace the traditional NMS post-processing, and the NMS-free paradigm demonstrates superior performance in dense scenarios. Our model is highly flexible and the number of anchors can be adjusted based on the specific scenario. Users have the option to use either the O2M classification head with NMS post-processing or the O2O classification head for a NMS-free approach. Polar R-CNN is also deployment-friendly due to its simple structure, making it a potential new baseline for lane detection. Future work could explore incorporating new structures, such as large kernels or attention mechanisms, and experimenting with new label assignment, training, and anchor sampling strategies. We also plan to extend Polar R-CNN to video instance lane detection and 3D lane detection, utilizing advanced geometric modeling for these new tasks.
|
||||
%
|
||||
%
|
||||
%
|
||||
\bibliographystyle{IEEEtran}
|
||||
\bibliography{reference}
|
||||
%\newpage
|
||||
%
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/wsq.jpg}}]{Shengqi Wang}
|
||||
received the Master degree from Xi'an Jiaotong University, Xi'an, China, in 2022. He is now pursuing for the Ph.D. degree in statistics at Xi'an Jiaotong University. His research interests include low-level computer vision, deep learning, and so on.
|
||||
\end{IEEEbiography}
|
||||
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/ljm.pdf}}]{Junmin Liu}
|
||||
was born in 1982. He received the Ph.D. degree in Mathematics from Xi'an Jiaotong University, Xi'an, China, in 2013. From 2011 to 2012, he served as a Research Assistant with the Department of Geography and Resource Management at the Chinese University of Hong Kong, Hong Kong, China. From 2014 to 2017, he worked as a Visiting Scholar at the University of Maryland, College Park, USA. He is currently a full Professor at the School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China. His research interests are mainly focused on the theory and application of machine learning and image processing. He has published over 60+ research papers in international conferences and journals.
|
||||
\end{IEEEbiography}
|
||||
|
||||
\begin{IEEEbiography}[{\includegraphics[width=1in,height=1.25in,clip,keepaspectratio]{thesis_figure/xiangyongcao.jpg}}]{Xiangyong Cao (Member, IEEE)}
|
||||
received the B.Sc. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 2012 and 2018, respectively. From 2016 to 2017, he was a Visiting Scholar with Columbia University, New York, NY, USA. He is an Associate Professor with the School of Computer Science and Technology, Xi’an Jiaotong University. His research interests include statistical modeling
|
||||
and image processing.
|
||||
\end{IEEEbiography}
|
||||
\vfill
|
||||
|
||||
|
||||
\newpage
|
||||
% 附录有多个section时
|
||||
\appendices
|
||||
\section{Title of the 1nd appendix}
|
||||
This is the first paragraph of Appx. A ...
|
||||
\section{Title of the 2nd appendix}
|
||||
This is the first paragraph of Appx. B ..
|
||||
|
||||
|
||||
\end{document}
|
||||
|
||||
|
||||
|
Binary file not shown.
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user