update
This commit is contained in:
parent
37a9a10309
commit
23c6c540f9
272
main.tex
272
main.tex
@ -269,7 +269,7 @@ To ensure both simplicity and efficiency in our model, the O2M regression head a
|
||||
\label{o2o_cls_head}
|
||||
\end{figure}
|
||||
|
||||
As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the appendix; here, we focus on elaborating the architecture of the Polar GNN.
|
||||
As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the Polar GNN.
|
||||
|
||||
In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\mathbf{M}^{P}\in\mathbb{R}^{K\times K}$:
|
||||
\begin{align}
|
||||
@ -293,7 +293,7 @@ This matrix facilitates the comparison of scores for each pair of anchors.
|
||||
The third component is the geometric prior matrix, denoted by $\mathbf{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as:
|
||||
\begin{align}
|
||||
M_{ij}^{G}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|<r_{\tau}\\
|
||||
1,\left| \theta _i-\theta _j \right|<\tau_{\theta}\land \left| r_{i}^{global}-r_{j}^{global} \right|<\tau_{r}\\
|
||||
0, others.\\
|
||||
\end{cases}
|
||||
\label{geometric prior matrix}
|
||||
@ -320,7 +320,7 @@ Here, $\varDelta \boldsymbol{x}_{ij}^{b}$ denotes the difference between the x-a
|
||||
\begin{align}
|
||||
\mathcal{C} _{ij}=s_i\times \left( GIoU_{lane} \right) ^{\beta}.
|
||||
\end{align}
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the LaneIoU function $IOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{}.
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the LaneIoU function $IOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
|
||||
|
||||
We use SimOTA \cite{yolox} with dynamic $k=4$ (one-to-many assignment) for O2M classification head and O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) is employed for the O2O classification head.
|
||||
|
||||
@ -336,7 +336,7 @@ In essence, certain samples with lower O2M scores are excluded from the computat
|
||||
\caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.}
|
||||
\label{auxloss}
|
||||
\end{figure}
|
||||
We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and Smooth-L1 loss for the regression of end points of lanes, denoted as $\mathcal{L} _{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L} _{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
|
||||
We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and Smooth-L1 loss for the regression of end points of lanes, denoted as $\mathcal{L} _{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L} _{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
|
||||
|
||||
The final loss functions for GPM are given as follows:
|
||||
\begin{align}
|
||||
@ -385,12 +385,12 @@ For Tusimple, the evaluation is formulated as follows:
|
||||
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
|
||||
|
||||
\subsection{Implement Detail}
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficients of cost function, $\beta_{c}$ and $\beta_{r}$, are set to 1 and 6 respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. The training processing is end-to-end just like \cite{}\cite{} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix.
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficients of cost function, $\beta_{c}$ and $\beta_{r}$, are set to 1 and 6 respectively. We set different base semi-widths, denoted as $w_{b}^{assign}$, $w_{b}^{cost}$ and $w_{b}^{loss}$ for label assignment, cost function and loss function, respectively, as demonstrated in previous work\cite{clrernet}. The training processing is end-to-end just like \cite{}\cite{} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
|
||||
|
||||
|
||||
\begin{table*}[htbp]
|
||||
\centering
|
||||
\caption{Comparision results on CULane test set with other methods.}
|
||||
\caption{Comparison results on CULane test set with other methods.}
|
||||
\normalsize
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrlllllllllll}
|
||||
@ -451,7 +451,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Comparision results on TuSimple test set with other methods.}
|
||||
\caption{Comparison results on TuSimple test set with other methods.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrcccc}
|
||||
\toprule
|
||||
@ -476,7 +476,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Comparision results on LLAMAS test set with other methods.}
|
||||
\caption{Comparison results on LLAMAS test set with other methods.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrcccc}
|
||||
\toprule
|
||||
@ -503,7 +503,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Comparision results on DL-Rail test set with other methods.}
|
||||
\caption{Comparison results on DL-Rail test set with other methods.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrccc}
|
||||
\toprule
|
||||
@ -527,7 +527,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Comparision results on CurveLanes validation set with other methods.}
|
||||
\caption{Comparison results on CurveLanes validation set with other methods.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrcccc}
|
||||
\toprule
|
||||
@ -802,64 +802,78 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
|
||||
\newpage
|
||||
% 附录有多个section时
|
||||
\appendices
|
||||
\section{Title of the 1nd appendix}
|
||||
This is the first paragraph of Appx. A ...
|
||||
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i$-th anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
|
||||
\begin{align}
|
||||
&\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) \rightarrow 1,
|
||||
\\
|
||||
&f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) \rightarrow 0.
|
||||
\label{sharp fun}
|
||||
\end{align}
|
||||
\setcounter{table}{0} %从0开始编号,显示出来表会A1开始编号
|
||||
\setcounter{figure}{0}
|
||||
\setcounter{section}{0}
|
||||
\setcounter{equation}{0}
|
||||
\renewcommand{\thetable}{A\arabic{table}}
|
||||
\renewcommand{\thefigure}{A\arabic{figure}}
|
||||
\renewcommand{\thesection}{A\arabic{section}}
|
||||
\renewcommand{\theequation}{A\arabic{equation}}
|
||||
\section{The Design Principles of One-to-one classification Head}
|
||||
Two necessary conditions of the NMS-free paradigm are label assignment strategies and the model structure.
|
||||
|
||||
As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS postprocessing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth.
|
||||
|
||||
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLP and CNN is hard to learn the ``one-to-one'' features, causing the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}.Let us take a trival example. Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. If there the $i$-th anchor and the $j$-th anchor are both around the ground truth and they are nearly overlapping with each other:
|
||||
\begin{align}
|
||||
\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
|
||||
\end{align}
|
||||
which means that the RoI pooling features of the two anchors are similar. Suppose that $\boldsymbol{F}^{roi}_{i}$ is assigned as a positive sample while $\boldsymbol{F}^{roi}_{j}$ a negative sample, the ideal output should be as follows:
|
||||
\begin{align}
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) &\rightarrow 1,
|
||||
\\
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) &\rightarrow 0,
|
||||
\label{sharp fun}
|
||||
\end{align}
|
||||
where $f_{cls}^{plain}$ denotes a classification head with the plain structure. The Eq. (\ref{sharp fun}) suggests that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances. This ``sharp'' pattern is hard to train for plain MLP or CNN and the similar issue are also mentioned in \cite{o3d}. So new heuristic structures like \cite{o3d}\cite{relationnet} should be designed.
|
||||
|
||||
|
||||
The Eq. (\ref{sharp fun}) suggests that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances, it implies that $f_{cls}^{plain}$ need to captures information with higher frequency. This issue is also discussed in \cite{o3d}. Capturing the high frequency with a plain structure is difficult because a naive MLP tends to capture information with lower frequency \cite{xu2022overview}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it becomes impossible to distinguish the two anchors to positive and negative samples completely; in practice, both confidences converge to around 0.5. This problem arises from the limitations of the input format and the structure of the naive MLP, which restrict its expressive capability for information with higher frequency. Therefore, it is crucial to establish relationships between anchors and design a new model structure to effectively represent ``sharp'' information.
|
||||
|
||||
It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls branch with O2M regression and NMS post-processing. If the NMS could be replaced by some equivalent but learnable functions (\textit{e.g.} a neural network with specific structure), the O2O head could be trained to handle the one-to-one assignment. However, the NMS involves sequential iteration and confidence sorting, which are challenging to reproduce with a neural network. Although previous works, such as RNN-based approaches \cite{stewart2016end}, utilize an iterative format, they are time-consuming and introduce additional complexity into the model training process due to their iterative nature. To eliminate the iteration process, we proposed a equivalent format of Fast NMS\cite{yolact}.
|
||||
|
||||
We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design of O2O classification head. Fast NMS is an iteration-free postprocessing algorithm to remove redundant predictions. Additionally, we add a sort-free straight and geometric priors to Fast NMS, and the details are shown in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\caption{The Algorithm of the Graph-based Fast NMS}
|
||||
\caption{Fast NMS with Sort-free Paradigm and Geometric Prior.}
|
||||
\begin{algorithmic}[1] %这个1 表示每一行都显示数字
|
||||
\REQUIRE ~~\\ %算法的输入参数:Input
|
||||
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
|
||||
The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\
|
||||
The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\
|
||||
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
|
||||
The positive confidence get from o2m classification head, $s_i$;\\
|
||||
The positive regressions get from o2m regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
|
||||
The positive confidence get from O2M classification head, $s_i$;\\
|
||||
The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
|
||||
\ENSURE ~~\\ %算法的输出:Output
|
||||
\STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows:
|
||||
\STATE Selecte the positive candidates by $\mathbf{M}^{P}\in\mathbb{R}^{K\times K}$:
|
||||
\begin{align}
|
||||
C_{ij}=\begin{cases}
|
||||
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
|
||||
0, others\\
|
||||
M_{ij}^{P}=\begin{cases}
|
||||
1, \left( s_i\geqslant \tau _s\land s_j\geqslant \tau _s \right)\\
|
||||
0,others,\\
|
||||
\end{cases}
|
||||
\label{al_1-1}
|
||||
\end{align}
|
||||
\STATE Caculate the confidence comparison matrix $\mathbf{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
|
||||
\begin{align}
|
||||
M_{ij}^{C}=\begin{cases}
|
||||
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
|
||||
0, others.\\
|
||||
\end{cases}
|
||||
\label{confidential matrix}
|
||||
\end{align}
|
||||
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
|
||||
\STATE Calculate the geometric prior adjacent matrix $\boldsymbol{M} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $M_{ij}$ in $\boldsymbol{M}$ is caculate as follows:
|
||||
\begin{align}
|
||||
M_{ij}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|<r_{\tau}\\
|
||||
0, others\\
|
||||
\STATE Calculate the geometric prior matrix $\mathbf{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
|
||||
\begin{align}
|
||||
M_{ij}^{G}=\begin{cases}
|
||||
1,\left| \theta _i-\theta _j \right|<\tau_{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau_{r}\\
|
||||
0, others.\\
|
||||
\end{cases}
|
||||
\label{al_1-2}
|
||||
\end{align}
|
||||
|
||||
\STATE Calculate the inverse distance matrix $\boldsymbol{D} \in \mathbb{R} ^{N_{pos} \times N_{pos}}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
|
||||
\label{geometric prior matrix}
|
||||
\end{align}
|
||||
\STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
|
||||
\begin{align}
|
||||
D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right),
|
||||
\label{al_1-3}
|
||||
\end{align}
|
||||
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
|
||||
\STATE Define the adjacent matrix $\boldsymbol{T}=\,\,\boldsymbol{C}\land\boldsymbol{M}$ and the final confidence $\tilde{s}_i$ is calculate as following:
|
||||
\STATE Define the adjacent matrix $\mathbf{M} = \mathbf{M}^{P} \land \mathbf{M}^{C} \land \mathbf{M}^{G}$ and the final confidence $\tilde{s}_i$ is calculate as following:
|
||||
\begin{align}
|
||||
\tilde{s}_i = \begin{cases}
|
||||
1, & \text{if } \underset{j \in \{ j \mid T_{ij} = 1 \}}{\max} D_{ij} < \delta_{\tau} \\
|
||||
1, & \text{if } \underset{j \in \{ j \mid T_{ij} = 1 \}}{\max} D_{ij} < \tau_g \\
|
||||
0, & \text{otherwise}
|
||||
\end{cases}
|
||||
\label{al_1-4}
|
||||
@ -871,122 +885,30 @@ It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls
|
||||
\label{Graph Fast NMS}
|
||||
\end{algorithm}
|
||||
|
||||
The key rule of the NMS post-processing is as follows:
|
||||
Given a series of positive detections with redundancy, a detection result A is suppressed by another detection result B if and only if:
|
||||
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to ``true'' (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
|
||||
|
||||
(1) The confidence of A is lower than that of B.
|
||||
|
||||
(2) The predefined distance (\textit{e.g.} IoU distance and L1 distance) between A and B is smaller than a threshold.
|
||||
|
||||
(3) B is not suppressed by any other detection results.
|
||||
|
||||
For simplicity, Fast NMS only satisfies the condition (1) and (2), which may lead to an increase in false negative predictions but offers faster processing without sequential iteration. Leveraging the “iteration-free” property, we propose a further refinement called “sort-free” Fast NMS. This new approach, named Graph-based Fast NMS, is detailed in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
It is straightforward to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Graph-based Fast NMS is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
|
||||
|
||||
According to the analysis of the shortcomings of traditional NMS post-processing shown in Fig. \ref{NMS setting}, the fundamental issue arises from the definition of the distance between predictions. Traditional NMS relies on geometric properties to define distances between predictions, which often neglects the contextual semantics. For example, in some scenarios, two predicted lanes with a small geometric distance should not be suppressed, such as the case of double lines or fork lines. Although setting a threshold $d_{\tau}$ can mitigate this problem, it is challenging to strike a balance between precision and recall.
|
||||
|
||||
To address this, we replace the explicit definition of the inverse distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{alemi2016deep}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural network. Besides the geometric information, features ${F}_{i}^{roi}$ containes the implicit contextual information of an anchor, which provides additional clues for establishing implicit contextual distances between two anchors. The implicit contextual distance is calculated as follows:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}\gets& \mathrm{Re}LU\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right),
|
||||
\\
|
||||
\boldsymbol{F}_{ij}^{edge}\gets& FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right)
|
||||
\\
|
||||
&+FC_{base}\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right),
|
||||
\\
|
||||
\boldsymbol{D}_{ij}^{edge}\gets& MLP_{edge}\left( \boldsymbol{F}_{ij}^{graph} \right).
|
||||
\\
|
||||
\label{edge_layer}
|
||||
\end{align}
|
||||
|
||||
Eq. (\ref{edge_layer}) represents the implicit expression of Eq. (\ref{al_1-3}), where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information compared to traditional geometric distance. The confidence caculation is expressed as follows:
|
||||
\begin{align}
|
||||
&\boldsymbol{D}_{i}^{node}\gets \underset{j\in \left\{ j|T_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
|
||||
\\
|
||||
&\boldsymbol{F}_{i}^{node}\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
|
||||
\\
|
||||
&\tilde{s}_i\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
|
||||
\label{node_layer}
|
||||
\end{align}
|
||||
|
||||
The Eq. (\ref{node_layer}) serves as the implicit replacement for Eq. (\ref{al_1-4}). In this approach, we use elementwise max pooling of tensors instead of scalar-based max operations. The pooled tensor is then fed into a neural network with a sigmoid activation function to directly obtain the confidence. By eliminating the need for a predefined distance threshold, all confidence calculation patterns are derived from the training data.
|
||||
|
||||
It should be noted that the O2O classification head depends on the predictons of O2M classification head as outlined in Eq. (\ref{al_1-1}). From a probablity percpective, the confidence output by O2M classification head, $s_{j}$, represents the probability that the $j$-th detection is a positive sample. The confidence output by O2O classification head, $\tilde{s}_i$, denotes the conditional probablity that $i$-th sample shouldn't be suppressed given the condition that the $i$-th sample identified as a positive sample:
|
||||
\begin{align}
|
||||
&s_j|_{j=1}^{N_a}\equiv P\left( a_j\,\,is\,\,pos \right), \,\,
|
||||
\\
|
||||
&\tilde{s}_i|_{i=1}^{N_{pos}}\equiv P\left( a_i\,\,is\,\,retained|a_i\,is\,\,pos \right),
|
||||
\label{probablity}
|
||||
\end{align}
|
||||
where $N_a$ equals $H^{l}\times W^{l}$ during the training stage and $K_{a}$ during the testing stage. The overall architecture of O2O classification head is illustrated in Fig. \ref{o2o_cls_head}.
|
||||
|
||||
\textbf{Label assignment and Cost function.} We use the label assignment (SimOTA) similar to previous works \cite{clrnet}\cite{clrernet}. However, to make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名
|
||||
\caption{Illustrations of GLaneIoU redefined in our work.}
|
||||
\label{glaneiou}
|
||||
\end{figure}
|
||||
\begin{align}
|
||||
&w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w_{b},
|
||||
\\
|
||||
&\hat{d}_{i}^{\mathcal{O}}=\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&\hat{d}_{i}^{\xi}=\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right) -\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right),
|
||||
\end{align}
|
||||
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
|
||||
\begin{align}
|
||||
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
|
||||
\end{align}
|
||||
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$.
|
||||
We then define the cost function between $i$-th prediction and $j$-th ground truth as follows like \cite{detr}:
|
||||
\begin{align}
|
||||
\mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}.
|
||||
\end{align}
|
||||
|
||||
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with $k=4$) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment.
|
||||
|
||||
|
||||
\textbf{Loss function.} We use focal loss \cite{focal} for O2O classification head and O2M classification head:
|
||||
\begin{align}
|
||||
\mathcal{L} _{o2m}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)},
|
||||
\\
|
||||
\mathcal{L} _{o2o}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2o}}{\alpha _{o2o}\left( 1-\tilde{s}_i \right) ^{\gamma}\log \left( \tilde{s}_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2o}}{\left( 1-\alpha _{o2o} \right) \left( \tilde{s}_i \right) ^{\gamma}\log \left( 1-\tilde{s}_i \right)}.
|
||||
\\
|
||||
\end{align}
|
||||
where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head:
|
||||
\begin{align}
|
||||
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}.
|
||||
\end{align}
|
||||
|
||||
Only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O classification head. According to \cite{pss}, to maintain feature quality during training stage, the gradient of O2O classification head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O classification head:
|
||||
The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau_g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks.
|
||||
|
||||
In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as:
|
||||
\begin{align}
|
||||
&\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega ^{pos}_{o2o}}{\sum_{j\in \varOmega ^{neg}_{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}},\\
|
||||
&N_{rank}=\left| \varOmega ^{pos}_{o2o} \right|\left| \varOmega ^{neg}_{o2o} \right|.
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi} & \gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right), \\
|
||||
\boldsymbol{F}_{ij}^{edge} & \gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) - FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) + FC_{b}\left( \varDelta \boldsymbol{x}_{ij}^{b} \right), \\
|
||||
\boldsymbol{D}_{ij}^{edge} & \gets MLP_{edge}\left( \boldsymbol{F}_{ij}^{edge} \right).
|
||||
\label{edge_layer_appendix}
|
||||
\end{align}
|
||||
where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows:
|
||||
\begin{align}
|
||||
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
|
||||
\\
|
||||
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
|
||||
\\
|
||||
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
|
||||
\label{node_layer_appendix}
|
||||
\end{align}
|
||||
|
||||
We directly use the GLaneIoU loss, $\mathcal{L}_{GLaneIoU}$, to regression the offset of xs (with g=1) and Smooth-L1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxiliary loss illustrated in Fig. \ref{auxloss}:
|
||||
\begin{align}
|
||||
\mathcal{L}_{aux} &= \frac{1}{\left| \varOmega_{pos}^{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
|
||||
&\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg].
|
||||
\end{align}
|
||||
In this expresion, we use elementwise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\tau_g$, the real implicit decision surface is learned from data by neural work.
|
||||
|
||||
The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape.
|
||||
|
||||
\subsection{Loss function}
|
||||
|
||||
The overall loss function of Polar R-CNN is given as follows:
|
||||
\begin{align}
|
||||
\mathcal{L}_{overall} &=\mathcal{L} _{lpm}^{cls}+w_{lpm}^{reg}\mathcal{L} _{lpm}^{reg}\\&+w_{o2m}^{cls}\mathcal{L} _{o2m}^{cls}+w_{o2o}^{cls}\mathcal{L} _{o2o}^{cls}+w_{rank}\mathcal{L} _{rank}\\&+w_{IoU}\mathcal{L} _{IoU}+w_{end}\mathcal{L} _{end}+w_{aux}\mathcal{L} _{aux}.
|
||||
\end{align}
|
||||
The first line in the loss function represents the loss for LPH, which includes both classification and regression components. The second line pertains to the losses associated with the two classification heads (O2M and O2O), while the third line represents the loss for the regression head within the triplet head. Each term in the equation is weighted by a factor to balance the contributions of each component to the gradient. The entire training process is end-to-end.
|
||||
\section{Title of the 2nd appendix}
|
||||
This is the first paragraph of Appx. B ..
|
||||
\label{NMS_appendix}
|
||||
|
||||
\begin{table*}[htbp]
|
||||
\centering
|
||||
@ -1030,14 +952,36 @@ This is the first paragraph of Appx. B ..
|
||||
\label{dataset_info}
|
||||
\end{table*}
|
||||
|
||||
\section{The Defination of GLaneIoU}
|
||||
\textbf{Label assignment and Cost function.} To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名
|
||||
\caption{Illustrations of GLaneIoU redefined in our work.}
|
||||
\label{glaneiou}
|
||||
\end{figure}
|
||||
\begin{align}
|
||||
&w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w_{b},
|
||||
\\
|
||||
&\hat{d}_{i}^{\mathcal{O}}=\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&\hat{d}_{i}^{\xi}=\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right) -\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right),
|
||||
\\
|
||||
&d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right),
|
||||
\end{align}
|
||||
where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
|
||||
\begin{align}
|
||||
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
|
||||
\end{align}
|
||||
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$. We set g=0 for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function.
|
||||
\label{giou_appendix}
|
||||
|
||||
\section{The Supplement of Implement Detail and The Visualization Result.}
|
||||
\textbf{Visualization.} Some important implement details for each dataset is shown in Table \ref{dataset_info}. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances.
|
||||
|
||||
\textbf{Visualization.} We present the Polar R-CNN predictions for both sparse and dense scenarios. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances.
|
||||
|
||||
|
||||
|
||||
|
||||
\begin{figure*}[htbp]
|
||||
\begin{figure*}[t]
|
||||
\centering
|
||||
\def\pagewidth{0.49\textwidth}
|
||||
\def\subwidth{0.47\linewidth}
|
||||
@ -1175,7 +1119,7 @@ This is the first paragraph of Appx. B ..
|
||||
|
||||
|
||||
|
||||
\begin{figure*}[htbp!]
|
||||
\begin{figure*}[t]
|
||||
\centering
|
||||
\def\subwidth{0.24\textwidth}
|
||||
\def\imgwidth{\linewidth}
|
||||
@ -1273,6 +1217,6 @@ This is the first paragraph of Appx. B ..
|
||||
\caption{The visualization of the detection results of sparse and dense scenarios on CurveLanes dataset.}
|
||||
\label{vis_dense}
|
||||
\end{figure*}
|
||||
|
||||
\label{vis_appendix}
|
||||
|
||||
\end{document}
|
Loading…
x
Reference in New Issue
Block a user