This commit is contained in:
ShqWW 2024-09-03 02:29:17 +08:00
parent b9838c625d
commit b3905a76a6

View File

@ -375,7 +375,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
\label{gnn}
\end{figure}
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M Reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M Reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. s previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS postprocessing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting} (b)(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features form two overlapping (or very close), implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head but trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M Reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M Reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. s previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS postprocessing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting} (b)(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head but trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
\begin{equation}
@ -390,12 +390,12 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
\end{equation}
The equation \ref{sharp fun} implies that the property of $f_{cls}^{plain}$, a similar issue is also discussed in \cite{}. Learning the sharp property with a plain structure is challenging because a naive MLP tends to capture information with lower frequency \cite{}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it's impossible to distinguish the two anchors to positive and negative samples completely, the reality is that both the confidence is convergent to around 0.5. The issue is caused by the limitations of the input format and the structure of naive MlP, which limit the expression ability. So it's essential to establish the relations between anchors and design new model structure to express the relation.
The equation \ref{sharp fun} suggests that the property of $f_{cls}^{plain}$ need to be "sharp" enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes repidly over short proids or distances, it implies that $f_{cls}^{plain}$ need to captures information with higher frequency. This issue is also discussed in \cite{}. Capturing the high frequency with a plain structure is challenging because a naive MLP tends to capture information with lower frequency \cite{}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it becomes impossible to distinguish the two anchors to positive and negative samples completely; in practice, both confidences converge to around 0.5. This problem arises from the limitations of the input format and the structure of the naive MLP, which restrict its expressive capability for information with higher frequency. Therefore, it is crucial to establish relationships between anchors and design a new model structure to effectively represent “sharp” information.
It is easy to notice that the "ideal" one-to-one branch is equivalence to o2m cls branch + o2m regression + NMS postprocessing. If the NMS could be replaced by some equivalent but learnable functions (e.g. some neural work), the o2o head is able to be trained and learn the one-to-one assignment. However, the NMS need sequential iteration and confidence sorting process, which is hard to be rewirtten to neural network. Though previous work such as the RNN based neural work is also porposed \cite{} to replace NMS, it's time comsuming and the iteration process introduce additional difficulty for the model trianing.
It is easy to see that the "ideal" one-to-one branch is equivalence to O2M cls branch with O2M regression and NMS postprocessing. If the NMS could be replaced by some equivalent but learnable functions (e.g. a neural work with specific structure), the O2O head could be trained to handle the one-to-one assignment. However, the NMS involves sequential iteration and confidence sorting, which are challenging to reproduce with a neural network. Although previous work, such as RNN-based approaches\cite{}. These methods are time-consuming and introduce additional complexity into the model training process due to their iterative nature. To eliminate the iteration process, we proposed a equivalent format of FastNMS\cite{}.
The key rule of the NMS postprocessing is gien as following:
Given a series of positive detections with redundancy, a detection lane A is supressed by another detection lane B if and only if:
The key rule of the NMS postprocessing is as follows:
Given a series of positive detections with redundancy, detection lane A is supressed by another detection lane B if and only if:
(1) The confidence of A is lower than that of B.
@ -403,12 +403,13 @@ Given a series of positive detections with redundancy, a detection lane A is sup
(3) Detection lane B is not supressed by any other detections.
However, as a simplicity of NMS, FastNMS only need the condition (1) and (2) and introduce more false negative predictions but has faster speed without sequential iteraion. Based on the propoerty of "iteration-free", we design a "sort-free" FastNMS further. The new algorithm are called Graph-based FastNMS, and the algorithm is elaborated in Algorithm \ref{Graph FastNMS}.
For simplicity, FastNMS only satisfies the condition (1) and (2), which may lead to an increase in false negative predictions but offers faster processing without sequential iteration. Leveraging the “iteration-free” property, we propose a further refinement called “sort-free” FastNMS. This new approach, named Graph-based FastNMS, is detailed in Algorithm \ref{Graph FastNMS}.
It's easy to prove that when the elements in $\boldsymbol{M}$ are all set to 1 (regardless of the geometric priors), the Graph-based FastNMS is equivalent to FastNMS. Based on our newly proposed Graph-based FastNMS, we can construct the structure of o2o cls head reference to Graph-based FastNMS.
It is straightforward to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Graph-based FastNMS is equivalent to FastNMS. Building upon our newly proposed Graph-based FastNMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based FastNMS.
According to the analysis of the shortcoming of traditional NMS postprocessing, the essential issue is due to the distance between two prediction and the settting of the threshold $\d_{\tau}$. So we replace the explicit defination of distance function with implicit graph neural work. What's more, the input of x axis is also replace with the anchor features ${F}_{i}^{roi}$. As the \cite{} mentioned, ${F}_{i}^{roi}$ contains the location and classification information, which is enough to modelling the distance by neural work.
So the implicit distance is defined as following;
According to the analysis of the shortcomings of traditional NMS postprocessing shown in Fig. \ref{nms setting}, the fundamental issue arises from the definition of the distance between predictions. Traditional NMS relies on geometric properties to define distances between predictions, which often neglects the contextual semantics. For example, in some scenarios, two predicted lanes with a small geometric distance should not be suppressed, such as in the case of double lines or crossing lines. Although setting a threshold $\d_{\tau}$ can mitigate this problem, it is challenging to strike a balance between precision and recall.
To address this, we replace the explicit definition of the distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural work. Besides the geometric information, features ${F}_{i}^{roi}$ containes the contextual information of an anchor, which provides additional clues for establishing implicit distances between two anchors. The implicit distance is expressed as follows:
\begin{equation}
\begin{aligned}
@ -424,8 +425,7 @@ It's easy to prove that when the elements in $\boldsymbol{M}$ are all set to 1 (
\label{edge_layer}
\end{equation}
the equation \ref{edge_layer} is the implicit replacement of equation \ref{al_1-3}
Equation \ref{edge_layer} represents the implicit expression of equation \ref{al_1-3}, where the distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information ompared to traditional geometric distance. The confidence caculation is expressed as follows:
\begin{equation}
\begin{aligned}
@ -439,12 +439,10 @@ the equation \ref{edge_layer} is the implicit replacement of equation \ref{al_1-
\label{node_layer}
\end{equation}
The equation \ref{node_layer} serves as the implicit replacement for equation \ref{al_1-4}. In this approach, we use elementwise max pooling of tensors instead of scalar-based max operations. The pooled tensor is then fed into a neural network with a sigmoid activation function to directly obtain the confidence. By eliminating the need for a predefined distance threshold, all confidence calculation patterns are derived from the training data.
the equation \ref{node_layer} is the implicit replacement of equation \ref{al_1-4}.
It should be noted that the o2o cls head depends on the predictons of o2m cls head. From the perspective of probablity, the confidence output by o2m cls head $s_{j}$ denotes the probablity that the $j_{th}$ detection is the positive sample. The confidence output by o2o cls head $\tilde{s}_i$ denotes the conditional probablity that $i_{th}$ sample shouldn't be supressed given the condition that the $i_{th}$ sample is already the positive sample.
It should be noted that the O2O cls head depends on the predictons of O2M cls head as outlined in equation \ref{al_1-1}. From a probablity percpective, the confidence output by O2M cls head, $s_{j}$
,represents the probability that the $j_{th}$ detection is a positive sample. The confidence output by o2o cls head, $\tilde{s}_i$, denotes the conditional probablity that $i_{th}$ sample shouldn't be supressed given the condition that the $i_{th}$ sample identified as a positive sample:
\begin{equation}
\begin{aligned}
&s_j|_{j=1}^{N_A}\equiv P\left( a_j\,\,is\,\,pos \right) \,\,
@ -455,15 +453,14 @@ It should be noted that the o2o cls head depends on the predictons of o2m cls he
\end{equation}
\textbf{Label assignment and Cost function} We use the label assignment (SimOTA) similar to previous work \cite{}\cite{} but in order to make the function more compact and keep consistant with works of general object detection \cite{}, the lane IoU is redefined. As illustrated in fig,9, the newly-defined lane Iou which we called GLaneIoU is defined as follows:
\textbf{Label assignment and Cost function} We use the label assignment (SimOTA) similar to previous work \cite{}\cite{}. However, to make the function more compact and consistent with general object detection works \cite{ref3}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{thsis_figure/GLaneIoU.png} % 替换为你的图片文件名
\caption{Illustrations of GLaneIoU re-defined in our work.}
\caption{Illustrations of GLaneIoU redefined in our work.}
\label{glaneiou}
\end{figure}
\begin{equation}
\begin{aligned}
&w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w
@ -478,24 +475,22 @@ It should be noted that the o2o cls head depends on the predictons of o2m cls he
\end{aligned}
\end{equation}
The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from \cite{} and \cite{}, which force the value nonnegative. This format aim to be consistant to the IoU definations for bounding box. So the overall GLaneIoU is given as follows;
The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{} and \cite{}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows:
\begin{equation}
\begin{aligned}
GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}
\end{aligned}
\end{equation}
Where j and k is the valid points index (the start point and the end point). It's easy to see that when $g=0$, the GLaneIoU is correspond to IoU for bounding box, and the value range is $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU for bounding box, and the value range is $\left(-1, 1 \right]$. Generally, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$.
Then we can define the cost function between $i_{th}$ prediction and $j_{th}$ ground truth as \cite{}:
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$.
We then define the cost function between $i_{th}$ prediction and $j_{th}$ ground truth as follows \cite{}:
\begin{equation}
\begin{aligned}
\mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}
\end{aligned}
\end{equation}
This cost function is more compact than previous work and taken both location and confidenct into account. SimOTA (k=4) \cite{} are used for label assignment for two o2m heads while hungary algorithm for the o2o cls head.
\textbf{Loss function} We use focal loss \cite{} for o2o cls head and o2m cls head:
This cost function is more compact than those in previous work and takes both location and confidence into account. For label assignment, SimOTA (with k=4) \cite{ref1} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{} algorithm is employed for the O2O classification head for one-to-one assignment.
\textbf{Loss function} We use focal loss \cite{} for O2O cls head and O2M cls head:
\begin{equation}
\begin{aligned}
\mathcal{L} _{\,\,o2m,cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)}
@ -504,17 +499,13 @@ This cost function is more compact than previous work and taken both location an
\\
\end{aligned}
\end{equation}
where the set of the one-to-one sample$\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$ is based on the positive sample from the sample of o2m cls head:
where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is resstricted to the subset $\varOmega _{neg}^{o2m}$ of O2M cls head:
\begin{equation}
\begin{aligned}
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2o} \right\}
\varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}
\end{aligned}
\end{equation}
only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of o2o cls head. So in ordr to keep the feature quality during training stage, the gradient of o2o cls head are stopped from remain detection head (the roi feature of the anchor $\boldsymbol{F}}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of o2o cls head:
only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O cls head. To maintain feature quality during training stage, the gradient of O2O cls head are stopped from propagating back to the rest of the network (the roi feature of the anchor $\boldsymbol{F}}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O cls head:
\begin{equation}
\begin{aligned}
&\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{j\in \varOmega _{neg}^{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}}\\
@ -530,12 +521,13 @@ only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate
\label{auxloss}
\end{figure}
We directly use the GLaneIoU loss $\mathcal{L} _{GLaneIoU}$to regression the offset of xs (with g=1) and SmoothL1 loss for the regression of end points (namely the y axis of the start point and the end point) $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxloss:
We directly use the GLaneIoU loss, $\mathcal{L} _{GLaneIoU}$, to regression the offset of xs (with g=1) and SmoothL1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxloss illustrated in fig. \ref{auxloss}:
\begin{equation}
\begin{aligned}
\mathcal{L} _{\,\,aux}=\frac{1}{\left| \varOmega _{pos}^{o2m} \right|N_{seg}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{m=j}^k{l\left( \theta _i-\hat{\theta}_{i}^{seg,m} \right) \\+l\left( r_{i}^{global}-\hat{r}_{i}^{seg,m} \right)}}
\end{aligned}
\end{equation}
The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This approach assists the anchors in learning more about the global geometric shape.
@ -548,23 +540,18 @@ The overall loss function of PolarRCNN is given as follows:
\mathcal{L}_overall &=\mathcal{L} _{lph}^{cls}+w_{lph}^{reg}\mathcal{L} _{lph}^{reg}\\&+w_{o2m}^{cls}\mathcal{L} _{o2m}^{cls}+w_{o2o}^{cls}\mathcal{L} _{o2o}^{cls}+w_{rank}\mathcal{L} _{rank}\\&+w_{IoU}\mathcal{L} _{IoU}+w_{end}\mathcal{L} _{end}+w_{aux}\mathcal{L} _{aux}
\end{aligned}
\end{equation}
The training process is end-to-end.
The first line in the loss function represents the loss for the local polar head, which includes both classification and regression components. The second line pertains to the losses associated with the two classification heads (O2M and O2O), while the third line represents the loss for the regression head within the triplet head. Each term in the equation is weighted by a factor to balance the contributions of each component to the gradient. The entire training process is end-to-end.
\section{Experiment}
\subsection{Dataset and Evaluation Metric}
\subsection{Implement Detail}
\subsection{Comparison with the state-of-the-art results}
\subsection{Ablation Study and Visualization}
\begin{figure}[t]