update

2024-09-02 02:09:39 +08:00 · 2024-09-02 02:09:39 +08:00 · b9838c625d
commit b9838c625d
parent 888326d770
1 changed files with 6 additions and 5 deletions
--- a/main.tex
+++ b/main.tex
@ -33,7 +33,7 @@
 \author{IEEE Publication Technology,~\IEEEmembership{Staff,~IEEE,}
        % <-this % stops a space
-\thanks{This paper was produced by the IEEE Publication Technology Group. They are in Piscataway, NJ.}% <-this % stops a space
+\thanks{This work was supported in part by the National Natural Science Foundation of China under Grant 62276208 and 12326607, and in part by the Natural Science Basic Research Program of Shaanxi Province 2024]C-JCQN-02.}% <-this % stops a space
 \thanks{Manuscript received April 19, 2021; revised August 16, 2021.}}
 % The paper headers
@ -311,7 +311,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
        \end{aligned}
 \end{equation} 
-\textbf{Triplet Head.} The triplet head contrains three heads, namely the one-to-one classification(o2o cls) head, one-to-many classification(o2m cls) head and one-to-many regression(o2m Reg) head. In \cite{}\cite{}\cite{}\cite{}, the detection head are all in the one-to-many paradigm. During training stage, more than one positive samples are assigned to one ground truth. So more than one detections results for each instance are predicted during the evaluation stage, which need Non-Maximum Suppression (NMS) to remove the redundant results and keep one final result with highest confidence. However, NMS depends on the defination of distance between two detection results and the calculation for distance is complicated for curve lane and other irregular geometric shapes (such as instance segment). So in order to provided a detection result without redundancy (NMS-free), one-to-one paradigm is necessary during training stage, according to \cite{}. However, one-to-one paradigm is not enough and the structure of detection head is also essential for NMS-free detection. This issue will be analyzed in detail below.
+\textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M Reg) head. In various studies \cite{}\cite{}\cite{}\cite{}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using Non-Maximum Suppression (NMS), which eliminates duplicate results and retains the highest confidence detection. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. To achieve non-redundant detection results (NMS-free), the one-to-one paradigm becomes crucial during training, as highlighted in \cite{}. Nevertheless, merely adopting the one-to-one paradigm is insufficient; the structure of the detection head also plays a pivotal role in achieving NMS-free detection. This aspect will be further analyzed in the following sections.
 \begin{algorithm}[t]
        \caption{The Algorithm of the Graph-based FastNMS}
@ -375,7 +375,7 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
        \label{gnn}
 \end{figure}
-\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the roi features extracted from $i_{th}$ anchors and the three sub heads take $\boldsymbol{F}^{roi}_{i}$ as input. Now, let us only consider the o2m cls head and o2m Reg head, which meets the old paradigm for previous work and can be taken as the baseline for the following new one-to-one paradigm. Keeping it simple and rigorous, both o2m cls head and o2m Reg head consists of two layers with activation function (plain structure without any complex mechanisms such as attention and desformable convolution). To remove the NMS postprocessing, directly replace the one-to-many with one-to-one label assignment is not enough as we mentioned before, because the anchors are highly overlapping or with small distance with each other, as the Fig. \ref{anchor setting} (b)(c) shows. Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ denote the features form two overlapping (or with very small distance), so the values of $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ is almost the same. Let $f_{plain}^{cls}$ denotes the neural structure the sample as the o2m cls head but trained with one-to-one label assignment. Supposed that $\boldsymbol{F}^{roi}_{i}$ is the positive sample and the $\boldsymbol{F}^{roi}_{j}$ is the negative, the ideal correspondingly output is different as following:
+\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M Reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M Reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. s previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS postprocessing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting} (b)(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features form two overlapping (or very close), implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head but trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
 \begin{equation}
@ -389,7 +389,8 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
        \label{sharp fun}
 \end{equation}
-The equatin \ref{sharp fun} implicit the property of $f_{cls}^{plain}$ is sharp, the same issue is also mentioned in \cite{}. Learning the sharp property with the plain structure is hard. In the most extreme case with $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it's nearly impossible to distinguish the two anchors to positive and negative samples completely, the reality is both the confidence is convergent to around 0.5. The issue is caused by the limitations of the input format and the structure, which limit the expression ability. So it's essential to establish the relations between anchors and design new model structure to express the relation.
+
 The equation \ref{sharp fun} implies that the property of $f_{cls}^{plain}$, a similar issue is also discussed in \cite{}. Learning the sharp property with a plain structure is challenging because a naive MLP tends to capture information with lower frequency \cite{}. In the most extreme case, where $\boldsymbol{F}_{i}^{roi} = \boldsymbol{F}_{j}^{roi}$, it's impossible to distinguish the two anchors to positive and negative samples completely, the reality is that both the confidence is convergent to around 0.5. The issue is caused by the limitations of the input format and the structure of naive MlP, which limit the expression ability. So it's essential to establish the relations between anchors and design new model structure to express the relation.
 It is easy to notice that the "ideal" one-to-one branch is equivalence to o2m cls branch + o2m regression + NMS postprocessing. If the NMS could be replaced by some equivalent but learnable functions (e.g. some neural work), the o2o head is able to be trained and learn the one-to-one assignment. However, the NMS need sequential iteration and confidence sorting process, which is hard to be rewirtten to neural network. Though previous work such as the RNN based neural work is also porposed \cite{} to replace NMS, it's time comsuming and the iteration process introduce additional difficulty for the model trianing.
@ -404,7 +405,7 @@ Given a series of positive detections with redundancy, a detection lane A is sup
 However, as a simplicity of NMS, FastNMS only need the condition (1) and (2) and introduce more false negative predictions but has faster speed without sequential iteraion. Based on the propoerty of "iteration-free", we design a "sort-free" FastNMS further. The new algorithm are called Graph-based FastNMS, and the algorithm is elaborated in Algorithm \ref{Graph FastNMS}.
- It's easy to prove that when the elements in $\boldsymbol{M}$ are all set to 1 (regardless of the geometric priors), the Graph-based FastNMS is equivalent to FastNMS. Based on our newly proposed  Graph-based FastNMS, we can construct the structure of o2o cls head reference to Graph-based FastNMS. 
+It's easy to prove that when the elements in $\boldsymbol{M}$ are all set to 1 (regardless of the geometric priors), the Graph-based FastNMS is equivalent to FastNMS. Based on our newly proposed  Graph-based FastNMS, we can construct the structure of o2o cls head reference to Graph-based FastNMS. 
 According to the analysis of the shortcoming of traditional NMS postprocessing, the essential issue is due to the distance between two prediction and the settting of the threshold $\d_{\tau}$. So we replace the explicit defination of distance function with implicit graph neural work. What's more, the input of x axis is also replace with the anchor features ${F}_{i}^{roi}$. As the \cite{} mentioned, ${F}_{i}^{roi}$ contains the location and classification information, which is enough to modelling the distance by neural work.
 So the implicit distance is defined as following;