update
This commit is contained in:
parent
31a50689a2
commit
1378b3e87b
79
main.tex
79
main.tex
@ -227,7 +227,7 @@ The regression branch consists of a single $1\times1$ convolutional layer and wi
|
||||
\end{align}
|
||||
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\lambda^{l}\}\right|$ is the number of positive local poles in LPM.
|
||||
\par
|
||||
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select $K$ anchors with the top-$K$ highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.} global polar module). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ assists it assists \textit{Global Polar Module} (the second stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
|
||||
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select $K$ anchors with the top-$K$ highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.}, global polar module). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ assists it assists \textit{Global Polar Module} (the second stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -266,7 +266,7 @@ where $\boldsymbol{w}_{k}\in \mathbb{R}^{N}$ represents the learnable aggregate
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
|
||||
\caption{The graph construction in O2O classification head. Each anchor is conceived as a node within the graph, with the associated ROI feature $\left\{\boldsymbol{F}_i^{roi}\right\}$ as the node feature. The interconnecting directed edges are established based on the scores emanating from the O2M classification head and the anchor geometric prior. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{54}$ are euqual to $1$ in the adjacent matrix $\boldsymbol{A}$, which implicit the existence of directed edges between corresponding node pairs (\textit{i.e.} $1\rightarrow2$, $3\rightarrow2$ and $5\rightarrow4$).}
|
||||
\caption{The graph construction in O2O classification head. Each anchor is conceived as a node within the graph, with the associated ROI feature $\left\{\boldsymbol{F}_i^{roi}\right\}$ as the node feature. The interconnecting directed edges are established based on the scores emanating from the O2M classification head and the anchor geometric prior. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{54}$ are euqual to $1$ in the adjacent matrix $\boldsymbol{A}$, which implicit the existence of directed edges between corresponding node pairs (\textit{i.e.}, $1\rightarrow2$, $3\rightarrow2$ and $5\rightarrow4$).}
|
||||
\label{o2o_cls_head}
|
||||
\end{figure}
|
||||
|
||||
@ -399,10 +399,10 @@ For Tusimple, the evaluation is formulated as follows:
|
||||
\begin{align}
|
||||
Accuracy=\frac{\sum{C_{clip}}}{\sum{S_{clip}}}.
|
||||
\end{align}
|
||||
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
|
||||
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate ($FPR=1-Precision$) and False Negative Rate ($FNR=1-Recall$) formular.
|
||||
|
||||
\subsection{Implement Detail}
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
|
||||
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6. The training processing of the whole model (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane}. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
|
||||
|
||||
|
||||
\begin{table*}[htbp]
|
||||
@ -472,7 +472,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{lrcccc}
|
||||
\toprule
|
||||
\textbf{Method}& \textbf{Backbone}& \textbf{Acc(\%)}&\textbf{F1(\%)}&\textbf{FP(\%)}&\textbf{FN(\%)} \\
|
||||
\textbf{Method}& \textbf{Backbone}& \textbf{Acc(\%)}&\textbf{F1(\%)}&\textbf{FPR(\%)}&\textbf{FNR(\%)} \\
|
||||
\midrule
|
||||
SCNN\cite{scnn} &VGG16 &96.53&95.97&6.17&\textbf{1.80}\\
|
||||
PolyLanenet\cite{polylanenet}&EfficientNetB0&93.36&90.62&9.42&9.33\\
|
||||
@ -737,7 +737,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{The ablation study for the stop-gradient strategy on CULane test set.}
|
||||
\caption{The ablation study for the stop gradient strategy on CULane test set.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{c|c|lll}
|
||||
\toprule
|
||||
@ -825,35 +825,34 @@ In this paper, we propose Polar R-CNN to address two key issues in anchor-based
|
||||
\renewcommand{\thefigure}{A\arabic{figure}}
|
||||
\renewcommand{\thesection}{A\arabic{section}}
|
||||
\renewcommand{\theequation}{A\arabic{equation}}
|
||||
\section{The Design Principles of One-to-one classification Head}
|
||||
Two necessary conditions of the NMS-free paradigm are label assignment strategies and the model structure.
|
||||
\section{The Design Principles of the One-to-one classification Head}
|
||||
Two fundamental prerequisites of the NMS-free framework lie in the label assignment strategies and the head structures.
|
||||
|
||||
As for the label assignment strategy, previous work use one-to-many label assignments such as SimOTA\cite{yolox}. One-to-many label assignment make the detection head make redundant preidictions for one ground truth, resulting in the need of NMS postprocessing. Thus, some works \cite{detr}\cite{learnNMS} proposed one-to-one label assignment such as Hungarian algorithm. This force the model to predict one positive samples for one ground truth.
|
||||
|
||||
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLP and CNN is hard to learn the ``one-to-one'' features, causing the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}.Let us take a trival example. Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. If there the $i$-th anchor and the $j$-th anchor are both around the ground truth and they are nearly overlapping with each other:
|
||||
However, directly using one-to-one label assignment damage the learning of the model, and the plain structure such as MLPs and CNNs struggle to assimilate the ``one-to-one'' characteristics, resulting in the decreasing of performance compared to one-to-many label assignments with NMS postprocessing\cite{yolov10}\cite{o2o}. Consider a trival example: Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from the $i$-th anchor, and the model is trained with one-to-one label assignment. Assuming that the $i$-th anchor and the $j$-th anchor are both close to the ground truth and overlap with each other, we can express as follows:
|
||||
\begin{align}
|
||||
\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
|
||||
\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi}.
|
||||
\end{align}
|
||||
which means that the RoI pooling features of the two anchors are similar. Suppose that $\boldsymbol{F}^{roi}_{i}$ is assigned as a positive sample while $\boldsymbol{F}^{roi}_{j}$ a negative sample, the ideal output should be as follows:
|
||||
This indicates that the RoI pooling features of the two anchors are similar. Suppose that $\boldsymbol{F}^{roi}_{i}$ is designated as a positive sample while $\boldsymbol{F}^{roi}_{j}$ as a negative sample, the ideal outcome should manifest as:
|
||||
\begin{align}
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{i}^{roi} \right) &\rightarrow 1,
|
||||
\\
|
||||
f_{cls}^{plain}\left( \boldsymbol{F}_{j}^{roi} \right) &\rightarrow 0,
|
||||
\label{sharp fun}
|
||||
\end{align}
|
||||
where $f_{cls}^{plain}$ denotes a classification head with the plain structure. The Eq. (\ref{sharp fun}) suggests that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. That is to say, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances. This ``sharp'' pattern is hard to train for plain MLP or CNN and the similar issue are also mentioned in \cite{o3d}. So new heuristic structures like \cite{o3d}\cite{relationnet} should be designed.
|
||||
where $f_{cls}^{plain}$ represents a classification head characterized by a plain architecture. The Eq. \ref{sharp fun} implies that the property of $f_{cls}^{plain}$ need to be ``sharp'' enough to differentiate between two similar features. In other words, the output of $f_{cls}^{plain}$ changes rapidly over short periods or distances. This ``sharp'' pattern is hard to train for MLPs or CNNs \cite{o3d} solely. Consequently, additional new heuristic structures like \cite{o3d}\cite{relationnet} need to be developed.
|
||||
|
||||
We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design of O2O classification head. Fast NMS is an iteration-free postprocessing algorithm to remove redundant predictions. Additionally, we add a sort-free straight and geometric priors to Fast NMS, and the details are shown in Algorithm \ref{Graph Fast NMS}.
|
||||
We draw inspiration from Fast NMS \cite{yolact} for the design of the O2O classification head. Fast NMS serves as an iteration-free postprocessing algorithm based on traditional NMS. Furthermore, we have incorporated a sort-free strategy along with geometric priors into Fast NMS, with the specifics delineated in Algorithm \ref{Graph Fast NMS}.
|
||||
|
||||
\begin{algorithm}[t]
|
||||
\caption{Fast NMS with Geometric Prior.}
|
||||
\begin{algorithmic}[1] %这个1 表示每一行都显示数字
|
||||
\REQUIRE ~~\\ %算法的输入参数:Input
|
||||
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
|
||||
The index of all anchors, $1, 2, ..., i, ..., K$;\\
|
||||
The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\
|
||||
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{s}$;\\
|
||||
The positive confidence get from the O2M classification head, $s_i^g$;\\
|
||||
The positive regressions get from the O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
|
||||
The confidence emanating from the O2M classification head, $s_i^g$;\\
|
||||
The regressions emanating from the O2M regression head, denoted as $\left\{ Lane_i \right\} |_{i=1}^{K}$\\
|
||||
\ENSURE ~~\\ %算法的输出:Output
|
||||
\STATE Caculate the confidence comparison matrix $\boldsymbol{A}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
|
||||
\begin{align}
|
||||
@ -863,7 +862,6 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
|
||||
\end{cases}
|
||||
\label{confidential matrix}
|
||||
\end{align}
|
||||
where the $\land$ denotes (element wise) logical ``AND'' operation between two Boolean values/tensors.
|
||||
\STATE Calculate the geometric prior matrix $\boldsymbol{A}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
|
||||
\begin{align}
|
||||
A_{ij}^{G}=\begin{cases}
|
||||
@ -874,7 +872,7 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
|
||||
\end{align}
|
||||
\STATE Calculate the inverse distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$ The element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
|
||||
\begin{align}
|
||||
D_{ij} = 1-d\left( \mathrm{lane}_i, \mathrm{lane}_j\right),
|
||||
D_{ij}=d^{-1}\left( Lane_i,Lane_j \right) ,
|
||||
\label{al_1-3}
|
||||
\end{align}
|
||||
where $d\left(\cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions such as IoU.
|
||||
@ -890,42 +888,41 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
|
||||
\begin{align}
|
||||
\varOmega_{nms}^{pos}=\left\{ i|s_{i}^{g}>\lambda _{o2m}^{s}\,\,and\,\,\tilde{s}_{i}^{g}=1 \right\}
|
||||
\end{align}
|
||||
where one prediction retained should statisfy the above two condition.
|
||||
where one prediction retained should statisfy the above two criterias.
|
||||
|
||||
\RETURN The final result $\varOmega_{nms}^{pos}$.
|
||||
\end{algorithmic}
|
||||
\label{Graph Fast NMS}
|
||||
\end{algorithm}
|
||||
|
||||
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
|
||||
The new algorithm has a distinct format from the original one\cite{yolact}. The geometric prior $\boldsymbol{A}_{G}$ indicated that predictions associated with adequately proximate anchors were likely to suppress one another. It is straightforward to demonstrate that, when all elements within $\boldsymbol{A}_{G}$ are all set to 1 (disregarding geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed sort-free Fast NMS with geometric prior, we can design the structure of the one-to-one classification head.
|
||||
|
||||
The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\lambda^g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks.
|
||||
|
||||
In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as:
|
||||
The principal limitations of the NMS lie in the definitions of distance derived from geometry (i.e., Eq. \ref{al_1-3}) and the threshold $\lambda^{g}$ employed to eliminate redundant predictions (i.e., Eq. \ref{al_1-4}). For instance, in the scenario of double lines, despite the minimal geometric distance between the two lanes, their semantic divergence is strikingly distinct. Consequently, we replace the above two steps with trainable neural networks, allowing them to learn the semantic distance in a data-driven fashion. The neural network blocks to replace Eq. \ref{al_1-3} are expressed as:
|
||||
\begin{align}
|
||||
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \boldsymbol{W}_{roi}\boldsymbol{F}_{i}^{roi}+\boldsymbol{b}_{roi} \right) ,\label{edge_layer_1_appendix}\\
|
||||
\boldsymbol{F}_{ij}^{edge}&\gets \boldsymbol{W}_{in}\tilde{\boldsymbol{F}}_{j}^{roi}-\boldsymbol{W}_{out}\tilde{\boldsymbol{F}}_{i}^{roi},\label{edge_layer_2_appendix}\\
|
||||
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\boldsymbol{W}_s\left( \boldsymbol{x}_{j}^{s}-\boldsymbol{x}_{i}^{s} \right) +\boldsymbol{b}_s,\label{edge_layer_3_appendix}\\
|
||||
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4_appendix}
|
||||
\end{align}
|
||||
where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows:
|
||||
where the inverse distance $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^{d}$ is no longer a scalar but a tensor. The replacement of Eq. \ref{al_1-4} is constructed as follows:
|
||||
\begin{align}
|
||||
\boldsymbol{D}_{i}^{node}&\gets \underset{k\in \left\{ k|A_{ki}=1 \right\}}{\max}\boldsymbol{D}_{ki}^{edge}.
|
||||
\end{align}
|
||||
\begin{align}
|
||||
\\
|
||||
\boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right) ,
|
||||
\\
|
||||
\tilde{s}_{i}^{g}&\gets \sigma \left( \boldsymbol{W}_{node}\boldsymbol{F}_{i}^{node} + \boldsymbol{b}_{node} \right) ,
|
||||
\tilde{s}_{i}^{g}&\gets \sigma \left( \boldsymbol{W}_{node}\boldsymbol{F}_{i}^{node} + \boldsymbol{b}_{node} \right).
|
||||
\label{node_layer_appendix}
|
||||
\end{align}
|
||||
|
||||
In this expression, we use element-wise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\lambda^g$, the real implicit decision surface is learned from data by neural work.
|
||||
|
||||
In this expression, we use element-wise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predetermined distance threshold $\lambda^s_d$, the authentic implicit decision surface is learned by neural work. Furthermore, since the score $\tilde{s}_{i}^{g}$ transitions from a binary score to a continuous soft score ranging from 0 to 1. We introduce a threshold $\lambda^s_{o2o}$ within the replacement criteria:
|
||||
\begin{align}
|
||||
\varOmega_{nms}^{pos}=\left\{ i|s_{i}^{g}>\lambda _{o2m}^{s}\,\,and\,\,\tilde{s}_{i}^{g}>\lambda^s_{o2o}\right\},
|
||||
\end{align}
|
||||
which is referred to as the \textit{dual confidence selection} in the main text.
|
||||
\label{NMS_appendix}
|
||||
|
||||
\begin{table*}[htbp]
|
||||
\centering
|
||||
\caption{Infos and hyperparameters for five datasets. For CULane, $*$ denotes the actual number of training samples used to train our model. Please note that labels for some validation/test sets are missing; therefore, we have selected different splits (test or validation set) for different datasets.}
|
||||
\caption{Infos and hyperparameters for five datasets. For CULane, $*$ denotes the actual number of training samples used to train our model. Labels for some validation/test sets are missing; therefore, selected different splits (test or validation set) are selected for different datasets.}
|
||||
\begin{adjustbox}{width=\linewidth}
|
||||
\begin{tabular}{l|l|ccccc}
|
||||
\toprule
|
||||
@ -957,6 +954,7 @@ In this expression, we use element-wise max pooling of tensors instead of scalar
|
||||
\multirow{4}*{Evaluation Hyperparameter}
|
||||
& $H^{l}\times W^{l}$ &$4\times10$&$4\times10$&$4\times10$&$4\times10$&$6\times13$\\
|
||||
& $K$ &20&20&20&12&50\\
|
||||
& $d$ &5&8&10&5&5\\
|
||||
& $C_{o2m}$ &0.48&0.40&0.40&0.40&0.45\\
|
||||
& $C_{o2o}$ &0.46&0.46&0.46&0.46&0.44\\
|
||||
\bottomrule
|
||||
@ -966,7 +964,7 @@ In this expression, we use element-wise max pooling of tensors instead of scalar
|
||||
\end{table*}
|
||||
|
||||
\section{The IoU Definations for Lane Instances}
|
||||
\textbf{Label assignment and Cost function.} To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名
|
||||
@ -988,13 +986,10 @@ where $w_{b}$ is the base semi-width of the lane instance. The definations of $d
|
||||
\begin{align}
|
||||
GIoU_{lane}\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}},
|
||||
\end{align}
|
||||
where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the $GIoU_{lane}$ is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the $GIoU_{lane}$ is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of $GIoU_{lane}$ is $\left(-g, 1 \right]$. We set $g=0$ for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function.
|
||||
where j and k are the indices of the start point and the end point, respectively. It's straightforward to observed that when $g=0$, the $GIoU_{lane}$ is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the $GIoU_{lane}$ is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of $GIoU_{lane}$ is $\left(-g, 1 \right]$. We set $g=0$ for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function.
|
||||
\label{giou_appendix}
|
||||
|
||||
\section{The Supplement of Implement Detail and The Visualization Result.}
|
||||
\textbf{Visualization.} Some important implement details for each dataset is shown in Table \ref{dataset_info}. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances.
|
||||
|
||||
\begin{figure*}[t]
|
||||
\begin{figure*}[htbp]
|
||||
\centering
|
||||
\def\pagewidth{0.49\textwidth}
|
||||
\def\subwidth{0.47\linewidth}
|
||||
@ -1132,7 +1127,7 @@ where j and k are the indices of the valid points (the start point and the end p
|
||||
|
||||
|
||||
|
||||
\begin{figure*}[t]
|
||||
\begin{figure*}[htbp]
|
||||
\centering
|
||||
\def\subwidth{0.24\textwidth}
|
||||
\def\imgwidth{\linewidth}
|
||||
@ -1230,6 +1225,12 @@ where j and k are the indices of the valid points (the start point and the end p
|
||||
\caption{The visualization of the detection results of sparse and dense scenarios on CurveLanes dataset.}
|
||||
\label{vis_dense}
|
||||
\end{figure*}
|
||||
\section{The Supplement of Implement Detail and The Visualization Results.}
|
||||
Some important implement details for each dataset is shown in Table \ref{dataset_info}.
|
||||
|
||||
Fig. \ref{vis_sparse} shows the visualization results for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased while maintaining accurate location around the ground truth compared to previous works, making our method faster than other anchor-based methods in theory.
|
||||
|
||||
Fig. \ref{vis_dense} shows the visualization results for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights that the trade-off struggles between a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in the last column of Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant predictions while retaining dense predictions with small geometric distances.
|
||||
\label{vis_appendix}
|
||||
|
||||
\end{document}
|
Binary file not shown.
Before Width: | Height: | Size: 1.7 MiB After Width: | Height: | Size: 1.7 MiB |
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user