From 23c6c540f958511afbbe1abf4d65f655a0075870 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E7=8E=8B=E8=80=81=E6=9D=BF?= Date: Thu, 3 Oct 2024 19:48:47 +0800 Subject: [PATCH] update --- main.tex | 272 ++++++++++++++++++++++--------------------------------- 1 file changed, 108 insertions(+), 164 deletions(-) diff --git a/main.tex b/main.tex index ac222b4..4fa1e49 100644 --- a/main.tex +++ b/main.tex @@ -269,7 +269,7 @@ To ensure both simplicity and efficiency in our model, the O2M regression head a \label{o2o_cls_head} \end{figure} -As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the appendix; here, we focus on elaborating the architecture of the Polar GNN. +As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior, which we refer to as the Polar GNN. he Polar GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The structural insight of the Polar GNN is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the Polar GNN. In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\mathbf{M}^{P}\in\mathbb{R}^{K\times K}$: \begin{align} @@ -293,7 +293,7 @@ This matrix facilitates the comparison of scores for each pair of anchors. The third component is the geometric prior matrix, denoted by $\mathbf{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as: \begin{align} M_{ij}^{G}=\begin{cases} - 1,\left| \theta _i-\theta _j \right|<\theta _{\tau}\land \left| r_{i}^{global}-r_{j}^{global} \right|0$, the value range of GLaneIoU is $\left(-g, 1 \right]$. -We then define the cost function between $i$-th prediction and $j$-th ground truth as follows like \cite{detr}: - \begin{align} - \mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}. -\end{align} - -This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with $k=4$) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment. - - -\textbf{Loss function.} We use focal loss \cite{focal} for O2O classification head and O2M classification head: - \begin{align} - \mathcal{L} _{o2m}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)}, - \\ - \mathcal{L} _{o2o}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2o}}{\alpha _{o2o}\left( 1-\tilde{s}_i \right) ^{\gamma}\log \left( \tilde{s}_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2o}}{\left( 1-\alpha _{o2o} \right) \left( \tilde{s}_i \right) ^{\gamma}\log \left( 1-\tilde{s}_i \right)}. - \\ - \end{align} -where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head: - \begin{align} - \varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}. - \end{align} - -Only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O classification head. According to \cite{pss}, to maintain feature quality during training stage, the gradient of O2O classification head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O classification head: +The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau_g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks. +In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as: \begin{align} - &\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega ^{pos}_{o2o}}{\sum_{j\in \varOmega ^{neg}_{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}},\\ - &N_{rank}=\left| \varOmega ^{pos}_{o2o} \right|\left| \varOmega ^{neg}_{o2o} \right|. + \tilde{\boldsymbol{F}}_{i}^{roi} & \gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right), \\ + \boldsymbol{F}_{ij}^{edge} & \gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) - FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) + FC_{b}\left( \varDelta \boldsymbol{x}_{ij}^{b} \right), \\ + \boldsymbol{D}_{ij}^{edge} & \gets MLP_{edge}\left( \boldsymbol{F}_{ij}^{edge} \right). +\label{edge_layer_appendix} +\end{align} +where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows: +\begin{align} + \boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge}, + \\ + \boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right), + \\ + \tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right). +\label{node_layer_appendix} \end{align} -We directly use the GLaneIoU loss, $\mathcal{L}_{GLaneIoU}$, to regression the offset of xs (with g=1) and Smooth-L1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxiliary loss illustrated in Fig. \ref{auxloss}: - \begin{align} - \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega_{pos}^{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\ - &\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg]. - \end{align} +In this expresion, we use elementwise max pooling of tensors instead of scalar-based max operations. By eliminating the need for a predefined distance threshold $\tau_g$, the real implicit decision surface is learned from data by neural work. -The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape. - -\subsection{Loss function} - -The overall loss function of Polar R-CNN is given as follows: - \begin{align} - \mathcal{L}_{overall} &=\mathcal{L} _{lpm}^{cls}+w_{lpm}^{reg}\mathcal{L} _{lpm}^{reg}\\&+w_{o2m}^{cls}\mathcal{L} _{o2m}^{cls}+w_{o2o}^{cls}\mathcal{L} _{o2o}^{cls}+w_{rank}\mathcal{L} _{rank}\\&+w_{IoU}\mathcal{L} _{IoU}+w_{end}\mathcal{L} _{end}+w_{aux}\mathcal{L} _{aux}. - \end{align} -The first line in the loss function represents the loss for LPH, which includes both classification and regression components. The second line pertains to the losses associated with the two classification heads (O2M and O2O), while the third line represents the loss for the regression head within the triplet head. Each term in the equation is weighted by a factor to balance the contributions of each component to the gradient. The entire training process is end-to-end. -\section{Title of the 2nd appendix} -This is the first paragraph of Appx. B .. +\label{NMS_appendix} \begin{table*}[htbp] \centering @@ -1030,14 +952,36 @@ This is the first paragraph of Appx. B .. \label{dataset_info} \end{table*} +\section{The Defination of GLaneIoU} +\textbf{Label assignment and Cost function.} To make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows: +\begin{figure}[t] + \centering + \includegraphics[width=\linewidth]{thesis_figure/GLaneIoU.png} % 替换为你的图片文件名 + \caption{Illustrations of GLaneIoU redefined in our work.} + \label{glaneiou} +\end{figure} + \begin{align} + &w_{i}^{k}=\frac{\sqrt{\left( \Delta x_{i}^{k} \right) ^2+\left( \Delta y_{i}^{k} \right) ^2}}{\Delta y_{i}^{k}}w_{b}, + \\ + &\hat{d}_{i}^{\mathcal{O}}=\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right), + \\ + &\hat{d}_{i}^{\xi}=\max \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right) -\min \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right), + \\ + &d_{i}^{\mathcal{U}}=\max \left( x_{i}^{p}+w_{i}^{p}, x_{i}^{q}+w_{i}^{q} \right) -\min \left( x_{i}^{p}-w_{i}^{p}, x_{i}^{q}-w_{i}^{q} \right), + \\ + &d_{i}^{\mathcal{O}}=\max \left( \hat{d}_{i}^{\mathcal{O}},0 \right), \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, d_{i}^{\xi}=\max \left( \hat{d}_{i}^{\xi},0 \right), + \end{align} +where $w_{b}$ is the base semi-width of the lane instance. The definations of $d_{i}^{\mathcal{O}}$ and $d_{i}^{\mathcal{\xi}}$ is similar but slightly different from those in \cite{clrnet} and \cite{adnet}, with adjustments made to ensure the values are non-negative. This format is intended to maintain consistency with the IoU definitions used for bounding boxes. Therefore, the overall GLaneIoU is given as follows: + \begin{align} +GLaneIoU\,\,=\,\,\frac{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{O}}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}-g\frac{\sum\nolimits_{i=j}^k{d_{i}^{\xi}}}{\sum\nolimits_{i=j}^k{d_{i}^{\mathcal{U}}}}, +\end{align} +where j and k are the indices of the valid points (the start point and the end point). It's straightforward to observed that when $g=0$, the GLaneIoU is correspond to IoU for bounding box, with a value range of $\left[0, 1 \right]$. When $g=1$, the GLaneIoU is correspond to GIoU\cite{giouloss} for bounding box, with a value range of $\left(-1, 1 \right]$. In general, when $g>0$, the value range of GLaneIoU is $\left(-g, 1 \right]$. We set g=0 for cost function and IoU matrix in SimOTA, while $g=1$ for the loss function. +\label{giou_appendix} +\section{The Supplement of Implement Detail and The Visualization Result.} +\textbf{Visualization.} Some important implement details for each dataset is shown in Table \ref{dataset_info}. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances. -\textbf{Visualization.} We present the Polar R-CNN predictions for both sparse and dense scenarios. Fig. \ref{vis_sparse} displays the predictions for sparse scenarios across four datasets. LPH effectively proposes anchors that are clustered around the ground truth, providing a robust prior for the RoI stage to achieve the final lane predictions. Moreover, the number of anchors has significantly decreased compared to previous works, making our method faster than other anchor-based methods in theory. Fig. \ref{vis_dense} shows the predictions for dense scenarios. We observe that NMS@50 mistakenly removes some predictions, leading to false negatives, while NMS@15 fails to eliminate redundant predictions, resulting in false positives. This highlights the trade-off between using a large IoU threshold and a small IoU threshold. The visualization clearly demonstrates that geometric distance becomes less effective in dense scenarios. Only the O2O classification head, driven by data, can address this issue by capturing semantic distance beyond geometric distance. As shown in Fig. \ref{vis_dense}, the O2O classification head successfully eliminates redundant true predictions while retaining dense predictions with small geometric distances. - - - - -\begin{figure*}[htbp] +\begin{figure*}[t] \centering \def\pagewidth{0.49\textwidth} \def\subwidth{0.47\linewidth} @@ -1175,7 +1119,7 @@ This is the first paragraph of Appx. B .. -\begin{figure*}[htbp!] +\begin{figure*}[t] \centering \def\subwidth{0.24\textwidth} \def\imgwidth{\linewidth} @@ -1273,6 +1217,6 @@ This is the first paragraph of Appx. B .. \caption{The visualization of the detection results of sparse and dense scenarios on CurveLanes dataset.} \label{vis_dense} \end{figure*} - +\label{vis_appendix} \end{document} \ No newline at end of file