This commit is contained in:
ShqWW 2024-10-11 17:54:40 +08:00
parent a2e56810de
commit abdd0c2843
3 changed files with 50 additions and 48 deletions

View File

@ -218,24 +218,24 @@ F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{
\end{align}
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
\par
\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, the LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows:
\textbf{Loss Function for LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
\begin{align}
\mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)
\label{loss_lph}
\end{align}
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in the LPM.
where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\tau^{l}\}\right|$ is the number of positive local poles in LPM.
\par
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as potential candidates during the training stage. However, some of these anchors serve as background anchors. We select the top-$K$ anchors with the highest confidence scores as the candicate anchor to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$ because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenrois. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
\begin{figure}[t]
\centering
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png}
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, O2M classification head, and O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.}
\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet heads—namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing. Conversely, the O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores (also denoted as $\left\{\tilde{s}_i^g\right\}$) based on the redundant scores (also denoted as $\left\{s_i^g\right\}$) from the O2M classification head.}
\label{g}
\end{figure}
\subsection{Global Polar Module}
Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. The GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}.
Similar to the pipeline of Faster R-CNN, LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}.
\par
\textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
\begin{align}
@ -255,12 +255,12 @@ Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors
\end{equation}
where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
\begin{equation}
\boldsymbol{F}^{roi}\gets FC_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}.
\boldsymbol{F}^{roi}\gets \mathrm{Linear}_{pool}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}.
\end{equation}
\textbf{Triplet Head.} The triplet head encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
\textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i \right\} $ output by the O2M classification head.
To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the models transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head.
\begin{figure}[t]
\centering
@ -274,16 +274,16 @@ As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture that inco
In the Polar GNN, each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the attributes of these nodes. pivotal component of the GNN is the edge, represented by the adjacency matrix. This matrix is derived from three submatrices. The first component is the positive selection matrix, denoted as $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$:
\begin{align}
M_{ij}^{P}=\begin{cases}
1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\
1, if \,s_i^g\geqslant \tau ^s\land s_j\geqslant \tau ^s\\
0,others,\\
\end{cases}
\end{align}
where $\tau ^s$ signifies the threshold for positive scores in the NMS paradigm. We employ this threshold to selectively retain positive redundant predictions.
where $\tau ^s$ signifies the threshold for positive scores to selectively retain initial positive redundant predictions.
The second component is the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align}
M_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
1, if \,s_i^g<s_j\,\,| \left( s_i^g=s_j \land i<j \right)\\
0, others.\\
\end{cases}
\label{confidential matrix}
@ -293,41 +293,43 @@ This matrix facilitates the comparison of scores for each pair of anchors.
The third component is the geometric prior matrix, denoted by $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$ which is defined as:
\begin{align}
M_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
0, others.\\
1, if \,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
0, others,\\
\end{cases}
\label{geometric prior matrix}
\end{align}
where $\tau^{\theta}$ and $\tau^{r}$ denote the thresholds for distances between lane anchor parameters.
This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two corresponding nodes if the anchors are sufficiently close.
With the aforementioned three matrices, we can define the overall adjacency matrix as $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND''. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows:
With the aforementioned three matrices, we can define the overall adjacency matrix as $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$; where ``$\land$'' denotes the elementwise ``AND'' for matrices. Then the relationships between the $i$-th anchor and the $j$-th anchor can be modeled by follows:
\begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\label{edge_layer_1}\\
\boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\label{edge_layer_2}\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\label{edge_layer_3}\\
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \mathrm{Linear}_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\label{edge_layer_1}\\
\boldsymbol{F}_{ij}^{edge}&\gets \mathrm{Linear}_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -\mathrm{Linear}_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\label{edge_layer_2}\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\mathrm{Linear}_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\label{edge_layer_3}\\
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
\end{align}
The \textit{implicit Distance Module} in Fig. \ref{o2o_cls_head} including Eq. (\ref{edge_layer_2})-(\ref{edge_layer_4}) to establish the relationships between the $i$-th anchor and the $j$-th anchor. Here, $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denotes the implicit semantic distance features from the $i$-th anchor to the $j$-th anchor. Given the semantic distance features for each pair of anchors, we employ a max pooling layer to aggregate the adjacent node features and update the node attributes, ultimately yielding the final non-redundant scores $\left\{ \tilde{s}_i\right\}$:
The \textit{implicit Distance Module} in Fig. \ref{o2o_cls_head} including Eq. (\ref{edge_layer_2})-(\ref{edge_layer_4}) to establish the relationships between the $i$-th anchor and the $j$-th anchor. Here, $\boldsymbol{D}_{ij}^{edge}\in\mathbb{R}^d$ denotes the implicit semantic distance features from the $i$-th anchor to the $j$-th anchor. Given the semantic distance features for each pair of anchors, we employ a max pooling layer to aggregate the adjacent node features and update the node attributes, ultimately yielding the final non-redundant scores $\left\{ \tilde{s}_i^g\right\}$:
\begin{align}
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
\\
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
\boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right),
\\
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\tilde{s}_i^g&\gets \sigma \left( \mathrm{Linear}_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\label{node_layer}
\end{align}
\textbf{Label Assignment and Cost Function.} As the previous work, we use the dual assignment strategy for label assignment of triplet head. The cost function of the $i$-th prediction and $j$-th ground truth is given as follows:
\textbf{Label Assignment and Cost Function for GPM.} As the previous work, we use the dual assignment strategy for label assignment of triplet head. The cost function for the $i$-th prediction and $j$-th ground truth is given as follows:
\begin{align}
\mathcal{C} _{ij}=s_i\times \left( GIoU_{lane} \right) ^{\beta}.
\mathcal{C} _{ij}^{o2m}&=s_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},\\
\mathcal{C} _{ij}^{o2o}&=\tilde{s}_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},
\end{align}
This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account, with $\beta$ serving as the trade-off hyperparameter for location and confidence. We have redefined the IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined the IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
We use SimOTA \cite{yolox} with dynamic-$k=4$ (one-to-many assignment) for O2M classification head and O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head.
We use SimOTA \cite{yolox} with dynamic-$k=4$ (one-to-many assignment) for the O2M classification head and the O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head.
\textbf{Loss function.}
We utilize focal loss \cite{focal} for both O2O classification head and O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega ^{pos}_{o2o}$ and $\varOmega ^{neg}_{o2o}$ for positive and negative target sets, is confined to the positive sample set of the O2M classification head:
\textbf{Loss function for GPM.}
We utilize focal loss \cite{focal} for both O2O classification head and the O2M classification head, dentoed as $\mathcal{L}^{o2m}_{cls}$ and $\mathcal{L}^{o2o}_{cls}$, respectively. The set of candidate samples involved in the computation of $\mathcal{L}^{o2o}_{cls}$, denoted as $\varOmega ^{pos}_{o2o}$ and $\varOmega ^{neg}_{o2o}$ for positive and negative target sets, is confined to the positive sample set of the O2M classification head:
\begin{align}
\varOmega _{o2o}^{pos}\cup \varOmega _{o2o}^{neg}=\left\{ i\mid s_i>\tau ^s. \right\}
\varOmega _{o2o}^{pos}\cup \varOmega _{o2o}^{neg}=\left\{ i\mid s_i^g>\tau ^s \right\}.
\end{align}
In essence, certain samples with lower O2M scores are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head.
\begin{figure}[t]
@ -336,7 +338,7 @@ In essence, certain samples with lower O2M scores are excluded from the computat
\caption{Auxiliary loss for segment parameter regression. The ground truth of a lane curve is partitioned into several segments, with the parameters of each segment denoted as $\left( \hat{\theta}_{i,\cdot}^{seg},\hat{r}_{i,\cdot}^{seg} \right)$. The model output the parameter offsets $\left( \varDelta \theta _{j,\cdot},\varDelta r_{j,\cdot}^{g} \right)$ to regress from the original anchor to each target line segments.}
\label{auxloss}
\end{figure}
We directly apply the redefined GLaneIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{\mathrm{aux}}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form.
The final loss functions for GPM are given as follows:
\begin{align}
@ -348,7 +350,7 @@ The final loss functions for GPM are given as follows:
% \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega^{pos}_{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
% &\quad + l \left( r_{i}^{g} - \hat{r}_{i}^{seg,m} \right) \Bigg].
% \end{align}
\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both the LPM and the GPM are trained concurrently. The overall loss function is delineated as follows:
\subsection{The Overalll Loss Function.} The entire training process is orchestrated in an end-to-end manner, wherein both LPM and GPM are trained concurrently. The overall loss function is delineated as follows:
\begin{align}
\mathcal{L} =\mathcal{L} _{cls}^{l}+\mathcal{L} _{reg}^{l}+\mathcal{L} _{cls}^{g}+\mathcal{L} _{reg}^{g}.
\end{align}
@ -385,7 +387,7 @@ For Tusimple, the evaluation is formulated as follows:
where $C_{clip}$ and $S_{clip}$ represent the number of correct points (predicted points within 20 pixels of the ground truth) and the ground truth points, respectively. If the accuracy exceeds 85\%, the prediction is considered correct. TuSimples also report the False Positive Rate (FPR=1-Precision) and False Negative Rate (FNR=1-Recall) formular.
\subsection{Implement Detail}
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including the LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
All input images are cropped and resized to $800\times320$. Similar to \cite{clrnet}, we apply random affine transformations and random horizontal flips. For the optimization process, we use the AdamW \cite{adam} optimizer with a learning rate warm-up and a cosine decay strategy. The initial learning rate is set to 0.006. The number of sampled points and regression points for each lane anchor are set to 36 and 72, respectively. The power coefficient of cost function $\beta$ is set to 6 respectively. The training processing (including LPM and GPM) is end-to-end just like \cite{adnet}\cite{srlane} in one step. All the experiments are conducted on a single NVIDIA A100-40G GPU. To make our model simple, we only use CNN-based backbone, namely ResNet\cite{resnet} and DLA34\cite{dla}. Other details for datasets and training process can be seen in Appendix \ref{vis_appendix}.
\begin{table*}[htbp]
@ -837,20 +839,20 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
The positive corresponding anchors, $\left\{ \theta _i,r_{i}^{g} \right\} |_{i=1}^{K}$;\\
The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
The positive confidence get from O2M classification head, $s_i$;\\
The positive regressions get from O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
The positive confidence get from the O2M classification head, $s_i^g$;\\
The positive regressions get from the O2M regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
\ENSURE ~~\\ %算法的输出Output
\STATE Selecte the positive candidates by $\boldsymbol{M}^{P}\in\mathbb{R}^{K\times K}$:
\begin{align}
M_{ij}^{P}=\begin{cases}
1, \left( s_i\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\
1, if\,\left( s_i^g\geqslant \tau ^s\land s_j\geqslant \tau ^s \right)\\
0,others,\\
\end{cases}
\end{align}
\STATE Caculate the confidence comparison matrix $\boldsymbol{M}^{C}\in\mathbb{R}^{K\times K}$, defined as follows:
\begin{align}
M_{ij}^{C}=\begin{cases}
1, s_i<s_j\,\,| \left( s_i=s_j \land i<j \right)\\
1, if\,s_i^g<s_j\,\,| \left( s_i^g=s_j \land i<j \right)\\
0, others.\\
\end{cases}
\label{confidential matrix}
@ -859,51 +861,51 @@ We decided to choose the Fast NMS \cite{yolact} as the inspiration of the design
\STATE Calculate the geometric prior matrix $\boldsymbol{M}^{G}\in\mathbb{R}^{K\times K}$, which is defined as follows:
\begin{align}
M_{ij}^{G}=\begin{cases}
1,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
1,if\,\left| \theta _i-\theta _j \right|<\tau^{\theta}\land \left| r_{i}^{g}-r_{j}^{g} \right|<\tau^{r}\\
0, others.\\
\end{cases}
\label{geometric prior matrix}
\end{align}
\STATE Calculate the distance matrix $\boldsymbol{D} \in \mathbb{R} ^{K \times K}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows:
\begin{align}
D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right),
D_{ij} = 1-d\left( \mathrm{lane}_i, \mathrm{lane}_j\right),
\label{al_1-3}
\end{align}
where $d\left(\cdot, \cdot, \cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
\STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i$ is calculate as following:
where $d\left(\cdot, \cdot \right)$ is some predefined function to quantify the distance between two lane predictions.
\STATE Define the adjacent matrix $\boldsymbol{M} = \boldsymbol{M}^{P} \land \boldsymbol{M}^{C} \land \boldsymbol{M}^{G}$ and the final confidence $\tilde{s}_i^g$ is calculate as following:
\begin{align}
\tilde{s}_i = \begin{cases}
1, & \text{if } \underset{j \in \{ j \mid M_{ij} = 1 \}}{\max} D_{ij} < \tau^g \\
\tilde{s}_i^g = \begin{cases}
1, & if\,\text{if } \underset{j \in \{ j \mid M_{ij} = 1 \}}{\max} D_{ij} < \tau^g \\
0, & \text{otherwise}
\end{cases}
\label{al_1-4}
\end{align}
\RETURN The final confidence $\tilde{s}_i$. % the return result of the algorithm
\RETURN The final confidence $\tilde{s}_i^g$. % the return result of the algorithm
\end{algorithmic}
\label{Graph Fast NMS}
\end{algorithm}
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to ``true'' (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
The new algorithm has a different format with the original one\cite{yolact}. But it is easy to demonstrate that, when all elements in $\boldsymbol{M}$ are all set to 1 (regardless of geometric priors), Algorithm \ref{Graph Fast NMS} is equivalent to Fast NMS. Building upon our newly proposed Graph-based Fast NMS, we can design the structure of the one-to-one classification head in a manner that mirrors the principles of following Graph-based Fast NMS.
The fundamental shortcomings of the NMS are the definations of distance based on geometry (\textit{i.e.} Eq. \ref{al_1-3}) and the threshold $\tau^g$ to remove the reduntant predictons (\textit{i.e.} Eq. \ref{al_1-4}). Thus, we replace the two steps with trainable neural networks.
In order to help the model to learn distance containing both explicit geometric information and implicit semantic informaton, the block to replace Eq. \ref{al_1-3} are expressed as:
\begin{align}
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\
\boldsymbol{F}_{ij}^{edge}&\gets FC_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -FC_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+FC_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\\
\boldsymbol{D}_{ij}^{edge}&\gets MLP_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .
\tilde{\boldsymbol{F}}_{i}^{roi}&\gets \mathrm{ReLU}\left( \mathrm{Linear}_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right) ,\\
\boldsymbol{F}_{ij}^{edge}&\gets \mathrm{Linear}_{in}\left( \tilde{\boldsymbol{F}}_{i}^{roi} \right) -\mathrm{Linear}_{out}\left( \tilde{\boldsymbol{F}}_{j}^{roi} \right) ,\\
\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\mathrm{Linear}_b\left( \boldsymbol{x}_{i}^{b}-\boldsymbol{x}_{j}^{b} \right) ,\\
\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .
\label{edge_layer_appendix}
\end{align}
where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$. The replacement of Eq. \ref{al_1-4} is constructed as follows:
\begin{align}
\boldsymbol{D}_{i}^{node}&\gets \underset{j\in \left\{ j|M_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
\\
\boldsymbol{F}_{i}^{node}&\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
\boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right),
\\
\tilde{s}_i&\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\tilde{s}_i^g&\gets \sigma \left( \mathrm{Linear}_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
\label{node_layer_appendix}
\end{align}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.