This commit is contained in:
王老板 2024-09-25 09:40:04 +08:00
parent 90ecbd703d
commit 4e855537b6

105
main.tex
View File

@ -202,69 +202,50 @@ In contrast, the global polar system has a single uniform pole, as shown on the
\begin{figure}[t] \begin{figure}[t]
\centering \centering
\includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png} \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
\caption{The main architecture of LPM.} \caption{The main architecture of local polar module.}
\label{lph} \label{lph}
\end{figure} \end{figure}
%During training, as depicted in Fig. \ref{lphlabel} (a), the ground truth labels for each local pole are defined as follows: the radius ground truth is the shortest distance from a local pole to the ground truth lane curve, and the angle ground truth represents the orientation of the vector from the local pole to the nearest point on the curve. A local pole is labeled as a positive sample (the green points) if its radius label is below a threshold $\tau_{l}$; otherwise, it is considered a negative sample (the red points). In the second stage (RoI Pooling and final lane detection), we standardize the lane anchors by transforming them from multiple local polar coordinate systems into a single uniform global coordinate system. This system contains only one reference point, termed the global pole, denoted as $\mathbf{c}^{g}$.
% We define two types of polar coordinate systems: the global coordinate system and the local coordinate system, with the origin points denoted as the global origin $\boldsymbol{c}^{g}$ and the local origin $\boldsymbol{c}^{l}$, respectively. For convenience, the global origin is positioned near the static vanishing point of the entire lane image dataset, while the local origins are set at lattice points within the image. As illustrated in Fig. \ref{coord}(b), only the radius parameters are affected by the choice of the origin point, while the angle/orientation parameters remain consistent.
\subsection{Local Polar Module} \subsection{Local Polar Module}
\textbf{Anchor formulation in local polar head.} Inspired by the region proposal network in Faster R-CNN \cite{fasterrcnn}, the local polar head (LPH) aims to propose flexible, high-quality anchors aorund the lane ground truths within an image. As Fig. \ref{lph} and Fig. \ref{overall_architecture} demonstrate, the highest level $P_{3} \in \mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ of FPN feature maps is selected as the input for LPH. Following a downsampling operation, the feature map is then fed into two branches: the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ and the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$: As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes feature maps $P_1, P_2, P_3$ as input and outputs a set of lane anchors along with their confidence scores. Taking the highest level feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ as an example, as demonstrated in Fig. \ref{lph}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
\begin{equation} \begin{equation}
\begin{aligned} F_d\gets DS\left( P_{3} \right)\ \text{and}\ F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}.
&F_d\gets DS\left( P_{3} \right), \,F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}},\\
&F_{reg\,\,}\gets \phi _{reg}^{lph}\left( F_d \right), \,F_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\
&F_{cls}\gets \phi _{cls}^{lph}\left( F_d \right), \,F_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}.
\end{aligned}
\label{lph equ}
\end{equation} \end{equation}
The downsampled feature map $F_d$ is then fed into two branches: a \textit{regression} branch $\phi _{reg}^{lpm}\left(\cdot \right)$ and a \textit{classification} branch $\phi _{cls}^{lpm}\left(\cdot \right)$, \textit{i.e.},
The regression branch aims to propose lane anchors by predicting two parameters $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, within the local polar coordinate system. These parameters represent the angles and the radius.The classification branch predicts the heat map $F_{cls\,\,}\equiv \left\{ c_j \right\} _{j=1}^{H^l\times W^l}$ of the local poles. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while removing background lane anchors to the greatest extent. Keeping it simple, the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ consists of one $1\times1$ convolutional layer while the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$ consists of two $1\times1$ convolutional layers. \begin{align}
F_{reg\,\,}\gets \phi _{reg}^{lpm}\left( F_d \right)\ &\text{and}\ F_{reg\,\,}\in \mathbb{R} ^{2\times H^{l}\times W^{l}},\\
\textbf{Loss Function.} F_{cls}\gets \phi _{cls}^{lpm}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{R} ^{H^{l}\times W^{l}}. \label{lph equ}
Once the regression and classification labels are established as Fig. \ref{lphlabel}, the LPH can be trained using the smooth-L1 loss $d\left(\cdot \right)$ for regression and the binary cross-entropy loss $BCE\left( \cdot , \cdot \right)$ for classification. The LPH loss function is defined as follows: \end{align}
The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_{j}$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ c_j \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
\par
\textbf{Loss Function for Training the LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $r^*$, is set to be the shortest distance from a local pole to the corresponding lane curve, while the ground truth angle, $\theta^*$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{c_j^*\}_{j=1}^{H^l\times W^l}$, where $c_j^*=1$ if the $j$-th local pole is positive and $c_j^*=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lphlabel}, the LPM can be trained using the \textit{smooth-L}1 loss $s\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for the LPM are given as follows:
\begin{align}
\mathcal{L} _{lpm}^{cls}&=BCE\left( F_{cls},F_{gt} \right), \\
\mathcal{L} _{lpm}^{r\mathrm{e}g}&=\frac{1}{N_{lpm}^{pos}}\sum_{j\in \left\{j|r_j<\tau_{l} \right\}}{\left( s\left( \theta _j-\theta_j^* \right) +s\left( r_j^*-r_j^* \right) \right)}, \label{loss_lph}
\end{align}
where $N_{lpm}^{pos}=\left|\{j|r_j<\tau_{l}\}\right|$ is the number of positive local poles in the LPM.
\par
\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a point in the feature map, are considered as candidate anchor during training the LPM. It is helpful to our Polar R-CNN to learn from a sufficient variety of features, including negative anchor samples. However, only the top-$K$ anchors with the highest confidence scores $\{c_j\}$ are selected and fed into the next stage. This strategy effectively filters out potential negative anchors and reduces the computational complexity of our Polar R-CNN. By doing this, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors. The following experiments will demonstrate the effectiveness of our top-$K$ anchor selection strategy.
%
\subsection{Global Polar Module}
Similar to the pipeline of Faster R-CNN, the LPM serves as the first stage for generating lane proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve accurate lane prediction. The GPM takes features extracted by a \textit{Region of Interest} (ROI) pooling layer as input and outputs the precise lane location and confidence scores through a triplet head.
\par
\textbf{RoI Pooling Layer.} It is designed to extract relevant areas of the feature map. For ease of operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
\begin{align}
r^{g}_{j}&=r^{l}_{j}+\left( \boldsymbol{c}^{l}_{j}-\boldsymbol{c}^{g} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right], \\
j&=1,2,\cdots,N_{lpm}^{pos},\notag
\end{align}
where $\boldsymbol{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\boldsymbol{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j$-th local pole and the global pole, respectively. Note that we keep the angle $\theta_j$ unchanged, since the local and global polar coordinate system have the same polar axis, as shown in Fig. \ref{lphlabel}. And next, the feature points are sampled on each lane anchors by
\begin{align}
x_{i,j}&=-y_{i,j}\tan \theta_j +\frac{r^{g}_j}{\cos \theta_j},\label{positions}\\
i&=1,2,\cdots,N,\notag
\end{align}
where the y-coordinates $\{y_{1,j}, y_{2,j},\cdots,y_{N,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned.
\par
Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{N_{lpm}^{pos}}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{N_{lpm}^{pos}\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
\begin{equation} \begin{equation}
\begin{aligned} \boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}},
\mathcal{L} _{lph}^{cls}&=BCE\left( F_{cls},F_{gt} \right), \\
\mathcal{L} _{lph}^{r\mathrm{e}g}&=\frac{1}{N_{lph}^{pos}}\sum_{j\in \left\{j|\hat{r}_i<\tau_{L} \right\}}{\left( d\left( \theta _j-\hat{\theta}_j \right) +d\left( r_j^L-\hat{r}_j^L \right) \right)}.\\
\end{aligned}
\label{loss_lph}
\end{equation} \end{equation}
where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N_{lpm}^{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
\textbf{Top-$K_{a}$ Anchor Selectoin.}. During the training stage, all $H^{l}\times W^{l}$ anchors are considered as candidate anchors and fed into the R-CNN module. This approach helps the R-CNN module to learn from sufficient features of negative (background) anchor samples. In the evaluation stage, however, only the top-$K_{a}$ anchors with the highest confidence scores are selected and fed into the R-CNN module. This strategy is designed to filter out potential negative (background) anchors and reduce the computational complexity of the R-CNN module. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors. The following experiments will demonstrate the effectiveness of our top-$K_{a}$ anchor selection strategy.
\subsection{Global Polar Head.}
Global polar head (GPH) is a crucial component in the second stage of Polar R-CNN. It takes lane anchor pooling features as input and predicts the precise lane location and confidence. Fig. \ref{gph} illustrates the structure and pipeline of GPH. GPH comprises RoI pooling modules and three subheads (triplet head module), which will be introduced in detail.
\textbf{RoI Pooling Module.} RoI pooling module is designed to transform features sampled from lane anchors into a standard feature tensor. Once the local polar parameters of a lane anchor are given, they can be converted to global polar coordinates using the following equation:
\begin{equation}
\begin{aligned}
r^{g}_{j}=r^{l}_{j}+\left( \textbf{c}^{l}_{j}-\textbf{c}^{g}_{j} \right) ^{T}\left[\cos\theta_{j}; \sin\theta_{j} \right].
\end{aligned}
\end{equation}
where $\textbf{c}^{l}_{j} \in \mathbb{R}^{2}$ and $\textbf{c}^{g} \in \mathbb{R}^{2}$ represent the Cartesian coordinates of $j_{th}$ local pole and the global pole correspondingly.
Next, feature points are sampled on the lane anchor. The y-coordinates of these points are uniformly sampled vertically from the image, as previously mentioned. The $x_{i}$ coordinates are computed using the global polar axis with the following equation:
\begin{equation}
\begin{aligned}
x_{i\,\,}=-y_i\tan \theta +\frac{r^{g}}{\cos \theta}.
\end{aligned}
\end{equation}
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{thesis_figure/detection_head.png} % 替换为你的图片文件名
\caption{The main architecture of GPH.}
\label{gph}
\end{figure}
Suppose the $P_{0}$, $P_{1}$ and $P_{2}$ denote the last three levels from FPN and $\boldsymbol{F}_{L}^{s}\in \mathbb{R} ^{N_p\times d_f}$ represent the $L_{th}$ sample point feature from $P_{L}$. The grid featuers from the three levels are extracted and fused together without cross layer cascade refinenment unlike CLRNet. To reduce the number of parameters, we employ a weight sum strategy to combine features from different layers (denoted as $L$), similar to \cite{detr}, but in a more compact form:
\begin{equation}
\begin{aligned}
\boldsymbol{F}^s=\sum_{L=0}^2{\boldsymbol{F}_{L}^{s}\times \frac{e^{\boldsymbol{w}_{L}^{s}}}{\sum_{L=0}^2{e^{\boldsymbol{w}_{L}^{s}}}}},
\end{aligned}
\end{equation}
where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}
\boldsymbol{F}^{roi}\gets FC_{pooling}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}, \boldsymbol{F}^{roi}\gets FC_{pooling}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r},
@ -272,7 +253,17 @@ where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable agg
\end{equation} \end{equation}
\textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. To achieve non-redundant detection results with a NMS-free paradigm, the one-to-one paradigm becomes crucial during training, as highlighted in \cite{o2o}. Nevertheless, merely adopting the one-to-one paradigm is insufficient; the structure of the detection head also plays a pivotal role in achieving NMS-free detection. This aspect will be further analyzed in the following sections. \textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. To achieve non-redundant detection results with a NMS-free paradigm, the one-to-one paradigm becomes crucial during training, as highlighted in \cite{o2o}. Nevertheless, merely adopting the one-to-one paradigm is insufficient; the structure of the detection head also plays a pivotal role in achieving NMS-free detection. This aspect will be further analyzed in the following sections.
%
%
\begin{figure}[t]
\centering
\includegraphics[width=0.89\linewidth]{thesis_figure/detection_head.png} %
\caption{The main architecture of global polar module.}
\label{gph}
\end{figure}
%
%
\par
\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows: \textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
\begin{equation} \begin{equation}
\begin{aligned} \begin{aligned}