update

2024-09-20 09:57:22 +08:00 · 2024-09-20 09:57:22 +08:00 · 90ecbd703d
commit 90ecbd703d
parent eba76aeca4
1 changed files with 31 additions and 39 deletions
--- a/main.tex
+++ b/main.tex
@ -82,7 +82,7 @@ In recent years, advancements in deep learning and the availability of large dat
                \includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/anchor_demo/gt.jpg}
                \caption{}
        \end{subfigure}
-        \caption{Anchor settings of different methods. (a) The initial anchor settings of CLRNet. (b) The learned anchor settings of CLRNet trained on CULane. (c) The learned anchors of our method. (d) The ground truth.}
+        \caption{Anchor (\textit{i.e.}, the yellow lines) settings of different methods and the ground truth lanes. (a) The initial anchor settings of CLRNet. (b) The learned anchor settings of CLRNet trained on CULane. (c) The learned anchors of our method. (d) The ground truth.}
        \label{anchor setting}
 \end{figure}

@ -132,7 +132,12 @@ To address the above two issues, we propose Polar R-CNN, a novel anchor-based me
 \item By integrating the polar coordinate systems and Polar GNN block, we present a Polar R-CNN model for fast and efficient lane detection. And we conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness of our model in high performance with fewer anchors and a NMS-free paradigm. %Additionally, our model features a straightforward structure—lacking cascade refinement or attention strategies—making it simpler to deploy. 
 \end{itemize}
 %
-
+\begin{figure*}[ht]
+	\centering
+	\includegraphics[width=0.85\linewidth]{thesis_figure/ovarall_architecture.png} 
+	\caption{An illustration of the Polar R-CNN architecture. It has a similar pipelines with the Faster R-CNN for the task of object detection, and consists of a backbone, a FPN with three levels of feature maps, respectively denote by $P_0, P_1, P_2$, followed by a \textit{Local Polar Module}, and a RoI pooling module to extract features fed to a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the robust and accurate lane predictions. The global polar module includes a triplet head, which comprises a \textit{one-to-one (O2O)} classification head, a \textit{one-to-many} (O2M) classification head , and a \textit{one-to-many} (O2M) regression head.}
+	\label{overall_architecture}
+\end{figure*}
 \section{Related Works}
 %As mentioned above, our model is based on deep learning.
 Generally, deep learning-based lane detection methods can be categorized into three groups: segmentation-based, parameter-based, and anchor-based methods. Additionally, NMS-free is an important technique for anchor-based methods, and it will also be described in this section.
@ -148,17 +153,8 @@ categorizes lane instances by angles and locations, allowing it to detect only a
 \par
 In this work, we aim to address the above two issues in the framework of anchor-based detection to achieve NMF-free and non-redundant lane predictions.
 %
-
 %
 \section{Polar R-CNN}
-
-\begin{figure*}[ht]
-	\centering
-	\includegraphics[width=0.85\linewidth]{thesis_figure/ovarall_architecture.png} 
-	\caption{An illustration of the Polar R-CNN architecture. It has a similar pipelines with the Faster R-CNN for the task of object detection, and consists of a backbone, a FPN with three levels of feature maps, respectively denote by $P_0, P_1, P_2$, followed by a \textit{local polar head}, and a RoI pooling module to extract features fed to a \textit{global polar head} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar head can propose sparse line anchors and the global polar head can produce the robust and accurate lane predictions. The global polar head includes a triplet head, which comprises a \textit{one-to-one (O2O) classification head}, a \textit{one-to-many (O2M) classification head} , and a \textit{one-to-many (O2M) regression head}.}
-	\label{overall_architecture}
-\end{figure*}
-
 \begin{figure}[t]
 	\centering
 	\def\subwidth{0.24\textwidth}
@ -173,50 +169,46 @@ In this work, we aim to address the above two issues in the framework of anchor-
 		\includegraphics[width=\imgwidth]{thesis_figure/coord/polar.png}
 		\caption{}
 	\end{subfigure}
-	\caption{Different descriptions for anchor parameters: (a) Ray: defined by its starting point and direction $\theta$. (b) Polar: defined by its radius and angle.} %rectangular coordinates
+	\caption{Different descriptions for anchor parameters: (a) Ray: defined by its starting point and direction $\theta$. (b) Polar: defined by its radius $r$ and angle $\theta$.} %rectangular coordinates
 	\label{coord}
 \end{figure}
-
-\begin{figure}[t]
-        \centering
-        \includegraphics[width=\linewidth]{thesis_figure/coord/localpolar.png}
-        \caption{Local (left) and global (right) polar coordinate system.}
-        \label{lphlabel}
-\end{figure}
 %
-The overall architecture of our Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. As shown in this figure, our Polar R-CNN for lane detection has a similar pipeline with Faster R-CNN \cite{fasterrcnn}, which consists of a backbone, a \textit{Feature Pyramid Network} (FPN), a \textit{Region Proposal Network} (RPN) followed by a local polar head, and \textit{Region of Interest} (RoI) pooling module followed by a global polar head. To investigate the fundamental factors affecting model performance, such as anchor settings and NMS post-processing, and also to enhance ease of deployment, our Polar R-CNN utilizes a simple and straightforward network structure. just relying on basic components, including convolutional or pooling operations, \textit{Multi-Layer Perceptrons} (MLPs), while deliberately excluding advanced elements like \textit{attention mechanisms}, \textit{dynamic kernels}, and \textit{cross-layer refinement} used in previous works \cite{clrnet}\cite{clrernet}. 
-\par
-In the following, based on a polar coordinate representation of lane and lane anchors, we will further introduce the designed \textit{Local Polar Head} (LPH) and \textit{Global Polar Head} (GPH) in our Polar R-CNN.
+The overall architecture of our Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. As shown in this figure, our Polar R-CNN for lane detection has a similar pipeline with Faster R-CNN \cite{fasterrcnn}, which consists of a backbone, a \textit{Feature Pyramid Network} (FPN), a \textit{Region Proposal Network} (RPN) followed by a \textit{Local Polar Module} LPM, and \textit{Region of Interest} (RoI) pooling module followed by a \textit{Global Polar Module} (GPM). In the following, we first introduce the polar coordinate representation of lane and lane anchors, and then present the designed LPM and GPM in our Polar R-CNN. %To investigate the fundamental factors affecting model performance, such as anchor settings and NMS post-processing, and also to enhance ease of deployment, our Polar R-CNN utilizes a simple and straightforward network structure. just relying on basic components, including convolutional or pooling operations, \textit{Multi-Layer Perceptrons} (MLPs), while deliberately excluding advanced elements like \textit{attention mechanisms}, \textit{dynamic kernels}, and \textit{cross-layer refinement} used in previous works \cite{clrnet}\cite{clrernet}. 
+%\par
+
 %
 \subsection{Representation of Lane and Lane Anchor}
 %
 Lanes are characterized by their thin, elongated, and curved shapes. A well-defined lane prior aids the model in feature extraction and location prediction. 
 \par
-\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of length $W$ and height $H$, a lane is represented by a set of 2D points with equally spaced y-coordinates $Y=\{y_1, y_2,\cdots, y_n\}$, where $y_i=i\times\frac{H}{n}$ and $n$ is the number of data points. Since the set $Y$ is fixed, a lane can be uniquely defined by its x-coordinates $X=\{x_1,x_2,\cdots,x_n\}$, with each $x_i$ corresponding to the respective $y_i\in Y$. Previous studies \cite{linecnn}\cite{laneatt} have introduced lane priors, also known as lane anchors, which are represented as straight lines in the image plane that serve as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a starting point $(x_{orig},y_{orig})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta$, as shown in Fig. \ref{coord}(a). The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance. 
+\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of length $W$ and height $H$, a lane is represented by a set of 2D points $X=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ with equally spaced y-coordinates, i.e., $y_i=i\times\frac{H}{N}$, where $N$ is the number of data points. Since the y-coordinate is fixed, a lane can be uniquely defined by its x-coordinates. Previous studies \cite{linecnn}\cite{laneatt} have introduced \textit{lane priors}, also known as \textit{lane anchors}, which are represented as straight lines in the image plane and served as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a starting point $(x_{0},y_{0})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta$. The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance. 
 \par
-However, lane anchors, which are essentially straight lines represented as rays, have certain drawbacks, as illustrated in Fig. \ref{coord}(a). A lane anchor possesses an infinite number of potential starting points, making the definition of the anchor’s starting point ambiguous and subjective. Some methods, such as \cite{linecnn}\cite{dalnet}\cite{laneatt}, define the starting points as being located at the boundaries of an image (e.g. the green point in Fig. \ref{coord}(a)), while \cite{adnet} sets the starting points to correspond to the actual visual location within the image (e.g. the purple point in Fig. \ref{coord}(a)). Moreover, occlusion and damage to the lane significantly impact the detection of these starting points, necessitating that the model possess a large receptive field \cite{adnet}. Fundamentally, a straight lane has two degrees of freedom (for instance, the slope and the intercept under a Cartesian coordinate system), implying that the lane anchor could be described using just two parameters instead of the three redundant parameters (two for the start point and one for orientation) currently employed.
-
+However, the representation of lane anchors as rays presents certain limitations. Notably, a lane anchor can have an infinite number of potential starting points, which makes the definition of its starting point ambiguous and subjective. As illustrated in Fig. \ref{coord}(a), the studies in \cite{dalnet}\cite{laneatt}\cite{linecnn} define the starting points as being located at the boundaries of an image, such as the green point in Fig. \ref{coord}(a). In contrast, the research presented in \cite{adnet} defines the starting points, exemplified by the purple point in Fig. \ref{coord}(a), based on their actual visual locations within the image. Moreover, occlusion and damage to the lane significantly affect the detection of these starting points, highlighting the need for the model to have a large receptive field \cite{adnet}. Essentially, a straight lane has two degrees of freedom: the slope and the intercept, under a Cartesian coordinate system, implying that the lane anchor could be described using just two parameters instead of the three redundant parameters (\textit{i.e.}, two for the start point and one for orientation) employed in ray representation.
+%
+\begin{figure}[t]
+	\centering
+	\includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png}
+	\caption{Local (left) and global (right) polar coordinate system.}
+	\label{lphlabel}
+\end{figure}
 \par
-\textbf{Representation in Polar Coordinate.} 
-As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce the polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (also called the pole) and an angle $\theta$ from a reference direction (called polar axis). As shown in Fig. \ref{coord}(b), a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (called radius), $r$, and the counterclockwise angle from the polar axis, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right]$. To better integrate the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local polar coordinate system and the global coordinate system.
-
-In the polar coordinate system, we introduce a set of reference points known as local poles. These local poles are positioned at the lattice points (or pixels) of the downsampled feature map, as illustrated in Fig. \ref{lphlabel} (a). Each local pole, which we denoted as $\mathbf{c}_{i}^{l}$, is responsible for predicting a single lane anchor, similar to the green points shown in Fig. \ref{lphlabel} (a). During training, as depicted in Fig. \ref{lphlabel} (a), the ground truth labels for each local pole are defined as follows: the radius ground truth is the shortest distance from a local pole to the ground truth lane curve, and the angle ground truth represents the orientation of the vector from the local pole to the nearest point on the curve. A local pole is labeled as a positive sample (the green points) if its radius label is below a threshold $\tau_{l}$; otherwise, it is considered a negative sample (the red points). Note that one lane curve instance is regressed by multiple local poles. Some local features around certain poles may be missed due to damage or occlusion of the lane curve, so the one-to-many approach is crucial for ensuring comprehensive anchor proposals.
-
-In the second stage (RoI Pooling and final lane detection), we standardize the lane anchors by transforming them from multiple local polar coordinate systems into a single uniform global coordinate system. This system contains only one reference point, termed the global pole, denoted as $\mathbf{c}^{g}$. The location of the global pole is manually set, and in this work, it is positioned around the static vanishing point of the entire lane image dataset.
-
-% \newpage
-% Since lane anchors are typically represented as straight lines, they can be described using straight line parameters. Previous approaches have used rays to describe 2D lane anchors, with the parameters including the coordinates of the starting point and the orientation/angle, denoted as $\left\{\theta, P_{xy}\right\}$, as shown in Fig. \ref{coord}(a). \cite{linecnn}\cite{laneatt} define the start points as lying  on the three image boundaries. However, \cite{adnet} argue that this approach is problematic because the actual starting point of a lane could be located anywhere within the image. In our analysis, using a ray can lead to ambiguity in line representation because a line can have an infinite number of starting points, and the choice of the starting point for a lane is subjective. As illustrated in Fig. \ref{coord}(a), the yellow (the visual start point) and green (the point located on the image boundary) starting points with the same orientation $\theta$ describe the same line, and either could be used in different datasets \cite{scnn}\cite{vil100}. This ambiguity arises because a straight line has two degrees of freedom, whereas a ray has three (two for the start point and one for orientation). To resolve this issue , we propose using polar coordinates to describe a lane anchor with only two parameters: radius and angle, deoted as $\left\{\theta, r\right\}$, where This representation is illustrated in Fig. \ref{coord}(b).
-% \newpage
-
+\textbf{Representation in Polar Coordinate.} As stated above, lane anchors represented by rays have some drawbacks. To address these issues, we introduce a polar coordinate representation of lane anchors. In mathematics, the polar coordinate is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point (also called the pole) and an angle $\theta$ from a reference direction (called polar axis). As shown in Fig. \ref{coord}(b), given a polar corresponding to the yellow point, a lane anchor for a straight line can be uniquely defined by two parameters: the radial distance from the pole (called radius), $r$, and the counterclockwise angle from the polar axis to the perpendicular line of the lane anchor, $\theta$, with $r \in \mathbb{R}$ and $\theta\in\left(-\frac{\pi}{2}, \frac{\pi}{2}\right]$. 
+\par
+To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system is to regress these anchors to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage training scheme for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}. This scheme alternates between training the local polar system and the global polar system, with the lane anchors kept fixed. 
+\par
+The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lphlabel}, there are two types of local poles: positive and negative. Positive local poles (e.g., the green points) have a radius $r_l$ below a threshold $\tau_l$, otherwise, they are classified as negative local poles (e.g., the red points). Each local pole is responsible for predicting a single lane anchor.
+\par
+In contrast, the global polar system has a single uniform pole, as shown on the right side of Fig. \ref{lphlabel}. The location of the global pole is manually set; in this work, it is positioned around the static vanishing point of the entire lane image dataset. Notably, one ground truth lane curve instance is regressed by multiple lane anchors generated by the positive local poles. This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve. 
 \begin{figure}[t]
        \centering
        \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
-        \caption{The main architecture of LPH.}
+        \caption{The main architecture of LPM.}
        \label{lph}
 \end{figure}
+%During training, as depicted in Fig. \ref{lphlabel} (a), the ground truth labels for each local pole are defined as follows: the radius ground truth is the shortest distance from a local pole to the ground truth lane curve, and the angle ground truth represents the orientation of the vector from the local pole to the nearest point on the curve. A local pole is labeled as a positive sample (the green points) if its radius label is below a threshold $\tau_{l}$; otherwise, it is considered a negative sample (the red points).  In the second stage (RoI Pooling and final lane detection), we standardize the lane anchors by transforming them from multiple local polar coordinate systems into a single uniform global coordinate system. This system contains only one reference point, termed the global pole, denoted as $\mathbf{c}^{g}$. 
 % We define two types of polar coordinate systems: the global coordinate system and the local coordinate system, with the origin points denoted as the global origin $\boldsymbol{c}^{g}$ and the local origin $\boldsymbol{c}^{l}$, respectively. For convenience, the global origin is positioned near the static vanishing point of the entire lane image dataset, while the local origins are set at lattice points within the image. As illustrated in Fig. \ref{coord}(b), only the radius parameters are affected by the choice of the origin point, while the angle/orientation parameters remain consistent.
-\subsection{Local Polar Head}
-\textbf{Anchor formulation in local polar head.}. Inspired by the region proposal network in Faster R-CNN \cite{fasterrcnn}, the local polar head (LPH) aims to propose flexible, high-quality anchors aorund the lane ground truths within an image. As Fig. \ref{lph} and Fig. \ref{overall_architecture} demonstrate, the highest level $P_{3} \in \mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ of FPN feature maps is selected as the input for LPH. Following a downsampling operation, the feature map is then fed into two branches: the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ and the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$:
+\subsection{Local Polar Module}
+\textbf{Anchor formulation in local polar head.} Inspired by the region proposal network in Faster R-CNN \cite{fasterrcnn}, the local polar head (LPH) aims to propose flexible, high-quality anchors aorund the lane ground truths within an image. As Fig. \ref{lph} and Fig. \ref{overall_architecture} demonstrate, the highest level $P_{3} \in \mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ of FPN feature maps is selected as the input for LPH. Following a downsampling operation, the feature map is then fed into two branches: the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ and the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$:
 \begin{equation}
        \begin{aligned}
                &F_d\gets DS\left( P_{3} \right), \,F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}},\\