update

2024-10-17 17:46:40 +08:00 · 2024-10-17 17:46:40 +08:00 · 31a50689a2
commit 31a50689a2
parent 8bb4976b42
7 changed files with 60 additions and 58 deletions
--- a/main.tex
+++ b/main.tex
@ -46,7 +46,7 @@
 \maketitle

 \begin{abstract}
-    Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and refine location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient  coverage across diverse datasets often requires a large number of dense anchors. Furthermore,
+    Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and  subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient  coverage across diverse datasets often requires a large number of dense anchors. Furthermore,
    the use of \textit{Non-Maximum Suppression} (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose \textit{Polar R-CNN}, a NMS-free anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising performance. Additionally, we introduce a heuristic \textit{Graph Neural Network} (GNN)-based NMS-free head that supports an end-to-end paradigm, enhancing deployment efficiency and performance in scenarios with dense lanes. Our method achieves competitive results on five popular lane detection benchmarks—\textit{Tusimple}, \textit{CULane}, \textit{LLAMAS}, \textit{CurveLanes}, and \textit{DL-Rail}—while maintaining a lightweight design and straightforward structure. Our source code is available at \href{https://github.com/ShqWW/PolarRCNN}{\textit{https://github.com/ShqWW/PolarRCNN}}.
 \end{abstract}
 \begin{IEEEkeywords}
@ -82,7 +82,7 @@ In recent years, advancements in deep learning and the availability of large dat
                \includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/anchor_demo/gt.jpg}
                \caption{}
        \end{subfigure}
-        \caption{Anchor (\textit{i.e.}, the yellow lines) settings of different methods and the ground truth lanes. (a) The initial anchor settings of CLRNet. (b) The learned anchor settings of CLRNet trained on CULane. (c) The learned anchors of our method. (d) The ground truth.}
+        \caption{Anchor (\textit{i.e.}, the yellow lines) settings of different methods and the ground truth lanes. (a) The initial anchor settings of CLRNet. (b) The learned anchor settings of CLRNet trained on CULane. (c) The flexible proposal anchors of our method. (d) The ground truth.}
        \label{anchor setting}
 \end{figure}

@ -110,7 +110,7 @@ In recent years, advancements in deep learning and the availability of large dat
                \caption{}
        \end{subfigure}

-        \caption{Comparison of anchor thresholds in \textit{sparse} and \textit{dense} scenarios. (a) and (b) Ground truths in a dense and sparse scenarios, respectively. (c) Predictions with large NMS thresholds in a dense scenario, resulting in a lane prediction being mistakenly suppressed. (d) Predictions with a small NMS threshold in a sparse scenario, where redundant prediction results are not effectively removed.}
+        \caption{Comparison of NMS thresholds in \textit{sparse} and \textit{dense} scenarios. (a) and (b) Ground truths in a dense and sparse scenarios, respectively. (c) Predictions with large NMS thresholds in a dense scenario, resulting in a lane prediction being mistakenly suppressed. (d) Predictions with small NMS thresholds in a sparse scenario, where redundant prediction results are not effectively removed.}
        \label{NMS setting}
 \end{figure}
 %, where some lane instances are close with each others; , where the lane instance are far apart
@ -121,13 +121,13 @@ Drawing inspiration from object detection methods such as \textit{YOLO} \cite{yo
 \item A \textit{Non-Maximum Suppression} (NMS) \cite{nms} post-processing step is required to eliminate redundant predictions but may struggle in \textbf{\textit{dense scenarios}} where lanes are close to each other, such as forked lanes and double lanes, as illustrated in the Fig. \ref{NMS setting}(a).
 \end{itemize}
 \par
-Regrading the first issue, \cite{clrnet} introduced learned anchors that optimize the anchor parameters during training to better adapt to lane distributions, as shown in Fig. \ref{anchor setting}(b). However, the number of anchors remains excessive to adequately cover the diverse potential distributions of lanes. Furthermore, \cite{adnet} proposes flexible anchors for each image by generating start points, rather than using a fixed set of anchors. Nevertheless, these start points of lanes are subjective and lack clear visual evidence due to the global nature of lanes. In contrast, \cite{srlane} uses a local angle map to propose sketch anchors according to the direction of ground truth. While this approach considers directional alignment, it neglects precise anchor positioning, resulting in suboptimal performance. Overall, the abundance of anchors is unnecessary in sparse scenarios.% where lane ground truths are sparse. The trend in new methodologies is to reduce the number of anchors while offering more flexible anchor configurations.%, which negatively impacts its performance. They also employ cascade cross-layer anchor refinement to bring the anchors closer to the ground truth.  in the absence of cascade anchor refinement
+Regrading the first issue, \cite{clrnet} introduced learned anchors that optimize the anchor parameters during training to better adapt to lane distributions, as shown in Fig. \ref{anchor setting}(b). However, the number of anchors remains excessive to adequately cover the diverse potential distributions of lanes. Furthermore, \cite{adnet} proposes flexible anchors for each image by generating start points with directions, rather than using a fixed set of anchors. Nevertheless, these start points of lanes are subjective and lack clear visual evidence due to the global nature of lanes. In contrast, \cite{srlane} uses a local angle map to propose sketch anchors according to the direction of ground truth. While this approach considers directional alignment, it neglects precise anchor positioning, resulting in suboptimal performance. Overall, the abundance of anchors is unnecessary in sparse scenarios.% where lane ground truths are sparse. The trend in new methodologies is to reduce the number of anchors while offering more flexible anchor configurations.%, which negatively impacts its performance. They also employ cascade cross-layer anchor refinement to bring the anchors closer to the ground truth.  in the absence of cascade anchor refinement
 \par
 Regarding the second issue, nearly all anchor-based methods \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane} rely on direct or indirect NMS post-processing to eliminate redundant predictions. Although it is necessary to eliminate redundant predictions, NMS remains a suboptimal solution. On one hand, NMS is not deployment-friendly because it requires defining and calculating distances between lane pairs using metrics such as \textit{Intersection over Union} (IoU). This task is more challenging than in general object detection due to the intricate geometry of lanes. On the other hand, NMS can struggle in dense scenarios. Typically, a large distance threshold may lead to false negatives, as some true positive predictions could be mistakenly eliminated, as illustrated in Fig. \ref{NMS setting}(a)(c). Conversely, a small distance threshold may fail to eliminate redundant predictions effectively, resulting in false positives, as shown in Fig. \ref{NMS setting}(b)(d). Therefore, achieving an optimal trade-off across all scenarios by manually setting the distance threshold is challenging. %The root of this problem lies in the fact that the distance definition in NMS considers only geometric parameters while ignoring the semantic context in the image. As a result, when two predictions are ``close'' to each other, it is nearly impossible to determine whether one of them is redundant.% where lane ground truths are closer together than in sparse scenarios;including those mentioned above,
 \par
-To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce local and global heads based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have incorporated a triplet head with a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} bolck. The GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows:
+To address the above two issues, we propose Polar R-CNN, a novel anchor-based method for lane detection. For the first issue, we introduce \textit{Local Polar Module} based on the polar coordinate system to create anchors with more accurate locations, thereby reducing the number of proposed anchors in sparse scenarios, as illustrated in Fig. \ref{anchor setting}(c). In contrast to \textit{State-Of-The-Art} (SOTA) methods \cite{clrnet}\cite{clrernet}, which utilize 192 anchors, Polar R-CNN employs only 20 anchors to effectively cover potential lane ground truths. For the second issue, we have incorporated a triplet head with a new heuristic \textit{Graph Neural Network} (GNN) \cite{gnn} block. The GNN block offers an interpretable structure, achieving nearly equivalent performance in sparse scenarios and superior performance in dense scenarios. We conducted experiments on five major benchmarks: \textit{TuSimple} \cite{tusimple}, \textit{CULane} \cite{scnn}, \textit{LLAMAS} \cite{llamas}, \textit{CurveLanes} \cite{curvelanes}, and \textit{DL-Rail} \cite{dalnet}. Our proposed method demonstrates competitive performance compared to SOTA approaches. Our main contributions are summarized as follows:
 \begin{itemize}
-\item We design a strategy to simplify the anchor parameters by using local and global polar coordinate systems and applied these to two-stage lane detection frameworks. Compared to other anchor-based methods, this strategy significantly reduces the number of proposed anchors while achieving better performance.
+\item We design a strategy to simplify the anchor parameters by using local and global polar coordinate systems and applied these to the two-stage lane detection framework. Compared to other anchor-based methods, this strategy significantly reduces the number of proposed anchors while achieving better performance.
 \item We propose a novel triplet detection head with GNN block to implement a NMS-free paradigm. The block is inspired by Fast NMS, providing enhanced interpretability. Our model supports end-to-end training and testing while still allowing for traditional NMS post-processing as an option for a NMS version of our model.
 \item By integrating the polar coordinate systems and NMS-free paradigm, we present a Polar R-CNN model for fast and efficient lane detection. And we conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness of our model in high performance with fewer anchors and a NMS-free paradigm. %Additionally, our model features a straightforward structure—lacking cascade refinement or attention strategies—making it simpler to deploy. 
 \end{itemize}
@ -135,7 +135,7 @@ To address the above two issues, we propose Polar R-CNN, a novel anchor-based me
 \begin{figure*}[ht]
 	\centering
 	\includegraphics[width=0.99\linewidth]{thesis_figure/ovarall_architecture.png} 
-	\caption{An illustration of the Polar R-CNN architecture. It has a similar pipelines with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_0, P_1, P_2$, followed by a \textit{Local Polar Module}, and a RoI pooling module to extract features fed to a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the robust and accurate lane predictions. The global polar module includes a triplet head, which comprises a \textit{one-to-one (O2O)} classification head, a \textit{one-to-many} (O2M) classification head , and a \textit{one-to-many} (O2M) regression head.}
+	\caption{An illustration of the Polar R-CNN architecture. It has a similar pipelines with the Faster R-CNN for the task of object detection, and consists of a backbone, a \textit{Feature Pyramid Network} with three levels of feature maps, respectively denote by $P_1, P_2, P_3$, followed by a \textit{Local Polar Module}, and a \textit{Global Polar Module} for lane detection. Based on the designed lane representation and lane anchor representation in polar coordinate system, the local polar module can propose sparse line anchors and the global polar module can produce the final accurate lane predictions. The global polar module includes a triplet head, which comprises a \textit{one-to-one (O2O)} classification head, a \textit{one-to-many} (O2M) classification head , and a \textit{one-to-many} (O2M) regression head.}
 	\label{overall_architecture}
 \end{figure*}
 \section{Related Works}
@ -151,7 +151,7 @@ categorizes lane instances by angles and locations, allowing it to detect only a
 \par
 \textbf{NMS-free Methods.} Due to the threshold sensitivity and computational overhead of NMS, many studies attempt to NMF-free methods or models that do not use NMS during the detection process. For example, \textit{DETR} \cite{detr} employs one-to-one label assignment to avoid redundant predictions without using NMS. Other NMS-free methods \cite{yolov10}\cite{learnNMS}\cite{date} have also been proposed to addressing this issue from two aspects: \textit{model architecture} and \textit{label assignment}. For example, studies in \cite{yolov10}\cite{date} suggest that one-to-one assignments are crucial for NMS-free predictions, but maintaining one-to-many assignments is still necessary to ensure effective feature learning of the model. While some works in \cite{o3d} \cite{relationnet} consider the model’s expressive capacity to provide non-redundant predictions. However, compared to the extensive studies conducted in general object detection, there has been limited research analyzing the NMS-free paradigm. 
 \par
-In this work, we aim to address the above two issues in the framework of anchor-based detection to achieve NMF-free and non-redundant lane predictions.
+In this work, we aim to address the above two issues in the framework of anchor-based lane detection to achieve NMF-free and non-redundant lane predictions.
 %
 %
 \section{Polar R-CNN}
@ -169,26 +169,25 @@ In this work, we aim to address the above two issues in the framework of anchor-
 		\includegraphics[width=\imgwidth]{thesis_figure/coord/polar.png}
 		\caption{}
 	\end{subfigure}
-	\caption{Different descriptions for anchor parameters: (a) Ray: defined by its start point (\textit{e.g.} the green point $\left( x_{1}^{s},y_{1}^{s} \right)$ or the yellow point $\left( x_{2}^{s},y_{2}^{s} \right) $) and direction $\theta$. (b) Polar: defined by its radius $r$ and angle $\theta$.} %rectangular coordinates
+	\caption{Different descriptions for anchor parameters: (a) Ray: defined by its start point (\textit{e.g.} the green point $\left( x_{1}^{s},y_{1}^{s} \right)$ or the yellow point $\left( x_{2}^{s},y_{2}^{s} \right) $) and direction $\theta^{s}$. (b) Polar: defined by its radius $r$ and angle $\theta$.} %rectangular coordinates
 	\label{coord}
 \end{figure}
 %
-The overall architecture of our Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. As shown in this figure, our Polar R-CNN for lane detection has a similar pipeline with Faster R-CNN \cite{fasterrcnn}, which consists of a backbone\cite{resnet}, a \textit{Feature Pyramid Network} (FPN) \cite{fpn}, a \textit{Region Proposal Network} (RPN) \cite{fasterrcnn} followed by a \textit{Local Polar Module} (LPM), and \textit{Region of Interest} (RoI) \cite{fasterrcnn} pooling module followed by a \textit{Global Polar Module} (GPM). In the following, we first introduce the polar coordinate representation of lane and lane anchors, and then present the designed LPM and GPM in our Polar R-CNN. %To investigate the fundamental factors affecting model performance, such as anchor settings and NMS post-processing, and also to enhance ease of deployment, our Polar R-CNN utilizes a simple and straightforward network structure. just relying on basic components, including convolutional or pooling operations, \textit{Multi-Layer Perceptrons} (MLPs), while deliberately excluding advanced elements like \textit{attention mechanisms}, \textit{dynamic kernels}, and \textit{cross-layer refinement} used in previous works \cite{clrnet}\cite{clrernet}. 
-%\par
+The overall architecture of our Polar R-CNN is illustrated in Fig. \ref{overall_architecture}. As shown in this figure, our Polar R-CNN for lane detection has a parallel  pipeline with Faster R-CNN \cite{fasterrcnn}, which consists of a backbone\cite{resnet}, a \textit{Feature Pyramid Network} (FPN) \cite{fpn}, a \textit{Local Polar Module} (LPM) as the \textit{Region Proposal Network} (RPN) \cite{fasterrcnn}, and a \textit{Global Polar Module} (GPM) as the \textit{Region of Interest} (RoI) \cite{fasterrcnn} pooling module. In the following, we first introduce the polar coordinate representation of lane anchors, and then present the designed LPM and GPM in our Polar R-CNN.

 %
 \subsection{Representation of Lane and Lane Anchor}
 %
 Lanes are characterized by their thin, elongated, and curved shapes. A well-defined lane prior aids the model in feature extraction and location prediction. 
 \par
-\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of length $W$ and height $H$, a lane is represented by a set of 2D points $X=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ with equally spaced y-coordinates, i.e., $y_i=i\times\frac{H}{N}$, where $N$ is the number of data points. Since the y-coordinate is fixed, a lane can be uniquely defined by its x-coordinates. Previous studies \cite{linecnn}\cite{laneatt} have introduced \textit{lane priors}, also known as \textit{lane anchors}, which are represented as straight lines in the image plane and served as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a start point $(x_{0},y_{0})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta$. The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance. 
+\textbf{Lane and Anchor Representation as Ray.} Given an input image with dimensions of width $W$ and height $H$, a lane is represented by a set of 2D points $X=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ with equally spaced y-coordinates, i.e., $y_i=i\times\frac{H}{N}$, where $N$ is the number of data points. Since the y-coordinate is fixed, a lane can be uniquely defined by its x-coordinates. Previous studies \cite{linecnn}\cite{laneatt} have introduced \textit{lane priors}, also known as \textit{lane anchors}, which are represented as straight lines in the image plane and served as references. From a geometric perspective, a lane anchor can be viewed as a ray defined by a start point $(x^{s},y^{s})$ located at the edge of an image (left/bottom/right boundaries), along with a direction $\theta^s$. The primary task of a lane detection model is to estimate the x-coordinate offset from the lane anchor to the ground truth of the lane instance. 
 \par
-However, the representation of lane anchors as rays presents certain limitations. Notably, a lane anchor can have an infinite number of potential start points, which makes the definition of its start point ambiguous and subjective. As illustrated in Fig. \ref{coord}(a), the studies in \cite{dalnet}\cite{laneatt}\cite{linecnn} define the start points as being located at the boundaries of an image, such as the green point in Fig. \ref{coord}(a). In contrast, the research presented in \cite{adnet} defines the start points, exemplified by the purple point in Fig. \ref{coord}(a), based on their actual visual locations within the image. Moreover, occlusion and damage to the lane significantly affect the detection of these start points, highlighting the need for the model to have a large receptive field \cite{adnet}. Essentially, a straight lane has two degrees of freedom: the slope and the intercept, under a Cartesian coordinate system, implying that the lane anchor could be described using just two parameters instead of the three redundant parameters (\textit{i.e.}, two for the start point and one for orientation) employed in ray representation.
+However, the representation of lane anchors as rays presents certain limitations. Notably, a lane anchor can have an infinite number of potential start points, which makes the definition of its start point ambiguous and subjective. As illustrated in Fig. \ref{coord}(a), the studies in \cite{dalnet}\cite{laneatt}\cite{linecnn} define the start points as being located at the boundaries of an image, such as the green point in Fig. \ref{coord}(a). In contrast, the research presented in \cite{adnet} defines the start points, exemplified by the purple point in Fig. \ref{coord}(a), based on their actual visual locations within the image. Moreover, occlusion and damage to the lane significantly affect the detection of these start points, highlighting the need for the model to have a large receptive field \cite{adnet}. Essentially, a straight lane has two degrees of freedom: the slope and the intercept, under a Cartesian coordinate system, implying that the lane anchor could be described using just two parameters instead of the three redundant parameters (\textit{i.e.}, two for the start point and one for the direction) employed in ray representation.
 %
 \begin{figure}[t]
 	\centering
 	\includegraphics[width=0.87\linewidth]{thesis_figure/coord/localpolar.png}
-	\caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the local polar axis.}
+	\caption{The local polar coordinate system. The ground truth of the radius $\hat{r}_{i}^{l}$ of the $i$-th local pole is defines as the minimum distance from the pole to the lane curve instance. A positive pole has a radius $\hat{r}_{i}^{l}$ that is below a threshold $\lambda^{l}$, and vice versa. Additionally, the ground truth angle $\hat{\theta}_i$ is determined by the angle formed between the radius vector (connecting the pole to the closest point on the lanes) and the polar axis.}
 	\label{lpmlabel}
 \end{figure}
 \par
@ -196,18 +195,20 @@ However, the representation of lane anchors as rays presents certain limitations
 \par
 To better leverage the local inductive bias properties of CNNs, we define two types of polar coordinate systems: the local and global coordinate systems. The local polar coordinate system is to generate lane anchors, while the global coordinate system expresses these anchors in a form within the entire image and regresses them to the ground truth lane instances. Given the distinct roles of the local and global systems, we adopt a two-stage framewrok for our Polar R-CNN, similar to Faster R-CNN\cite{fasterrcnn}.  
 \par
-The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lpmlabel}, there are two types of local poles: positive and negative. Positive local poles (\textit{e.g.}, the blue points) have a radius $r_{i}^{l}$ below a threshold $\lambda^l$, otherwise, they are classified as negative local poles (\textit{e.g.}, the red points). Each local pole is responsible for predicting a single lane anchor. While a lane ground truth may generate multiple lane anchors, as shown in Fig. \ref{lpmlabel}, there are three positive poles around the lane instance (green lane), which are expected to generate three lane anchors. This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve. 
+The local polar system is designed to predict lane anchors adaptable to both sparse and dense scenarios. In this system, there are many poles with each as the lattice point of the feature map, referred to as local poles. As illustrated on the left side of Fig. \ref{lpmlabel}, there are two types of local poles: positive and negative. Positive local poles (\textit{e.g.}, the blue points) have a radius $r_{i}^{l}$ below a threshold $\lambda^l$, otherwise, they are classified as negative local poles (\textit{e.g.}, the red points). Each local pole is responsible for predicting a single lane anchor. While a lane ground truth may generate multiple lane anchors, as shown in Fig. \ref{lpmlabel}, there are three positive poles around the lane instance (green lane), which are expected to generate three lane anchors. 
+
+%This one-to-many approach is essential for ensuring comprehensive anchor proposals, especially since some local features around certain poles may be lost due to damage or occlusion of the lane curve. 
 \par
-In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its position becomes fixed and independent from its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a single pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified framework for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances.
+In the local polar coordinate system, the parameters of each lane anchor are determined based on the location of its corresponding local pole. However, in practical terms, once a lane anchor is generated, its definitive position becomes immutable and independent from its original local pole. To simplify the representation of lane anchors in the second stage of Polar-RCNN, a global polar system has been designed, featuring a singular and unified pole that serves as a reference point for the entire image. The location of this global pole is manually set, and in this case, it is positioned near the static vanishing point observed across the entire lane image dataset. This approach ensures a consistent and unified polar coordinate for expressing lane anchors within the global context of the image, facilitating accurate regression to the ground truth lane instances.

 \begin{figure}[t]
        \centering
        \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
-        \caption{The main architecture of local polar module.}
-        \label{l}
+        \caption{An illustration of the structure of LPM.}
+        \label{lpm}
 \end{figure}
 \subsection{Local Polar Module}
-As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$  as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{l}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
+As shown in Fig. \ref{overall_architecture}, three levels of feature maps, denoted as $P_1, P_2, P_3$, are extracted using a \textit{Feature Pyramid Network} (FPN). To generate high-quality anchors around the lane ground truths within an image, we introduce the \textit{Local Polar Module} (LPM), which takes the highest feature map $P_3\in\mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$  as input and outputs a set of lane anchors along with their confidence scores. As demonstrated in Fig. \ref{lpm}, it undergoes a \textit{downsampling} operation $DS(\cdot)$ to produce a lower-dimensional feature map of a size $H^l\times W^l$:
 \begin{equation}
 		F_d\gets DS\left( P_{3} \right)\ \text{and}\ F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}}.
 \end{equation}
@ -218,24 +219,25 @@ F_{cls}\gets \phi _{cls}^{l}\left( F_d \right)\ &\text{and}\ F_{cls}\in \mathbb{
 \end{align}
 The regression branch consists of a single $1\times1$ convolutional layer and with the goal of generating lane anchors by outputting their angles $\theta_j$ and the radius $r^{l}_{j}$, \textit{i.e.}, $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, in the defined local polar coordinate system previously introduced. Similarly, the classification branch $\phi _{cls}^{l}\left(\cdot \right)$ only consists of two $1\times1$ convolutional layers for simplicity. This branch is to predict the confidence heat map $F_{cls\,\,}\equiv \left\{ s_j^l \right\} _{j=1}^{H^l\times W^l}$ for local poles, each associated with a feature point. By discarding local poles with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while effectively removing background lane anchors.
 \par
-\textbf{Loss Function for LPM.} To train the local polar module, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. A positive pole is labeled as one; otherwise, it is labeled as zero. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
+\textbf{Loss Function for LPM.} To train the LPM, we define the ground truth labels for each local pole as follows: the ground truth radius, $\hat{r}^l_i$, is set to be the minimum distance from a local pole to the corresponding lane curve, while the ground truth angle, $\hat{\theta}_i$, is set to be the orientation of the vector extending from the local pole to the nearest point on the curve. Consequently, we have a label set of local poles $F_{gt}=\{\hat{s}_j^l\}_{j=1}^{H^l\times W^l}$, where $\hat{s}_j^l=1$ if the $j$-th local pole is positive and $\hat{s}_j^l=0$ if it is negative. Once the regression and classification labels are established, as shown in Fig. \ref{lpmlabel}, LPM can be trained using the $Smooth_{L1}$ loss $S_{L1}\left(\cdot \right)$ for regression branch and the \textit{binary cross-entropy} loss $BCE\left( \cdot , \cdot \right)$ for classification branch. The loss functions for LPM are given as follows:
 \begin{align}
-\mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)
+\mathcal{L} ^{l}_{cls}&=BCE\left( F_{cls},F_{gt} \right)\\
+\mathcal{L} _{reg}^{l}&=\frac{1}{N_{pos}^{l}}\sum_{j\in \left\{ j|\hat{r}_{j}^{l}<\lambda^l \right\}}{\left( S_{L1}\left( \theta _{j}^{l}-\hat{\theta}_{j}^{l} \right) +S_{L1}\left( r_{j}^{l}-\hat{r}_{j}^{l} \right) \right)}
 \label{loss_lph}
 \end{align}
 where $N^{l}_{pos}=\left|\{j|\hat{r}_j^l<\lambda^{l}\}\right|$ is the number of positive local poles in LPM.
 \par
-\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select top-$K$ anchors with the highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.} global polar head). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$  because it aids \textit{Global Polar Module} (the next stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where  $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.
+\textbf{Top-$K$ Anchor Selection.} As discussed above, all $H^{l}\times W^{l}$ anchors, each associated with a local pole in the feature map, are all considered as candidates during the training stage. However, some of these anchors serve as background anchors. We select $K$ anchors with the top-$K$ highest confidence scores as the foreground candidates to feed into the second stage (\textit{i.e.} global polar module). During training, all anchors are chosen as candidates, where $K=H^{l}\times W^{l}$  assists it assists \textit{Global Polar Module} (the second stage) in learning from a diverse range of features, including various negative background anchor samples. Conversely, during the evaluation stage, some of the anchors with lower confidence can be excluded, where  $K\leqslant H^{l}\times W^{l}$. This strategy effectively filters out potential negative anchors and reduces the computational complexity of the second stage. By doing so, it maintains the adaptability and flexibility of anchor distribution while decreasing the total number of anchors especially in the sprase scenarios. The following experiments will demonstrate the effectiveness of different top-$K$ anchor selection strategies.

 \begin{figure}[t]
 	\centering
 	\includegraphics[width=\linewidth]{thesis_figure/detection_head.png} 
-	\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet head. The  triplet head consists of three parts, namely, the O2O classification head, the O2M classification head, and the O2M regression head. The predictions generated by the O2M classification head  $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing (the gray dashed route). The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant prediction scores $\left\{\tilde{s}_i^g\right\}$ based on $\left\{s_i^g\right\}$ (the green solid route). Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ participate in the selection of final non-redundant results, which is called dual confidence selection. During backword training, the gradient from the O2O classification head are stopped (the blue dashed route) to the RoI pooling module.}
-        \label{g}
+	\caption{The main pipeline of GPM. It comprises the RoI Pooling Layer alongside the triplet head. The  triplet head consists of three parts, namely, the O2O classification head, the O2M classification head, and the O2M regression head. The scores generated by the O2M classification head  $\left\{s_i^g\right\}$ exhibit redundancy and necessitate Non-Maximum Suppression (NMS) post-processing (the gray dashed route). The O2O classification head functions as a substitute for NMS, directly delivering the non-redundant scores $\left\{\tilde{s}_i^g\right\}$ based on $\left\{s_i^g\right\}$ (the green solid route). Both $\left\{s_i^g\right\}$ and $\left\{\tilde{s}_i^g\right\}$ engage in the process of selecting the ultimate non-redundant outcomes, a procedure referred to as dual confidence selection. During the backward training phase, the gradients from the O2O classification head (the blue dashed route) are stopped.}
+        \label{gpm}
 \end{figure}

 \subsection{Global Polar Module}
-Similar to the pipeline of Faster R-CNN, LPM serves as the first stage for generating lane anchor proposals. As illustrated in Fig. \ref{overall_architecture}, we introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. GPM takes features samples from anchors and outputs the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{g}.
+ We introduce a novel \textit{Global Polar Module} (GPM) as the second stage to achieve final lane prediction. As illustrated in Fig. \ref{overall_architecture}, GPM takes features samples from anchors proposed by LPM and provides the precise location and confidence scores of final lane detection results. The overall architecture of GPM is illustrated in the Fig. \ref{gpm}.
 \par
 \textbf{RoI Pooling Layer.} It is designed to extract sampled features from lane anchors. For ease of the sampling operation, we first convert the radius of the positive lane anchors in a local polar coordinate, $r_j^l$, to the one in a global polar coordinate system, $r_j^g$, by the following equation
 \begin{align}
@ -249,29 +251,29 @@ i&=1,2,\cdots,N_p,\notag
 \end{align}
 where the y-coordinates $\boldsymbol{y}_{j}^{s}\equiv \{y_{1,j},y_{2,j},\cdots ,y_{N_p,j}\}$ of the $j$-th lane anchor are uniformly sampled vertically from the image, as previously mentioned. Then the x-coordinates  $\boldsymbol{x}_{j}^{s}\equiv \{x_{1,j},x_{2,j},\cdots ,x_{N_p,j}\}$ are caculated by Eq. \ref{positions}.
 \par
-Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1}, \boldsymbol{F}_{2}, \boldsymbol{F}_{3}\in \mathbb{R} ^{K\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
+Given the feature maps $P_1, P_2, P_3$ from FPN, we can extract feature vectors corresponding to the positions of feature points $\{(x_{1,j},y_{1,j}),(x_{2,j},y_{2,j}),\cdots,(x_{N,j},y_{N,j})\}_{j=1}^{K}$, respectively denoted as $\boldsymbol{F}_{1,j}, \boldsymbol{F}_{2,j}, \boldsymbol{F}_{3,j}\in \mathbb{R} ^{N\times C_f}$. To enhance representation, similar to \cite{detr}, we employ a weighted sum strategy to combine features from different levels as
 \begin{equation}
-\boldsymbol{F}^s=\sum_{k=1}^3{\boldsymbol{F}_{k}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=0}^3{e^{\boldsymbol{w}_{k}}}}},
+\boldsymbol{F}^s_j=\sum_{k=1}^3{\boldsymbol{F}_{k,j}\otimes \frac{e^{\boldsymbol{w}_{k}}}{\sum_{k=1}^3{e^{\boldsymbol{w}_{k}}}}},
 \end{equation}
-where $\boldsymbol{w}_{k}\in \mathbb{R} ^{N^{l}_{pos}}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are subsequently subjected to a linear transformation, thereby yielding the pooled RoI features associated with the corresponding anchor:
+where $\boldsymbol{w}_{k}\in \mathbb{R}^{N}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s_j\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s_j\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are subsequently subjected to a linear transformation, thereby yielding the pooled RoI features associated with the corresponding anchor:
 \begin{align}
-        \boldsymbol{F}^{roi}\gets \boldsymbol{W}_{pool}\boldsymbol{F}^s, \,\boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r}.
+        \boldsymbol{F}^{roi}_j\gets \boldsymbol{W}_{pool}\boldsymbol{F}^s_j, \,\boldsymbol{F}^{roi}_j\in \mathbb{R} ^{d_r}.
 \end{align}

-\textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{g}. In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
-
-To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are constructed using a straightforward architecture featuring two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to an end-to-end paradigm, we have developed an extended O2O classification head. As illustrated in Fig. \ref{g}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head.
-
+\textbf{Triplet Head.} With the $\boldsymbol{F}^{roi}$ as input of the Triplet Head, it encompasses three distinct components: the one-to-one (O2O) classification head, the one-to-many (O2M) classification head, and the one-to-many (O2M) regression head, as depicted in Fig. \ref{gpm}. To attain optimal non-redundant detection outcomes within a NMS-free paradigm (i.e., end-to-end detection), both the one-to-one and one-to-many paradigms become pivotal during the training stage, as underscored in \cite{o2o}. Drawing inspiration from \cite{o3d}\cite{pss} but with subtle variations, we architect the triplet head to achieve a NMS-free paradigm.
+%In numerous studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly adheres to the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation phase, redundant detection outcomes are frequently predicted for each instance. These redundancies are conventionally mitigated using Non-Maximum Suppression (NMS), which eradicates duplicate results. Nevertheless, NMS relies on the definition of the geometric distance between detection results, rendering this calculation intricate for curvilinear lanes. Moreover, NMS post-processing introduces challenges in balancing recall and precision, a concern highlighted in our previous analysis.
+%As illustrated in Fig. \ref{gpm}, it is important to note that the detection process of the O2O classification head is not independent; rather, the confidence $\left\{ \tilde{s}_i^g \right\}$ output by the O2O classificatoin head relies upon the confidence $\left\{ s_i^g \right\} $ output by the O2M classification head.
 \begin{figure}[t]
        \centering
        \includegraphics[width=0.9\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
-        \caption{The main architecture of O2O classification head. Each anchor is conceived as a node within the GNN, with the associated ROI feature $\left\{ \boldsymbol{F}_i^{roi}\right\}$ as the node feature. The interconnecting directed edges are established based on the scores emanating from the O2M classification head and the anchor geometric prior. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{65}$ are euqual to $1$ in the adjacent matrix $\boldsymbol{A}$, which implicit the existence of directed edges between corresponding node pairs (\textit{i.e.} $1\rightarrow2$, $3\rightarrow2$ and $6\rightarrow5$).}
+        \caption{The graph construction in O2O classification head. Each anchor is conceived as a node within the graph, with the associated ROI feature $\left\{\boldsymbol{F}_i^{roi}\right\}$ as the node feature. The interconnecting directed edges are established based on the scores emanating from the O2M classification head and the anchor geometric prior. In the illustration, the elements $A_{12}$, $A_{32}$ and $A_{54}$ are euqual to $1$ in the adjacent matrix $\boldsymbol{A}$, which implicit the existence of directed edges between corresponding node pairs (\textit{i.e.} $1\rightarrow2$, $3\rightarrow2$ and $5\rightarrow4$).}
        \label{o2o_cls_head}
 \end{figure}

-As shown in Fig. \ref{o2o_cls_head}, we introduce a novel architecture to O2O classification head, incorporates a \textit{graph neural network} \cite{gnn} (GNN) with a polar geometric prior. The GNN is designed to model the relationship between features $\boldsymbol{F}_{i}^{roi}$ sampled from different anchors. Based on our previous analysis, the distance between lanes should not only be modeled by explicit geometric properties but also consider implicit contextual semantics such as “double” and “forked” lanes. These types of lanes, despite their tiny geometric differences, should not be removed as redundant predictions. The insight of the GNN design is derived from Fast NMS \cite{yolact}, which operates without iterative processes. The detailed design can be found in the Appendix \ref{NMS_appendix}; here, we focus on elaborating the architecture of the O2O classification head.
+To ensure both simplicity and efficiency in our model, the O2M regression head and the O2M classification head are architected with a straightforward design with two-layer Multi-Layer Perceptrons (MLPs). To facilitate the model’s transition to a NMS-free paradigm, we have developed an extended O2O classification head. As shown in Fig. \ref{o2o_cls_head}, we construct a graph and incorporates a \textit{graph neural network} \cite{gnn} (GNN) to the O2O classification head. The GNN is designed to model the relationship between RoI features $\boldsymbol{F}_{i}^{roi}$ of anchors. 
+Drawing upon our previous analysis, the distance between two lanes should not only be modeled by explicit geometric properties but also to encompass implicit contextual semantics, such as “double” and “forked” lanes, which shouldn't be eliminated as redundant predictions. The insight of the GNN design is derived from Fast NMS \cite{yolact}, without iterative processes. A comprehensive description of the design can be located in Appendix \ref{NMS_appendix}; in this section, we focus on elaborating the architecture of the O2O classification head.

-In GNN, the essential components are nodes and edges. We have constructed a directed GNN as follows. Each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the input features (\textit{i.e.}, initial signals) of these nodes. Directed edges between nodes are expressed by adjacent matrix $\boldsymbol{A}\in\mathrm{R}^{K\times K}$. Specifically, if one element $A_{ij}$ in $\boldsymbol{A}$ equals $1$, a directed edge exist from the $i$-th node and $j$-th node. The existence of an edge from one node to another is contingent upon two conditions. For simplification, we encapsulate the two conditions within two matrices.
+In GNN, the essential components are nodes and edges. We have constructed a directed GNN as follows. Each anchor is conceptualized as a node, with the ROI features $\boldsymbol{F}_{i}^{roi}$ serving as the input features (\textit{i.e.}, initial signals) of these nodes. Directed edges between nodes are expressed by adjacent matrix $\boldsymbol{A}\in\mathrm{R}^{K\times K}$. Specifically, if one element $A_{ij}$ in $\boldsymbol{A}$ equals $1$, a directed edge exists from the $i$-th node and $j$-th node. The existence of an edge from one node to another contingent upon two criterias. For simplification, we encapsulate the two criterias within two matrices.

 % The first matrix is the positive selection matrix, denoted as $\boldsymbol{A}^{P}\in\mathbb{R}^{K\times K}$:
 % \begin{align}
@ -300,7 +302,7 @@ The second component is the geometric prior matrix, denoted by $\boldsymbol{A}^{
        \end{cases}
        \label{geometric prior matrix}
 \end{align}
-This matrix indicates that an edge (\textit{e.g.} the relationship between two nodes) is considered to exist between two nodes \textit{only if} the two corresponding anchors are sufficiently close with each other. The distance between anchors is described by their global polar parameters. 
+This matrix indicates that an edge is considered to exist between two nodes \textit{only if} the two corresponding anchors are sufficiently close to each other. The distance between anchors is characterized by their global polar parameters. 

 With the aforementioned two matrices, the overall adjacency matrix is formulated as $\boldsymbol{A} = \boldsymbol{A}^{C} \odot \boldsymbol{A}^{G}$; where ``$\odot$'' signifies the element-wise multiplication. This indicates that the existence of edges should statisfies the above two corresponding conditions. Subsequently, the relationships between the $i$-th anchor and the $j$-th anchor can be modeled as follows:
 \begin{align}
@ -309,11 +311,11 @@ With the aforementioned two matrices, the overall adjacency matrix is formulated
 	\tilde{\boldsymbol{F}}_{ij}^{edge}&\gets \boldsymbol{F}_{ij}^{edge}+\boldsymbol{W}_s\left( \boldsymbol{x}_{j}^{s}-\boldsymbol{x}_{i}^{s} \right) +\boldsymbol{b}_s,\label{edge_layer_3}\\
 	\boldsymbol{D}_{ij}^{edge}&\gets \mathrm{MLP}_{edge}\left( \tilde{\boldsymbol{F}}_{ij}^{edge} \right) .\label{edge_layer_4}
 \end{align}
-Eq. (\ref{edge_layer_1})-(\ref{edge_layer_4}) establish the directed relationships from the $i$-th node and the $j$-th node. Here, tensor $\boldsymbol{D}_{ij}^{edge}$ signifies  the semantic features of directed edge $E_{ij}$. With the directed edge characteristics provided for linked node pairs, we employ an element-wise max pooling layer to aggregate all the \textit{incoming edges} features of one node to refine its node features:
+Eq. (\ref{edge_layer_1})-(\ref{edge_layer_4}) establish the directed relationships from the $i$-th node and the $j$-th node. Here, tensor $\boldsymbol{D}_{ij}^{edge}$ signifies  the semantic features of directed edge $E_{ij}$. With the directed edge features provided for linked node pairs, we employ an element-wise max pooling layer to aggregate all the \textit{incoming edges} features of one node to refine its node features:
 \begin{align}
        \boldsymbol{D}_{i}^{node}&\gets \underset{k\in \left\{ k|A_{ki}=1 \right\}}{\max}\boldsymbol{D}_{ki}^{edge}.
 \end{align}
-Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the most distinctive features alone the column of the adjacent matrix (\textit{i.e.}, the incoming edges). With the refined node features $\boldsymbol{D}_{i}^{node}$, the ultimate confidence scores $\tilde{s}_{i}^{g}$ are generated by the subsequent layers:
+Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the most distinctive features alone the column of the adjacent matrix (\textit{i.e.}, the incoming edges). With the refined node features $\boldsymbol{D}_{i}^{node}\in \mathbb{R}^{d}$, the ultimate confidence scores $\tilde{s}_{i}^{g}$ are generated by the subsequent layers:
 \begin{align}
        \boldsymbol{F}_{i}^{node}&\gets \mathrm{MLP}_{node}\left( \boldsymbol{D}_{i}^{node} \right) ,
        \\
@ -321,18 +323,18 @@ Here, inspired by \cite{o3d}\cite{pointnet}, the max pooling aims to get the mos
 \label{node_layer}
 \end{align}

-\textbf{Dual Confidence Selection.} We employ dual confidence thresholds, denoted as $\lambda_{o2m}^s$ and $\lambda_{o2o}^s$, to select the positive (\textit{i.e.}, foreground) predictions. Within the conventional NMS framework, the predictions emanating from the O2M classification heads with confidences $\left\{ s_{i}^{g} \right\} $ surpassing $\lambda_{o2m}^s$ are designated as positive predictions. hese are subsequently channeled into the NMS post-processing stage to remove redundant predictions. In the NMS-free paradigm of our work, the final non-redundant predictions are selected through the following certerion:
+\textbf{Dual Confidence Selection.} Within the conventional NMS framework, the predictions emanating from the O2M classification heads with confidences $\left\{ s_{i}^{g} \right\} $ surpassing $\lambda_{o2m}^s$ are designated as positive candidates. They are subsequently fed into the NMS post-processing stage to remove redundant predictions. In the NMS-free paradigm of our work, the final non-redundant predictions are selected through the following certerion:
 \begin{align}
        \varOmega _{o2o}^{pos}\equiv \left\{ i|\tilde{s}_{i}^{g}>\lambda _{o2o}^{s} \right\} \cap \left\{ i|s_{i}^{g}>\lambda _{o2m}^{s} \right\}.
 \end{align}
-The $\varOmega _{o2o}^{pos}$ signifies the ultimate collection of non-redundant predictions, wherein both confidences satisfy the aforementioned conditions in conjunction with the dual confidence thresholds. This methodology of selecting non-redundant predictions is termed \textit{dual confidence selection}.
+We employ dual confidence thresholds, denoted as $\lambda_{o2m}^s$ and $\lambda_{o2o}^s$, to select the final non-redundant positives predictions. $\varOmega _{o2o}^{pos}$ signifies the ultimate collection of non-redundant predictions, wherein both confidences satisfy the aforementioned conditions in conjunction with the dual confidence thresholds. This methodology of selecting non-redundant predictions is termed \textit{dual confidence selection}.

 \textbf{Label Assignment and Cost Function for GPM.} As the previous work \cite{o3d}\cite{pss}, we use the dual assignment strategy for label assignment of triplet head. The cost function for the $i$-th prediction and $j$-th ground truth is given as follows:
 \begin{align}
        \mathcal{C} _{ij}^{o2m}&=s_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},\\
        \mathcal{C} _{ij}^{o2o}&=\tilde{s}_i^g\times \left( GIoU_{lane, \,ij} \right) ^{\beta},
 \end{align}
-where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.
+where $\mathcal{C} _{ij}^{o2m}$ is the cost function for the O2M classification and regression head while $\mathcal{C} _{ij}^{o2o}$ for O2O classification head, with $\beta$ serving as the trade-off hyperparameter for location and confidence. This cost function is more compact than that in previous works\cite{clrnet}\cite{adnet}, considering both location and confidence into account. We have redefined IoU function between lane instances: $GIOU_{lane}$, which differs slightly from previous work \cite{clrernet}. More details about $GIOU_{lane}$ can be found in the Appendix \ref{giou_appendix}.

 Given the cost matrix, we use SimOTA \cite{yolox} (one-to-many assignment) for the O2M classification head and the O2M regression head while Hungarian \cite{detr} algorithm (one-to-one assignment) for the O2O classification head. 

@ -341,7 +343,7 @@ Focal loss \cite{focal} is utilized for both O2O classification head and the O2M
 \begin{align}
        \varOmega _{o2o}=\left\{ i\mid s_i^g>\lambda_{o2m}^s \right\}.
 \end{align}
-In essence, certain samples with lower $\left\{ s_{i}^{g} \right\} $ are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. Given the disparity between the label assignments of the O2O classification head and the O2M classification head, to preserve the quality of RoI feature learning, the gradient is stopped from the O2O classification head to the ROI pooling head during the training process. This technique is also posited in \cite{pss}.
+In essence, certain samples with lower $\left\{ s_{i}^{g} \right\} $ are excluded from the computation of $\mathcal{L}^{o2o}_{cls}$. Furthermore, we harness the rank loss $\mathcal{L} _{rank}$ as referenced in \cite{pss} to amplify the disparity between the positive and negative confidences of the O2O classification head. Given the disparity between the label assignments of the O2O classification head and the O2M classification head, to preserve the quality of RoI feature learning, the gradient is stopped from the O2O classification head during the training process. This technique is also utilized in \cite{pss}.
 \begin{figure}[t]
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/auxloss.png} % 
@ -349,11 +351,13 @@ In essence, certain samples with lower $\left\{ s_{i}^{g} \right\} $ are exclude
        \label{auxloss}
 \end{figure}

-We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{aux}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the anchors in acquiring a deeper comprehension of the global geometric form. 
+We directly apply the redefined GIoU loss (refer to Appendix \ref{giou_appendix}), $\mathcal{L}_{GIoU}$, to regress the offset of x-axis coordinates of sampled points and $Smooth_{L1}$ loss for the regression of end points of lanes, denoted as $\mathcal{L}_{end}$. 
+
+To facilitate the learning of global features, we propose the auxiliary loss $\mathcal{L}_{aux}$ depicted in Fig. \ref{auxloss}. The anchors and ground truth are segmented into several divisions. Each anchor segment is regressed to the primary components of the corresponding segment of the designated ground truth. This approach aids the detection head in acquiring a deeper comprehension of the global geometric form. 

 The final loss functions for GPM are given as follows:
 \begin{align}
-        \mathcal{L} _{cls}^{g}&=w_{o2m}^{cls}\mathcal{L}^{o2m}_{cls}+w_{o2o}^{cls}\mathcal{L}_{o2o}^{\mathrm{cls}}+w_{rank}\mathcal{L}_{\mathrm{rank}},
+        \mathcal{L} _{cls}^{g}&=w^{o2m}_{cls}\mathcal{L}^{o2m}_{cls}+w^{o2o}_{cls}\mathcal{L}^{o2o}_{cls}+w_{rank}\mathcal{L}_{rank},
        \\
        \mathcal{L} _{reg}^{g}&=w_{GIoU}\mathcal{L}_{GIoU}+w_{end}\mathcal{L}_{end}+w_{aux}\mathcal{L} _{aux}.
 \end{align}
@ -566,12 +570,11 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
 \end{table}

 \subsection{Comparison with the state-of-the-art method}
-The comparison results of our proposed model with other methods are shown in Tables \ref{culane result}, \ref{tusimple result}, \ref{llamas result}, \ref{dlrail result}, and \ref{curvelanes result}. We present results for two versions of our model: the NMS-based version, denoted as Polar R-CNN-NMS, and the NMS-free version, denoted as Polar R-CNN. The NMS-based version utilizes predictions $\left\{s_i^g\right\}$ obtained from the O2M head followed by NMS post-processing, while the NMS-free version derives predictions  $\left\{\tilde{s}_i^g\right\}$ directly from the O2O classification head without NMS.
+The comparison results of our proposed model with other methods are shown in Tables \ref{culane result}, \ref{tusimple result}, \ref{llamas result}, \ref{dlrail result}, and \ref{curvelanes result}. We present results for two versions of our model: the NMS-based version, denoted as Polar R-CNN-NMS, and the NMS-free version, denoted as Polar R-CNN. The NMS-based version utilizes predictions $\left\{s_i^g\right\}$ obtained from the O2M head followed by NMS post-processing, while the NMS-free version derives predictions via dual confidence selection.

-To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet} on the CULane and CurveLanes datasets, as we use a similar training strategy and dataset splits. As illustrated in the comparison results, our model demonstrates competitive performance across five datasets. Specifically, on the CULane, TuSimple, LLAMAS, and DL-Rail datasets of sparse scenarios, our model outperforms other anchor-based methods. Additionally, the performance of the NMS-free version is nearly identical to that of the NMS-based version, highlighting the effectiveness of the O2O head in eliminating redundant predictions. On the CurveLanes dataset, the NMS-free version achieves superior F1-measure and Recall compared to both NMS-based and segment\&grid-based methods.
-
-We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement strategy. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without additional refinement, indicating that the design of flexible anchors is crucial for performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}. The combination of fast processing speed and a straightforward architecture makes our model highly deployable.
+To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet} on the CULane and CurveLanes datasets, as we use a similar training strategy and dataset splits. As illustrated in the comparison results, our model demonstrates competitive performance across five datasets. Specifically, on the CULane, TuSimple, LLAMAS, and DL-Rail datasets of sparse scenarios, our model outperforms other anchor-based methods. Additionally, the performance of the NMS-free version is nearly identical to that of the NMS-based version, highlighting the effectiveness of the O2O classification head in eliminating redundant predictions in the sparse scenarios. On the CurveLanes dataset, the NMS-free version achieves superior F1-measure and Recall compared to other methods.

+We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane dataset. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without complicated components such as cross-layer refinement, indicating the pivotal role of  flexible anchors under polar coordinates in enhaning performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}.
 \begin{figure}[t]
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/anchor_num_method.png}
@ -580,12 +583,11 @@ We also compare the number of anchors and processing speed with other methods. F
 \end{figure}

 \subsection{Ablation Study}
-To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation studies on CULane and CurveLanes dataset to show the performance.
+To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation studies on CULane and CurveLanes datasets.

-\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
-
-We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic $k=4$ for SimOTA, with no more than 4 positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
+\textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius parameters contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48\% and 0.3\% under the fixed anchor paradigm and proposal anchor paradigm, respectively.

+We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose dynamic $k=4$ for SimOTA, with no more than 4 positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground. It is evident that most of the proposed anchors are clustered around the lane ground truth.

 \begin{figure}[t]
        \centering
@ -666,11 +668,11 @@ We also explore the effect of different local polar map sizes on our model, as i
        \label{cam}
 \end{figure}

-\textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free head in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, the classification matrix (cls matrix) proves crucial, indicating that conditional probability is effective. Other components, such as the neighbor matrix (provided as a geometric prior) and rank loss, also contribute to the performance of the NMS-free block.
+\textbf{Ablation study on NMS-free block in sparse scenarios.} We conduct several experiments on the CULane dataset to evaluate the performance of the NMS-free paradigm in sparse scenarios. As shown in Table \ref{aba_NMSfree_block}, without using the GNN to establish relationships between anchors, Polar R-CNN fails to achieve a NMS-free paradigm, even with one-to-one assignment. Furthermore, the confidence comparison matrix $\boldsymbol{A}^{C}$ proves crucial, indicating that conditional probability is effective. Other components, such as the geometric prior matrix $\boldsymbol{A}^{G}$ and rank loss, also contribute to the performance of the NMS-free block.

-To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both proposal and fixed anchor strategies. Table \ref{NMS vs NMS-free} presents the results of these experiments. Here, O2M-B refers to the O2M classification head, O2O-B refers to the O2O classification head with a plain structure, and O2O-G refers to the O2O classification head with proposed GNN structure. To assess the ability to eliminate redundant predictions, NMS post-processing is applied to each head. The results show that NMS is necessary for the traditional O2M classification head. In the fixed anchor paradigm, although the O2O classification head with a plain structure effectively eliminates redundant predictions, it is less effective than the proposed GNN structure. In the proposal anchor paradigm, the O2O classification head with a plain structure fails to eliminate redundant predictions due to high anchor overlap and similar RoI features. Thus, the GNN structure is essential for Polar R-CNN in the NMS-free paradigm. Both in the fixed and proposal anchor paradigms, the O2O classification head with the GNN structure successfully eliminates redundant predictions, indicating that our GNN-based O2O classification head can replace NMS post-processing in sparse scenarios without a decrease in performance. This confirms our earlier theory that both structure and label assignment are crucial for a NMS-free paradigm.
+To compare the NMS-free paradigm with the traditional NMS paradigm, we perform experiments with the NMS-free block under both proposal and fixed anchor strategies (employing a fixed set of anchors as illustrated in Fig. \ref{anchor setting}(b)). Table \ref{NMS vs NMS-free} presents the results of these experiments. In the table, ``O2M'' and ``O2O'' refer to the NMS (the gray dashed route in Fig. \ref{o2o_cls_head}) and NMS-free paradigms (the green route in Fig. \ref{o2o_cls_head}) respectively. The suffix ``-B'' signifies that the head consists solely of MLPs, whereas ``-G'' indicates that the head is equipped with the GNN architecture. In the fixed anchor paradigm, although the O2O classification head without GNN effectively eliminates redundant predictions, the performance still improved by incorporating GNN structure. In the proposal anchor paradigm, the O2O classification head without GNN fails to eliminate redundant predictions due to high anchor overlaps. Thus, the GNN structure is essential for Polar R-CNN in the NMS-free paradigm. In both the fixed and proposed anchor paradigms, the O2O classification head with the GNN structure successfully eliminates redundant predictions, indicating that our GNN-based O2O classification head can supplant the NMS post-processing in sparse scenarios without decline in performance.

-We also explore the stop-gradient strategy for the O2O classification head. As shown in Table \ref{stop}, the gradient of the O2O classification head negatively impacts both the O2M classification head (with NMS post-processing) and the O2O classification head. This suggests that one-to-one assignment introduces critical bias into feature learning.
+We also explore the stop-gradient strategy for the O2O classification head. As shown in Table \ref{stop}, the gradient of the O2O classification head negatively impacts both the O2M classification head (with NMS post-processing) and the O2O classification head. This observation indicates that the one-to-one assignment induces significant bias into feature learning, thereby underscoring the necessity of the stop-gradient strategy to preserve optimal performance.

 \begin{table}[h]
        \centering
@ -735,7 +737,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s

 \begin{table}[h]
        \centering
-        \caption{The ablation study for the stop gradient strategy on CULane test set.}
+        \caption{The ablation study for the stop-gradient strategy on CULane test set.}
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{c|c|lll}
        \toprule
--- a/thesis_figure/coord/localpolar.png
+++ b/thesis_figure/coord/localpolar.png
--- a/thesis_figure/coord/polar.png
+++ b/thesis_figure/coord/polar.png
--- a/thesis_figure/coord/ray.png
+++ b/thesis_figure/coord/ray.png
--- a/thesis_figure/detection_head.png
+++ b/thesis_figure/detection_head.png
--- a/thesis_figure/gnn.png
+++ b/thesis_figure/gnn.png
--- a/thesis_figure/thisis_pic.pptx
+++ b/thesis_figure/thisis_pic.pptx