update

2024-09-14 16:56:25 +08:00 · 2024-09-14 16:56:25 +08:00 · bdb82c99aa
commit bdb82c99aa
parent 212f77d446
1 changed files with 60 additions and 60 deletions
--- a/main.tex
+++ b/main.tex
@ -211,8 +211,6 @@ The overall architecture of Polar R-CNN is illustrated in Fig. \ref{overall_arch
 \subsection{Lane and Lane Anchor Representation}
 Lanes are characterized by their thin and elongated curved shapes. A suitable lane prior aids the model in extracting features, predicting locations, and modeling the shapes of lane curves with greater accuracy. Consistent with previous studies \cite{linecnn}\cite{laneatt}, our lane priors (also referred to as lane anchors) consists of straight lines. We sample a sequence of 2D points along each lane anchor, denoted as $ P\doteq \left\{ \left( x_1, y_1 \right) , \left( x_2, y_2 \right) , ....,\left( x_n, y_n \right) \right\} $, where N is the number of sampled points. The y-coordinates of these points are uniformly sampled from the vertical axis of the image, specifically $y_i=\frac{H}{N-1}*i$, where H is the image height.  These y-coordinates are also sampled from the ground truth lane, and the model is tasked with regressing the x-coordinate offset from the lane anchor to the lane instance ground truth. The primary distinction between Polar R-CNN and previous approaches lies in the description of the lane anchors, which will be detailed in the following sections.
 \textbf{Polar Coordinate system.} Since lane anchors are typically represented as straight lines, they can be described using straight line parameters. Previous approaches have used rays to describe 2D lane anchors, with the parameters including the coordinates of the starting point and the orientation/angle, denoted as $\left\{\theta, P_{xy}\right\}$, as shown in Fig. \ref{coord}(a). \cite{linecnn}\cite{laneatt} define the start points as lying  on the three image boundaries. However, \cite{adnet} argue that this approach is problematic because the actual starting point of a lane could be located anywhere within the image. In our analysis, using a ray can lead to ambiguity in line representation because a line can have an infinite number of starting points, and the choice of the starting point for a lane is subjective. As illustrated in Fig. \ref{coord}(a), the yellow (the visual start point) and green (the point located on the image boundary) starting points with the same orientation $\theta$ describe the same line, and either could be used in different datasets \cite{scnn}\cite{vil100}. This ambiguity arises because a straight line has two degrees of freedom, whereas a ray has three (two for the start point and one for orientation). To resolve this issue , we propose using polar coordinates to describe a lane anchor with only two parameters: radius and angle, deoted as $\left\{\theta, r\right\}$, where $\theta \in \left[-\frac{\pi}{2}, \frac{\pi}{2}\right)$ and $r \in \left(-\infty, +\infty\right)$. This representation isillustrated in Fig. \ref{coord}(b).
 \begin{figure}[t]
        \centering
        \def\subwidth{0.24\textwidth}
@ -230,9 +228,18 @@ Lanes are characterized by their thin and elongated curved shapes. A suitable la
        \caption{Different descriptions for anchor parameters: (a) Ray: defined by its start point and orientation. (b) Polar: defined by its radius and angle.}
        \label{coord}
 \end{figure}
 \textbf{Polar Coordinate system.} Since lane anchors are typically represented as straight lines, they can be described using straight line parameters. Previous approaches have used rays to describe 2D lane anchors, with the parameters including the coordinates of the starting point and the orientation/angle, denoted as $\left\{\theta, P_{xy}\right\}$, as shown in Fig. \ref{coord}(a). \cite{linecnn}\cite{laneatt} define the start points as lying  on the three image boundaries. However, \cite{adnet} argue that this approach is problematic because the actual starting point of a lane could be located anywhere within the image. In our analysis, using a ray can lead to ambiguity in line representation because a line can have an infinite number of starting points, and the choice of the starting point for a lane is subjective. As illustrated in Fig. \ref{coord}(a), the yellow (the visual start point) and green (the point located on the image boundary) starting points with the same orientation $\theta$ describe the same line, and either could be used in different datasets \cite{scnn}\cite{vil100}. This ambiguity arises because a straight line has two degrees of freedom, whereas a ray has three (two for the start point and one for orientation). To resolve this issue , we propose using polar coordinates to describe a lane anchor with only two parameters: radius and angle, deoted as $\left\{\theta, r\right\}$, where $\theta \in \left[-\frac{\pi}{2}, \frac{\pi}{2}\right)$ and $r \in \left(-\infty, +\infty\right)$. This representation isillustrated in Fig. \ref{coord}(b).
 \begin{figure}[t]
        \centering
        \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
        \caption{The main architecture of LPH.}
        \label{lph}
 \end{figure}
 We define two types of polar coordinate systems: the global coordinate system and the local coordinate system, with the origin points denoted as the global origin $\boldsymbol{c}^{g}$ and the local origin $\boldsymbol{c}^{l}$, respectively. For convenience, the global origin is positioned near the static vanishing point of the entire lane image dataset, while the local origins are set at lattice points within the image. As illustrated in Fig. \ref{coord}(b), only the radius parameters are affected by the choice of the origin point, while the angle/orientation parameters remain consistent.
 \subsection{Local Polar Head}
-\textbf{Anchor formulation in local polar head.}. Inspired by the region proposal network in Faster R-CNN \cite{fasterrcnn}, the local polar head (LPH) aims to propose flexible, high-quality anchors aorund the lane ground truths within an image. As Fig. \ref{lph} and Fig. \ref{overall_architecture} demonstrate, the highest level $P_{3} \in \mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ of FPN feature maps is selected as the input for LPH. Following a downsampling operation, the feature map is then fed into two branches: the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ and the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$.
+\textbf{Anchor formulation in local polar head.}. Inspired by the region proposal network in Faster R-CNN \cite{fasterrcnn}, the local polar head (LPH) aims to propose flexible, high-quality anchors aorund the lane ground truths within an image. As Fig. \ref{lph} and Fig. \ref{overall_architecture} demonstrate, the highest level $P_{3} \in \mathbb{R}^{C_{f} \times H_{f} \times W_{f}}$ of FPN feature maps is selected as the input for LPH. Following a downsampling operation, the feature map is then fed into two branches: the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ and the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$:
 \begin{equation}
        \begin{aligned}
                &F_d\gets DS\left( P_{3} \right), \,F_d\in \mathbb{R} ^{C_f\times H^{l}\times W^{l}},\\
@ -242,14 +249,7 @@ We define two types of polar coordinate systems: the global coordinate system an
        \label{lph equ}
 \end{equation}
- The regression branch aims to propose lane anchors by predicting two parameters $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, within the local polar coordinate system. These parameters represent the angles and the radius.The classification branch predicts the heat map $F_{cls\,\,}\left\{c_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$ of the local polar origin grid. By discarding local origin points with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while removing background lane anchors to the greatest extent. Keeping it simple, the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ consists of one $1\times1$ convolutional layer while the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$ consists of two $1\times1$ convolutional layers.
+ The regression branch aims to propose lane anchors by predicting two parameters $F_{reg\,\,} \equiv \left\{\theta_{j}, r^{l}_{j}\right\}_{j=1}^{H^{l}\times W^{l}}$, within the local polar coordinate system. These parameters represent the angles and the radius.The classification branch predicts the heat map $F_{cls\,\,}\equiv \left\{ c_j \right\} _{j=1}^{H^l\times W^l}$ of the local polar origin grid. By discarding local origin points with lower confidence, the module increases the likelihood of selecting potential positive foreground lane anchors while removing background lane anchors to the greatest extent. Keeping it simple, the regression branch $\phi _{reg}^{lph}\left(\cdot \right)$ consists of one $1\times1$ convolutional layer while the classification branch $\phi _{cls}^{lph}\left(\cdot \right)$ consists of two $1\times1$ convolutional layers.
 \begin{figure}[t]
        \centering
        \includegraphics[width=0.45\textwidth]{thesis_figure/local_polar_head.png}
        \caption{The main architecture of LPH.}
        \label{lph}
 \end{figure}
 \textbf{Loss Function.} During the training phase, as illustrated in Fig. \ref{lphlabel}, the ground truth labels for LPH are constructed as follows. The radius ground truth is defined as the shortest distance from a grid point (local origin point) to the ground truth lane curve. The angle ground truth is defined as the orientation of the vector from the grid point to the nearest point on the curve. A grid point is designated as a positive sample if its radius label is less than a threshold $\tau_{L}$ ; otherwise, it is considered a negative sample.
@ -305,13 +305,13 @@ Suppose the $P_{0}$, $P_{1}$ and $P_{2}$ denote the last three levels from FPN a
 where $\boldsymbol{w}_{L}^{s}\in \mathbb{R} ^{N_p}$ represents the learnable aggregate weight, serving as a learned model weight. Instead of concatenating the three sampling features into $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f\times 3}$ directly, the adaptive summation significantly reduces the feature dimensions to $\boldsymbol{F}^s\in \mathbb{R} ^{N_p\times d_f}$, which is one-third of the original dimension. The weighted sum tensors are then fed into fully connected layers to obtain the pooled RoI features of an anchor:
 \begin{equation}
        \begin{aligned}
-                \boldsymbol{F}^{roi}\gets FC^{pooling}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r},
+                \boldsymbol{F}^{roi}\gets FC_{pooling}\left( \boldsymbol{F}^s \right), \boldsymbol{F}^{roi}\in \mathbb{R} ^{d_r},
        \end{aligned}
 \end{equation} 
 \textbf{Triplet Head.} The triplet head comprises three distinct heads: the one-to-one classification (O2O cls) head, the one-to-many classification (O2M cls) head, and the one-to-many regression (O2M reg) head. In various studies \cite{laneatt}\cite{clrnet}\cite{adnet}\cite{srlane}, the detection head predominantly follows the one-to-many paradigm. During the training phase, multiple positive samples are assigned to a single ground truth. Consequently, during the evaluation stage, redundant detection results are often predicted for each instance. These redundancies are typically addressed using NMS, which eliminates duplicate results and retains the highest confidence detection for each groung truth. However, NMS relies on the definition of distance between detection results, and this calculation can be complex for curved lanes and other irregular geometric shapes. To achieve non-redundant detection results with a NMS-free paradigm, the one-to-one paradigm becomes crucial during training, as highlighted in \cite{o2o}. Nevertheless, merely adopting the one-to-one paradigm is insufficient; the structure of the detection head also plays a pivotal role in achieving NMS-free detection. This aspect will be further analyzed in the following sections.
-\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M cls head and the O2M reg head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M cls head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
+\textbf{NMS vs NMS-free.} Let $\boldsymbol{F}^{roi}_{i}$ denotes the ROI features extracted from $i_{th}$ anchors and the three subheads using $\boldsymbol{F}^{roi}_{i}$ as input. For now, let us focus on the O2M classification (O2M cls) head and the O2M regression (O2M reg) head, which follow the old paradigm used in previous work and can serve as a baseline for the new one-to-one paradigm. To maintain simplicity and rigor, both the O2M classification head and the O2M regression head consist of two layers with activation functions, featuring a plain structure without any complex mechanisms such as attention or deformable convolution. as previously mentioned, merely replacing the one-to-many label assignment with one-to-one label assignment is insufficient for eliminating NMS post-processing. This is because anchors often exhibit significant overlap or are positioned very close to each other, as shown in Fig. \ref{anchor setting}(b)\&(c). Let the $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ represent the features from two overlapping (or very close) anchors, implying that $\boldsymbol{F}^{roi}_{i}$ and $\boldsymbol{F}^{roi}_{j}$ will be almost identical. Let $f_{plain}^{cls}$ denotes the neural structure used in O2M classification head and suppose it's trained with one-to-one label assignment. If $\boldsymbol{F}^{roi}_{i}$ is a positive sample and the $\boldsymbol{F}^{roi}_{j}$ is a negative sample, the ideal output should be as follows:
 \begin{equation}
        \begin{aligned}
                &\boldsymbol{F}_{i}^{roi}\approx \boldsymbol{F}_{j}^{roi},
@ -335,8 +335,8 @@ It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls
            The index of positive predictions, $1, 2, ..., i, ..., N_{pos}$;\\
            The positive corresponding anchors, $[\theta_i, r_{i}^{global}]$;\\
            The x axis of sampling points from positive anchors, $\boldsymbol{x}_{i}^{b}$;\\
-            The positive confidence get from o2m cls head, $s_i$;\\
+            The positive confidence get from o2m classification head, $s_i$;\\
-            The positive regressions get from o2m reg head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
+            The positive regressions get from o2m regression head, the horizontal offset $\varDelta \boldsymbol{x}_{i}^{roi}$ and end point location $\boldsymbol{e}_{i}$.\\
        \ENSURE ~~\\ %算法的输出：Output
            \STATE Calculate the confidential adjacent matrix $\boldsymbol{C} \in \mathbb{R} ^{N_{pos} \times N_{pos}} $, where the element $C_{ij}$ in $\boldsymbol{C}$ is caculate as follows: 
            \begin{equation}
@ -360,7 +360,7 @@ It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls
                \label{al_1-2}
            \end{equation} 
-            \STATE Calculate the distance matrix $\boldsymbol{D}  \in \mathbb{R} ^{N_{pos} \times N_{pos}}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows: 
+            \STATE Calculate the inverse distance matrix $\boldsymbol{D}  \in \mathbb{R} ^{N_{pos} \times N_{pos}}$, where the element $D_{ij}$ in $\boldsymbol{D}$ is defined as follows: 
            \begin{equation}
                \begin{aligned}
            D_{ij} = 1-d\left( \boldsymbol{x}_{i}^{b} + \varDelta \boldsymbol{x}_{i}^{roi}, \boldsymbol{x}_{j}^{b} + \varDelta \boldsymbol{x}_{j}^{roi}, \boldsymbol{e}_{i}, \boldsymbol{e}_{j}\right),
@ -387,17 +387,17 @@ It is easy to see that the ``ideal'' one-to-one branch is equivalence to O2M cls
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/gnn.png} % 替换为你的图片文件名
        \caption{The main architecture of O2O classification head.}
-        \label{gnn}
+        \label{o2o_cls_head}
 \end{figure}
 The key rule of the NMS post-processing is as follows:
 Given a series of positive detections with redundancy, a detection result A is suppressed by another detection result B if and only if:
-        (1) The confidence of A is lower than that of B. 
+(1) The confidence of A is lower than that of B. 
-        (2) The predefined distance (e.g. IoU distance and L1 distance) between A and B is smaller than a threshold.
+(2) The predefined distance (e.g. IoU distance and L1 distance) between A and B is smaller than a threshold.
-        (3) B is not suppressed by any other detection results.
+(3) B is not suppressed by any other detection results.
 For simplicity, Fast NMS only satisfies the condition (1) and (2), which may lead to an increase in false negative predictions but offers faster processing without sequential iteration. Leveraging the “iteration-free” property, we propose a further refinement called “sort-free” Fast NMS. This new approach, named Graph-based Fast NMS, is detailed in Algorithm \ref{Graph Fast NMS}.
@ -405,7 +405,7 @@ It is straightforward to demonstrate that, when all elements in $\boldsymbol{M}$
 According to the analysis of the shortcomings of traditional NMS post-processing shown in Fig. \ref{NMS setting}, the fundamental issue arises from the definition of the distance between predictions. Traditional NMS relies on geometric properties to define distances between predictions, which often neglects the contextual semantics. For example, in some scenarios, two predicted lanes with a small geometric distance should not be suppressed, such as the case of double lines or fork lines. Although setting a threshold $d_{\tau}$ can mitigate this problem, it is challenging to strike a balance between precision and recall.
-To address this, we replace the explicit definition of the distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{alemi2016deep}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural network. Besides the geometric information, features ${F}_{i}^{roi}$ containes the implicit contextual information of an anchor, which provides additional clues for establishing implicit contextual distances between two anchors. The implicit contextual distance is calculated as follows:
+To address this, we replace the explicit definition of the inverse distance function with an implicit graph neural network. Additionally, the coordinates of anchors is also replace with the anchor features ${F}_{i}^{roi}$. According to information bottleneck theory \cite{alemi2016deep}, ${F}_{i}^{roi}$ , which contains the location and classification information, is sufficient for modelling the explicit geometric distance by neural network. Besides the geometric information, features ${F}_{i}^{roi}$ containes the implicit contextual information of an anchor, which provides additional clues for establishing implicit contextual distances between two anchors. The implicit contextual distance is calculated as follows:
 \begin{equation}
        \begin{aligned}
                \tilde{\boldsymbol{F}}_{i}^{roi}\gets& \mathrm{Re}LU\left( FC_{o2o}^{roi}\left( \boldsymbol{F}_{i}^{roi} \right) \right),
@ -420,31 +420,30 @@ To address this, we replace the explicit definition of the distance function wit
        \label{edge_layer}
 \end{equation}
-Eq. (\ref{edge_layer}) represents the implicit expression of Eq. (\ref{al_1-3}), where the distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information compared to traditional geometric distance. The confidence caculation is expressed as follows:
+Eq. (\ref{edge_layer}) represents the implicit expression of Eq. (\ref{al_1-3}), where the inverse distance $\boldsymbol{D}_{ij}^{edge}$ is no longer a scalar but a semantic tensor with dimension $d_{dis}$. $\boldsymbol{D}_{ij}^{edge}$ containes more complex information compared to traditional geometric distance. The confidence caculation is expressed as follows:
 \begin{equation}
        \begin{aligned}
                \\
                &\boldsymbol{D}_{i}^{node}\gets \underset{j\in \left\{ j|T_{ij}=1 \right\}}{\max}\boldsymbol{D}_{ij}^{edge},
                \\
                &\boldsymbol{F}_{i}^{node}\gets MLP_{node}\left( \boldsymbol{D}_{i}^{node} \right),
                \\
-                &\tilde{s}_i\gets \sigma \left( FC_{o2o,out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
+                &\tilde{s}_i\gets \sigma \left( FC_{o2o}^{out}\left( \boldsymbol{F}_{i}^{node} \right) \right).
        \end{aligned}
        \label{node_layer}
 \end{equation}
 The Eq. (\ref{node_layer}) serves as the implicit replacement for Eq. (\ref{al_1-4}). In this approach, we use elementwise max pooling of tensors instead of scalar-based max operations. The pooled tensor is then fed into a neural network with a sigmoid activation function to directly obtain the confidence. By eliminating the need for a predefined distance threshold, all confidence calculation patterns are derived from the training data.
-It should be noted that the O2O cls head depends on the predictons of O2M cls head as outlined in Eq. (\ref{al_1-1}). From a probablity percpective, the confidence output by O2M cls head, $s_{j}$, represents the probability that the $j_{th}$ detection is a positive sample. The confidence output by O2O cls head, $\tilde{s}_i$, denotes the conditional probablity that $i_{th}$ sample shouldn't be suppressed given the condition that the $i_{th}$ sample identified as a positive sample:
+It should be noted that the O2O classification head depends on the predictons of O2M classification head as outlined in Eq. (\ref{al_1-1}). From a probablity percpective, the confidence output by O2M classification head, $s_{j}$, represents the probability that the $j_{th}$ detection is a positive sample. The confidence output by O2O classification head, $\tilde{s}_i$, denotes the conditional probablity that $i_{th}$ sample shouldn't be suppressed given the condition that the $i_{th}$ sample identified as a positive sample:
 \begin{equation}
        \begin{aligned}
                &s_j|_{j=1}^{N_a}\equiv P\left( a_j\,\,is\,\,pos \right), \,\,  
                \\
-                &\tilde{s}_i|_{i=1}^{N_{pos}}\equiv P\left( a_i\,\,is\,\,retained|a_i\,is\,\,pos \right).
+                &\tilde{s}_i|_{i=1}^{N_{pos}}\equiv P\left( a_i\,\,is\,\,retained|a_i\,is\,\,pos \right),
        \end{aligned}
        \label{probablity}
 \end{equation}
-  
+where $N_a$ equals $H^{l}\times W^{l}$ during the training stage and  $K_{a}$ during the testing stage. The overall architecture of O2O classification head is illustrated in Fig. \ref{o2o_cls_head}.
 \textbf{Label assignment and Cost function.} We use the label assignment (SimOTA) similar to previous works \cite{clrnet}\cite{clrernet}. However, to make the function more compact and consistent with general object detection works \cite{iouloss}\cite{giouloss}, we have redefined the lane IoU. As illustrated in Fig. \ref{glaneiou}, the newly-defined lane IoU, which we refer to as GLaneIoU, is redefined as follows:
 \begin{figure}[t]
@ -479,9 +478,16 @@ We then define the cost function between $i_{th}$ prediction and $j_{th}$ ground
                \mathcal{C} _{ij}=\left(s_i\right)^{\beta_c}\times \left( GLaneIoU_{ij, g=0} \right) ^{\beta_r}.
 \end{aligned}
 \end{equation}
 This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with k=4) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment.
-\textbf{Loss function.} We use focal loss \cite{focal} for O2O cls head and O2M cls head:
+This cost function is more compact than those in previous works\cite{clrnet}\cite{adnet} and takes both location and confidence into account. For label assignment, SimOTA (with k=4) \cite{yolox} is used for the two O2M heads with one-to-many assignment, while the Hungarian \cite{detr} algorithm is employed for the O2O classification head for one-to-one assignment.
 \begin{figure}[t]
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/auxloss.png} % 
        \caption{Auxiliary loss for segment parameter regression.}
        \label{auxloss}
 \end{figure}
 \textbf{Loss function.} We use focal loss \cite{focal} for O2O classification head and O2M classification head:
 \begin{equation}
        \begin{aligned}
                \mathcal{L} _{o2m}^{cls}&=\sum_{i\in \varOmega _{pos}^{o2m}}{\alpha _{o2m}\left( 1-s_i \right) ^{\gamma}\log \left( s_i \right)}\\&+\sum_{i\in \varOmega _{neg}^{o2m}}{\left( 1-\alpha _{o2m} \right) \left( s_i \right) ^{\gamma}\log \left( 1-s_i \right)},
@ -490,29 +496,31 @@ This cost function is more compact than those in previous works\cite{clrnet}\cit
                \\
        \end{aligned}
 \end{equation}
-where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M cls head:
+where the set of the one-to-one sample, $\varOmega _{pos}^{o2o}$ and $\varOmega _{neg}^{o2o}$, is restricted to the positive sample set of O2M classification head:
 \begin{equation}
        \begin{aligned}
                \varOmega _{pos}^{o2o}\cup \varOmega _{neg}^{o2o}=\left\{ i|s_i>C_{o2m} \right\}.
        \end{aligned}
 \end{equation}
-only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O cls head. According to \cite{pss},  to maintain feature quality during training stage, the gradient of O2O cls head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O cls head:
+
 Only one sample with confidence larger than $C_{o2m}$ is chosed as the canditate sample of O2O classification head. According to \cite{pss},  to maintain feature quality during training stage, the gradient of O2O classification head are stopped from propagating back to the rest of the network (stop from the roi feature of the anchor $\boldsymbol{F}_{i}^{roi}$). Additionally, we use the rank loss to increase the gap between positive and negative confidences of O2O classification head:
 \begin{equation}
        \begin{aligned}
-                &\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{j\in \varOmega _{neg}^{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}}\\
+                &\mathcal{L} _{\,\,rank}=\frac{1}{N_{rank}}\sum_{i\in \varOmega _{pos}^{o2o}}{\sum_{j\in \varOmega _{neg}^{o2o}}{\max \left( 0, \tau _{rank}-\tilde{s}_i+\tilde{s}_j \right)}},\\
                &N_{rank}=\left| \varOmega _{pos}^{o2o} \right|\left| \varOmega _{neg}^{o2o} \right|.
        \end{aligned}
 \end{equation}
 We directly use the GLaneIoU loss, $\mathcal{L}_{GLaneIoU}$, to regression the offset of xs (with g=1) and Smooth-L1 loss for the regression of end points (namely the y axis of the start point and the end point), denoted as $\mathcal{L} _{end}$. In order to make model learn the global features, we proposed the auxiliary loss illustrated in Fig. \ref{auxloss}:
 \begin{align}
        \begin{aligned}
        \mathcal{L}_{aux} &= \frac{1}{\left| \varOmega_{pos}^{o2m} \right| N_{seg}} \sum_{i \in \varOmega_{pos}^{o2o}} \sum_{m=j}^k \Bigg[ l \left( \theta_i - \hat{\theta}_{i}^{seg,m} \right) \\
-        &\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg]
+        &\quad + l \left( r_{i}^{global} - \hat{r}_{i}^{seg,m} \right) \Bigg].
        \end{aligned}
 \end{align}
 The anchors and ground truth are divided into several segments. Each anchor segment is regressed to the main components of the corresponding segment of the assigned ground truth. This trick assists the anchors in learning more about the global geometric shape.
 \subsection{Loss function}
 The overall loss function of Polar R-CNN is given as follows:
@ -565,17 +573,9 @@ The first line in the loss function represents the loss for LPH, which includes
        \label{dataset_info}
    \end{table*}
 \begin{figure}[t]
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/auxloss.png} % 
        \caption{Auxiliary loss for segment parameter regression.}
        \label{auxloss}
 \end{figure}
 \section{Experiment}
 \subsection{Dataset and Evaluation Metric}
-We conducted experiments on four widely used lane detection benchmarks and one rail detection dataset: CULane\cite{scnn}, TuSimple\cite{tusimple}, LLAMAS\cite{llamas}, CurveLanes\cite{curvelanes}, and DL-Rail\cite{dalnet}. Among these datasets, CULane and CurveLanes are particularly challenging. The CULane dataset consists various scenarios but has sparse lane distributions, whereas CurveLanes includes a large number of curved and dense lane types, such as forked and double lanes. The DL-Rail dataset, focused on rail detection across different scenarios, was chosen to evaluate our model’s performance beyond traditional lane detection. The details for five dataset are shown in Table. \ref{dataset_info}
+We conducted experiments on four widely used lane detection benchmarks and one rail detection dataset: CULane\cite{scnn}, TuSimple\cite{tusimple}, LLAMAS\cite{llamas}, CurveLanes\cite{curvelanes}, and DL-Rail\cite{dalnet}. Among these datasets, CULane and CurveLanes are particularly challenging. The CULane dataset consists various scenarios but has sparse lane distributions, whereas CurveLanes includes a large number of curved and dense lane types, such as forked and double lanes. The DL-Rail dataset, focused on rail detection across different scenarios, is chosen to evaluate our model’s performance beyond traditional lane detection. The details for five dataset are shown in Table. \ref{dataset_info}
 We use the F1-score to evaluate our model on the CULane, LLAMAS, DL-Rail, and Curvelanes datasets, maintaining consistency with previous works. The F1-score is defined as follows:
 \begin{equation}
@ -703,8 +703,8 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
        SCNN\cite{scnn}          &ResNet34&94.25&94.11&94.39\\
        BézierLaneNet\cite{bezierlanenet} &ResNet34&95.17&95.89&94.46\\
        LaneATT\cite{laneatt}       &ResNet34&93.74&96.79&90.88\\
-        LaneAF\cite{laneaf}        &DLA34   &96.07&96.91&95.26\\
+        LaneAF\cite{laneaf}        &DLA34   &96.07&\textbf{96.91}&95.26\\
-        DALNet\cite{dalnet}        &ResNet18&96.12&\textbf{96.83}&95.42\\
+        DALNet\cite{dalnet}        &ResNet18&96.12&96.83&95.42\\
        CLRNet\cite{clrnet}       &DLA34   &96.12&-    &-    \\
        \midrule
@ -749,7 +749,7 @@ All input images are cropped and resized to $800\times320$. Similar to \cite{clr
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{lrcccc}
        \toprule
-        \textbf{Method}& \textbf{Backbone}&\textbf{F1(\%)}&\textbf{Precision(\%)}&\textbf{Recall(\%)} \\
+        \textbf{Method}& \textbf{Backbone}&\textbf{F1@50 (\%)}&\textbf{Precision (\%)}&\textbf{Recall (\%)} \\
        \midrule
        SCNN\cite{scnn}          &VGG16    &65.02&76.13&56.74\\
        Enet-SAD\cite{enetsad}      &-        &50.31&63.60&41.60\\
@ -777,6 +777,13 @@ To ensure a fair comparison, we also include results for CLRerNet \cite{clrernet
 We also compare the number of anchors and processing speed with other methods. Fig. \ref{anchor_num_method} illustrates the number of anchors used by several anchor-based methods on CULane. Our proposed model utilizes the fewest proposal anchors (20 anchors) while achieving the highest F1-score on CULane. It remains competitive with state-of-the-art methods like CLRerNet, which uses 192 anchors and a cross-layer refinement strategy. Conversely, the sparse Laneformer, which also uses 20 anchors, does not achieve optimal performance. It is important to note that our model is designed with a simpler structure without additional refinement, indicating that the design of flexible anchors is crucial for performance in sparse scenarios. Furthermore, due to its simple structure and fewer anchors, our model exhibits lower latency compared to most methods, as shown in Fig. \ref{speed_method}. The combination of fast processing speed and a straightforward architecture makes our model highly deployable.
 \subsection{Ablation Study and Visualization}
 To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance.
 \textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
 We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
 \begin{figure}[t]
        \centering
        \includegraphics[width=\linewidth]{thesis_figure/anchor_num_method.png}
@ -792,20 +799,13 @@ We also compare the number of anchors and processing speed with other methods. F
        \label{speed_method}
 \end{figure}
 \subsection{Ablation Study and Visualization}
 To validate and analyze the effectiveness and influence of different component of Polar R-CNN, we conduct serveral ablation expeoriments on CULane and CurveLanes dataset to show the performance.
 \textbf{Ablation study on polar coordinate system and anchor number.} To assess the importance of local polar coordinates of anchors, we examine the contribution of each component (i.e., angle and radius) to model performance. As shown in Table \ref{aba_lph}, both angle and radius contribute to performance to varying degrees. Additionally, we conduct experiments with auxiliary loss using fixed anchors and Polar R-CNN. Fixed anchors refer to using anchor settings trained by CLRNet, as illustrated in Fig. \ref{anchor setting}(b). Model performance improves by 0.48% and 0.3% under the fixed anchor paradigm and proposal anchor paradigm, respectively.
 We also explore the effect of different local polar map sizes on our model, as illustrated in Fig. \ref{anchor_num_testing}. The overall F1 measure improves with increasing the local polar map size and tends to stabilize when the size is sufficiently large. Specifically, precision improves, while recall decreases. A larger polar map size includes more background anchors in the second stage (since we choose k=4 for SimOTA, with no more than four positive samples for each ground truth). Consequently, the model learns more negative samples, enhancing precision but reducing recall. Regarding the number of anchors chosen during the evaluation stage, recall and F1 measure show a significant increase in the early stages of anchor number expansion but stabilize in later stages. This suggests that eliminating some anchors does not significantly affect performance. Fig. \ref{cam} displays the heat map and top-$K_{a}$ selected anchors’ distribution in sparse scenarios. Brighter colors indicate a higher likelihood of anchors being foreground anchors. It is evident that most of the proposed anchors are clustered around the lane ground truth.
 \begin{table}[h]
        \centering
        \caption{Ablation study of anchor proposal strategies}
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{c|ccc|cc}
        \toprule
-        \textbf{Anchor strategy}&\textbf{Local R}& \textbf{Local Angle}&\textbf{Auxloss}&\textbf{F1@50}&\textbf{F1@75}\\
+        \textbf{Anchor strategy}&\textbf{Local R}& \textbf{Local Angle}&\textbf{Auxloss}&\textbf{F1@50 (\%)}&\textbf{F1@75 (\%)}\\
        \midrule
        \multirow{2}*{Fixed}
                &-         &-         &          &79.90         &60.98\\
@ -867,7 +867,7 @@ We also explore the effect of different local polar map sizes on our model, as i
                \includegraphics[width=\imgwidth, height=\imgheight]{thesis_figure/heatmap/anchor2.jpg}
                \caption{}
        \end{subfigure}
-        \caption{The heap map of the local polar map and the final anchor selection during the evaluation stage.}
+        \caption{(a)\&(c): The heap map of the local polar map; (b)\&(d): The final anchor selection during the evaluation stage.}
        \label{cam}
 \end{figure}
@ -883,7 +883,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{cccc|ccc}
        \toprule
-        \textbf{GNN}&\textbf{cls Mat}& \textbf{Nbr Mat}&\textbf{Rank Loss}&\textbf{F1@50}&\textbf{Precision(\%)} & \textbf{Recall(\%)} \\
+        \textbf{GNN}&\textbf{cls Mat}& \textbf{Nbr Mat}&\textbf{Rank Loss}&\textbf{F1@50 (\%)}&\textbf{Precision (\%)} & \textbf{Recall (\%)} \\
        \midrule
                  &          &          &          &16.19&69.05&9.17\\
        \checkmark&\checkmark&          &          &79.42&88.46&72.06\\
@ -907,7 +907,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{c|l|lll}
        \toprule
-        \multicolumn{2}{c|}{\textbf{Anchor strategy~/~assign}} & \textbf{F1@50(\%)} & \textbf{Precision(\%)} & \textbf{Recall(\%)} \\
+        \multicolumn{2}{c|}{\textbf{Anchor strategy~/~assign}} & \textbf{F1@50 (\%)} & \textbf{Precision (\%)} & \textbf{Recall (\%)} \\
        \midrule
        \multirow{6}*{Fixed}     
                                &O2M-B w/~ NMS &80.38&87.44&74.38\\
@ -944,7 +944,7 @@ We also explore the stop-gradient strategy for the O2O classification head. As s
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{c|c|lll}
        \toprule
-        \multicolumn{2}{c|}{\textbf{Paradigm}} & \textbf{F1(\%)} & \textbf{Precision(\%)} & \textbf{Recall(\%)} \\
+        \multicolumn{2}{c|}{\textbf{Paradigm}} & \textbf{F1 (\%)} & \textbf{Precision (\%)} & \textbf{Recall (\%)} \\
        \midrule
        \multirow{2}*{Baseline}  
                                 &O2M-B w/~ NMS &78.83&88.99&70.75\\
@ -972,7 +972,7 @@ In the traditional NMS post-processing \cite{clrernet}, the default IoU threshol
        \begin{adjustbox}{width=\linewidth}
        \begin{tabular}{l|l|ccc}
        \toprule
-        \textbf{Paradigm} & \textbf{NMS thres(pixel)} & \textbf{F1(\%)} & \textbf{Precision(\%)} & \textbf{Recall(\%)} \\
+        \textbf{Paradigm} & \textbf{NMS thres(pixel)} & \textbf{F1@50(\%)} & \textbf{Precision(\%)} & \textbf{Recall(\%)} \\
        \midrule
        \multirow{7}*{Polar R-CNN-NMS} 
                                & 50 (default) &85.38&\textbf{91.01}&80.40\\