[논문 리뷰] Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer(InterFuser)

woonho·2023년 2월 19일

Computer Vision E2E driving autonomous driving 논문 리뷰

Introduction

연구 배경

기존 자율주행의 문제 ⇒ safety(High-traffic-density 상황에서 성능 저하)
- Lack of comprehensive scene understanding
  - 신호위반 차량, 보행자가 갑자기 등장하는 상황 등에 취약
- Lack of interpretability
InterFuser
- multi-modal, multi-view 센서들로 부터 정보를 혼합해서 comprehensive scene understanding을 해결하였다.
- intermediate interpretable feature를 생성해 interpretability를 해결하였다.

Comprehensive scene understanding

Single-modal
- Single Image ⇒ 주변의 복잡한 상황을 받아오기 힘듦
- Single LiDAR ⇒ traffic Light와 같은 Semantic Information을 받아오기 힘듦
Fusing multiple sensors
- Match geometric features between Image & LiDAR
  - By locality assumption
- Simply concatenate multiple-sensor features
  
  ⇒ But, 이러한 multi-modal feature 사이의 interaction은 모델링 되기 어려움
Transformer
- Feature 사이의 interaction을 파악하기는 어려우므로, global context를 고려하기 위해 Attention Mechanism이 사용됨
  - Ex) TransFuser : transformer를 통해 Image, LiDAR input을 혼합하는 구조를 설계 ⇒ But, Sensor scalability에 좋지 않고, LiDAR & Single Image 혼합에 한정되어 있다.
- InterFuser는 one-stage architecture를 통해 Multi-view sensor들의 Input을 혼합했다.
  - LiDAR Input & Multi-view Images(Left, Front, Right, Focus)

Interpretability

Existing method
- 실패했을 때, 모델을 직접적으로 이해하기 보다는 neural network가 적용되는 부분을 확인하려 했다. ⇒ Lack feedback from the failure
New method
- 사람의 정보 습득 방식에서 착안해서, Generating action 뿐만 아니라, Safety mind map 또한 출력했다.
  - Intermediate Interpretable features ⇒ Information on surrounding objects and traffic signs

Contribution

Transformer를 통해 different modalities and views에서 global contextual perception을 가능하게 했다.
Safety and Interpretability 향상
1. intermediate feature of the model
2. constraining actions within safe sets
CARLA benchmark에서 SOTA를 달성

Method

Architecture

Transformer encoder
- integrates the signals from multiple RGB cameras and LiDAR
transformer decoder
- low level action
- intermediate feature ⇒ ego vehicle’s future trajectory, object density map, traffic rule signal
safety controller
- utilize the interpretable features to constrain the low-level control within the safe set

Input and Output Representations

Input representations

Four image inputs
- { $I_{left}, I_{front}, I_{right}$ }
  - 3개의 RGB camera로 부터 left, front, right image input을 받아온다.
- $I_{focus}$
  - traffic light의 상태를 받아오기 위해 front RGB image에서 center 부분을 crop해서 받아온다.
LiDAR point clouds
- TransFuser 논문과 같은 방법으로 point cloud data를 two-channel bird-eye view projection image로 나타낸다. ⇒ $I_{lidar}$
  => TransFuser 논문의 그림

Output representations

safety-insensitive output
- ego vehicle이 이동하려는 waypoints $(L=10)$
safety-sensitive output
- object density map
  - $M \in \R^{R \times R \times 7}$ ⇒ 7 features for potential objects in each grid cell
  - ego vehicle로 부터 $R$ meter, 좌우로 각각 $R \over 2$ meter씩 나타낸다.
  - 7 channel
    - 각 grid cell에 object가 존재할 확률
    - 2-dimensional offset from the center of the grid
    - 2-dimensional bounding box
    - object heading
    - object velocity
- traffic rule information
  - traffic light status
  - stop sign
  - intersection

Model architecture

Backbone

Input : $I\in \R^{3 \times H_0 \times W_0}$
ResNet Backbone ⇒ ( $I\in \R^{3 \times H_0 \times W_0}$ → $f \in \R^{C \times H \times W}$ )
- C = 2048
- (H, W) = ( ${H_0 \over 32}, {W_0 \over 32}$ )

Transformer encoder

Feature map $f$ 에 대해 1x1 convolution을 수행해 low-channel feature map $z \in \R^{d\times H\times W}$ 를 얻는다.
$z$ 의 spatial dimension을 one dimension으로 나타내어 $d \times HW$ 로 나타낸다.
Fixed 2D sinusoidal positional encoding $e \in \R^{d \times HW}$ 를 더해 각 token이 positional information을 갖도록 한다.
학습가능한 sensor embedding $s \in \R^{d \times N}$ 을 더해 각 token이 N개의 sensor들을 구분하도록 한다.

각 sensor에 대해 다음의 과정을 거친 후에, 각 sensor들의 output을 concat하고, $K$ 개의 transformer layer를 거친다.
- Each layer $\mathcal{K}$ 는 MSA, MLP, LM으로 구성되어 있다.

Transformer decoder

standard transformer architecture
Three types of queries
- $L$ waypoints queries
- $R^2$ density map queries
- one traffic rule query
Transformer decoder는 permutation-invariant하므로, query embedding이 decoder마다 동일하다.
- permutation-invariant ⇒ 입력 벡터 요소의 순서와 상관없이 같은 출력을 생성하는 모델
learnable positional embedding이 이 query embedding에 더해진다.

Prediction headers

Three parallel prediction modules
- Waypoints prediction
  - single layer GRU ⇒ auto-regressively predict waypoints $[w_l]^L_{l=1}$
  - GPS 좌표로 생성된 goal location을 64-dimensional vector로 만들어 초기 hidden state를 초기화 한다.
- Object density map prediction
  - 3-layer MLP를 거쳐 ( $R^2 \times d$ ⇒ $R^2 \times 7$ )의 형태로 feature map을 변화시키고, $M \in \R^{R \times R \times 7}$ 의 형태로 reshape 한다.
- Traffic rule prediction
  - single linear layer를 통해 traffic light, stop sign, intersection을 예측한다.

Loss Function

$\mathcal{L} = \lambda_{pt}\mathcal{L}_{pt} + \lambda_{map}\mathcal{L}_{map} + \lambda_{tf}\mathcal{L}_{tf}$
- $\mathcal{L}_{pt}$ ⇒ Waypoint prediction loss function
  - $\mathcal{L}_{pt} = \sum^L_{l=1}||w_l - w_l^p||_1$ ⇒ by L1 loss
- $\mathcal{L}_{map}$ ⇒ Object density map prediction loss function
  - $\mathcal{L}_{map} = \mathcal{L}_{prob} + \mathcal{L}_{meta}$
    - $\mathcal{L}_{prob} = {1 \over 2}(\mathcal{L}_{prob}^0 + \mathcal{L}_{prob}^1)$
    - $\mathcal{L}_{meta} = {1 \over C_1}\sum^R_{i}\sum^R_{j}\sum^R_{k=1}(1_{[\check{M}_{ij0}=1]}|{\check{M}_{ijk}} - {M_{ijk}}|_1)$
- $\mathcal{L}_{tf}$ ⇒ Traffic rule prediction loss function
  - $\mathcal{L}_{tf} = \lambda_l\mathcal{L}_l + \lambda_s\mathcal{L}_s + \lambda_j\mathcal{L}_j$ ⇒ by cross-entropy loss
    - $l$ : traffic light status
    - $s$ : stop sign
    - $j$ : junction of roads(intersection)

Safety Controller

Low-level action

Transformer decoder와 waypoint predictor로 부터 생성된 waypoints는 PID controller에 의해 two low-level action으로 변환된다.
- lateral steering action
  - $ψ_d$ ⇒ ego vehicle’s desired heading
- longitudinal acceleration action
  - $v_d$ ⇒ desired speed
Low-level actions는 interpretable feature(object density map, traffic rule)에 의해 safe set에 부합하도록 조정된다.

Object density map

Object existence probability
- surrounding grid에서 existence probability의 local maximum이 threshold보다 높은 경우 object가 있다고 판단한다.
Tracker
- historical dynamic을 기록해서 future trajectory를 moving average로 판단한다.

⇒ 주변 환경과 object가 어떻게 이동할지 예측해서 ego-vehicle이 이동 가능한 safe distance( $s_t$ )를 구한다.

Traffic rule

예측된 traffic rule 또한 safe driving에 이용된다.
- traffic light not green, stop sign ⇒ emergency stop

Experiments

Experiment Setup

CARLA simulator (8 towns and 21 kinds of weather)

Data collection

8 kinds of towns and weather
Randomly generated different routes, dynamic objects, adversarial scenarios ⇒ for the diversity of the dataset

Metrics

RC ⇒ Route completion ratio
IS ⇒ Infraction score
DS ⇒ Driving score

Comparison to the state of the art

TCP : integrate trajectory planning and direct control
- Use one camera
LAV : dataset collected from all the vehicles that it observes

woonho

이전 포스트

Reinforcement Learning이란?

다음 포스트

[논문 리뷰] Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer(InterFuser)

Introduction

연구 배경

Comprehensive scene understanding

Interpretability

Contribution

Method

Architecture

Input and Output Representations

Input representations

Output representations

Model architecture

Backbone

Transformer encoder

Transformer decoder

Prediction headers

Loss Function

Safety Controller

Low-level action

Object density map

Traffic rule

Experiments

Experiment Setup

Data collection

Metrics

Comparison to the state of the art

Reinforcement Learning이란?

[논문 리뷰] Neural Attention Fields for end-to-end Autonomous Driving(NEAT)

0개의 댓글

관련 채용 정보