πŸ“¦ LLM으둜 μžμœ¨μ£Όν–‰μ„ ν˜μ‹ ν•  수 μžˆμ„κΉŒ

BardΒ·2025λ…„ 2μ›” 23일

RTCL

λͺ©λ‘ 보기
3/15
post-thumbnail

Overview

We will explore the integration of Large Language Models (LLMs) with autonomous driving systems by examining the following two papers:

  • DriveGPT4: Interpretable End-to-End Autonomous Driving via Large Language Model
    IEEE RA-L 2024

    Can we implement an explainable end-to-end autonomous driving system using LLMs?

  • LMDrive: Closed-Loop End-to-End Driving with Large Language Models
    CVPR 2024

    Can we build cognitive autonomous driving systems on top of LLMs that can interact with human passengers or navigation software using natural language?

DriveGPT4


Motivation

πŸš— The Challenge of End-to-End Autonomous Driving

  • End-to-end learning-based autonomous driving models act as black boxes, making it difficult to interpret their decisions.
  • Lack of explainability raises ethical and legal concerns, limiting real-world deployment.

🧠 Why use LLMs?

  • LLMs can provide natural language explanations for driving actions, improving interpretability.
  • By leveraging multimodal learning, LLMs can process both visual data and textual queries, enabling a more transparent decision-making process.

Model Architecture


Video Tokenizer

Video frames V=[I1,I2,…,IN]V = [I_1, I_2, \dots, I_N]
For each IiI_i, the pretrained CLIP visual encoder is used to extract its feature Fi∈R257Γ—dF_i ∈ \Bbb{R}^{257 Γ— d}

  • The global feature FiGF_i^G
  • The local patch features FiPF_i^P

Visual encoder splits the image to 16Γ—1616 \times 16 patches and extract the local features of each patch. (ViT)

The temporal visual feature of the entire video can then be expressed as:

T=F0GβŠ•F1GβŠ•...βŠ•FNG(1)T =F^G_0 βŠ•F^G_1 βŠ•...βŠ• F^G_N \tag{1}

The spatial visual feature of the whole video is given by:

S=Pooling(F0P,F1P,...,FNP)(2)S = \textrm{Pooling}(F^P_0, F^P_1, ..., F^P_N) \tag{2}

Β 

Ultimately, both the temporal feature TT and spatial feature SS are projected into the text domain using a projector.


Text and Control Signals

Inspired by RT-2, control signals are processed similarly to texts.

  • The speed of ego vehicle and time length of the video clip are included in the text input.
  • The turning angle represents the relative angle between the current and the previous frame.

Training

DriveGPT4’s training consists of two stages:

  1. the pretraining stage, focusing on video-text alignment
  2. the mix-finetuning stage, aimed at training the LLM
    to answer questions related to interpretable end-to-end
    autonomous driving.

The pretraining stage

  • 595K image-text pairs + 702K video-text pairs.
  • The pretraining images and videos encompass various
    topics and are not specifically designed for autonomous driving
    applications.
  • During this phase, the CLIP encoder and LLM weights
    remain fixed. Only the projector is trained.

The mix-finetuning stage

  • To enable DriveGPT4 to understand and process domain
    knowledge, it is trained with the 56K video-text
    instruction-following data

  • 56K is not sufficient for LLM fine-tuning.
  • 223K general instruction-following data generated
    by LLaVA and Valley.

Experiment

Interpretable Autonomous Driving

  • ADAPT serves as the state-of-the-art baseline work.
  • Recent multimodal video understanding LLMs are also considered for comparison.
  • All methods use 8-frame videos as input.

Testing Set Split

SplitScenesAmount
EasyStopped; Driving forward; Parked; etc.1202
MediumLane switch; Acceleration; Intersection; etc.295
HardVehicle turning; Traffic light change; etc.312

Evaluation Metrics

CIDEr, BLEU4, and ROUGE-L: metric scores widely used in the NLP community.

  • CIDEr: Is the explanation human-readable? (Naturalness)
  • BLEU-4: Are the words accurately matched? (Precision)
  • ROUGE-L: How smoothly does the sentence flow? (Coherence)

Also, ChatGPT is prompted to assign a numerical score between 0 and 1 for evaluate answers of questions generated by ChatGPT.

Additional Question Answering


End-to-end Control

  • RMSE: root mean squared error
  • AΟ„A_\tau: threshold accuracies
    • the proportion of test samples with prediction errors lower than ττ.
    • Ο„βˆˆ{0.1,0.5,1.0,5.0}\tau \in \{0.1,0.5,1.0,5.0\}


Ablation Studies

LMDrive

Motivation

πŸš— Challenges in Autonomous Driving

  • Traditional models rely on fixed-format sensor inputs, limiting human interaction.

"Can we build cognitive autonomous driving systems on top of LLMs, that can interact with human passengers or navigation software simply by natural language?”

β‡’\Rightarrow LMDrive, a language-guided, end-to-end, closed-loop autonomous driving framework

🧠 Why LLM?

  • In long-tail unforeseen events and challenging urban situations, the language-aware AVs can easily survive by following instructions from passengers or navigation software.
  • AVs can adapt to passengers’ sudden notice simply via natural language, which was previously non-trial and required a large amount of handcrafted rules

πŸ” Closed-Loop

The absence of closed-loop evaluation leads to insufficient consideration of critical issues

  • cumulative errors, human-robot interaction, and temporal consistency of actions.


Contributions

  1. End-to-End Closed-Loop Autonomous Driving
    • First language-based framework that integrates multimodal sensor data with real-time control execution.
  2. Language-Guided Driving Dataset
    • Collected 64K data clips in CARLA, containing navigation instructions, notice instructions, sensor data, and control signals.
  3. LangAuto Benchmark
    • Designed a new evaluation set including misleading instructions, complex scenarios, and adversarial conditions.
  4. Extensive Closed-Loop Experiments
    • Conducted detailed component analysis to validate LMDrive’s effectiveness.

Dataset generation

Three sources of input

  • sensor data (multi-view camera and LiDAR)
  • navigation instructions
  • human notice instruction

The data collection consists of two stages:

  1. collecting sensor data and control signals with an expert agent.
  2. parsing and labeling collected data with instructions.

Dataset generation

Sensor and control data collection

  • rule-based expert agent creates a dataset including about 3M driving frames.
    • camera data, LiDAR data, and control actions
  • the agent runs on 2.5k routes, 8 towns, and 21 kinds of environmental conditions.

  • four RGB cameras (left, front, right, rear) and one LiDAR
  • center crop the front image as an additional focus-view image to capture the status of the distant traffic light

Parsing and language annotation.

If the agent started to turn left at frame T0T_0 and ended at frame TnT_n

  • label (T0,Tn)(T_0, T_n) as a new clip with the instruction β€œHang a left at the next crossroads”

If an adversarial event occurs at time TaT_a

  • add one notice instruction into this clip.

Instruction design

Consider three types of navigation instructions (follow, turn, and others) + notice instruction. (consists 56 different instructions)

To enable the agent to drive in realistic instructional settings,

  • Diversifying the instructions with ChatGPT API

  • Incorporating misleading instructions.
    • "Change to left lane" on a single-lane road.
  • Connecting multiple instructions.
    • "Turn right at this intersection, then go straight to the next intersection and turn right again."

Model Architecture


Sensor Encoding

Multi-view camera

  • 2D backbone ResNet to extract the image feature map
  • each layer containing Multi-Headed Self-Attention, MLP blocks and layer normalization.

LiDAR input

  • 3D backbone PointPillars processes the raw point cloud data into ego-centered LiDAR features.
  • PointNet aggregates features and downsamples the feature map.

BEV Decoder

  • BEV Decoder is designed as a standard transformer with KdecK_{dec} layers.
  • Generates three types of visual tokens (BEV, waypoint, and traffic light).

Pre-training with prediction headers

  • Object detection, future waypoint prediction, and traffic light status classification.
  • Discarded after pre-training.

LLM for instruction-following auto driving

Instruction and visual tokenization

  • The vision encoder generates visual tokens (HΓ—WH Γ— W BEV tokens, NN waypoint tokens, and one traffic light token) for each frame
  • It's too large for the LLM because usually hundreds of frames are needed to complete one instruction.
  • To overcome this, we follow BLIP-2 to use the Q-Former to reduce the number of visual tokens.

  • Subsequently, we use a 2-layer MLP adapter to convert the tokens extracted by the Q-Former to share the same dimension as the language token

Action prediction

After LLM predicts the action tokens, Another 2-layer MLP adapter predicts

  • future waypoints
  • a flag to indicate whether the given instruction has been completed.

To get the final control signal, which includes braking, throttling, and steering

  • use two PID controllers for latitudinal and longitudinal control to track the heading and velocity of predicted waypoints respectively.

Training Objectives

  • l1l_1 waypoint loss.
  • the classification loss. (instruction completed flag, cross-entropy)

LangAuto Benchmark

Metrics

  1. Route completion (RC)
    • the percentage of the total route length that has been completed
  2. Infraction score (IS)
    • infractions triggered by the agent.
    • collisions or traffic rule violations.
  3. Driving score (DS)
    • RC Γ—\times IS

Quantitative Results

LLM Backbone

Module design

LangAuto-Notice Benchmark

LangAuto-Sequential Benchmark

Prediction Example

profile
돈 λ˜λŠ” 건 λ‹€ κ³΅λΆ€ν•©λ‹ˆλ‹€.

0개의 λŒ“κΈ€