[RSS '21] RMA: Rapid Motor Adaptation for Legged Robots

minha·2022년 11월 29일

RL

목록 보기

6/6

풀고자 하는 문제
- Quadruped robot의 locomotion 학습에 있어, 학습된 policy를 deploy 할 때 sim2real 문제가 있음
- 시뮬레이션에서 학습된 policy가 다양한 실제 지형에 맞딱드렸을 때 강건하게 작동할 수 있도록(rapid adaptation) 만들어야 함
- 기존의 adaptation 방식들은 실제 환경에 가서 어느정도 시행착오를 통한 fine-tuning을 한 후 비로소 adapt하는 few-shot adaptation 방법으로 연구가 진행되어 왔음. 아무리 meta learning등을 도입하여 adaptation을 가속화 한다고 하더라도, 이런식으로 새로운 환경에서의 데이터 수집이 필수적인 경우엔 로봇 하드웨어 손상의 위험도가 높아짐(일반적으로 4-8분의 시행착오 단계가 수반됨). 따라서 학습이 완료된 policy에 어떠한 추가적인 fine-tuning을 거치지 않고도 곧바로 새로운 실제 환경에 적응할 수 있는 zero-shot adaptation 방식이 필요함
Main contributions
- System identification in latent space $z_t$ (extrinsics)
- Zero-shot adaptation (= directly deployed in the real world / RMA takes less than 1s)
방법론
- 시뮬레이션에서 학습하고 zero-shot adaptation으로 policy가 실제 환경에 바로 투입됨
- 대신 시뮬레이션 상에서 학습할 땐 실제 환경에서 맞닥뜨릴 수 있는 다양한 환경에 대해 모두 학습이 이루어져야함 $\leftarrow$ 이를 위해 fractal terrain generator을 사용함 (some kind of randomized dynamics generative simulation)
- A) Training in Simulation
  Locomotion policy는 state, 과거 action, latent environment vector을 받아서 다음 action을 내뱉도록 학습되는데, 이와 더불어 system identification 모듈인 adaptation module 또한 시뮬레이션의 latent environment vector와 비슷해지도록 회귀학습 시킴 $\rightarrow$ 결국 deployment에 사용되는 것: Base locomotion policy & Trained adaptation module
- B) Deployment in Real environment
  과거 trajectory 바탕(state, action pair 50쌍; 대략 1초 이내)으로 adaptation 모듈이 latent environment vector를 만들어주면 이를 함께 활용하여 즉각 학습 완료된 base policy를 활용하여 locomotion 가능
참고 1: 기존 sim2real 해결 방법론들의 한계점
1. Domain randomization
  - 단일 policy가 randomized된 다양한 simulation 환경에서 학습됨
  - 한계: robustness $\uparrow$ , but optimality $\downarrow$ $\rightarrow$ overconservative policy
2. Simulation 자체를 더 실제와 비슷하게 만들기
  - DNN 기반 actuator model 등 실제를 더 잘 모사할 수 있는 모델을 추가
  - 한계: 애초에 zero-shot adaptation은 불가함
3. Meta learning
  - 애초에 한 environment에 가장 최적의 policy를 학습하기보다는, 여러 environment에 가장 빠르게 adapt할 수 있는 meta policy를 학습하자는 지론
  - 한계: 2와 같이 이 방식 또한 애초에 zero-shot adaptation은 불가함
참고 2: 기존 system identification 방법론들의 한계점
- 기존 방식은 전체 physics parameter을 모두 exact하게 예측하는 방식으로 진행되었고, 이는 실질적으로 잘 작동하기 어려워 저조한 성능을 보임
- 이 논문에서는 저차원의 latent environment embedding $z_t$ 를 활용하여 위 문제를 해결하였음 (Note that instead of predicting $e_t$ , which is the case in typical system identification, we directly estimate the extrinsics $z_t$ that only encodes how the behavior should change to correct for the given environment vector $e_t$ )
코드: 없음
참고자료: https://seunghyun-lee.tistory.com/65

minha

이전 포스트

[RSS '21] RMA: Rapid Motor Adaptation for Legged Robots

RL

[IJCAI '22] Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning

0개의 댓글