Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

DeepDIV!·2024년 7월 26일

SKT AI Fellowship 6기 논문 리뷰

논문 리뷰

목록 보기

7/10

Paper : https://arxiv.org/pdf/2404.05719
No Code..

1. Contributions

Ferret-UI

UI screen에 대한 이해도가 높은 mLLM

Ferret Training Method

1. Data Generation
Different granularities에 따라 data를 따로 생성했음

Elementary Tasks : Template-based approach를 이용해서 기초적인 UI task를 위한 training samples를 생성함
- UI element의 의미와 위치에 대한 이해도를 높이기 위함
Advancded Tasks : GPT 4를 이용해서 advanced task를 위한 data를 생성함
- 자세한 설명, interaction 등이 포함되어있음
- visual components에 대해 미묘한 차이를 인지하고, 해당 UI screen이 있는 목적에 대한 이해도를 높이기 위함

Template-based approach
문제를 해결하거나 작업을 수행하기 위해 사전에 정의된 템플릿을 사용하는 방법

2. Develop test benchmark

14 diverse mobile UI tasks
- domain specific model training이 잘되었는지 확인함

Ferret-UI Contributions

Ferret-UI 모델을 제안함
- 어떤 비율의 UI screen에도 적용 가능함
- 어떤 해상도도 수용 가능함
- referring, grounding, reasoning task를 할 수 있는 최초의 UI 중심 mLLM 모델임
Elementary 와 advanced UI tasks에 대한 training sample을 구축하여 모델을 훈련했음
Test benchmark를 구축했음

UI screen 자체에 대한 이해도를 높이고자하는 선행 연구들은 많았지만, 본 논문은 UI task를 수행할 수 있도록 UI screen에 대한 이해도를 학습하는 mLLM을 개발했음

3. Method

Ferret-UI가 Ferret Model을 기반으로 어떤 방식을 써서 UI screen에 대한 이해도를 높이고, referring, grounding task를 실행할 수 있었는지 보여줌

Raw screen pixel을 input으로 받음 (다른 mLLM 모델들은 External detection modules 와 screen view files 를 input으로 받음)
- advanced single-screen interactions를 수용할 수 있음
- 새로운 어플리케이션에 대해서도 쉽게 수용할 수 있음

External detection modules
UI 요소를 식별하고 분석하기 위해 시스템 외부에서 작동하는 독립된 소프트웨어 또는 하드웨어 구성 요소

UI 요소 감지, 텍스트 인식, 이벤트 감지 등을 할 수 있음

Screen view files
UI의 상태를 저장한 파일 (특정 시점의 UI 구성 요소와 그 속성을 기록함)

UI 구조, 이벤트 기록 등이 들어있음

4. Dataset and Task Formulation

Model training과 evaluation을 위해 어떻게 data를 생성했는지 소개함

연구에서 Android를 사용하므로, iphone은 본 정리글에서 다루지 않을 예정

UI Data Collection

RICO Dataset을 이용해서 Android UI screen에 대한 data를 모았음

이 데이터셋에는 screen2words, widgetcaptions, taperception이 포함되어있음
Pixel-based UI detection Model 을 이용해서 UI screen에 대한 세부적으로 주석하는 작업을 진행함
- UI element에 대한 UI type들이 다 달려있음
  (ex. Button, Text, Icon, Picture 등)

Taperception
모바일 사용자 인터페이스(UI)의 '탭 가능성'(tappability)을 예측하고 설명하기 위해 사용된 데이터셋

Google Research에서 제공함

Task Formulation

Ferret을 학습시키기 위해 UI screen을 3가지 방법으로 reformating함

Reformatting Spotlight

5. Experiments

6. Conclusion

연구에 적용할 수 있는 부분

DeepDIV!

이전 포스트

Rico: A Mobile App Dataset for Building Data-Driven Design Applications

다음 포스트

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

논문 리뷰

1. Contributions

Ferret-UI

Ferret Training Method

Ferret-UI Contributions

3. Method

4. Dataset and Task Formulation

UI Data Collection

Task Formulation

5. Experiments

6. Conclusion

Rico: A Mobile App Dataset for Building Data-Driven Design Applications

MVDream: Multi-view Diffusion for 3D Generation

0개의 댓글

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

논문 리뷰

1. Contributions

Ferret-UI

Ferret Training Method

Ferret-UI Contributions

2. Related Work

3. Method

4. Dataset and Task Formulation

UI Data Collection

Task Formulation

5. Experiments

6. Conclusion

Rico: A Mobile App Dataset for Building Data-Driven Design Applications

MVDream: Multi-view Diffusion for 3D Generation

0개의 댓글