Using Speech to Visualise Shared Gaze Cues in MR Remote Collaboration

Sei Kang·2023년 11월 14일
0

AR/VR 논문 리뷰

목록 보기
4/5

ABSTRACT

  • 360도 panoramic MR system- visualise shared gaze cues using contextual speech input to improve tas coordination
  • conducted two studies to evaluate the design of the MR gaze-speech interface exploring the combinations of visualisation style and context control level
  1. explicit visual form that directly connects the collaborators’ shared gaze to the contextual conversation is preferred
  2. gaze-speech modality shortens the coordination time to attend to the shared interest, making the communication more natural and the collaboration more effective

INTRODUCTION

  • This research explores how speech-triggered visualisation of gaze cues can improve Mixed Reality (MR) remote collaboration.

  • A live 360° panoramic MR system can enable remote pairs to be immersed in the same task space and share virtual non-verbal cues

    • Shared gaze visualisations (SGVs) encourage the active use of deictic references to facilitate joint attention to the objects of interest
    • in previous MR studies gaze has not been fully integrated with verbal cues.
  • This study: extend the previous research in SGVs and 360° MR remote collaboration, adding context-based speech input as an additional interaction modality to provide explicit visual guidance to encourage proactive task coordination and easy communication.

  • Joint Gaze Indicator (JGI): connects collaborators gaze focus to guide them to attend to an object of interest with fewer verbal descriptions and better coordination

  • 3가지 JGI를 통해 2가지 with-in subject task 진행

  1. investigated the design of three MR JGI interfaces based on the level of visual obtrusiveness to understand how contextual speech input should be visualise
    -> explicit gaze-speech JGI design is preferred because it creates a direct connection between context and collaborators gaze focus

  2. compared three JGI modalities (gaze-speech modality, gaze-only modality, and constant JGI) against no JGI to evaluate their effect on task coordination and mutual communication.
    -> gaze-speech modality shortens the coordination time spent for exchanging information to attend to the shared object of interest

Contribution

  • The first study which integrates contextual speech triggered gaze visualisations into 360° MR remote collaboration
  • An evaluation of gaze-speech MR interface designs
  • A user study to examine the use of multimodal gaze input and its visualisations to enhance task coordination

SYSTEM DESIGN

  • eye-tracking to capture and share bi-directional real-time gaze cues, and combines them with speech input.

360 Mixed Reality View

  • A 360°panoramic camera is placed at the centre of the task space to share the live scene between users while the views are omni-directional

Shared Gaze Representation

  • combination of a virtual ray and a cursor attached at the endpoint to represent gaze direction and location for the partner while using only the cursor to illustrate the user’s own gaze

  • Three gaze behaviour states are visualised based on prior work [24]:
    1) Browse state (default) - where there are active eye movements, shown as a blue single cursor
    2) Focus state - where the gaze dwells for 500ms or more, shown as a yellow double-ring cursor
    3) Joint state - where the collaborators’ gaze overlaps for over 200ms, shown as a green single cursor twice as large spawned from the midpoint of the collaborator’s gaze points, to indicate shared gaze location.

Gaze-Speech Interaction and Design

  • speech recognition to recognize contextual verbal conversation during task collaboration, aiming to build direct connections with the shared gaze cues

  • Six frequently used deictic references were used as target keywords: ”this”, ”that”, ”here”, ”there”, ”look”, and ”see”

  • target keywords is spoken by a user (sender), the system will show a visual guide to the collaborator who receives the instruction (receiver) = Joint Gaze Indicator (JGI)

  • To minimise distraction, the JGIs are only shown in the receiver’s view (uni-directional)

  • Three MR interfaces were designed to showcase JGI
    (1) Arrow Pointer- a subtle JGI using a small arrow rotating around the receiver’s gaze cursor to point at the sender’s gaze
    (2) Screen-edge Pointer- a moderately visible JGI using a big arrow along the edge of the receiver’s field of view (FOV) to point towards the sender’s gaze only when the sender’s gaze is out of FOV
    (3) Ray Pointer - a very obvious JGI using a 3D bent ray to visually link the sender’s and receiver’s gaze cursor


USER STUDY 1 - GAZE INTERFACE

  • evaluate the design of the three MR JGI interfaces

Experimental Design and Measures

  • A within-subject design was employed with three conditions: ”Arrow Pointer”, ”Screen-edge Pointer”, and ”Ray Pointer"

Task and Process

-Task: collaboratively locate 6 duplicated symbols among a pool of 44 abstract symbols randomly scattered on three walls (Fig. 2)

  • The roles were swapped after completing all conditions

Results

  • 3 conditions x 2 roles x 12 participants

User Preference Ranking

  • Participants’ preferences were asked between two sets of contextual keywords used to trigger JGI: G1) deictic references, including ”this”, ”that”, ”here”, and ”there”, and G2) commonly used gaze-related signal words ”see” and ”look”
  • ”Ray Pointer” is considered the most preferred UI

Interviews

  • Arrow Pointer: less distracting but not obvious enough to prompt attention
  • Screen-edge Pointer: relatively hard to noticed
  • Ray Pointer: allow to precisely locate partner’s gaze and understand intention quicker
  • Additionally all participants except for one preferred to remove the avatar gaze ray when Ray Pointer is enabled
  • prefer an obvious JGI that creates a direct connection between the collaborators gaze points triggered by in-situ contextual conversation.
  • JGI available all the time may add extra cognitive load during task collaboration --- --> Therefore, the JGI display style and the modality of context control may need further investigation. = User Study 2

USER STUDY 2 - MULTIMODAL GAZE SHARING

  • understand how to combine the JGI display styles with the different levels of context control to positively affect task coordination.
  • Ray Pointer UI를 확장하고 아바타 레이를 제거

Four conditions were designed

  • x-axis represents how closely the context is related to the control modality and the y-axis shows how frequently the indicator is displayed.
  1. Indicator ”always on” (AO) - requires no control effort (passive) and JGI is displayed constantly
  2. Indicator controlled by voice (VO) - using keywords from spoken phrases to trigger JGI, VO is controlled by the conversations and the display is less frequent with contextual meanings
  3. Indicator controlled by ”eye dwell” (ED) - referring to dwell (500ms) as a threshold to enable (> 500ms) or disable (<=500ms) JGI, the dwell behaviour in ED stays in between AO and VO as it is indirectly related to the context depending on intentional or unintentional gaze focuses
  4. No Indicator (NI) - works as a baseline to compare against

Task

We randomly scattered forty abstract symbols on three walls in the local user’s task space -> locate four consecutive symbols using provided reference

  • participants pairs(local, remote) -> separate reference showing only half of the symbols
  • AR: a piece of paper
  • VR: shown on the HMD with controllers used to toggle it on and off

Measures

Time

  • Completion Time (seconds): the time taken to complete each task, reflecting performance per pair.
  • Joint Time (seconds): the time spent while the pair’s gaze overlapped. Note that only overlaps beyond 200 milliseconds were included to avoid accidental joint gaze.
  • Coordination Time (seconds): the length from JGI triggered to joint state reached. For AO where indicator is always enabled and NI constantly disabled, we used focus state as the start of coordination because focus-led joint state aligns with the definition of the implicit JGI.
    • Browse Time (seconds): the rest of the time where collaborators gaze idle or explore separately.
    • Local and remote coordination time (seconds): the average coordination time per task in local and remote roles respectively

Gaze behaviour distribution (%)

  • describes the distribution of joint time, coordination time, and browse time over completion time across all conditions.

Coordination between roles (seconds)

  • how successfully JGIs support local and remote pair communication during task
    coordination

Results

Gaze Behaviour Analysis

  • Completion Time: no significant difference
  • Gaze Behaviour Distirubion
    • AD,VO,ED > NI (Joint Time)
    • no significant difference in Coordination Time and Browse TIme
  • Coordination between roles : VO was significantly faster than NI in remote roles
    -> VO enables collaborators to rapidly achieve joint attention

Post-Condition Survey

  • Social Presence
    • Co-presence(CP): AO, VO, ED > NI (in local & remote)
    • Attentional Allocation(AA): AO, ED > NI (in local) / ED > NI(in remote)
    • Perceived Message Understanding(PMU): AO,VO,ED > NI (in local) / VO > NI (in remote)

=> three JGIs help improve social presence in both roles compared to NI. However, AO and ED are better at bringing collaborator’s attention towards each other while
VO ensures better understanding of the messages exchanged between collaborators

  • Collaboration Experience
    • three JGI percieved effective than NI
    • AO and ED less mental effort than NI
    • Distraction-wise: no significant difference
    • Cognitive Load: AO and ED < NI (local), VO < NI (remote)

=> JGI in AO and ED requires a lower mental load in local roles while a proactive JGI controlled by contextual voice command is less mentally demanding in remote roles.

Post-Study Questionnaire

업로드중..

  • Preference: AO was most preferred to represent shared experience while VO was the best to improve collaborative communication in local roles.

DISCUSSION

Local vs Remote Coordination

  • The experiences for local and remote task coordination was similar
  • AR HMD has a smaller field of view (FOV), making it hard to see the endpoint of the JGI and the perception of virtual cues sometimes get skewed because of that. Whereas VR has a wider FOV and it felt much more continuous and less disoriented with virtual cues enabled within the FOV

JGI Design Space

  • local user(AR task): JGI may be suitable to be displayed constantly -> requires no control effort, has less mental load to coordinate, and is passive in a smaller FOV
  • remote user (VR task): context-based speech input seems to be a better choice -> physically disconnected from the local task space and the wider FOV can afford certain degrees of virtual visualisations, proactive speech controlled shared gaze visualisations may have a bigger effect to improve coordination and language comprehension.

1개의 댓글

comment-user-thumbnail
2023년 11월 14일

많은 것을 배웠습니다, 감사합니다.

답글 달기

관련 채용 정보