[논문 리뷰] Towards Ordinal Suicide Ideation Detection on Social Media

fla1512·2022년 7월 24일

MentalHealth

NLP Study

목록 보기

3/23

[논문 간단 요약!]

Task : SISMO로 자살 위험 평가하기
Data : reddit posts of 500 users across 9 mental health and suicide related subreddits
Model : SISMO(Suicide Ideation detection on Social Media)
Contribution : 기존 자살 위험 평가와 달리 자살 위험 평가를 ordinal regression problem로 재구성하여 위험 당면 사용자에게 우선 조치를 취할 수 있도록 함.

`Abstract`

(소셜미디어에 대한 시각)

소셜미디어의 보편성 → 전통적인 임상 환경 대신 개인이 suicide ideation을 표출할 수 있는 platform이 제시됨.
기존 자살 위험 평가 neural methods가 있지만, 자살 위험의 고유 순서 무시한다는 한계.(ignore the inherent ordinal nature across finegrain levels of suicide risk.)

(그래서 이 논문에서는)
1. 자살 위험 평가를 Ordinal Regression problem으로 재구성 Columbia Suicide Severity Scale 기반

SISMO(Suicide Ideation detection on Social Media) 제시
- hierarchical attention model임
  
  hierarchical attention model: 문서는 본질적으로 계층적 구조(hierarchical structure)를 가진다는 맥락에서 등장, attention mechanism을 이용해 중요한 단어, 문장에 더 가중치를 줌으로써 document classification 성능을 높이는 것을 목표로 함.
soft probability distribution (since not all wrong risk-levels are equally wrong)
- suicide risk levels 사이의 natural inter-class relationships를 학습 가능
임상 전문가가 주석을 단 실제 레딧 데이터 사용

`Introduction`

자살현황: 시간, 나이 별로 심각 → 임상 방법의 예측 성능이 이제 향상X

(그래서 이제 encouraging aspect로)

온라인 사용자 행동 예측으로 위험 진단해보자 → 언어는 주관적(한계) → 그래도 이제 많이 발전되었다

(그러나 challenging aspect임, 적용이 쉽지 않다는 문제)

문제점
1. 이진분류 → 너무 단순화임, artificial notions of risk로 이어질 수 있음(우선순위가 없다)
2. finer-grained evaluation for the degree of risk poses의 부족 문제
3. 모든 위험 수준을 동일하게 측정(위험도 사이의 고유한 순서 무시) → 정확한 예측 불가(Figure1 참고)

(그래서 해당 논문에서 기여하고자 하는 것)

자살 위험 평가를 ordinal regression problem(not all wrong classes are equally wrong)에 기반해 재구성하기
SISMO(Suicide Ideation detection on Social Media) 제시
실질적 윤리적 문제 논의

`Related Work`

[Neural Methods]
1. NLP → language-specific methods(한계)
2. 문제점: 임상개입에서 이진분류체계에 의해 위험도 높은 사람 우선순위 지원 불가
[Fine-Grained Assessment]
1. (최근 연구 초점이) varying degrees of severity로 바뀜
2. ordinal regression의 한계 3가지

`Methodology`

1. Problem Formulation

목표: 게시물 분석으로 사용자의 자살 위험도 평가
user assessment: Columbia-Suicide Severity Rating Scale (C-SSRS) 기반
- 5가지 분류: Support (SU) < Indicator(IN) < Ideation (ID) < Behavior (BR) < Attempt (AT)
- suicide risk 사이에 상대적인 순서가 있어 ordinal Regression problem로 구성 → 다중분류
- five levels of increasing suicide risk: 𝑦 ∈ {SU, IN, ID, BR, AT}.

2. Post Embedding

게시물은 사용자가 생성하며, 자살 위험의 잠재적 지표임.
user posts레이어(첫번째 레이어): user post embedding layer로서 word-level attention에 영향을 줌.
Longformer 레이어
- pre-trained transformer language model(사전 훈련 트랜스포머 언어 모델)
- post-level embeddings를 얻음
- self attention mechanism은 게시물에서 risk-markers indicative of suicidality가 되는 토큰을 강조
  - 각 historical post 토큰화
  - 각 게시물의 시작에 [CLS] 토큰 추가
    - [CLS]토큰에 상응하는 마지막 히든 스테이트 사용(aggregate(집합) representation of the post로서)
  - 각각의 코드 encode

3. User Context Modeling

suicide ideation에 대한 build up은 주, 달, 년마다 발생 가능
각 게시물이 suicidal risk의 지표가 될 수 있지만 사용자 게시물의 순서를 예측하는 것이 더 좋은 시야 제공 가능(전체적으로 보는 것)

(그래서)

Bi-LSTM 사용 → 게시물 간의 temporal features를 capture하고자

Sequential Context Modeling
- LSTM: 연대순으로 게시물 모델링 가능(장기의존성)
  - left-to-right/right-to-left 히든스테이트 벡터 연결해서 과거, 미래 게시물로부터 문맥 얻음
  - BiLSTM으로 historical post encodings가 contextual representations로 바뀜
Temporal Attention
- 관련 시그널의 존재가 몇 게시물에만 있는 경우 多(등장배경)
- 각 사용자의 게시물 사이에서 degree of suicidality와 presence of suicide as sociation이 다르기 때문에 등장
- 각 게시물의 contextual representations에 대한 adaptive weights를 학습함
  - highlighting posts with indicative markers for suicide risk and aggregates them

4. Ordinal Regression

(배경)

마지막 레이어
- = FC 레이어
- 인풋: 사용자가 만든 어텐션 기반 게시물
- 아웃풋: 자살 위험 수준(classification confidence for all suicide risk-levels)
SISMO 훈련
- 기존 자살 위험도의 순서를 보존하고자
- ordinal regression loss에 최소화하도록 훈련함
  - 크로스엔트로피 손실과 유사 = used by suicide risk classification models that treat all the risk-levels equally
- soft encoded vector (probability distribution) 사용

(활용 방법)

Y = {SU = 0, IN = 1, ID = 2, BR = 3, AT = 4} = {𝑟𝑖 }^4 𝑖=0
five ordinal risk-levels
soft labels를 probability distributions y = [𝑦0, 𝑦1, · · · , 𝑦4] of ground
truth labels로서 계산

`Dataset`

Experimental setup

1. Data

1.1 (개요)

reddit 게시물(270,000명 → cohort of 2181 Redditors, 9 mental health 관련)
주석: four practicing psychiatrists가 C_SSRS 기반해서 진행
- acceptable average pairwise agreement of 0.79
- group-wise agreement of 0.73
게시물 수: 18.25 ± 27.45(1인당, 최대 292개)
각 게시물에서의 토큰 수: 73.4±97.7

1.2 Exploratory SAGE Analysis

목적: to assess the language variation across the five levels

Sparse Additive Generative Model(SAGE) 사용
- 토픽 모델의 조합(a combination of topic models and generalized additive models.)
- log-odds ratio 방법 사용
  - 단어 분포 대조하고자
  - 어떤 단어가 각 위험 수준에 가장 기여했는지 확인 가능.[Table1]
    - 예) Supportive level → 긍정적 단어(perseverance, aspiration) 포함
    - higher suicide risk로 갈 수록, clear signs of distress, 정확한 단어 사용

1.3 [Preprocessing and Data Split] (전처리 및 데이터 분할 과정 나열)

2. Evaluation Metrics

FP(False positive)와 FN(False negative)의 공식 바꿈
- FN: ratio of number of times predicted severity of suicide risk level (𝑘𝑝) is less than the actual risk level (𝑘𝑎) __ 실제값 > 예측값
  - 예) 위험도가 3인 사람을 1로 예측한 경우 → 위험함(잘못예측한 것이라고 판단, FN에 해당)
  - 반대로 위험도가 3인 사람을 5로 예측한 경우 → 크게 문제가 없어서 잘못예측했다고 판단하지 않는걸로(=예측 잘했다, FN에 해당X)
- FP: ratio of number of times the predicted risk 𝑘𝑝 is greater than the actual risk 𝑘𝑎 __ 예측값 > 실제값
  - 예) 위험도가 3인 사람을 5으로 예측한 경우 → 문제 없음(그래서 실제로 False인 정답을 True라고 예측했다, 오답, FP에 해당)
  - 𝑁𝑇: 테스트 데이터 크기
  - 𝑘𝑎: the actual risk level
  - 𝑘𝑝: predicted severity of suicide risk level
  - FN, FP • False Negative(FN) : 실제 True인 정답을 False라고 예측 (오답) • False Positive(FP) : 실제 False인 정답을 True라고 예측 (오답)
precision과 recall
- risk levels 사이에서 순서를 통합하고자 변경(FP, FN에 대한 규정이 바뀌면서 graded라는 이름이 붙는 원리)
- precision → Graded Precision (GP)
- recall → Graded Recall(GR)
- precision과 recall
  - precision: 모델이 True라고 분류한 것 중에서 실제 True인 것의 비율
    - = TP/(TP+FP)
  - recall: 실제 True인 것 중에서 모델이 True라고 예측한 것의 비율
    - =TP/(TP+FN)

3. Baselines

Handcrafted features
- SVM+RBF
- SVM-L
- Random Forest
- MLP
Deep learning approaches
- Contextual CNN
- Suicide Detection Model
- ContextBERT

4. Experimental Settings

(dropout, learning rate 등 하이퍼파라미터 설정 어떻게 했는지 나열)

베이스 버전 fine-tune
- longformer
- Huggingface’s Transformers library 사용

`Experiments`

Results

1. Performance Comparison

1.1 How does SISMO perform compared to the baselines?

SISMO > contextBERT(특히 higher level)
- 왜? ordinal formulation 때문(actual risk가 C-SSRS scale에서 predicted risk와 how far에 따라 예측에 패널티 부여)
deep learning approaches > handcrafted feature-based models
- 왜? embedding approaches 사용해서 aggregate user posts more effectively to model the mental state of a user.
- SDM, ContextBERT > contextual CNN
  - 왜? 순차형 모델이 learn better representations from the temporal context in a user history

1.2 3+1 Label Classification: How well does SISMO perform at a coarse(조잡한) grain segregation of high risk users compared to low (or no) risk users?

SU, IN 클래스를 common no-risk 카테고리로 분류
예측 모델의 정확도는 더 향상
- 왜? 카테고리 하나 줄어듦, 두 클래스(SU, IC)가 카테고리에서 주를 담당하기 때문에 모델이 low-risk classes를 BH, AT보다 더 잘 예측
SISMO는 higher risk 예측을 더 잘함(graded recall값으로 확인 가능, Table3)

2. Ablation study

How does the model performance improve upon adding each component of SISMO to a naive BERT + Average Pooling model?

temporal attention layer의 추가가 모델 성능 향상에 도움 주는 것을 발견.

3. Qualitative Analysis

How can we infer what the model is learning through actual examples?

token-level (red), post-level(blue) attention
모델: Contextual CNN(CCNN), SISM, SISMO
- user1: 세 모델 모두 위험 수준에 접근 가능(’want to choke’ 같은 문구 지속적으로 보임)
- user2: SISM, SISMO만 정확(시간이 지남에 따라 슬픔이 쌓이는 것을 보여줌)
- user3: SISMO 정확, indicator class에 해당
- user4: SISMO, user3과 유사한 결과

4. Error Analysis

When do modern deep learning models fail?

user5, user6은 어떤 모델도 접근X
(그래도 세 모델 중) SISMO의 예측이 가장 유사 → 왜? ordinal regression component이어서. (이로 인해, penalizes predictions based on the inherent increasing order of suicide risk.) 이는 인간의 위험에 대한 판단과 유사하며 실제적인 상황에서 중요하다 → 실제적인 상황에서 우선순위 제공 가능

5. Parametric Analysis

How does varying user history and 𝛼 affect SISMO’s performance?

Results with Varying User History
- 게시물 양 다르게(given different amounts of posts) → 20개(평균)까지는 상승
- (relatively few number of users (9%) having more than 50 posts.때문에 이랬을 것이라고 가정)
Results with Varying 𝛼
- 𝛼: SISMO의 주요 파라미터로 소프트 라벨의 확률분포 통제.
- 0-3.0까지 나누어본 결과 → 0일 때 안 좋음, 2 이상일 때 안 좋음, 1.8 정도가 가장 좋음.

`Conclusion`

1. Discussion

[Ethical Considerations]

목표: 사용자를 감시하는 neural model 개발(does not make any diagnostic claims related to suicide.)
사용자 경험에는 개입X

[Limitations]

자살위험 연구는 주관적 → 최신 연구는 온라인에서의 자살 표현과 심리적으로 평가된 자살 위험 간의 연관성 보여줌.
연구데이터는 인구통계학적, 주석전문가, medium-specific에 따라 취약.
SISMO가 다양한 소셜네트워크사이트의 문맥 일반화에 실패했을 가능성도 있음. → 모델로 한계 극복하고자 함.

[Practical Applicability]

(연구 목표) 임상자원, 사용자의 우선순위 돕고자 소셜미디어에서 사용자의 위험도를 미리 스크린하는 신경망 제안.
SISMO should form part of a human-centered mental health ecosystem
임상적 개입 외에도 이해관계자(가족, 당사자 등) 개입 가능
학제 간 협업 및 대화

2. Conclusion

(연구 시사점)

자살 위험 평가를 ordinal regression 문제로 재구성(위험 당면 사용자를 우선순위하고자) → 분류 문제 해결(not all wrong risk-levels are equally wrong.)
SISMO 제안 → 많은 양의 데이터 다루고자 함

(연구 목표) 소셜미디어에서 사용자의 자살 위험 평가를 미리 예측하자

→ 더 실질적인 결과 내려면, 다양한 이해관계자와 관념화하는 것이 필요

요약:

💡 본 연구의 목표는 게시물을 분석해 사용자의 자살 위험도를 평가하는 것이다.
기존 자살 위험도 평가는 모든 위험 수준을 동일하게 측정해 위험도 사이의 고유한 순서 무시한다는 한계를 가진다.
따라서 해당 연구에서는 자살 위험 평가를 ordinal regression problem로 재구성하여 위험 당면 사용자에게 우선 조치를 취할 수 있도록 하였다.
더 나아가 사용자를 C-SSRS(Columbia-Suicide Severity Rating Scale) 기반으로 하여 SU, IN, ID, BR, AT 5가지 클래스에 맞추어 분류해주었다.
이를 통해 SISMO(Suicide Ideation detection on Social Media)를 제안하며 그 우수성을 입증하였다.

fla1512

이전 포스트

[논문 리뷰] Attention is all you need

다음 포스트