캐글필사 - CareerVillage.org

Sooin Yoon·2025년 3월 10일

2025.03.10 첫번째 캐글필사 시작

Data Science for Good: CareerVillage.org

Description

Welcome
In this competition you'll notice there isn't a leaderboard, and you are not required to develop a predictive model. This isn't a traditional supervised Kaggle machine learning competition.
여기 경쟁에서 너에게 공지하는데 여기는 leaderboard(순위표, 점수판)가 아니고 너는 예측 모델을 만드는것을 요구하지 않는다. 여기는 전통적인 supervised(감독학습) kaggle 머신러닝 경쟁을 하는곳이 아니다.

CareerVillage.org is a nonprofit that crowdsources career advice for underserved youth. Founded in 2011 in four classrooms in New York City, the platform has now served career advice from 25,000 volunteer professionals to over 3.5M online learners. The platform uses a Q&A style similar to StackOverflow or Quora to provide students with answers to any question about any career.
CareerVillage.org는 비영리 조직이고 underserved youth(소외된 청소년)에 대해 crowdsourcese(클라우드소싱, 여러사람의 지식이나 경험을 활용해서 어떤 문제를 해결하거나 정보를 모으는 방식) 취업조언을 해주는곳이며 new york city에 4개의 클래스룸이에서 2011에 세워졌고 이 플랫폼은 지금 25,000명의 전문적인 봉사자들로부터 3.5M의 온라인 학생들에게 취업조언이 제공되고 있다.

In this Data Science for Good challenge, CareerVillage.org, in partnership with Google.org, is inviting you to help recommend questions to appropriate volunteers. To support this challenge, CareerVillage.org has supplied five years of data.
CareerVillage.org와 Google.org는 협업을 하고 있고 봉사자들에게 추천 질문을 추천하는데 도와줄수 있도록 너를 초대한다. 이 도전을 지지하기 위해 CareerVillage.org는 5년간의 데이터를 제공한다.

Problem Statement

The U.S. has almost 500 students for every guidance counselor. Underserved youth lack the network to find their career role models, making CareerVillage.org the only option for millions of young people in America and around the globe with nowhere else to turn.
~~미국은 거의 500명의 학생들이 각각 조언자를 가지고 있다.~~ 미국에서는 학생 약500명당 한 명의 진로 상담사가 배정되어있다. 소외된 청소년은 그들의 직업역할모델을 찾는 네트워크가 부족하다. CareerVillage.org는 미국의 백만명의 어린 학생들의 뿐만 아니라 전 세계(around the globe) 수백만 명의 청소년들에게 유일한 선택지(with nowhere else to turn)가 되고 있다.

To date, 25,000 volunteers have created profiles and opted in to receive emails when a career question is a good fit for them. This is where your skills come in. To help students get the advice they need, the team at CareerVillage.org needs to be able to send the right questions to the right volunteers. The notifications sent to volunteers seem to have the greatest impact on how many questions are answered.
~~오늘 날짜에서,~~ 현재까지 25,000 봉사자들은 프로필은 ~~만들어냈고~~ 등록했고 ~~받은 이메일은 opted했다.~~ 이메일을 받을수 있도록 설정했다(opted). 취업질문이 그들에게 꽤 꼭 맞을 때. ~~이것이 어디에서 너의 기술이 오는지 알수있다.~~ 여기서 너의 기술이 필요하다. 그들이 필요한 조언을 얻기위해 학생들을 도울려면 CareerVillage.org의 팀에서 올바른 봉사자들에게 올바은 질문이 가도록 하는것이 필요하다. 봉사자들에게 보내지는 공지사항은 처럼 보인다 가장 최고의 영향력이 가도록 하는 것처럼 보인다. 얼마나 많은 질문들에 답변되는지에

Your objective: develop a method to recommend relevant questions to the professionals who are most likely to answer them.
목적 : ~~그들의 대답을 가장 비슷하게 하는 전문가들을 관련된 질문을 추천하는 방법을 만들고 ~~ 가장 적절한 전문가에게 관련된 질문을 추천하는 방법을 개발하는 것

Dataset Description

CareerVillage.org has provided several years of anonymized data and each file comes from a table in their database.
데이터셋은 몇년간의 익명의 데이터를 제공하고, 각 파일은 그들의 데이터베이스의 테이블로부터 온다

EDA

answers.csv(답변)
: Answers are what this is all about! Answers get posted in response to questions. Answers can only be posted by users who are registered as Professionals. However, if someone has changed their registration type after joining, they may show up as the author of an Answer even if they are no longer a Professional.
답변은 ~~모든것이다=>~~ 핵심이다. 답변은 질문의 응답으로써 게시될수있다. 답변은 전문가로 등록된 사용자들의 의해서만(only) 게시될수 있다. 그러나 만약 누군가 가입후에 등록타입을 변화했다면 그들은 질문의 글쓴이로써 보일수도있다 만약 그들이 더이상의 전문가가 아닐지라도
comments.csv
: Comments can be made on Answers or Questions. We refer to whichever the comment is posted to as the "parent" of that comment. Comments can be posted by any type of user. Our favorite comments tend to have "Thank you" in them :)
코멘트(댓글)는 답변 또는 질문에 의해 만들어진다. 우리는 어디서든 그 코멘트가 "부모"에 의해 게시되어진다. 코멘트는 어떤 타입의 사용자들간데 게시되어진다.
우리가 가장 좋아하는 코멘드는 그들에게 고마워라고 하는 경향이 있다.
emails.csv
: Each email corresponds to one specific email to one specific recipient. The frequency_level refers to the type of email template which includes immediate emails sent right after a question is asked, daily digests, and weekly digests.
각 이메일은 특정한 받는 사람에게 보내지는 하나의 개별 이메일을 의미함. 빈도수준(frequency_level)은 이메일 템플릿의 유형을 나타내며, 여기에는 질문 직후 즉시 발송되는 이메일, 일일요약(daily_digest), 주간요약(weekly digest)등이 포함된다.
group_memberships.csv
: Any type of user can join any group. There are only a handful of groups so far.
사용자의 어떤 타입이든 어떤 그룹에 가입할수있다. 지금까지는(so far) 그룹의 수는 많지 않다.
groups.csv
: Each group has a "type". For privacy reasons we have to leave the group names off.
각 그룹은 '특징'을 가지고 있다. 개인정보보호로, ~~우리는 그룹의 이름을 이유로 떠나갈수 없다~~ 그룹 이름을 공개하지 않는다.
matches.csv
: Each row tells you which questions were included in emails. If an email contains only one question, that email's ID will show up here only once. If an email contains 10 questions, that email's ID would show up here 10 times.
각 행은 이메일을 포함한 질문을 말한다. 만약 이메일이 오직 한 질문만 포함한다면 각 이메일의 아이디는 오직 한번만 보여질것이다. 만약 한 일메일이 10개의 질문을 포함한다면 그 이메일의 아이디는 여기에 10번 보여질것이다.
professionals.csv
: We call our volunteers "Professionals", but we might as well call them Superheroes. They're the grown ups who volunteer their time to answer questions on the site.
우리는 봉사자들을 전문가라고 부르지만 영웅이라고도 부를지도 모른다. 봉사자들이 그들의 시간을 이 사이트에서 질문에 대한 답을 하는 어른들입니다.(grown ups)
questions.csv
: Questions get posted by students. Sometimes they're very advanced. Sometimes they're just getting started. It's all fair game, as long as it's relevant to the student's future professional success.
질문은 학생들에 의해 게시되어진다. 가끔식 그들은 매우 진보되었거나(심화) ~~그냥 시작한다~~ 때로는 기초적인 질문일수도 있다. ~~모든 공평한 게임일뿐만 ~~=> 모든 질문은 의미 있다. 아니라 학생들의 미래의 전문적인 성공에도 연관이 되어있다.
school_memberships.csv
: Just like group_memberships, but for schools instead.
사용자와 학교 간의 관계를 나타낸다
students.csv
: Students are the most important people on CareerVillage.org. They tend to range in age from about 14 to 24. They're all over the world, and they're the reason we exist!
학생들은 CareerVillage.org에서 가자 중요한 사람들이다. 그들의 나이는 14~24 범위에 있는 경향이 있다. 그들은 전세계에 있고 ~~그것이~~ 그들이 우리가 존재하는 이유이다.
tag_questions.csv
: Every question can be hashtagged. We track the hashtag-to-question pairings, and put them into this file.
각 질문들은 해시태그될수있다. 우리는 해시태그와 질문의 연결 관계를 추척하여 이 파일에 저장한다.
tag_users.csv
: Users of any type can follow a hashtag. This shows you which hashtags each user follows.
어떤 타입의 사용자들은 해시태그를 따라갈수있다. 이들은각 사용자의 팔로우는 해시태그를 확인할수있다
tags.csv
: Each tag gets a name.
~~각 태그는 이름을 얻는다~~ => 각 태그에는 이름이 부여된다.
question_scores.csv
: "Hearts" scores for each question.
각 질문에 대한 하트 점수 => 좋아요~
answer_scores.csv
:"Hearts" scores for each answer.
각 답변에 대한 하트 점수
결측치들을 확인해 봤는데 문제가 되는 결측치는 몇개 없어서 문제가 없어보였음
merge하는 부분들이 너무 많아서 이 부분에 대해선 아직 이해가 잘 되지는 않음

문제 파악

Recommandation System project
LightFM 라이브러리를 활용하여 hybrid recomandation system 구현하는 것이 목표
hybird?? = Collaborative Filtering(CF) 협업필터링 + Content-based Filtering(CB) 콘텐츠 기반 필터링
Collaborative Filtering(CF) 협업필터링
- 유저와 아이템 간의 상호작용 데이터(구매, 클릭, 좋아요 등)을 기반으로 추천
- cold start문제로 새로운 유저/아이템이 추가될때 추천이 어려움
- cold start? 추천 시스템에서 새로운 유저나 아이템이 추가될 때 해당 데이터가 없기 때문에 추천이 어려워 발생하는 데이터 부족 문제(user/item/system...)
Content-based Filtering(CB) 콘텐츠 기반 필터링
- 유저가 소비한 아이템의 메타데이터(예: 장르, 카테고리, 태그 등)를 기반으로 추천
- 사용자 간 협업 정보가 부족함

what is the LightFM?

: 그래서 두 개의 협업 필터링하고 콘텐츠 기반 필터링을 결합한 하이브리드 모델
암묵적피드백과 명시적 피드백 모두 지원
Sparse Matrix 기반 추천 가능
SGD(Stochastic Gradient Descent) 및 Adagrad 등 최적화 기법 지원
빠르고 효율적인 추천 시스템 구현 가능

평가 Metric

LightFM은 ROC AUC(Receiver Operation Characteristic - Area Under the Curve) 점수를 활용하여 평가하는 경우가 많음
일반적인 recommand system에서는 Precision@K, Recall@K, MAP@k, NDCG@K와 같은 평가지표를 사용함
ROC AUC는 모델이 얼마나 정확하게 추천할 수 있는지 측정하는 지표
- 1에 가까울수록 완벽한 추천
- 0.5에 가까울수록 랜덤 추천
- 0.5보다 낮으면 잘못된 추천을 의미함

from lightfm.evalution import auc_score
calculate_auc_score(model, interaction, questions_features, professional_features)

NoteBook 필사

google colab :
EDA : Data Science for Good 👨‍🎓 👩‍🎓
모델링 : LightFM Hybrid Recommendation system

Lesson Learned

단독으로 쓰는 추천시스템은 본적이 있는데 이렇게 하이브리드로 추천 시스템이 동작하는 추천시스템은 처음 보게 되었음
협업 필터링과 콘텐츠 기반 필터링의 차이는 이해가 되고 알겠으나 아직까지는 코드는 아직 어려움
또 sparse matrix와 matrix factorizaiton의 개념을 알게되었음
LightFM 라이브러리의 사용법 학습함
Matrix 방법 중 하나인 ROC AUC 대해 학습 하게 되엇음
LightFM을 통해 정확도 0.9가 넘기 때문에 거의 완벽한 추천 시스템이라고 볼 수 있음

Sooin Yoon

이전 포스트

빅데이터분석기사 필기 접수

다음 포스트