[TIL#36 250411]

๊ฐ•๋ฏผ์ง€ยท2025๋…„ 4์›” 11์ผ

๋ฐ์ดํ„ฐ๋ถ„์„_TIL

๋ชฉ๋ก ๋ณด๊ธฐ
39/81

Daily plan

๐ŸŒž์˜ค์ „

- SQL ์ฝ”๋“œ์นดํƒ€ 98, 99 ,100๋ฒˆ
- 11์‹œ ์Šคํƒ ๋‹ค๋“œ ๊ฐ•์˜

๐Ÿ”ฅ ์˜คํ›„

- 14์‹œ QCC
- 15์‹œ 30๋ถ„ QCC ํ•ด์„ค

๐ŸŒ ์ €๋…

- TIL ์ œ์ถœํ•ด !!!!

To Do

  • โœ… SQL ์ฝ”๋“œ์นดํƒ€ 98~100
  • โœ… QCC + ํ•ด์„ค
  • โœ… ์Šคํƒ ๋‹ค๋“œ ๊ฐ•์˜

์ด๋ฒˆ์ฃผ ๋ชฉํ‘œ

  • ํ†ต๊ณ„ํ•™ ๊ฐ•์˜ ์™„๊ฐ• โœ…
  • ์‹œ๊ฐํ™”&์ „์ฒ˜๋ฆฌ ๊ฐ•์˜ ์™„๊ฐ•
  • ๋ฐ์ดํ„ฐ๋ถ„์„ ํŒŒ์ด์ฌ ์ข…ํ•ฉ๋ฐ˜ 5์ฃผ์ฐจ ์ •๋ฆฌ
  • ํ†ต๊ณ„ ๋ผ์ด๋ธŒ์„ธ์…˜ ์ •๋ฆฌ&๋ณต์Šต ๋ฐ€๋ฆฌ์ง€ ์•Š๊ธฐ! โœ…
  • ํŒŒ์ด์ฌ ์Šคํƒ ๋‹ค๋“œ๋ฐ˜ ์ •๋ฆฌ&๋ณต์Šต ๋ฐ€๋ฆฌ์ง€ ์•Š๊ธฐ! โœ…
  • ํŒŒ์ด์ฌ ๋ฒ ์ด์ง๋ฐ˜ ๋‚ด์šฉ ๊ณต๋ถ€ (ํ—ท๊ฐˆ๋ฆฌ๊ฑฐ๋‚˜ ์ฒ˜์Œ๋ณด๋Š” ๋‚ด์šฉ๋“ค๋งŒ ์ •๋ฆฌ)
  • ์ง€๋‚œ ํ”„๋กœ์ ํŠธ ๋นจ๋ฆฌ ์ •๋ฆฌํ•ด์„œ ์—…๋กœ๋“œํ•˜์ž...
  • SQL ์ฝ”๋“œ์นดํƒ€ 100๋ฒˆ ์ฐ๊ธฐ!โœ…

SQL ์ฝ”๋“œ์นดํƒ€

Q98 - Game Play Analysis IV

select *, min(event_date)
from activity
group by player_id

Q99 - User Activity for the Past 30 Days I

select activity_date as day,
    count(distinct user_id) active_users
from activity
where activity_date between '2019-06-27' and '2019-07-27'
group by activity_date
  • ๊ธฐ๊ฐ„์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ฌธ ์ž‘์„ฑ์‹œ์—๋Š” ๋ถ€๋“ฑํ˜ธ๋ณด๋‹ค between and๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ๋” ์ •ํ™•
  • date('2019-06-27')๋กœ date format์ž„์„ ํ‘œ์‹œํ•˜๋ฉด ๋” ๋ช…ํ™•ํ•˜๊ฒŒ ์ž‘์„ฑ ๊ฐ€๋Šฅํ•จ

Q100 - Number of Unique Subjects Taught by Each Teacher

select teacher_id,
     count(distinct subject_id) cnt
from teacher
group by teacher_id

Python Standard

sklearn์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๋‹ค๋ฃจ๊ธฐ

  • XGBoost
  • CatBoost

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์„ ํƒ ์‹œ ๊ณ ๋ ค ์‚ฌํ•ญ

  • ์˜ˆ์ธก ์ •ํ™•๋„
  • ํ•™์Šต ์†๋„/ ์˜ˆ์ธก ์†๋„
  • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๋Šฅ๋ ฅ
  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ ์—ฌ๋ถ€
  • ์นดํ…Œ๊ณ ๋ฆฌํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ
    • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋‚˜ ์นดํ…Œ๊ณ ๋ฆฌํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ๋„ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ชจ๋ธ ๋‚ด์—์„œ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•จ

์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ์ข‹์ง€๋งŒ, ์ •ํ™•๋„๊ฐ€ ๋†’์•„์งˆ์ˆ˜๋ก ๊ณผ์ ํ•ฉ ์šฐ๋ ค๋„ ์ปค์ง€๋ฏ€๋กœ ์ ์ ˆํžˆ ๊ท ํ˜•์„ ๋งž์ถ”์–ด์•ผ ํ•จ

ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด (Decision Tree): ์งˆ๋ฌธ์„ ๋˜์ง€๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ตฌ์กฐ
    • ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์„ ๋งŽ์ด ์‚ฌ์šฉํ•จ: ์—ฌ๋Ÿฌ ๋‚˜๋ฌด๋ฅผ ํ•ฉ์ณ์„œ ๋” ๊ฐ•ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•
    • XGBoost, CatBoost์™€ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์ด ์ด์— ํ•ด๋‹น

XGBoost (eXtreme Gradient Boosting)

  • ๋ถ€์ŠคํŒ…: ์•ฝํ•œ ๋ชจ๋ธ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์‹œ์ผœ ๊ฐ•ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹
  • ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๋ฉฐ ๋Œ€ํšŒ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ
  • ํŠน์ง•
    • ์ •ํ˜• ๋ฐ์ดํ„ฐ์— ๊ฐ•๋ ฅ
    • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๊ธฐ๋Šฅ์„ ๊ฐ€์ง (regularization: ๊ทœ์ œ)
    • ๊ฒฐ์ธก์น˜ ์ž๋™ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ 
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ๋ชจ๋ธ ํ•™์Šต
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

CatBoost (Category+Boosting)

  • Yandex์—์„œ ๊ฐœ๋ฐœํ•œ gradient boosting ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • ํŠน์ง•
    • ์นดํ…Œ๊ณ ๋ฆฌ ์ฒ˜๋ฆฌ์— ๋งค์šฐ ๊ฐ•๋ ฅํ•˜๋ฉฐ, Label encoding์„ ์•ˆํ•ด๋„ ๋จ
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ ๊ฒŒ ํ•˜๊ฑฐ๋‚˜ ์—†์ด๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง
    • ๊ฒฐ์ธก์น˜ ์ž๋™ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ๋ชจ๋ธ ํ•™์Šต
cat = CatBoostClassifier(verbose=0)
cat.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = cat.predict(X_test)
accuracy_score(y_test, y_pred)

XGBoost vs CatBoost

ํ•ญ๋ชฉXGBoostCatBoost
์นดํ…Œ๊ณ ๋ฆฌ ์ฒ˜๋ฆฌ์ˆ˜๋™ ์ธ์ฝ”๋”ฉ ํ•„์š”์ž๋™ ์ฒ˜๋ฆฌ
๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ์ž๋™ ์ฒ˜๋ฆฌ์ž๋™ ์ฒ˜๋ฆฌ
ํ•™์Šต ์†๋„๋น ๋ฆ„๋‹ค์†Œ ๋А๋ฆด ์ˆ˜ ์žˆ์Œ
์ดˆ๋ณด์ž ์นœํ™”์„ฑ์ค‘๊ฐ„๋†’์Œ(ํŠœ๋‹ ์˜ํ–ฅ์ด ์ ์Œ)
๋ชจ๋ธ ํ•ด์„๋ ฅ์ค‘๊ฐ„์ค‘๊ฐ„
  • ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ feature importance๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ ์ด๋ฅผ ํ†ตํ•ด ํ•ด์„ ๊ฐ€๋Šฅ
  • ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ sklearn์˜ fit, predict์™€ ์™„๋ฒฝํ•˜๊ฒŒ ํ˜ธํ™˜๋จ
  • GridSearchCV ๋˜๋Š” RandomizedSearchCV๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹๋„ ๊ฐ€๋Šฅํ•จ

์ฃผ์š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • ๊ณตํ†ต ํŒŒ๋ผ๋ฏธํ„ฐ
    • n_estimators: ํŠธ๋ฆฌ ๊ฐœ์ˆ˜
    • learning_rate: ํ•™์Šต๋ฅ  (ํด์ˆ˜๋ก ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด)
    • max_depth: ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด
  • XGBoost ์ „์šฉ ํŒŒ๋ผ๋ฏธํ„ฐ
    • subsample: ๊ฐ ํŠธ๋ฆฌ๋งˆ๋‹ค ์‚ฌ์šฉํ•  ์ƒ˜ํ”Œ ๋น„์œจ
    • colsample_bytree: ๊ฐ ํŠธ๋ฆฌ๋งˆ๋‹ค ์‚ฌ์šฉํ•  ํŠน์„ฑ ๋น„์œจ
    • gamma: ์ •๋ณด ํš๋“์ด ์–ผ๋งˆ ์ด์ƒ์ผ ๋•Œ split
    • reg_alpha: L1 ๊ทœ์ œ
    • reg_labmda: L2 ๊ทœ์ œ
  • CatBoost ์ „์šฉ ํŒŒ๋ผ๋ฏธํ„ฐ
    • depth: ํŠธ๋ฆฌ ๊นŠ์ด
    • l2_leaf_reg: L2 ๊ทœ์ œ
    • iterations: ํŠธ๋ฆฌ ์ˆ˜ (n_estimators์™€ ๋™์ผํ•จ)
    • bagging_temperature: ๋žœ๋ค์„ฑ ์กฐ์ ˆ
    • random_strength: ํŠน์„ฑ ์„ ํƒ ๋žœ๋ค์„ฑ

GridSearchCV๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

  • XGBoost Grid Search ์˜ˆ์‹œ
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

param_grid = {
	'n_estimators': [100,300],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5],
    'subsample':[0.8, 1.0]
}

xgb = XGBClassifier(use_label_encoder=Flase, eval_metric='logloss')

grid_xgb = GridSearchCV(estimator=xgb,
						param_grid=param_grid,
                        cv=3,
                        scoring='accuracy,
                        n_jobs=-1)
                        
grid_xgb.fit(X_train, y_train)

grid_xgb.best_params_ # ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ
grid_xgb.best_score_  # ์ตœ๊ณ  ์ •ํ™•๋„
  • use_label_encoder

    • ๊ธฐ๋ณธ๊ฐ’: True (๊ตฌ๋ฒ„์ „ ํ˜ธํ™˜์„ฑ ์œ ์ง€)
    • ์ตœ๊ทผ์—๋Š” deprecation waring์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ํ•ญ์ƒ False๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ฒœ
    • XGBClassifier(use_label_encoder=False)
  • CatBoost Grid Search ์˜ˆ์‹œ

from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV

cat = CatBoostClassifier(verbose=0)

param_grid = {
	'iterations':[100,300],
    'learning_rate':[0.05,0.1],
    'depth':[4,6],
    'l2_leaf_reg':[1,3]
}

grid_cat = GridSearchCV(estimator=cat,
					param_grid=param_grid,
                    cv=3,
                    scoring='accuracy',
                    n_jobs=-1)
 
grid_cat.fit(X_train,y_train)

grid_cat.best_params_ #์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ
grid_cat.best_score_  #์ตœ๊ณ  ์ •ํ™•๋„

### ์ฃผ์˜!!!
# CatBoost๋Š” fit() ํ˜ธ์ถœ ์‹œ cat_features๋ฅผ ์ง€์ •ํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Œ (๋ฒ”์ฃผํ˜•์ด ์žˆ์„ ๋•Œ๋งŒ)    

CatBoost์˜ GPU ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

  • GPU ์„ค์ • ์˜ˆ์‹œ
from catboost import CatBoostClassifier

model = CatBoostClassifier(task_type="GPU", devices="0", verbose=0)
  • task_type="GPU": GPU ์‚ฌ์šฉ ๋ช…์‹œ
  • devices="0": GPU ๋ฒˆํ˜ธ ์ง€์ • (๋ฉ€ํ‹ฐ GPU์ผ ๊ฒฝ์šฐ "0:1" ๋“ฑ๋„ ๊ฐ€๋Šฅํ•จ)
  • GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต ์†๋„๊ฐ€ ๋ช‡ ๋ฐฐ ๋นจ๋ผ์ง€๋ฉฐ, ์ˆ˜์ฒœ ๊ฐœ ์ด์ƒ์˜ ์ƒ˜ํ”Œ์—์„œ ํŠนํžˆ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ๋‚˜ํƒ€๋‚จ

์ •๋ฆฌ

  • XGB, CatBoost ๋ชจ๋‘ ๊ธฐ์ดˆ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋น„์Šท
  • CatBoost๋Š” ํŠœ๋‹ ์—†์ด๋„ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์€ ํŽธ์ž„
  • GridSearchCV๋Š” ํ•™์Šต ์‹œ๊ฐ„์ด ๊ธธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฒ”์œ„๋ฅผ ์ขํ˜€์„œ ์‹œ์ž‘ํ•˜๋Š” ๊ฒŒ ์ข‹์Œ
  • ๋” ๋น ๋ฅธ ํŠœ๋‹์„ ์œ„ํ•ด์„œ๋Š” RandomizedSearchCV๋„ ์ข‹์Œ

QCC

1๋ฒˆ

select region_name, max(sales) highest_sales
from stores
group by region_name
having count(store_id)>1
order by region_name

2๋ฒˆ

select a.name as name_x, b.name as name_y, count(distinct a.cart_id) orders
from cart_products a
	cross join cart_products b
on a.cart_id = b.cart_id
and a.name!=b.name
group by 1,2
order by 1,2

cross join๊นŒ์ง€๋Š” ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ๊ทธ ์ดํ›„์— ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ์ง€ ๋„ˆ๋ฌด ๋ง‰๋ง‰ํ•ด์„œ ํ•œ์ฐธ์„ ํ—ค๋งธ๋‹คใ…œใ…œ
join์—์„œ a.name!=b.name ์ด ์กฐ๊ฑด ํ•˜๋‚˜ ๋” ๊ฑธ์–ด์ฃผ๋Š” ๊ฑธ ์™œ ์ƒ๊ฐ ๋ชปํ–ˆ์„๊นŒ... ๋ฐ”๋ณด๋‹ค ๋ฐ”๋ณด..

3๋ฒˆ

select count(distinct o.user_id)
from orders o 
  left join payments p
  on p.user_id = o.user_id
where p.user_id is null 
  or pay_date>order_date

์•„๋‹ˆ ๋ฌธ์ œ ์ œ๋Œ€๋กœ ์•ˆ ์ฝ๋‹ˆ????
'์ฒซ๋ฒˆ์งธ' ๊ฒฐ์ œ์ผ๋ณด๋‹ค ์ด์ „์— ์ƒํ’ˆ์„ ์ฃผ๋ฌธํ•œ ์‚ฌ์šฉ์ž ์ถœ๋ ฅํ•˜๋ผ๊ณ .....'์ฒซ๋ฒˆ์งธ'!!!!

with first_payment as(
	select user_id, min(pay_date) first_payment_date
    from payments
    group by 1
)
select count(distinct o.user_id)
from orders o 
  left join first_payments fp
  on fp.user_id = o.user_id
where fp.user_id is null 
  or first_payment_date>order_date

์ผ๊ธฐ

์•„ ์šฐ๋ฆฌ ์•„๊นŒ ์ง๊ฟ๋†€์ด ํ•œ ๊ฑฐ ์‚ฌ์ง„ ๋‹ค๋‚ ๋ผ๊ฐ....;;;
velog ์ž๋™์ €์žฅ ํ•ด์ฃผ๋Š” ์ค„ ์•Œ์•˜๋Š”๋ฐ ์ด๋ ‡๊ฒŒ ๋‚  ๋ฐฐ์‹ ํ•˜๋„ค...
์ฝ”๋“œ์นดํƒ€ ํ•œ ๊ฒƒ๋„ ๋‚ ๋ผ๊ฐ€๊ณ  ์Šคํƒ ๋‹ค๋“œ ๊ฐ•์˜ ์ •๋ฆฌํ•œ ๊ฒƒ๋„ ๋‹ค ๋‚ ๋ผ๊ฐ;;;;;


์‚ฌ์ง„ ์–ป์–ด์™”๋‹ค ใ…œ.ใ…œ

์šฐ๋ฆฌ์กฐ ์˜ค๋Š˜์˜ ์ด์Šˆ ๋ธŒ๋ฆฌํ•‘

  • ์ง๊ฟ ๋†€์ด
  • ์‚ฌ์นญ์„ ๋ฉˆ์ถฐ์ฃผ์„ธ์š”
  • ์˜ค๋Š˜์€ ์•ผ์™ธ์ˆ˜์—…....^^

ํœด,, ์˜ค๋Š˜์€ ๊ณต๋ถ€๋„ ์†์— ์•ˆ์žกํžˆ๊ณ  ์—ด ๋งŽ์ด ๋ฐ›๋Š” ํ•˜๋ฃจ์˜€๋‹ค^^
์ฃผ๋ง ๋™์•ˆ ๊ผญ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฐ•์˜ ๋‹ค ๋“ค์–ด์•ผ์ง€!!

0๊ฐœ์˜ ๋Œ“๊ธ€