2026.04.09(Thu)

오유찬·2026년 4월 10일

DE

목록 보기

5/16

Python에서 날짜와 시간 다루기

date() →연, 월, 일
date(2000, 10, 16), date(2000, 8, 24)

from datetime import timedelta
# timedelta : 사건 사이에 경과한 시간
td = timedelta(days=29)
print(d1 + td)

sort_values() : Dataframe이나 series에서 사용하는 메서드

list 타입을 정렬하려면 내장 함수 sorted()나 sort()를 사용해야 한다.

sorted() : 정렬된 새로운 리스트 반환
sort() : 리스트 자체 정렬하지만 반환값은 없다

dt.replace(tzinfo=timezone.utc))랑 dt.astimezone(timezone.utc))의 차이

dt.replace(tzinfo=timezone.utc)) : 시각은 그대로 둔 채 '이 시각은 이제부터 이 시간대야'라고 이름표만 갈아끼우는 것이다.

datetime 객채 값은 변하지 않고, tzinfo 속성만 덮어쓴다.
시간대 정보가 없는 Naive 데이터에 이건 이러한 UTC 시간이야-라고 정의할 때 사용
이미 시간대 정보가 있는 객체에 사용하면, 실제 시점이 변해버린다. → 서울 오후 3시를 replace(tzinfo=UTC)라고 하면 영국 시간(UTC) 오후 3시가 되어서 시간이 어긋난다. 잘못 설정했을 때는 이게 맞지.

dt.astimezone(timezone.utc)) : 실제 시점을 유지하면서 다른 시간대의 시각으로 계산하는 것

UTC 기준으로 시각을 더하거나 빼서 실제 같은 순간을 가르키는 다른 지역의 시간을 계산
서울 오후 3시 → 영국 시간으로 변환할 때
Naive 객체에 사용하면, 파이썬은 이 객체를 시스템의 로컬 시간대로 간주하고 변환을 시도한다.

astimezone - datetime 객체에서 사용하는 메서드

tz_convert : Pandas의 Series나 DatetimeIndex에서 사용한다.

timezone 객체를 생성할 때는 timezone(timedelta(hours=-8)) 이라면,
datetime 객체를 생성할 때는 dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=pst)

tzinfo ← datetime 객체에서

column	data type	description	cleaning requirements
`client_id`	`integer`	Client ID	N/A
`age`	`integer`	Client's age in years	N/A
`job`	`object`	Client's type of job	Change `"."` to `"_"`
`marital`	`object`	Client's marital status	N/A
`education`	`object`	Client's level of education	Change `"."` to `"_"` and `"unknown"` to `np.NaN`
`credit_default`	`bool`	Whether the client's credit is in default	Convert to `boolean` data type: `1` if `"yes"`, otherwise `0`
`mortgage`	`bool`	Whether the client has an existing mortgage (housing loan)	Convert to boolean data type: `1` if `"yes"`, otherwise `0`

campaign.csv

column	data type	description	cleaning requirements
`client_id`	`integer`	Client ID	N/A
`number_contacts`	`integer`	Number of contact attempts to the client in the current campaign	N/A
`contact_duration`	`integer`	Last contact duration in seconds	N/A
`previous_campaign_contacts`	`integer`	Number of contact attempts to the client in the previous campaign	N/A
`previous_outcome`	`bool`	Outcome of the previous campaign	Convert to boolean data type: `1` if `"success"`, otherwise `0`.
`campaign_outcome`	`bool`	Outcome of the current campaign	Convert to boolean data type: `1` if `"yes"`, otherwise `0`.
`last_contact_date`	`datetime`	Last date the client was contacted	Create from a combination of `day`, `month`, and a newly created `year` column (which should have a value of `2022`); Format = `"YYYY-MM-DD"`

month, day 컬럼 불러온 뒤에 2022로 설정하기

economics.csv

column	data type	description	cleaning requirements
`client_id`	`integer`	Client ID	N/A
`cons_price_idx`	`float`	Consumer price index (monthly indicator)	N/A
`euribor_three_months`	`float`	Euro Interbank Offered Rate (euribor) three-month rate (daily indicator)	N/A

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.read_csv('bank_marketing.csv')
client = df[['client_id', 'age', 'job', 'marital','education', 'credit_default', 'mortgage']]

client['job'] = client['job'].str.replace('.','_')
client['education'] = client['education'].replace({'.':'_', 'unknown':np.NaN})
client['credit_default'] = (client['credit_default'] == 'yes').astype(int)
client['mortgage'] = np.where(client['mortgage'] == 'yes', 1, 0)

client.to_csv('client.csv')

campaign = df[['client_id', 'number_contacts', 'contact_duration', 'previous_campaign_contacts','previous_outcome', 'campaign_outcome']]

campaign['previous_outcome'] = (campaign['previous_outcome'] == 'success').astype(int)
campaign['campaign_outcome'] = (campaign['campaign_outcome'] == 'yes').astype(int)
campaign['last_contact_date'] = pd.to_datetime(campaign['last_contact_date'], format='YYYY-MM-DD')

campaign.to_csv('campaign.csv')


economics = df[['client_id', 'cons_price_idx', 'euribor_three_months']]
economics.to_csv('economics.csv')

error1 : Expected the credit_default column in the client.csv file to be bool data type.

client['credit_default'] = (client['credit_default'] == 'yes').astype(int)

# astype(bool)

wrong :

client['education'] = client['education'].str.lower().replace({'.':'_', 'unknown':np.NaN})

![[스크린샷 2026-04-09 오후 9.35.15.png]]
실제로 체크해보니까 안 바뀌었다.

str.lower().replace → str.replace()가 아니라 .replace()로 받아들여진다. 이는 값 전체가 일치할 때, 처리에 적합하다. 따라서, unknown → NaN 처리에는 적합하나, 문자열의 일부인 "."를 "_ " 로 바꾸는 데는 적합하지 않다.

문자열의 일부를 바꿀 때는 str.replace() 메서드를 활용한다.

client['education'] = client['education'].str.lower().replace('unknown', np.NaN)
client['education'] = client['education'].str.replace('.', '_')

answer

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.read_csv('bank_marketing.csv')
client = df[['client_id', 'age', 'job', 'marital','education', 'credit_default', 'mortgage']]

client['job'] = client['job'].str.replace('.','_')
client['education'] = client['education'].str.lower().replace('unknown', np.NaN)
client['education'] = client['education'].str.replace('.', '_')
client['credit_default'] = (client['credit_default'] == 'yes').astype(bool)
client['mortgage'] = np.where(client['mortgage'] == 'yes', 1, 0).astype(bool)

client.to_csv('client.csv', index = False)

campaign = df[['client_id', 'number_contacts', 'contact_duration', 'previous_campaign_contacts','previous_outcome', 'campaign_outcome']]

campaign['previous_outcome'] = (campaign['previous_outcome'] == 'success').astype(bool)
campaign['campaign_outcome'] = (campaign['campaign_outcome'] == 'yes').astype(bool)
# campaign['last_contact_date'] = pd.to_datetime(campaign['last_contact_date'], format='YYYY-MM-DD')
campaign['last_contact_date'] = pd.to_datetime(
    "2022-" + df['month'].str.lower() + "-" + df['day'].astype(str)
).dt.strftime('%Y-%m-%d')
campaign.to_csv('campaign.csv', index = False)


economics = df[['client_id', 'cons_price_idx', 'euribor_three_months']]
economics.to_csv('economics.csv', index = False)

효율적인 python 코드 작성

Goal : 지연 시간과 오버헤드를 줄인다

unpack 연산자 (" * ")

nums = range(1,11,2)
n_list = list(nums)


# unpack 연산자
num_list = [*range(1,11,2)]

enumerate(iterable, start=0)

iterable (필수) : 반복 가능한 객체(list, tuple, string, dictionary …)
start(선택) : 인덱스 시작 번호

[enumerate(names,1)] → 함수 실행하면 리스트가 아니라 enumerate object라는 특수한 객체 반환한다.
→ 주소 반환
→ 이거는 iterator로 호출하면 결과값을 내보일 준비가 됐다.
→ 결과값을 보기 위해서는 그 안의 내용물을 꺼내야 한다.
unpack 연산자는 안의 결과값을 꺼내서 나열하는 연산자다.

일종의 Lazy Evaulation이다. 필요할 때까지 계산을 미루는 것!
이게 나중에 spark나 snowflake 등 모든 곳에 쓰이는 핵심 원리!

str.upper()처럼 괄호를 붙이면, str.upper 메서드를 즉시 호출하려고 시도합니다. 하지만 str.upper()는 문자열 인스턴스가 필요합니다(예: 'abc'.upper()). map() 함수에는 함수 자체(즉, 호출하지 않은 상태)를 전달해야 합니다. 즉, str.upper처럼 괄호 없이 함수 객체를 전달해야 각 요소에 대해 나중에 호출할 수 있습니다.

함수 객체와 함수 호출의 차이
괄호를 붙이면 함수가 바로 실행되고, 괄호 없이 전달하면 함수 자체를 전달하는 것이다.

numpy

numpy 배열은 동종으로, 모든 원소가 같은 타입이어야 한다.
맞추지 않으면, nupy가 알아서 변환한다.
python 내장 list는 브로드캐스팅을 지원하지 않는다.

실행 시간

%timeit : 분석하고 싶은 줄 앞에 매직 커맨드를 붙이면 된다.
시간 통계를 평균으로 제공한다.(평균 + 표준편차 제공)

-r (runs) : 실행 횟수 설정
-n (loops) : 루프 수 설정
여러 줄 → %%timeit
-o : %timeit의 출력을 변수에 저장할 수 있다.

코드 프로파일링

함수 호출 빈도와 소요 시간의 상세 통계
한 줄 단위 분석
line_profiler 설치 필요

메모리 사용 코드 프로파일링
memory_profiler

메모리 사용량에 대한 상세 통계
분석하고자 하는 함수는 반드시 import 해야 한다.

효율적 결합, 세고, 반복

zip : 객체들을 하나로 맞물려 결합시킨다. → zip 객체 반환하므로, 내용을 보려면 리스트로 풀어(unpack)서 출력해야 한다. 각 항목은 원래 리스트들에서 같은 위치의 원소들을 모은 tuple이다.

differing_lengths = [*zip(names[:5], primary_types[:3])]

collections 모듈

napedtuple : 필드명이 있는 tuple 서브클라스
deque : 빠른 append/pop이 가능한 리스트형 컨테이너
Counter : 해시 가능한 객체를 세는 dict
OrderDict : 삽입 순서 유지하는 dict
defaultdict : 누락 값에 공장 함수 호출하는 dict

집합 이론

symmetric_difference() : 대칭 차집합
.union() : 중복 없이 두 집합 원소들 결합

iterrows(), itertuples()

values: np.array 타입으로 가져올 수 있다.

오유찬

열심히 하면 재밌다

이전 포스트

2026.04.08(Wed)

다음 포스트