python pandas dataframe 데이터 문자열 다루기

potato·2021년 8월 12일

LOWER UPPER dataframe describe()group by nunique pandas python sort_values swapcase

파이썬

목록 보기

14/14

데이터 null 찾기

dataframe에 null인 항목이 있는지 찾기

print(data.isnull().values.any())
True
True가 나왔다면 데이터 중에 Null 값을 가진 샘플이 존재한다는 의미입니다. 어떤 열에 존재하는지 확인해봅시다.

어느 column인지 찾기

print(data.isnull().sum())
id          0
document    1
label       0
dtype: int64

리뷰가 적혀있는 document 열에서 Null 값을 가진 샘플이 총 1개가 존재

해당 데이터 row 확인

data.loc[data.document.isnull()]

출력 결과는 위와 같습니다. Null 값을 가진 샘플을 제거하겠습니다.

train_data = train_data.dropna(how = 'any') # Null 값이 존재하는 행 제거
print(train_data.isnull().values.any()) # Null 값이 존재하는지 확인
False

데이터 소문자로 바꾸기

# 소문자로 변경
data['artist'] = data['artist'].str.lower()

# 대문자로 변경
data['artist'] = data['artist'].str.upper()

# 대문자-> 소문자, 소문자-> 대문자
data['artist'] = data['artist'].str.swapcase()

lower()는 문자열의 모든 문자를 소문자로 바꾼다. 예를 들어 “Ups AND Downs”.lower()는 ‘ups and downs’로 계산된다.

• upper()는 문자열의 모든 문자를 대문자로 바꾼다. 예를 들어 “Ups AND Downs”.upper()는 ‘UPS AND DOWNS’로 계산된다.

• swapcase()는 대문자를 소문자로, 소문자를 대문자로 바꾼다. 예를 들어 “Ups AND Downs”.swapcase()는 ‘uPS and dOWNS’로 계산된다.

데이터 확인, 추출

print(data.loc[1000, 'user_id']) # user_id 컬럼의 1000번째 데이터를 찾아줌
print(data.loc[0,    'user_id']) # user_id 컬럼의 0번째 데이터를 찾아줌

특정 데이터 조회

# 1. user_id의 첫번째 아이디인 모든 데이터 조회
condition = (data['user_id']== data.loc[0, 'user_id'])
data.loc[condition]

# 2. unique 한 항목 수
data['user_id'].nunique()


# 3. isin
isin 구문은 열이 list의 값들을 포함하고 있는 모든 행들을 골라낼 때 주로 쓰인다. 

예를 들어, 아래 예제를 보면

df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df.isin([1, 3, 12, 'a'])

이와 같이 이진값을 반환한다. 

       A      B
0   True   True
1  False  False
2   True  False
Dataframe의 컬럼에서 어떤 list의 값을 포함하고 있는것만 걸러낼 때 isin 구문이 유용하다.

이러한 데이터프레임이 있을 때
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})

    A	B
0	1	a
1	2	b
2	3	f

A 컬럼의 값이 [1,3,12]를 포함하는 것만 골라낸다.
df[df['A'].isin([1, 3, 12])]

group by

# 1. 인기 많은 아티스트 30위
artist_count = data.groupby('artist')['user_id'].count()
artist_count.sort_values(ascending=False).head(30)

artist
radiohead 77254
the beatles 76245
coldplay 66658
red hot chili peppers 48924
muse 46954
metallica 45233
pink floyd 44443
the killers 41229
linkin park 39773
nirvana 39479
system of a down 37267

select *
from
(
select artist, count(user_id)
from data group by artist
order by count(user_id) desc
) a
limit 30

# 2. 유저별 몇 명의 아티스트를 듣고 있는지에 대한 통계
user_count = data.groupby('user_id')['artist'].count()
user_count.describe()

count 358868.000000
mean 48.863234
std 8.524272
min 1.000000
25% 46.000000
50% 49.000000
75% 51.000000
max 166.000000
Name: artist, dtype: float64

# 3. label(0,1)에 대해서 group by 아래와 같은 방법도 있음
print(train_data.groupby('label').size().reset_index(name = 'count'))

data 추가



data = [user_id, artist, play]

# 추가용 데이터
my_favorite = ['black eyed peas' , 'maroon5' ,'jason mraz' ,'coldplay' ,'beyoncé']

# dataframe 생성
my_playlist = pd.DataFrame({'user_id': ['zimin']*5, 'artist': my_favorite, 'play':[30]*5})

# user_id에 'zimin'이라는 데이터가 없다면, my_playlist 추가
if not data.isin({'user_id':['zimin']})['user_id'].any():  
    data = data.append(my_playlist)
    
data.tail(10)       # 잘 추가되었는지 확인해 봅시다.

potato

안녕하세요~

이전 포스트