[Analysis] DataFrame Code

Jaewon Lim·2024년 11월 6일

Sorting

# Sort by multiple cols
df = df.sort_values(by=['Col_A', 'Col_B']).reset_index(drop=True)
# Sort descending
df = df.sort_values(by='Col_A', ascending = False)

nunique() vs unique() vs value_counts()

df['Col_A'].nunique()
  • It calculates the number of unique Col_A

df['Col_A'].unique()
  • It shows all unique Col_A rows

df['Col_A'].value_counts()
  • It returns a series of containing counts of unique values.

Drop column

df = df.drop(columns=['Col_A,', 'Col_B'])

Min & Max of the date

df['Col_date'] = pd.to_datetime(df['Col_date'])

min_date = df['Col_date'].min()
max_date = df['Col_date'].max()

duration = max_date - min_date
DetailDate
Min_Date2022-01-01
Start_Date2023-12-31
Duration729 days

.groupby()

df['time'] = pd.to_datetime(recent['time'])
df['Year'] = df['time'].dt.year
df['Quarter'] = df['time'].dt.to_period('Q') # this makes 2022Q1

quarterly_sub = df.groupby(['Year', 'Quarter'])['user_id'].nunique().reset_index()
quarterly_sub['Quarter'] = quarterly_sub['Quarter'].astype(str)
quarterly_sub.columns = ['Year', 'Quarter', '# of sub_user']

print(quarterly_sub)
YearQuarter# of sub_user
20222022Q117628
20222022Q215280
20222022Q310722
20222022Q414757
20232023Q124983
20232023Q217241
20232023Q321180
20232023Q423342
  • It groups the df by the Year and Quarter cols, and then it counts the # of unique user_id values in each group. This calculation determines how many distinct users perfromed the action logged in 'time' within each quarter.
  • It converts it into a string for easier readability and use in displays or reports.

.astype()

df.astype(dtype, copy=True, errors='raise')

drop_duplicates() vs .duplicated()

df.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
# remove duplicates on specific col, use subset
df.drop_duplicates(subset=['Col_A'])
# remove duplicates and keep last
  • It returns dataframe with duplicate rows removed.
df.duplicated(subset=None, keep='first')
# By default, for each set of duplicated values, the first occurrence is set on False and all others on True. 
df.dulicated()

0개의 댓글