Series vs. DataFrame, iloc vs. loc

been_29·2024년 7월 29일
post-thumbnail

💡 The difference between Series and DataFrame


Series

  • Definition : A one-dimensional array-like object that holds a sequence of data and associated labels, called indices.
  • Features
    • 1D Data : Can hold data of any type (integer, float, string, etc.).
    • Index : Each element in a ‘Series’ has a unique index, which can be a default integer indes or a custom index.
    • Homogeneous Data : All elements in a ‘Series’ are typically of the same data type.
  • Example code
    ```python
    import pandas as pd
    
    data = [1, 3, 5, 7, 9]
    series = pd.Series(data)
    print(series)
    ```
    
    ```python
    #Output
    0    1
    1    3
    2    5
    3    7
    4    9
    dtype: int64
    ```

DataFrame

  • Definition: A two-dimensional, tabular data structure with labeled axes (rows and columns). It can be thought of as a collection of ‘series’ objects.
  • Features
    • 2D Data : It can hold data of different type (numeric, string, boolean, etc.) in columns.
    • Index and Columns : It has both row indices and column labels, making it very flexible for data manipulation.
    • Heterogeneous Data : Each columns in a ‘DataFrame’ can contain data of different types.
  • Example code
    import pandas as pd
    
    data = {
        'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]
    }
    df = pd.DataFrame(data)
    print(df)
    #Output
       A  B  C
    0  1  4  7
    1  2  5  8
    2  3  6  9
    

Differences between Series and DataFrame

DimensionalityStructureData TypeUsage
SeriesOne-dimensionalA single sequence of values, similar to a list of an arrayTypically homogeneous (all elements are of the same type)Useful for storing and manipulating columns and for operations involving multiple variables
DataFrameTwo-dimensionalA table with multiple columns, each of which can be considered a ‘Series’Heterogeneous (different columns can have different types)Ideal for datasets with multiple columns and for operations involving multiple variables






💡 The difference between iloc and loc


iloc

  • Definition: ‘iloc’ stands for ‘integer location’ and is used for indexing by position. It allows you to select data by its integer position.

  • Characteristics

    • Integer-Based Indexing : Use integer indices to select rows and columns. Useful when you want to access data by its position in the DataFrame.
    • Python-Like Slicing : Follow Python’s slicing rules where the start index is inclusive and the end index is exclusive.
    • Positional Access : Ideal for accessing data when you know the exact position (row/column number) of the data.
    • Out-of-Bounds Handling : Raise and ‘IndexError’ if you attempt to access a position that doesn’t exist in the DataFrame.
    • Supports Integer Arrays : Can use lists or arrays of integers to select specific rows and columns.
  • Usage : Access rows and columns using Integer indices.

  • Example code

    import pandas as pd
    
    data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
    df = pd.DataFrame(data)
    
    # Select the first row
    print(df.iloc[0])
    
    # Select the first row and first column
    print(df.iloc[0, 0])
    
    # Select the first two rows
    print(df.iloc[:2])
    
    # Select the first two rows and first two columns
    print(df.iloc[:2, :2])

loc

  • Definition: ‘loc’ stands for “label location” and is used for indexing by labels or boolean arrays. It allows you to select data by the labels of rows and columns.
  • Characteristics
    • Label-Based Indexing : ‘loc’ uses labels (indices and column names) to select rows and columns. Useful when your DataFrame has meaningful index labels.
    • Inclusive Slicing : Include both the start and end labels in the slice.
    • Label Access : Ideal for accessing data when you know the labels of the data.
    • Flexible Indexing : Can handle more complex data retrieval scenarios, such as using boolean arrays, lists of labels, or slices with labels.
    • Error Handling : Raises a ‘KeyError’ if the specified label does not exist in the DataFrame.
    • Supports Boolean Indexing : Can use boolean arrays to filter rows or columns based of conditions.
  • Usage: Access rows and columns using labels.
  • Example code
    import pandas as pd
    
    data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
    df = pd.DataFrame(data, index=['one', 'two', 'three'])
    
    # Select the row with label 'one'
    print(df.loc['one'])
    
    # Select the row with label 'one' and column 'A'
    print(df.loc['one', 'A'])
    
    # Select rows with labels 'one' and 'two'
    print(df.loc[['one', 'two']])
    
    # Select rows 'one' and 'two' and columns 'A' and 'B'
    print(df.loc[['one', 'two'], ['A', 'B']])

Differences between ‘iloc’ and ‘loc’

Feature‘iloc’‘loc’
Indexing MethodInteger-BasedLabel-based
Slicing BehaviorStart inclusive, end exclusiveBoth start and end inclusive
Error Handling‘IndexError’ for out-of-bounds‘KeyError’ for missing labels
Access MethodPositionalLabel
UsageWhen position is knownWhen label is known
Supports Boolean ArraysNoYes
Supports Integer ArraysYesNo
FlexibilityLess flexibleMore flexible
profile
Data Analysis

0개의 댓글