Payment / Logistics / Review EDA (Python)

brusel Luam·2026년 3월 9일

Further EDA was conducted based on the previous analysis of the Olist dataset.

1. Payment Analysis

Payment types distribution

Credit cards are the dominant payment method, accounting for about 74% of all transactions.

Payment Intallments / Count

This chart shows the distribution of payment installments.
We can see that most transactions are made with a single payment, indicating that many customers prefer paying in full rather than using installments. A Smaller portion of customers use 2 to 4 installments, while long installment plans are relatively rare.

Most orders (about 97–98%) are paid with a single payment.
Orders with multiple payments are very rare, indicating customers usually pay the full amount at once.

2. Customer Review Analysis

Wordcloud

I created an average review score data mart by categories and its cnt on MySQl and then I've loaded it on Python.

[MySQL]
use olistdata;

# Categories Review Datamart
SELECT
  COALESCE(t.product_category_name_english, p.product_category_name) AS category,
  COUNT(*) AS review_cnt,
  ROUND(AVG(r.review_score), 3) AS avg_review_score
FROM order_reviews r
JOIN orders_items oi
  ON r.order_id = oi.order_id
JOIN products p
  ON oi.product_id = p.product_id
LEFT JOIN product_category_name_translation t
  ON p.product_category_name = t.product_category_name
WHERE r.review_score IS NOT NULL
GROUP BY category
ORDER BY avg_review_score DESC;

Categories like CDs/DVDs, children’s clothes, and books have the highest review scores (around 4.4–4.6), indicating strong customer satisfaction. In contrast, categories such as office furniture and electronics-related items have lower score (around 3.2~3.5), indicating more customer complaints or delivery/product issues.

In-depth Analysis of Three Problematic Sellers

Among the bottom five categories by average review score, office_furniture has the largest sample size (1,687). Therefore, I drilled down into this category(office_furniture) to identify specific sellers with consistently low review scores.

As a result, I identified three problematic sellers in the office_furniture category with the highest number of review scores of 2 or below.

These sellers show consistently high proportions of low ratings, suggesting potential issues with product quality, shipping, or customer service.

Next, to analyze where the orders from these sellers were distributed, I filtered the dataset using the problematic seller_ids. The results show that about 60% of the orders from these sellers occurred in SP and RJ, where the sellers themselves are located.

Is delivery delay actually associated with review scores of 2 or below?

[Python Code]

problem_orders['is_delayed'] = (
    problem_orders['order_delivered_customer_date'] >
    problem_orders['order_estimated_delivery_date']
)

# 지연 vs 비지연 2점 이하 비율 비교 
problem_orders['is_low'] = problem_orders['review_score'] <= 2

delay_compare = (
    problem_orders
    .groupby('is_delayed')['is_low']
    .mean()
    .reset_index()
)

The direct cause of the low average review scores of these sellers is delivery delays.
However, the earlier hypothesis—that delivery delays were mainly caused by logistics infrastructure limitations—is not fully supported. Low ratings are not concentrated only in remote regions but also appear in SP and RJ, where the sellers themselves are located.
This suggests that the issue may stem not only from logistics infrastructure but also from seller-related problems, indicating that both seller factors and infrastructure limitations contribute to the delay.

[Next Step]
To more clearly determine whether the issue stems from seller-related problems or logistics infrastructure limitations, we can examine the difference between shipping_limit_date and order_delivered_carrier_date, which allows us to check whether the delay occurred at the seller stage.
If shipping_limit_date is later, the delay is more likely related to delivery or logistics issues. If order_delivered_carrier_date is later, the delay likely occurred at the seller’s fulfillment stage.

[Python Code]

problem_orders['order_delivered_carrier_date'] = pd.to_datetime(problem_orders['order_delivered_carrier_date'])
problem_orders['shipping_limit_date'] = pd.to_datetime(problem_orders['shipping_limit_date'])

problem_orders['seller_late'] = (
    problem_orders['order_delivered_carrier_date'] >
    problem_orders['shipping_limit_date']
)

problem_orders.groupby('seller_id')['seller_late'].mean().sort_values(ascending=False)

The seller shipment delay rate—defined as cases where carrier handover occurs after the shipping_limit_date—ranges from 16.7% to 30.3%, which is 1.8–3.3 times higher than the Olist overall average (9.2%). This suggests that the primary driver of low review scores is more likely seller-side fulfillment and processing delays rather than external logistics infrastructure issues.

However, the characteristics of the office_furniture category, where products are typically large and bulky, as well as the remaining limitations in logistics infrastructure, should not be overlooked.

Therefore, it would be useful to examine the correlation between delivery time or delay status and product characteristics, specifically product volume and weight, where volume is calculated using product length, height, and width.

Correlation between delivery time and product characteristics

The base dataset was created using the full dataset instead of focusing on a specific category(office_furniture).


Orders with delivery delays took about 30 days on average, compared to around 10 days for non-delayed orders.


Delayed orders are approximately 300g heavier on average than non-delayed orders.


Delayed orders have about 10% larger product volume on average than non-delayed orders.

However, the correlation analysis shows a coefficient of 0.016, indicating virtually no relationship between product volume and delivery delays.

The correlation between product weight and freight value is 0.61, indicating a relatively strong positive relationship. Similarly, product volume and freight value show a correlation of 0.59, suggesting a comparable positive association.

Delayed orders took an average of about 30 days for delivery, while non-delayed orders took around 10 days on average.

[Conclusion]
Low review scores in the office_furniture category are primarily driven by seller-side shipment delays rather than logistics infrastructure constraints.

3. Logistics Infrastructure Analysis

Same State vs Differnet State

When the seller and customer are in the same state, the average delivery time is about 7 days, while it increases to about 14 days when they are in different states, indicating that cross-state deliveries take roughly twice as long.

States with the highest average delivery days were identified for orders where the seller and customer are located in the same state.

Delivery Delay Ratio by State

To gain insights into the causes of delivery delays, cases where shipping_limit_date is later than order_delivered_carrier_date were defined as seller_delay, while the opposite cases were defined as logistics_delay.

[Python Code]

# 배송소요일 계산 
base2['delivery_days'] = (
    base2['order_delivered_customer_date'] - 
    base2['order_purchase_timestamp']
).dt.days

# 최종 배송 지연 여부 8ㅑ![](https://velog.velcdn.com/images/hyuunk_/post/d3eff9d6-709c-4ad0-866c-bd8328aa0563/image.png)

base2['is_delayed'] = (
    base2['order_delivered_customer_date'] > 
    base2['order_estimated_delivery_date']
)

# seller 출고 지연 여부 
base2['seller_late'] = (
    base2['order_delivered_carrier_date'] >
    base2['shipping_limit_date']
)

Load only states with more than 200 total orders (order_id).

The analysis suggests key point that logistics infrastructure may play a larger role in delivery delays in RJ.
In this region, 77% of delayed orders occur during the logistics stage, which is nearly 20 percentage points higher than the approximately 60% observed in other regions.

Average lead time by state analysis

Seller processing lead time and logistics delivery lead time(carrier_delivery_days) were defined as follows.

[Python Code]

# Seller Processing Lead Time 
base2['seller_processing_days'] = (
    base2['order_delivered_carrier_date'] - 
    base2['order_purchase_timestamp']
).dt.days

# Carrier Delivery Days  
base2['carrier_delivery_days'] = (
    base2['order_delivered_customer_date'] - 
    base2['order_delivered_carrier_date']
).dt.days

Based on the average across all customer states, the seller processing lead time for delayed orders is 5.36 days, compared to 2.52 days for non-delayed orders, indicating it roughly doubles when delays occur.
However, the carrier delivery lead time increases much more significantly—from 7.44 days to 25.19 days, more than tripling.
This suggests that delivery delays are driven more by logistics delivery time than by seller processing time.

This boxplot analysis shows that carrier delivery time is significantly longer for delayed orders, with the median increasing from about 7 days to around 20 days. The distribution is also much wider, with several extreme delays observed, suggesting that logistics delivery time is a major driver of delivery delays.

The median logistics delivery time for delayed orders is about three times higher than that of non-delayed orders, and in some cases extreme delivery delays exceeding 100 days were observed.

In-depth Analysis of RJ Delivery Time

The distribution shows that non-delayed orders are concentrated within 3–10 days, while delayed orders are mainly distributed between 15–30 days with a long tail extending beyond 40 days. Considering the result above, this indicates that delivery delays are primarily driven by substantially longer carrier delivery times.


A deeper analysis of logistics lead time in the RJ region reveals several key patterns.
Most deliveries are concentrated within 5–15 days, indicating relatively fast delivery lead times on average.
In fact, the majority of orders arrive well before the SLA threshold.
However, deliveries begin to exceed the SLA around the 95th percentile (P95).
The distribution also shows a long right tail, suggesting that a small number of extreme delays significantly contribute to the overall delay rate.
The long right tail in the distribution contributes significantly to the high delivery delay rate.

Comparing carrier delivery times between Brazil overall and the RJ region shows little difference in the median, but RJ exhibits longer delivery times in the upper percentiles, with delays of +6 days at P90, +9 days at P95, and +12 days at P99.

RJ experiences 2–3 times more extremely long delivery delays (30+ days) compared to the overall average across Brazilian states, indicating a structural tendency toward long-tail delivery delays.

RJ shows similar average delivery performance to the national level, but the severity of SLA exceedance and the frequency of extremely long deliveries are 2–3 times higher than the national average.
This suggests the presence of structural bottlenecks that occur under certain conditions.

Therefore, it is necessary to drill down into the top 5% of RJ orders (those exceeding P95) to examine whether the delays are concentrated in specific product categories, time periods, or carriers.

Top 5% Tail Analysis of Carrier Delivery Days in RJ

Analysis of the top 5% delayed orders in the RJ region shows that the seller processing time remained within a normal range, averaging around 3 days, while the carrier delivery time increased abnormally to an average of 45 days.
This indicates that the primary cause of the delay lies in the logistics transportation stage rather than seller processing.

76.7% of the top 5% delayed deliveries in RJ originate from sellers located in SP, which is about 10 percentage points higher than the overall share of SP-origin orders in RJ (66%).
This suggests the possibility of relatively more severe logistics bottlenecks along the SP→RJ delivery route.

Average Review Score by Freight Cost Across Regions

Across all regions, a clear negative relationship is observed between delivery days and review scores, indicating that longer delivery times consistently lead to lower customer ratings.

Average Late Rate by Category (Top 15)

Several categories show late delivery rates above the overall average (~10%), with audio, fashion underwear, and Christmas items recording the highest delay rates.

Monthly Late Rate & Order Volume

Late delivery rates fluctuate over time, with noticeable spikes during periods of high order volume, suggesting demand surges may contribute to delivery delays.

0개의 댓글