TIE-openxl,pandas Issue

kyoungyeon·2026년 1월 22일

TIE

목록 보기

52/52

format_fix_ver.1

{ Status: What went wrong
Symptoms: Logs/Errors/Manifestations
Cause: Why it happened (including estimation → verification process)
Resolution: Actions I took
Preventing recurrence: What to do next time }

Status

xlsx 파일을 filter 을 위해 openxl lib 다가 pandas로 NA를 잡는 도중 에러 발생

Symptoms

err1 : ModuleNotFoundError

ModuleNotFoundError: No module named 'config' > python scripts/rule_filter.py Traceback (most recent call last): File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pandas/compat/_optional.py", line 135, in import_optional_dependency module = importlib.import_module(name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1204, in _gcd_import File "<frozen importlib._bootstrap>", line 1176, in _find_and_load File "<frozen importlib._bootstrap>", line 1140, in _find_and_load_unlocked ModuleNotFoundError: No module named 'openpyxl' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/ky/Applications/AssiPER/scripts/rule_filter.py", line 73, in <module> main("inputs/nationwide.xlsx", "outputs/flagged_rows.xlsx") File "/Users/ky/Applications/AssiPER/scripts/rule_filter.py", line 46, in main df = pd.read_excel(input_xlsx) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pandas/io/excel/_base.py", line 495, in read_excel io = ExcelFile( ^^^^^^^^^^ File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pandas/io/excel/_base.py", line 1567, in __init__ self._reader = self._engines[engine]( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pandas/io/excel/_openpyxl.py", line 552, in __init__ import_optional_dependency("openpyxl") File "/Users/ky/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pandas/compat/_optional.py", line 138, in import_optional_dependency raise ImportError(msg)

err2

> python -m scripts.rule_fliter /Users/ky/Applications/AssiPER/.venv/bin/python: No module named scripts.rule_fliter > python -m scripts.rule_fliter /Users/ky/Applications/AssiPER/.venv/bin/python: No module named scripts.rule_fliter

err3

flags:12116

err4 : how to zip in MAC?
(+ exclude sensatvie file?)
err5
불완정 검수에 대한 기준
- D나 E 둘 중 하나라도 뭐가 들어있으면 = 누군가 “한 번은 봤다”
- D도 비었고 E도 비었으면 = 아무도 안 봤다(우선적으로 봐야 함)
err5
검수 조건 더 집요하게 하기 (진검승부.. 전수조사 2탄?)

python -m scripts.rule_filter [OK] flagged rows saved -> outputs/flagged_rows.xlsx 
- total rows: 14337 
- flagged rows: 6679

err6
update_script debug

> python -m scripts.apply_to_original [OK] saved -> outputs/nationwide_updated.xlsx 
- rows updated: 0

조건을 바꿨는데 0일리 없다
err7 UnboundLocalError

UnboundLocalError: cannot access local variable 'd_marked' where it is not associated with a valuedef main(): wb = load_workbook(ORIGINAL_XLSX) ws = wb.active max_row = ws.max_row filled = 0 skipped = 0 for r in range(2, max_row + 1): # 1행은 헤더라고 가정 d = s(ws[f"{COL_D}{r}"].value) e = s(ws[f"{COL_E}{r}"].value) g = s(ws[f"{COL_G}{r}"].value) h = s(ws[f"{COL_H}{r}"].value) # ✅ E가 비었거나 너무 짧고, H가 충분히 길면 -> H를 E에 복사 if len(e) < MIN_E_LEN and len(h) >= MIN_H_LEN: # 법령명(G)도 같이 넣고 싶으면 prefix로 붙이기 (선택) e_new = f"[{g}]\n{h}" if g else h #e_new = h ws[f"{COL_E}{r}"].value = e_new filled += 1 if d == "": ws[f"{COL_D}{r}"].value = "AI조항복붙" d_marked += 1 else: skipped += 1 wb.save(OUTPUT_XLSX) print(f"[OK] saved -> {OUTPUT_XLSX}") print(f"- E filled from H: {filled}") print(f"- skipped: {skipped}") if __name__ == "__main__":

err8 script 실행 순서 꼬임

python -m scripts.apply_flagged_to_autofilled [OK] saved -> outputs/nationwide_final.xlsx - rows updated: 0 > python -m scripts.rule_filter [OK] flagged rows saved -> outputs/flagged_rows_after_autofill.xlsx - total rows: 14337 - flagged rows: 24

err 9 : script 가 못잡아낸 공백
- script 다 만들었는데 E열 공백 있음
  - 특히 D=이상없음인데 E가 공백”
- E 열에도 공백이 있었고 D에도 공백있음 3개정도 행이 잡힘.
err 10 : "AI조항없음" 을 "이상없음"으로 바꿀예정
err 11: 행밀림 현상

Cause:

e1 : config 위치 (최상단 부모 프로젝트 하단) + module화위해 script init.py 추가
e2 : typo error
e3 : flag 전수조사각이라 너무 많아서 규칙추가
e4 : zip -r peri.zip . -x "*.git*" -x "*.xlsx*" -x "*.pyc" -x "*.venv*"
e6 : column 변수명 불일치 (엑셀 간 한글 변수명)
e7 : scope 변수명 이슈 (최상단 d_marked=0 초기화 선언.)
e8: 로직 이해 미스.
e9
- 조건 rule 확인
  1. 등급판단 데이터 = 기준미설정 / 조례없음
  2. 법령명(G) = 해당 없음
  3. 조항내용(H) = 해당 없음 (혹은 없음/공백) → 즉 애초에 참조할 조항이 존재하지 않는 케이스
2) D 이상없음 E 공백
D : “이상없음” = E: “(규제 조항) 해당 없음 / 조례 없음 / 기준 미설정이라서 검수 결과 문제 없음”
- E index 추천 : "해당없음","조례없음","기준미비"

3) D도 공백이고 E도 공백인 행이 3개
셀에 “눈에 안 보이는 공백”이 들어간 경우
- "(space)", "\n", NBSP
4) D열 값이 “이상 없음”처럼 띄어쓰기 버전인 경우
t = t.replace("이상 없음", "이상없음")
e9, 10: rule 추가("AI조항" -삭제 후 추가)

 
F_NO_BASE = {"조례없음", "기준미설정", "규정없음"}

 ...(중략)
 def main():
    wb = load_workbook(INPUT_XLSX)
    ws = wb.active
    
 # 삭제
 removed = 0
    for r in range(2, ws.max_row + 1):
        cell = ws[f"{COL_D}{r}"]
        if s(cell.value) == "AI조항복붙":
            cell.value = ""
            removed += 1

    wb.save(OUTPUT_XLSX)
 # 추가 
 filled = 0
    for r in range(2, ws.max_row + 1):
        d = s(ws[f"{COL_D}{r}"].value)
        e = s(ws[f"{COL_E}{r}"].value)
        f = s(ws[f"{COL_F}{r}"].value)

      # D가 이상없음(또는 이상 없음)이고 E가 빈칸인데,
        # F가 조례없음/기준미설정/규정없음이면 → E 채우기
        if d.replace(" ", "") == "이상없음" and e == "" and f in F_NO_BASE:
            ws[f"{COL_E}{r}"].value = "해당 없음"
            filled += 1

e 11: index/ city 없이 merge 한 경우 생김
1) autofill_e_from_h
2) apply_flagged_to_autofilled
3) pandas merge, concat, sort_values

Resolution: Actions I took

python -m scripts.rule_filter [OK] flagged rows saved -> outputs/flagged_rows.xlsx 
- total rows: 14337 
- flagged rows: 1150

> python -m scripts.autofill_e_from_h [OK] saved -> outputs/nationwide_autofilled.xlsx 
- E filled from H: 9647 
- skipped: 4690

python -m scripts.autofill_e_from_h [OK] saved -> outputs/nationwide_autofilled.xlsx 
- E filled from H: 9647 
- skipped: 4690

> python -m scripts.rule_filter [OK] flagged rows saved -> outputs/flagged_rows_after_autofill.xlsx 
- total rows: 14337 
- flagged rows: 24 
#...ㄷㄷ 엄청나다

-> 사람손으로 채워야하는수 24개

-> fin

> python -m scripts.apply_flagged_to_autofilled [OK] saved -> outputs/nationwide_final.xlsx - rows updated: 0
> python -m scripts.apply_flagged_to_autofilled [OK] saved -> outputs/nationwide_final.xlsx - rows updated: 24 
> python -m scripts.rule_filter [OK] flagged rows saved -> outputs/flagged_rows_final.xlsx - total rows: 14337

e11
missing_in_a.csv : final에는 있는데 원본에 없는 키
missing_in_b.csv : 원본에는 있는데 final에 없는 (도시명, 항목명)


	a = pd.read_excel(a_path, engine="openpyxl")
    b = pd.read_excel(b_path, engine="openpyxl")
    a = _norm_df(a)
    b = _norm_df(b)

    a_keys = _key_set(a)
    b_keys = _key_set(b)

	cnt_a = a.groupby("항목명").size().reset_index(name="rows_a")
    cnt_b = b.groupby("항목명").size().reset_index(name="rows_b")
    per_item = cnt_a.merge(cnt_b, on="항목명", how="outer").fillna(0)
    per_item["diff(a-b)"] = per_item["rows_a"].astype(int) - per_item["rows_b"].astype(int)
    per_item = per_item.sort_values(by="diff(a-b)", key=lambda s: s.abs(), ascending=False)

    missing_in_b = pd.DataFrame(list(a_keys - b_keys), columns=KEY_COLS).sort_values(KEY_COLS)
    missing_in_a = pd.DataFrame(list(b_keys - a_keys), columns=KEY_COLS).sort_values(KEY_COLS)
    
    # Ordering mismatches (도시명 sequence) per 항목명
    mismatches = []
    common_items = sorted(set(a["항목명"]) & set(b["항목명"]))
    for item in common_items:
        la = a.loc[a["항목명"]==item, "도시명"].tolist()
        lb = b.loc[b["항목명"]==item, "도시명"].tolist()
        if len(la) != len(lb):
            mismatches.append([item, "len_diff", len(la), len(lb), None, None])
            continue
        for i,(ca,cb) in enumerate(zip(la,lb)):
            if ca != cb:
                try:
                    j = lb.index(ca)
                except ValueError:
                    j = None
                mismatches.append([item, "order_mismatch", len(la), len(lb), i, j])
                break
    order_df = pd.DataFrame(mismatches, columns=["항목명","type","rows_a","rows_b","first_mismatch_idx","idx_in_b_of_a_city"])

-> 결과값

> python -m scripts.debug_compare \
  --a "전수조사취합_20260115_noErr.xlsx" \
  --b "nationwide_final_copy.xlsx" \
  --out "debug_out"

[OK] wrote reports -> debug_out
  - missing_in_b: 0
  - missing_in_a: 1
  - items with order mismatch: 0
  - per_item_count rows: 60

Preventing recurrence:

로직을 잘 짜자
- 예외 case에 대한 생각할 겨를부족
  - 이번기회로 QA 검증 기준 세분화-> 특히 해당 데이터 도메인 이해 후 빌드 업 필요하다는 걸 느낌.

kyoungyeon

🏠TECH & GOSSIP

이전 포스트

TIE-openxl,pandas Issue

TIE

format_fix_ver.1

Status

Symptoms

Cause:

Resolution: Actions I took

Preventing recurrence:

TIE- 코드 이슈 정리 및 보안/설정 상세 설명

0개의 댓글