MLOps - Define Data and Establish Baseline

yozzum·2025년 1월 26일
0

MLOps

목록 보기
6/19

[More label ambiguity examples]

  • Is it a bot or spam account?
  • Is it a fradulent transaction?
  • Is he/she looking for job?
  • Are the two users the same person? (no explicit key)
  • Data definition questions
  • What is the input x?
    • (Images) Lighting? Contrast? Resolution?
    • What features need to be included?
  • What is the target label y?
    • How can we ensure the labelers give consistent labels?
  • Major types of data problems
  • Unstructured data
    • May or may not have huge collection of unlabeled examples x.
    • Humans can label more data.
    • Data augmentation maore likely to be helpful.
  • Structured data
    • it is harder to obtain more data.
    • Human labeling may not be possible.
  • Small data
    • Clean labels are critical.
    • can get workers talk to each other.
  • Big data
    • Emphasis on data process.

[Small data and label consistency]

  • Labeling instructions must be clear in order to keep the consistency in data.
  • Big data problems can have small data challenges too
    • Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges too.
      • Web search
      • Self-driving cars
      • Product recommendation systems

[Improving label consistency]

  • Have multiple labelers label same example.
  • When there is disagreement, have MLE, suibject matter expert(SME) and/or labels discuss definition of y to reach agreement.
  • If labelers believe that x doesn't contain enough information, consider changing x.
  • Iterate until it is hard to significantly increase agreement.

[Human level performance(HLP)]

  • Why measure HLP?

    • Estimate Bayes error / irreducible error to help with error analysis and prioritization.
  • Important to be conservative when comparing model performance with HLP.

  • Should be able to persuade the business stakeholders for the performance.

profile
yozzum

0개의 댓글