If your model metrics jump up and down between runs, label quality is usually part of the story. That is normal. The fix is not panic relabeling. The fix is a repeatable QA system.
This checklist is designed for real teams with limited time.
1) Confirm guideline clarity first
Before measuring quality, confirm that rules are clear. No guideline clarity means no consistent labels.
Check:
- class definitions are concrete
- edge-case decisions are documented
- ambiguous examples are illustrated
If needed, start with annotation guideline template.
2) Run a fixed QA sample every week
Use the same reference set each week. That gives you trend visibility.
Checklist:
- stable sample size
- balanced by important classes
- reviewed by designated reviewers
Changing QA samples every week hides drift.
3) Measure disagreement, not just speed
High throughput with high disagreement is not success.
Track:
- disagreement rate per class
- disagreement trend week-over-week
- top recurring disagreement reasons
This gives actionable quality signals.
4) Calibrate reviewers regularly
Even good reviewers drift over time. Use short calibration sessions:
- review 20-50 hard examples together
- align on decisions
- update guideline immediately
This is low effort and high impact.
5) Define release gates clearly
Every dataset release should pass explicit checks:
- reviewer coverage threshold met
- no open critical conflicts
- export validation passed
- class mapping verified
No release gate means quality varies by whoever is rushing that day.
6) Track correction loops
How many labels come back from review? How quickly are corrections closed?
These metrics show whether annotation instructions are understandable.
7) Watch high-risk classes separately
Not all classes are equally important. For safety- or business-critical classes, use tighter QA thresholds.
Risk-based QA beats equal-effort QA.
8) Maintain a lightweight change log
Each quality update should record:
- what changed
- why it changed
- which classes are affected
- when it became active
This makes training results easier to interpret.
9) Audit random samples monthly
Weekly QA catches short-term issues. Monthly random audits catch slow drift.
Keep it simple:
- random pull from recent production-like data
- independent reviewer check
- one summary note
10) Close the loop with model errors
QA should connect to model outcomes.
After each training cycle:
- inspect false positives/negatives
- map errors to labeling rules
- adjust guideline and QA focus
This turns QA from compliance work into product improvement.
For broader operations, combine this with workflow automation and versioning.
A minimal weekly template
Use this schedule:
- Monday: QA sample review
- Wednesday: reviewer calibration
- Friday: release gate + change log update
Keep it consistent. Consistency beats complexity.
Final takeaway
Quality control is not about perfect labels. It is about predictable labels.
If your team can detect drift early and correct it quickly, model iteration becomes calmer and cheaper.
FAQ
How big should weekly QA sample be?
Large enough to reveal drift in key classes. Many teams start with 100-300 items.
Do small teams need formal QA?
Yes. Even a simple checklist prevents costly rework later.
Should QA focus only on difficult images?
No. Use both fixed representative samples and a hard-case slice.