Deepdive: Human Intelligence Providers
Human intelligence vendors deepdive (data labeling, RLHF/evals, expert networks, crowd marketplaces)
This note summarizes reputable “human intelligence” vendors—companies that provide human labor + tooling for tasks like data labeling, RLHF, LLM evaluations, content moderation, and research participant recruitment.
Purpose of this deep-dive (why “human intelligence” matters)
This deep-dive is intended to:
- Introduce the value of human intelligence in the AI industry (in contrast to synthetic data):
- Ground-truth + preference data: Humans create/validate labels, rankings, and judgments that are hard to reliably synthesize without leaking model biases back into training.
- Adversarial coverage: Humans can probe edge cases, ambiguity, and “unknown unknowns” that synthetic generation tends to under-sample.
- Evaluation integrity: Independent human evals can act as a check against benchmark gaming and overfitting to synthetic test sets.
- Domain expertise: Some tasks require credentialed expertise (medical, legal, finance, security), where synthetic data is especially risky.
- Clarify typical customer personas (who buys this and why):
- Frontier/AI labs: RLHF, evaluations, and specialized “frontier data” for next-model training.
- Enterprise AI teams: labeling + governance + security/compliance; steady throughput at predictable quality.
- Product orgs: trust & safety, moderation, multilingual QA, search relevance, and UX judgments.
- Research orgs: recruitment of reliable human participants (surveys/experiments).
- Introduce different business models (how vendors package labor + tooling):
- Enterprise managed services: sales-led delivery with SOW/MSA, SLAs, and dedicated PM/QC.
- Marketplace / self-serve: requesters launch tasks; quality control is largely on the buyer.
- Hybrid platform + workforce: SaaS workflow + optional vendor-provided workforce and QA.
- Expert contracting / “talent cloud”: hire vetted specialists/contractors rather than buy per-label work.
Method / caveats
- Valuation & revenue: public numbers (when available) are summarized in a short note below the tables. If no reliable public number is found, it’s treated as Not publicly disclosed.
- Onboarding flows: summarized from public product docs / “get started” guides and common enterprise procurement patterns; when vendor-specific details aren’t public, I label it as Typical.
Economics snapshots (pay/earnings + market sizing)
TODO: add sourced, public numbers for (1) typical worker earnings (expert vs microtask), and (2) market size trend (advanced LLM labeling vs simple labeling).
Comparison table (publicly verifiable fields; “Not disclosed” means no reliable public number found)
Leading category
| Company | Founded | HQ region | Serving regions | Strength | Clients (examples) | Client onboarding flow | Worker onboarding workflow | Worker size (public) |
|---|---|---|---|---|---|---|---|---|
| Surge AI (Surge Labs, Inc.) | 2020 | US (San Francisco, CA) | Global | Frontier-style RLHF + evals; high-skill judgment | AI labs (public case study: Anthropic) | Partner-managed projects or API/projects flow (funds required before launching to their workforce) | Managed workforce; worker pipeline details not fully public | Not publicly disclosed |
| Mercor | 2021 | US (San Francisco, CA) | Global | Expert contracting + AI-assisted matching (closer to hiring) | Not consistently public | Typical: sales/intake → define role(s) → review candidates → onboard/pay in platform | Worker: apply → interview/assessment → onboarding → paid contracting work | 30,000+ contractors (reported; see sources) |
| Scale AI | 2016 | US (SF Bay Area) | Global | Enterprise-scale labeling + compliance posture | Enterprise + public sector (public case studies) | Typical enterprise: sales/demo → security review → pilot → scale | Managed workforce + vendor QA; worker pipeline details not fully public | Not publicly disclosed |
| Amazon Mechanical Turk (MTurk) | 2005 | US (Seattle, WA) | Global | Lowest friction + low unit cost crowd microtasks | Many requesters (marketplace) | Create requester → create HIT/project → design → quals → fund → sandbox → launch | Worker signup → verify/payment setup → qualification → accept HITs | Not consistently published (marketplace; active supply fluctuates) |
Other reputable vendors
| Company | Founded | HQ region | Serving regions | Strength | Clients (examples) | Client onboarding flow | Worker onboarding workflow |
|---|---|---|---|---|---|---|---|
| Appen | 1996 | Australia (Sydney) | Global | Large-scale data programs + multilingual | Big Tech historically (varies by year) | Enterprise: scope → MSA/security → pilot → ramp | Worker: apply → locale/language screening → task qualification → production + QC |
| TELUS International (AI Data Solutions) | 2005 (TELUS International) | Canada (Vancouver) / global | Global | BPO + AI data services at scale | Enterprise (public) | Enterprise: scope → contract/security → ramp | Worker: crowdsourcing + managed teams; screening/qualification varies by program |
| Sama (formerly Samasource) | 2008 | US (SF Bay Area) | Global | High-quality data labeling + impact sourcing | Public case studies | Enterprise: scope → security → pilot → scale | Worker: recruit/train (impact sourcing) → qualification → production + QC |
| iMerit | 2012 | India (Kolkata) / US presence | Global | Managed annotation + domain teams | Public case studies | Enterprise: scope → pilot → scale | Worker: hiring + training → qualification → supervised delivery |
| CloudFactory | 2010 | UK/US with Nepal roots | Global | Managed distributed workforce | Public case studies | Enterprise: scope → pilot → scale | Worker: recruit → training → QA-managed work |
| Toloka | (see sources) | Europe (reported HQ varies) | Global | Marketplace + AI data ops (strong in multilingual) | Marketplace | Self-serve + managed: project setup → pool/filters → QA → launch | Worker: signup → verification → exams/quals → tasks + ratings |
| Clickworker | 2005 | Germany (Essen) | Global | Crowd microtasks, surveys, data collection | Marketplace | Self-serve: create job → target/quals → publish → review | Worker: signup → profile/verification → quals → tasks |
| Prolific | 2014 | UK (Oxford/London, reported) | Global | High-quality research participants (academia + industry) | Researchers/companies | Self-serve: create study → screen → launch → pay | Participant: signup → ID/eligibility → take studies |
| TransPerfect DataForce | (see sources) | US (NYC for TransPerfect) | Global | Multilingual data collection + annotation | Enterprise | Enterprise: scope → contract/security → delivery | Worker: recruit panelists → consent/verification → collection + QC |
| TaskUs (AI Services) | 2008 | US (New Braunfels, TX) | Global | BPO + trust & safety + AI data operations | Enterprise | Enterprise: scope → contract/security → ramp | Worker: hiring/training → production + QA |
| Labelbox | 2018 | US (San Francisco, CA) | Global | Labeling platform + managed services | Enterprise | SaaS: signup/demo → integrate data → label/eval workflows | Workforce via partners/managed services; worker pipeline varies |
| Hive (Hive AI / Hive Data) | 2013 | US (SF Bay Area) | Global | Content moderation + labeling + model APIs | Public customers (varies) | Typical: sales/demo → pilot → scale | Mix of in-house + crowd; worker details not fully public |
| LXT | 2010 | Canada/US (reported) | Global | Speech/text data collection + annotation (multilingual) | Enterprise | Enterprise: scope → contract → collection/annotation | Worker: recruit contributors → consent/verification → tasks + QA |
Valuation & revenue notes (publicly reported where available): Public-company “valuation” is best treated as market cap (varies daily) and may not be listed here. For revenue, the only specific figures currently captured in this doc are from Wikipedia infoboxes (verify before relying on them): Surge AI (lists $1.2B, 2024), Telus Digital / TELUS International (lists US$2.658B, 2024), and TaskUs (lists US$227.5M, 2024). For Appen/TELUS/TaskUs, refer to their latest annual reports/filings for definitive revenue.
Conclusions
1) Onboarding pattern
- Enterprise-managed (Scale-like): discovery call → NDA/MSA → security/compliance review → pilot → production ramp → ongoing governance/QBRs.
- Platform-first (hybrid): product demo → project setup → small launch → iterate task design/QC → scale.
- Marketplace self-serve (MTurk-like): account + funding → task/HIT design → qualification strategy → sandbox/soft launch → scale (buyer-owned QC).
- Expert contracting (Mercor-like): role definition → candidate shortlist → interview/selection → onboarding + ongoing payments.
2) Pricing / business model
- Per-task / per-annotation: common for labeling + microtasks; predictable unit economics but QC overhead shifts to buyer unless managed.
- Per-hour / per-contractor / retained teams: common for expert work, RLHF, and programs needing continuity; often sold with minimum commitments.
- Platform fees + services: SaaS/workflow subscription plus optional managed workforce/PM/QC.
- Pre-funded wallets / pay-as-you-go: common in self-serve; can be a friction point for procurement-heavy enterprises.
3) Worker style
- Open crowd marketplace (MTurk): broad coverage and speed, but higher fraud/variance risk unless you build quals + gold + redundancy.
- Managed workforce (Scale/Surge pattern): tighter QC, training, and consistency; less transparency into the labor pool and often less “instant self-serve.”
- Vetted experts / contractors (Mercor): higher-skill judgments and domain work; lower throughput for pure microtasks, higher unit costs.
4) Key pain points many vendors still don’t solve well
- Provenance & auditability: buyers often want cryptographic-grade lineage (who/when/how), richer adjudication logs, and reproducible labeling decisions.
- Fraud / collusion resistance at scale: worker identity, account resale, coordinated cheating, and model-assisted gaming remain persistent risks.
- Fast iteration loops: turning ambiguous specs into stable rubrics + gold sets + tooling quickly is still a heavy lift.
- Transparent quality metrics: many programs lack standardized, comparable metrics across vendors (inter-annotator agreement, drift, adjudication rates, error taxonomies).
- Domain expert supply constraints: credentialed specialists are scarce; vendors may not sustain large-scale expert throughput.
- Data governance constraints: some customers need strict residency/onshore-only work, short retention, and fine-grained access controls.
- Worker experience & sustainability: training burden, compensation fairness, and mental-health protections (esp. moderation) are uneven and hard to verify.