Deepdive: Human Intelligence Providers

Human intelligence vendors deepdive (data labeling, RLHF/evals, expert networks, crowd marketplaces)

This note summarizes reputable “human intelligence” vendors—companies that provide human labor + tooling for tasks like data labeling, RLHF, LLM evaluations, content moderation, and research participant recruitment.

Purpose of this deep-dive (why “human intelligence” matters)

This deep-dive is intended to:

Introduce the value of human intelligence in the AI industry (in contrast to synthetic data):
- Ground-truth + preference data: Humans create/validate labels, rankings, and judgments that are hard to reliably synthesize without leaking model biases back into training.
- Adversarial coverage: Humans can probe edge cases, ambiguity, and “unknown unknowns” that synthetic generation tends to under-sample.
- Evaluation integrity: Independent human evals can act as a check against benchmark gaming and overfitting to synthetic test sets.
- Domain expertise: Some tasks require credentialed expertise (medical, legal, finance, security), where synthetic data is especially risky.
Clarify typical customer personas (who buys this and why):
- Frontier/AI labs: RLHF, evaluations, and specialized “frontier data” for next-model training.
- Enterprise AI teams: labeling + governance + security/compliance; steady throughput at predictable quality.
- Product orgs: trust & safety, moderation, multilingual QA, search relevance, and UX judgments.
- Research orgs: recruitment of reliable human participants (surveys/experiments).
Introduce different business models (how vendors package labor + tooling):
- Enterprise managed services: sales-led delivery with SOW/MSA, SLAs, and dedicated PM/QC.
- Marketplace / self-serve: requesters launch tasks; quality control is largely on the buyer.
- Hybrid platform + workforce: SaaS workflow + optional vendor-provided workforce and QA.
- Expert contracting / “talent cloud”: hire vetted specialists/contractors rather than buy per-label work.

Method / caveats

Valuation & revenue: public numbers (when available) are summarized in a short note below the tables. If no reliable public number is found, it’s treated as Not publicly disclosed.
Onboarding flows: summarized from public product docs / “get started” guides and common enterprise procurement patterns; when vendor-specific details aren’t public, I label it as Typical.

Economics snapshots (pay/earnings + market sizing)

TODO: add sourced, public numbers for (1) typical worker earnings (expert vs microtask), and (2) market size trend (advanced LLM labeling vs simple labeling).

Comparison table (publicly verifiable fields; “Not disclosed” means no reliable public number found)

Leading category

Company	Founded	HQ region	Serving regions	Strength	Clients (examples)	Client onboarding flow	Worker onboarding workflow	Worker size (public)
Surge AI (Surge Labs, Inc.)	2020	US (San Francisco, CA)	Global	Frontier-style RLHF + evals; high-skill judgment	AI labs (public case study: Anthropic)	Partner-managed projects or API/projects flow (funds required before launching to their workforce)	Managed workforce; worker pipeline details not fully public	Not publicly disclosed
Mercor	2021	US (San Francisco, CA)	Global	Expert contracting + AI-assisted matching (closer to hiring)	Not consistently public	Typical: sales/intake → define role(s) → review candidates → onboard/pay in platform	Worker: apply → interview/assessment → onboarding → paid contracting work	30,000+ contractors (reported; see sources)
Scale AI	2016	US (SF Bay Area)	Global	Enterprise-scale labeling + compliance posture	Enterprise + public sector (public case studies)	Typical enterprise: sales/demo → security review → pilot → scale	Managed workforce + vendor QA; worker pipeline details not fully public	Not publicly disclosed
Amazon Mechanical Turk (MTurk)	2005	US (Seattle, WA)	Global	Lowest friction + low unit cost crowd microtasks	Many requesters (marketplace)	Create requester → create HIT/project → design → quals → fund → sandbox → launch	Worker signup → verify/payment setup → qualification → accept HITs	Not consistently published (marketplace; active supply fluctuates)

Other reputable vendors

Company	Founded	HQ region	Serving regions	Strength	Clients (examples)	Client onboarding flow	Worker onboarding workflow
Appen	1996	Australia (Sydney)	Global	Large-scale data programs + multilingual	Big Tech historically (varies by year)	Enterprise: scope → MSA/security → pilot → ramp	Worker: apply → locale/language screening → task qualification → production + QC
TELUS International (AI Data Solutions)	2005 (TELUS International)	Canada (Vancouver) / global	Global	BPO + AI data services at scale	Enterprise (public)	Enterprise: scope → contract/security → ramp	Worker: crowdsourcing + managed teams; screening/qualification varies by program
Sama (formerly Samasource)	2008	US (SF Bay Area)	Global	High-quality data labeling + impact sourcing	Public case studies	Enterprise: scope → security → pilot → scale	Worker: recruit/train (impact sourcing) → qualification → production + QC
iMerit	2012	India (Kolkata) / US presence	Global	Managed annotation + domain teams	Public case studies	Enterprise: scope → pilot → scale	Worker: hiring + training → qualification → supervised delivery
CloudFactory	2010	UK/US with Nepal roots	Global	Managed distributed workforce	Public case studies	Enterprise: scope → pilot → scale	Worker: recruit → training → QA-managed work
Toloka	(see sources)	Europe (reported HQ varies)	Global	Marketplace + AI data ops (strong in multilingual)	Marketplace	Self-serve + managed: project setup → pool/filters → QA → launch	Worker: signup → verification → exams/quals → tasks + ratings
Clickworker	2005	Germany (Essen)	Global	Crowd microtasks, surveys, data collection	Marketplace	Self-serve: create job → target/quals → publish → review	Worker: signup → profile/verification → quals → tasks
Prolific	2014	UK (Oxford/London, reported)	Global	High-quality research participants (academia + industry)	Researchers/companies	Self-serve: create study → screen → launch → pay	Participant: signup → ID/eligibility → take studies
TransPerfect DataForce	(see sources)	US (NYC for TransPerfect)	Global	Multilingual data collection + annotation	Enterprise	Enterprise: scope → contract/security → delivery	Worker: recruit panelists → consent/verification → collection + QC
TaskUs (AI Services)	2008	US (New Braunfels, TX)	Global	BPO + trust & safety + AI data operations	Enterprise	Enterprise: scope → contract/security → ramp	Worker: hiring/training → production + QA
Labelbox	2018	US (San Francisco, CA)	Global	Labeling platform + managed services	Enterprise	SaaS: signup/demo → integrate data → label/eval workflows	Workforce via partners/managed services; worker pipeline varies
Hive (Hive AI / Hive Data)	2013	US (SF Bay Area)	Global	Content moderation + labeling + model APIs	Public customers (varies)	Typical: sales/demo → pilot → scale	Mix of in-house + crowd; worker details not fully public
LXT	2010	Canada/US (reported)	Global	Speech/text data collection + annotation (multilingual)	Enterprise	Enterprise: scope → contract → collection/annotation	Worker: recruit contributors → consent/verification → tasks + QA

Valuation & revenue notes (publicly reported where available): Public-company “valuation” is best treated as market cap (varies daily) and may not be listed here. For revenue, the only specific figures currently captured in this doc are from Wikipedia infoboxes (verify before relying on them): Surge AI (lists $1.2B, 2024), Telus Digital / TELUS International (lists US$2.658B, 2024), and TaskUs (lists US$227.5M, 2024). For Appen/TELUS/TaskUs, refer to their latest annual reports/filings for definitive revenue.

Conclusions

1) Onboarding pattern

Enterprise-managed (Scale-like): discovery call → NDA/MSA → security/compliance review → pilot → production ramp → ongoing governance/QBRs.
Platform-first (hybrid): product demo → project setup → small launch → iterate task design/QC → scale.
Marketplace self-serve (MTurk-like): account + funding → task/HIT design → qualification strategy → sandbox/soft launch → scale (buyer-owned QC).
Expert contracting (Mercor-like): role definition → candidate shortlist → interview/selection → onboarding + ongoing payments.

2) Pricing / business model

Per-task / per-annotation: common for labeling + microtasks; predictable unit economics but QC overhead shifts to buyer unless managed.
Per-hour / per-contractor / retained teams: common for expert work, RLHF, and programs needing continuity; often sold with minimum commitments.
Platform fees + services: SaaS/workflow subscription plus optional managed workforce/PM/QC.
Pre-funded wallets / pay-as-you-go: common in self-serve; can be a friction point for procurement-heavy enterprises.

3) Worker style

Open crowd marketplace (MTurk): broad coverage and speed, but higher fraud/variance risk unless you build quals + gold + redundancy.
Managed workforce (Scale/Surge pattern): tighter QC, training, and consistency; less transparency into the labor pool and often less “instant self-serve.”
Vetted experts / contractors (Mercor): higher-skill judgments and domain work; lower throughput for pure microtasks, higher unit costs.

4) Key pain points many vendors still don’t solve well

Provenance & auditability: buyers often want cryptographic-grade lineage (who/when/how), richer adjudication logs, and reproducible labeling decisions.
Fraud / collusion resistance at scale: worker identity, account resale, coordinated cheating, and model-assisted gaming remain persistent risks.
Fast iteration loops: turning ambiguous specs into stable rubrics + gold sets + tooling quickly is still a heavy lift.
Transparent quality metrics: many programs lack standardized, comparable metrics across vendors (inter-annotator agreement, drift, adjudication rates, error taxonomies).
Domain expert supply constraints: credentialed specialists are scarce; vendors may not sustain large-scale expert throughput.
Data governance constraints: some customers need strict residency/onshore-only work, short retention, and fine-grained access controls.
Worker experience & sustainability: training burden, compensation fairness, and mental-health protections (esp. moderation) are uneven and hard to verify.