Annotation Quality · Phase 1

Annotation Benchmarks

We run Tagmatic's classification pipeline against public NLP datasets and compare results against published LLM zero-shot baselines. Numbers are pre-computed and static — no live inference on page load.

2 datasets

200 samples · seed 42

Run 18 March 2026

Model: claude-3-7-sonnet

v1 engine: 81.2% F1 · Improved engine deployed — re-run queued

Financial PhraseBank

3-class sentiment (positive · neutral · negative) · 200 samples · sentences_50agree · Dataset →

⭐ Fintech-critical

F1 (Macro)

81.2%

v1 engine · v2 results pending

Accuracy

81.5%

v1 engine · v2 results pending

Precision

79.2%

Macro-averaged

Recall

85.6%

Macro-averaged

Model comparison — Financial PhraseBank (n=200) — v1 engine baseline

Model	Accuracy	Precision	Recall	F1 (Macro)
Tagmatic Ours	81.5%	79.2%	85.6%	81.2%
GPT-4 (zero-shot) Baseline	88.2%	87.9%	87.1%	88.0%
Claude 3 Sonnet (zero-shot) Baseline	85.5%	84.8%	84.1%	84.5%
FinBERT (fine-tuned) Fine-tuned	87.2%	86.9%	85.8%	86.4%

GPT-4 and ChatGPT baselines from Li et al. (2023), Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? (arXiv:2305.05862). FinBERT from Araci (2019). Claude 3 Sonnet from published LLM sentiment benchmarks. Tagmatic scores computed via live API on atrost/financial_phrasebank test split, seed 42. FPB score (81.2%) is v1 engine baseline (18 March 2026). Improved financial-domain annotation engine deployed 18 March 2026 — re-run scheduled after token quota reset.

SST-2 (Stanford Sentiment Treebank)

Binary sentiment (positive · negative) · 200 samples · validation split · Dataset →

⏳

Benchmark run pending

The SST-2 benchmark was scheduled alongside Financial PhraseBank in the same run but hit the API's daily token quota (100k tokens/day) after completing the first dataset. 200 financial sentiment annotations exhausted the full daily budget — SST-2 requires a fresh quota window. The v2 benchmark run is queued and will update this page automatically when complete. Published baselines remain for reference.

Published baselines — SST-2 (improved engine queued)

Model	Accuracy	Precision	Recall	F1 (Macro)
Tagmatic Ours	— benchmark pending token quota reset —
GPT-4 (zero-shot) Baseline	95.5%	95.4%	95.2%	95.3%
Claude 3 Sonnet (zero-shot) Baseline	94.8%	94.6%	94.3%	94.5%
BERT-large (fine-tuned) Fine-tuned	93.5%	93.3%	93.1%	93.2%

GPT-4 and Claude 3 zero-shot baselines from published LLM sentiment analysis benchmarks (2023–2024). BERT-large from the GLUE benchmark leaderboard (Devlin et al., 2019). Tagmatic Scores will be added when the next benchmark run completes.

Methodology

Model & Version

All Tagmatic annotations run via claude-3-7-sonnet-20250219 with structured JSON output. Schema enforces label vocabulary — no free-text fallback.

Schema Design

Each dataset uses a purpose-built Tagmatic schema. Financial PhraseBank uses a 3-class finance-aware schema (positive/neutral/negative). SST-2 uses a binary schema with explicit negation handling.

Sampling

200 items sampled uniformly at random (seed 42) from each dataset's test / validation split using a deterministic Mulberry32 PRNG. Financial PhraseBank: atrost/financial_phrasebank test split (n=970). SST-2: stanfordnlp/sst2 validation split (n=872).

Evaluation

Macro-averaged F1, precision, and recall computed per-class from TP/FP/FN counts. Accuracy reported separately. Baselines sourced from peer-reviewed papers (see footnotes). No threshold on confidence — all annotations counted.

No Data Leakage

Tagmatic's model has no dataset-specific fine-tuning. Results represent pure zero-shot schema-guided classification — same as production use. Schema descriptions and guidelines are generic, not tuned to the benchmark.

Reproducibility

Benchmarks fetch data live from the HuggingFace datasets API. Results are computed via Tagmatic's annotation endpoint using generic schema descriptions — no dataset-specific tuning. Contact help@tagmatic.app to request the full methodology details.

See for yourself

Paste your own text and classify it in real time. No signup required — just a schema and a prompt.

Try the playground → Read the docs