Annotation Benchmarks
We run Tagmatic's classification pipeline against public NLP datasets and compare results against published LLM zero-shot baselines. Numbers are pre-computed and static — no live inference on page load.
Financial PhraseBank
| Model | Accuracy | Precision | Recall | F1 (Macro) |
|---|---|---|---|---|
|
Tagmatic
Ours
|
||||
|
GPT-4 (zero-shot)
Baseline
|
||||
|
Claude 3 Sonnet (zero-shot)
Baseline
|
||||
|
FinBERT (fine-tuned)
Fine-tuned
|
GPT-4 and ChatGPT baselines from Li et al. (2023), Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? (arXiv:2305.05862). FinBERT from Araci (2019). Claude 3 Sonnet from published LLM sentiment benchmarks. Tagmatic scores computed via live API on atrost/financial_phrasebank test split, seed 42. FPB score (81.2%) is v1 engine baseline (18 March 2026). Improved financial-domain annotation engine deployed 18 March 2026 — re-run scheduled after token quota reset.
SST-2 (Stanford Sentiment Treebank)
The SST-2 benchmark was scheduled alongside Financial PhraseBank in the same run but hit the API's daily token quota (100k tokens/day) after completing the first dataset. 200 financial sentiment annotations exhausted the full daily budget — SST-2 requires a fresh quota window. The v2 benchmark run is queued and will update this page automatically when complete. Published baselines remain for reference.
| Model | Accuracy | Precision | Recall | F1 (Macro) |
|---|---|---|---|---|
|
Tagmatic
Ours
|
— benchmark pending token quota reset — | |||
|
GPT-4 (zero-shot)
Baseline
|
||||
|
Claude 3 Sonnet (zero-shot)
Baseline
|
||||
|
BERT-large (fine-tuned)
Fine-tuned
|
||||
GPT-4 and Claude 3 zero-shot baselines from published LLM sentiment analysis benchmarks (2023–2024). BERT-large from the GLUE benchmark leaderboard (Devlin et al., 2019). Tagmatic Scores will be added when the next benchmark run completes.
Methodology
Model & Version
All Tagmatic annotations run via claude-3-7-sonnet-20250219 with structured JSON output. Schema enforces label vocabulary — no free-text fallback.
Schema Design
Each dataset uses a purpose-built Tagmatic schema. Financial PhraseBank uses a 3-class finance-aware schema (positive/neutral/negative). SST-2 uses a binary schema with explicit negation handling.
Sampling
200 items sampled uniformly at random (seed 42) from each dataset's test / validation split using a deterministic Mulberry32 PRNG. Financial PhraseBank: atrost/financial_phrasebank test split (n=970). SST-2: stanfordnlp/sst2 validation split (n=872).
Evaluation
Macro-averaged F1, precision, and recall computed per-class from TP/FP/FN counts. Accuracy reported separately. Baselines sourced from peer-reviewed papers (see footnotes). No threshold on confidence — all annotations counted.
No Data Leakage
Tagmatic's model has no dataset-specific fine-tuning. Results represent pure zero-shot schema-guided classification — same as production use. Schema descriptions and guidelines are generic, not tuned to the benchmark.
Reproducibility
Benchmarks fetch data live from the HuggingFace datasets API. Results are computed via Tagmatic's annotation endpoint using generic schema descriptions — no dataset-specific tuning. Contact help@tagmatic.app to request the full methodology details.
See for yourself
Paste your own text and classify it in real time. No signup required — just a schema and a prompt.
Try the playground → Read the docs