Annotation Benchmarks
We run Tagmatic's classification pipeline against public NLP datasets and compare results against published LLM zero-shot baselines. Numbers are pre-computed and static — no live inference on page load.
Financial PhraseBank
| Model | Accuracy | Precision | Recall | F1 (Macro) |
|---|---|---|---|---|
|
Tagmatic
Ours
|
||||
|
GPT-4 (zero-shot)
Baseline
|
||||
|
Claude 3 Sonnet (zero-shot)
Baseline
|
||||
|
FinBERT (fine-tuned)
Fine-tuned
|
GPT-4 and ChatGPT baselines from Li et al. (2023), Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? (arXiv:2305.05862). FinBERT from Araci (2019). Claude 3 Sonnet from published LLM sentiment benchmarks. Tagmatic scores computed via live API on atrost/financial_phrasebank test split, seed 42. FPB score (81.2%) is v1 engine baseline (18 March 2026). Improved financial-domain annotation engine deployed 18 March 2026 — re-run scheduled after token quota reset.
SST-2 (Stanford Sentiment Treebank)
The SST-2 benchmark was scheduled alongside Financial PhraseBank in the same run but hit the API's daily token quota (100k tokens/day) after completing the first dataset.
200 financial sentiment annotations exhausted the full daily budget — SST-2 requires a fresh quota window.
The benchmark script is ready (node scripts/benchmarks/run.js) and will re-run automatically.
Published baselines remain for reference.
| Model | Accuracy | Precision | Recall | F1 (Macro) |
|---|---|---|---|---|
|
Tagmatic
Ours
|
— benchmark pending token quota reset — | |||
|
GPT-4 (zero-shot)
Baseline
|
||||
|
Claude 3 Sonnet (zero-shot)
Baseline
|
||||
|
BERT-large (fine-tuned)
Fine-tuned
|
||||
GPT-4 and Claude 3 zero-shot baselines from published LLM sentiment analysis benchmarks (2023–2024). BERT-large from the GLUE benchmark leaderboard (Devlin et al., 2019). Tagmatic Scores will be added when the next benchmark run completes.
Methodology
Model & Version
All Tagmatic annotations run via claude-3-7-sonnet-20250219 with structured JSON output. Schema enforces label vocabulary — no free-text fallback.
Schema Design
Each dataset uses a purpose-built Tagmatic schema. Financial PhraseBank uses a 3-class finance-aware schema (positive/neutral/negative). SST-2 uses a binary schema with explicit negation handling.
Sampling
200 items sampled uniformly at random (seed 42) from each dataset's test / validation split using a deterministic Mulberry32 PRNG. Financial PhraseBank: atrost/financial_phrasebank test split (n=970). SST-2: stanfordnlp/sst2 validation split (n=872).
Evaluation
Macro-averaged F1, precision, and recall computed per-class from TP/FP/FN counts. Accuracy reported separately. Baselines sourced from peer-reviewed papers (see footnotes). No threshold on confidence — all annotations counted.
No Data Leakage
Tagmatic's model has no dataset-specific fine-tuning. Results represent pure zero-shot schema-guided classification — same as production use. Schema descriptions and guidelines are generic, not tuned to the benchmark.
Reproducibility
Benchmark script at /scripts/benchmarks/run.js in the repository. Fetches data live from HuggingFace datasets API. Run with API_KEY=tmk_... node scripts/benchmarks/run.js to reproduce results exactly.
See for yourself
Paste your own text and classify it in real time. No signup required — just a schema and a prompt.
Try the playground → Read the docs