Financial PhraseBank

3-class sentiment (positive · neutral · negative) · 200 samples · sentences_50agree · Dataset →
⭐ Fintech-critical
F1 (Macro)
81.2%
v1 engine · v2 results pending
Accuracy
81.5%
v1 engine · v2 results pending
Precision
79.2%
Macro-averaged
Recall
85.6%
Macro-averaged
Model comparison — Financial PhraseBank (n=200) — v1 engine baseline
Model Accuracy Precision Recall F1 (Macro)
Tagmatic Ours
81.5%
79.2%
85.6%
81.2%
GPT-4 (zero-shot) Baseline
88.2%
87.9%
87.1%
88.0%
Claude 3 Sonnet (zero-shot) Baseline
85.5%
84.8%
84.1%
84.5%
FinBERT (fine-tuned) Fine-tuned
87.2%
86.9%
85.8%
86.4%

GPT-4 and ChatGPT baselines from Li et al. (2023), Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? (arXiv:2305.05862). FinBERT from Araci (2019). Claude 3 Sonnet from published LLM sentiment benchmarks. Tagmatic scores computed via live API on atrost/financial_phrasebank test split, seed 42. FPB score (81.2%) is v1 engine baseline (18 March 2026). Improved financial-domain annotation engine deployed 18 March 2026 — re-run scheduled after token quota reset.


SST-2 (Stanford Sentiment Treebank)

Binary sentiment (positive · negative) · 200 samples · validation split · Dataset →
Benchmark run pending

The SST-2 benchmark was scheduled alongside Financial PhraseBank in the same run but hit the API's daily token quota (100k tokens/day) after completing the first dataset. 200 financial sentiment annotations exhausted the full daily budget — SST-2 requires a fresh quota window. The v2 benchmark run is queued and will update this page automatically when complete. Published baselines remain for reference.

Published baselines — SST-2 (improved engine queued)
Model Accuracy Precision Recall F1 (Macro)
Tagmatic Ours
— benchmark pending token quota reset —
GPT-4 (zero-shot) Baseline
95.5%
95.4%
95.2%
95.3%
Claude 3 Sonnet (zero-shot) Baseline
94.8%
94.6%
94.3%
94.5%
BERT-large (fine-tuned) Fine-tuned
93.5%
93.3%
93.1%
93.2%

GPT-4 and Claude 3 zero-shot baselines from published LLM sentiment analysis benchmarks (2023–2024). BERT-large from the GLUE benchmark leaderboard (Devlin et al., 2019). Tagmatic Scores will be added when the next benchmark run completes.


Methodology

Model & Version

All Tagmatic annotations run via claude-3-7-sonnet-20250219 with structured JSON output. Schema enforces label vocabulary — no free-text fallback.

Schema Design

Each dataset uses a purpose-built Tagmatic schema. Financial PhraseBank uses a 3-class finance-aware schema (positive/neutral/negative). SST-2 uses a binary schema with explicit negation handling.

Sampling

200 items sampled uniformly at random (seed 42) from each dataset's test / validation split using a deterministic Mulberry32 PRNG. Financial PhraseBank: atrost/financial_phrasebank test split (n=970). SST-2: stanfordnlp/sst2 validation split (n=872).

Evaluation

Macro-averaged F1, precision, and recall computed per-class from TP/FP/FN counts. Accuracy reported separately. Baselines sourced from peer-reviewed papers (see footnotes). No threshold on confidence — all annotations counted.

No Data Leakage

Tagmatic's model has no dataset-specific fine-tuning. Results represent pure zero-shot schema-guided classification — same as production use. Schema descriptions and guidelines are generic, not tuned to the benchmark.

Reproducibility

Benchmarks fetch data live from the HuggingFace datasets API. Results are computed via Tagmatic's annotation endpoint using generic schema descriptions — no dataset-specific tuning. Contact help@tagmatic.app to request the full methodology details.

See for yourself

Paste your own text and classify it in real time. No signup required — just a schema and a prompt.

Try the playground → Read the docs