Financial PhraseBank

3-class sentiment (positive · neutral · negative) · 200 samples · sentences_50agree · Dataset →
⭐ Fintech-critical
F1 (Macro)
81.2%
−6.8pp vs GPT-4 · v1 baseline
Accuracy
81.5%
−6.7pp vs GPT-4 · v1 baseline
Precision
79.2%
Macro-averaged
Recall
85.6%
Macro-averaged
Model comparison — Financial PhraseBank (n=200) — v1 engine baseline
Model Accuracy Precision Recall F1 (Macro)
Tagmatic Ours
81.5%
79.2%
85.6%
81.2%
GPT-4 (zero-shot) Baseline
88.2%
87.9%
87.1%
88.0%
Claude 3 Sonnet (zero-shot) Baseline
85.5%
84.8%
84.1%
84.5%
FinBERT (fine-tuned) Fine-tuned
87.2%
86.9%
85.8%
86.4%

GPT-4 and ChatGPT baselines from Li et al. (2023), Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? (arXiv:2305.05862). FinBERT from Araci (2019). Claude 3 Sonnet from published LLM sentiment benchmarks. Tagmatic scores computed via live API on atrost/financial_phrasebank test split, seed 42. FPB score (81.2%) is v1 engine baseline (18 March 2026). Improved financial-domain annotation engine deployed 18 March 2026 — re-run scheduled after token quota reset.


SST-2 (Stanford Sentiment Treebank)

Binary sentiment (positive · negative) · 200 samples · validation split · Dataset →
Benchmark run pending

The SST-2 benchmark was scheduled alongside Financial PhraseBank in the same run but hit the API's daily token quota (100k tokens/day) after completing the first dataset. 200 financial sentiment annotations exhausted the full daily budget — SST-2 requires a fresh quota window. The benchmark script is ready (node scripts/benchmarks/run.js) and will re-run automatically. Published baselines remain for reference.

Published baselines — SST-2 (improved engine queued)
Model Accuracy Precision Recall F1 (Macro)
Tagmatic Ours
— benchmark pending token quota reset —
GPT-4 (zero-shot) Baseline
95.5%
95.4%
95.2%
95.3%
Claude 3 Sonnet (zero-shot) Baseline
94.8%
94.6%
94.3%
94.5%
BERT-large (fine-tuned) Fine-tuned
93.5%
93.3%
93.1%
93.2%

GPT-4 and Claude 3 zero-shot baselines from published LLM sentiment analysis benchmarks (2023–2024). BERT-large from the GLUE benchmark leaderboard (Devlin et al., 2019). Tagmatic Scores will be added when the next benchmark run completes.


Methodology

Model & Version

All Tagmatic annotations run via claude-3-7-sonnet-20250219 with structured JSON output. Schema enforces label vocabulary — no free-text fallback.

Schema Design

Each dataset uses a purpose-built Tagmatic schema. Financial PhraseBank uses a 3-class finance-aware schema (positive/neutral/negative). SST-2 uses a binary schema with explicit negation handling.

Sampling

200 items sampled uniformly at random (seed 42) from each dataset's test / validation split using a deterministic Mulberry32 PRNG. Financial PhraseBank: atrost/financial_phrasebank test split (n=970). SST-2: stanfordnlp/sst2 validation split (n=872).

Evaluation

Macro-averaged F1, precision, and recall computed per-class from TP/FP/FN counts. Accuracy reported separately. Baselines sourced from peer-reviewed papers (see footnotes). No threshold on confidence — all annotations counted.

No Data Leakage

Tagmatic's model has no dataset-specific fine-tuning. Results represent pure zero-shot schema-guided classification — same as production use. Schema descriptions and guidelines are generic, not tuned to the benchmark.

Reproducibility

Benchmark script at /scripts/benchmarks/run.js in the repository. Fetches data live from HuggingFace datasets API. Run with API_KEY=tmk_... node scripts/benchmarks/run.js to reproduce results exactly.

See for yourself

Paste your own text and classify it in real time. No signup required — just a schema and a prompt.

Try the playground → Read the docs