LLM Benchmarking Framework for Financial Sentiment Analysis

// final year project / data analysis

Built a financial sentiment analysis benchmarking framework comparing OpenAI GPT-4o and Google Gemini 2.0 Flash on real-world finance datasets. The project automates dataset processing, prompt standardisation, prediction validation, evaluation metrics, and result visualisation to analyse how different LLMs perform on sentiment classification tasks.

Implemented a full evaluation pipeline using Python and Jupyter across the Financial PhraseBank and FiQA datasets, covering standardised 3-class sentiment prediction (positive, negative, neutral), automated response validation, and label normalisation. Evaluation included accuracy, macro F1, per-class F1, latency, and hallucination analysis, with statistical significance testing via McNemar's Test and comparative confusion matrix visualisations.

The project demonstrated strong differences in model behaviour: GPT-4o achieved significantly higher classification performance while Gemini 2.0 Flash delivered lower latency. Focus was placed on reproducible benchmarking, clean evaluation methodology, and structured LLM comparison workflows.

→Financial PhraseBank and FiQA dataset handling
→Standardised 3-class sentiment prediction across both models
→Accuracy, macro F1, per-class F1, latency, and hallucination analysis
→Automated response validation and label normalisation
→Statistical significance testing using McNemar's Test
→Confusion matrices and comparative visualisations
→Optimised API usage to reduce failed calls and improve cost efficiency

evidence

Model Evaluation Evidence

Comparative benchmark outputs, classification metrics, and latency analysis artifacts from the financial sentiment LLM evaluation framework.

select artifact to expand

//stack

PythonOpenAI APIGemini APIPandasNumPyScikit-learnMatplotlibJupyter Notebook