Supplemental materials to "Benchmarking Local Language Models for Social Robots using Edge Devices"

Lamouille, Dorian; Zorec, Matevž Borjan; Baksh, Farnaz; Kruusamäe, Karl

doi:10.5281/zenodo.19643021

Published April 18, 2026 | Version v1

Dataset Open

Supplemental materials to "Benchmarking Local Language Models for Social Robots using Edge Devices"

1. University of Tartu
2. ECAM LaSalle

This record accompanies the paper "Benchmarking Local Language Models for Social Robots using Edge Devices" [accepted IEEE ARSO 2026] and contains the raw benchmark data, MMLU scores, automated teaching-effectiveness ratings, human rater sheets, and the analysis notebook supporting the results reported therein.

Overview

We benchmarked 25 open-source language models for local deployment on edge hardware in a social-educational robotics context (the Robot Study Companion project, rsc.ee). Each model was evaluated across three dimensions — inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality validated against five independent human raters) — primarily on the Raspberry Pi 4, with scalability comparisons on the Raspberry Pi 5 and a laptop NVIDIA RTX 4060 GPU.

This record contains: per-query hardware telemetry across all three platforms, per-model MMLU scores, GPT-4o-mini teaching-effectiveness ratings, five human rater workbooks, and the notebook that computes inter-rater agreement statistics. Readers should consult the paper for methodology, results, and discussion; this record serves as the underlying evidentiary base.

.
├── benchmarks/
│   ├── benchmarks_merged.csv                 727-row consolidated per-query telemetry across all platforms
│   ├── results_pi4/                          7 CSV files, per-query benchmarks on Raspberry Pi 4
│   ├── results_pi5/                          3 CSV files, per-query benchmarks on Raspberry Pi 5
│   └── results_computer/                     24 CSV files, per-query benchmarks on laptop GPU
│   └── fig1_final.pdf                        Figure 1 (benchmark summary)
├── MMLU/
│   ├── MMLU_merged.csv                       25-row merged MMLU results (model tags harmonised with benchmarks)
│   └── models_MMLU_scores/                   25 paired CSV+JSON files, per-model MMLU results
└── rated_teaching/
    ├── human_ratings_merged.csv              200-row merged human rater workbook data, deblinded
    ├── teaching_effectiveness_ratings/       GPT-4o-mini per-response ratings (250 rows)
    └── human_rate_gpt4o/                     5 rater workbooks + analysis
        ├── annotation_analysis.ipynb         human-rater analysis (α, ICC, Pearson r, Fig 2)
        └── figure_2_human_annotation.pdf     Figure 2 (human annotation validation)

Data dictionary

`benchmarks/results_/benchmark_all_models_.csv`

Per-query records from benchmark runs. Each row captures one model answering one question, with full response text and hardware telemetry. Filenames encode the run timestamp: benchmark_all_models_YYYYMMDD_HHMMSS.csv.

Shared columns (all platforms, 18 fields):

Column	Description
`timestamp`	ISO-8601 start of inference
`model`	Ollama model tag (e.g. `qwen3:0.6b`)
`model_parameters`	Nominal parameter count, as reported
`question`	Benchmark prompt text (see Table I in paper)
`response`	Full model-generated answer (not truncated)
`response_length_chars`	Character count of the response
`estimated_tokens`	Token count estimated from streaming chunks (see caveat)
`inference_time_s`	Total generation time, via `time.time()`
`time_to_first_token_s`	Latency to first streamed chunk, via `time.time()`
`tokens_per_second`	Throughput
`cpu_baseline_percent`, `cpu_average_percent`	CPU load (pre-inference baseline; average during inference)
`cpu_per_core`	Per-core utilisation, stringified Python list; parse with `ast.literal_eval`
`cpu_freq_mhz`	CPU frequency during inference
`memory_baseline_mb`, `memory_peak_mb`, `memory_increase_mb`, `memory_percent`	RAM telemetry

Raspberry Pi-only additional columns (19 fields, present in results_pi4/ and results_pi5/):

Column	Description
`temperature_c`	CPU die temperature (`vcgencmd measure_temp`)
`throttled`	Throttling flag (`vcgencmd get_throttled`)
`avg_voltage_v`	Mean rail voltage during inference (`vcgencmd measure_volts`)
`estimated_current_a`	Current estimate from linear CPU-load model (idle 0.6A, full-load 3A at 5V)
`avg_power_watts`	V × A, averaged over inference
`total_energy_joules`	Estimated energy consumption during inference
`tokens_per_joule`	Energy efficiency metric (`estimated_tokens / total_energy_joules`)
`io_read_count`, `io_write_count`, `io_read_bytes`, `io_write_bytes`, `io_read_time_ms`, `io_write_time_ms`, `io_total_ops`, `io_total_bytes`, `io_iops`, `io_throughput_mb_s`, `io_avg_read_latency_ms`, `io_avg_write_latency_ms`	Disk I/O telemetry

Laptop-only additional columns (4 fields, present in results_computer/ only):

Column	Description
`inference_time_perf_s`	Generation time via `time.perf_counter()`
`time_to_first_token_perf_s`	TTFT via `time.perf_counter()`
`tokens_per_second_perf`	Throughput computed from `perf_counter` timings
`timing_diff_ms`	Precision difference between `time.time()` and `time.perf_counter()`

`benchmarks/benchmarks_merged.csv`

Consolidated per-query telemetry across all three platforms. 727 rows (250 Pi 4 + 237 Pi 5 + 240 laptop) × 42 columns (41 raw columns unioned across platforms, plus a platform identifier prepended). One row per (model, platform, question); the qwen3:1.7b laptop double-run contributes 20 rows rather than 10 (see caveats).

Columns: platform (rpi4 / rpi5 / laptop), followed by all columns documented above. Pi-only columns fill as NaN on laptop rows; laptop-only *_perf columns fill as NaN on Pi rows. No values are transformed; the file is a pure union merge of the per-session CSVs with platform provenance added.

`MMLU/models_MMLU_scores/{model}_MMLU.csv` and `.json`

One-row-per-model aggregate MMLU results on the six-category subset used in the paper. Both formats preserved: CSV flattens per-task scores into columns (score_{task}); JSON preserves the native task_scores dict structure.

Column	Description
`model_name`	Ollama model tag
`overall_score`	Mean accuracy across the six categories (0–1 scale)
`tasks`	Comma-separated task list
`num_tasks`	Task count (always 6)
`n_shots`	Prompting shots (always 3)
`timestamp`	Run timestamp
`status`	`success` or failure marker
`score_formal_logic`, `score_global_facts`, `score_college_computer_science`, `score_college_mathematics`, `score_marketing`, `score_high_school_macroeconomics`	Per-task accuracy (0–1)

`MMLU/MMLU_merged.csv`

25-row summary aggregating the per-model files above; underlies Table II's MMLU column and Table III. Columns: model, overall_score, n_shots, status, timestamp, and the six per-task score_* columns. Model tags harmonised with the benchmarks dataset — the upstream per-model file nemotron-mini_MMLU.csv appears here as nemotron-mini:4b to match the Ollama canonical form used elsewhere in the record.

`rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv`

Per-response GPT-4o-mini ratings (250 rows = 25 models × 10 questions). Ratings produced via the OpenAI API; the rating prompt appears in our paper under §III-B.

Column	Description
`model`, `model_parameters`	Model identifiers
`question`	Benchmark prompt
`response_preview`	First 200 characters of the model response (full text in the benchmark CSVs)
`score`	Teaching-effectiveness rating, 1–10 scale
`strengths`, `weaknesses`	Stringified lists of rater-identified strengths and weaknesses
`justification`	One- or two-sentence rationale
`tokens_per_second`, `inference_time_s`	Carried over from benchmark run for joint analysis
`error`	Populated when rating JSON failed to parse (score then defaults to 0)

`rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx`

One workbook per rater; rater pseudonyms: bird, duck, sky, squirrel, tree. Each workbook contains three sheets:

Instructions — participant information, ethics statement, consent block
Responses — 40 rows (4 models × 10 questions) × 8 teaching-quality criteria on a 1–10 scale, plus an optional Comments column
GPT Scores (DO NOT VIEW) — GPT-4o-mini scores for the same 40 responses, intended to be consulted only after annotation

Raters received model responses blinded via A/B/C/D labels (A=Gemma3 0.27B, B=Gemma3 1B, C=Granite4 Tiny Hybrid 7B, D=Mistral 7B). No personally identifiable data was retained; the contact email in the consent block is the principal contact above.

`rated_teaching/human_ratings_merged.csv`

200-row merged dataset aggregating all five rater workbooks (5 raters × 4 models × 10 questions). One row per (rater, model, question). Models are deblinded to their canonical Ollama tags; the A/B/C/D labels visible to raters are dropped.

Column	Description
`rater`	Rater pseudonym: `bird`, `duck`, `sky`, `squirrel`, or `tree`
`model`	Ollama model tag (deblinded from A/B/C/D)
`question_num`	Question index (1–10)
`question`	Benchmark prompt (full text)
`response_preview`	Truncated model response as shown to the rater
`clarity`, `accuracy`, `engagement`, `structure`, `completeness`, `appropriate_level`, `examples_analogies`, `actionable`	Per-criterion rating, 1–10 scale
`mean_score`	Arithmetic mean across the eight criteria
`comments`	Rater's free-text comment, if any (otherwise NaN)

`annotation_analysis.ipynb`

Jupyter notebook computing the statistics reported in paper §IV-E: Krippendorff's α, ICC(C,1) and ICC(C,k), Pearson r, mean absolute difference. Generates Figure 2.

Methodology

Metric	Description	Type of Metrics
Token Per Second (TPS)	Number of tokens generated by the model in one second (average based on the total generation time divided by the total number of tokens).	Hardware
Inference time	Total time spent generating one output (time between the query being input to the model and the last generated token).	Hardware
Time To First Token (TTFT)	Time spent to generate the first token of the output.	Hardware
Response length	Total number of characters for an output.	Hardware
IOPS	Input and output memory operations.	Hardware
Token Per Joule (TPJ)	Number of tokens generated by the model per joule of energy consumed.	Hardware
Massive Multitask Language Understanding (MMLU)	Benchmark used to assess the general knowledge of a model across multiple topics.	Accuracy
Teaching effectiveness	Rating of the output based on teaching criteria on a scale from 1 to 10 (judged by a larger LLM: GPT-4o-mini).	Accuracy
Human rater	Rating of the output based on the same eight teaching criteria on a 1–10 scale (judged by five independent human raters on a four-model subset; 200 annotations total).	Accuracy

For full methodology, please consult paper §III. In brief:

MMLU subset covers Formal Logic, Global Facts, College Computer Science, College Mathematics, Marketing, and High School Macroeconomics (1,050 questions total). DeepEval [16] orchestrates 3-shot prompting at temperature 0.1. The evaluation runs on an NVIDIA A100 (40GB) via University of Tartu HPC, since MMLU accuracy is hardware-agnostic.
Inference benchmarks use Ollama with streaming enabled on the target platform. Ten pedagogical questions per model (Table I in paper) cover explanatory depth, adaptability, misconception handling, and student guidance. Models above 1.4B parameters receive a structured system prompt; models below receive the raw question only, owing to sensitivity to prompt formatting at small parameter counts.
Teaching-effectiveness ratings use GPT-4o-mini against eight criteria (clarity, accuracy, engagement, structure, completeness, appropriate level, examples/analogies, actionable). The rating prompt appears verbatim in paper §III-B.
Human validation (paper §IV-E) covers four representative models × 10 questions × 5 raters = 200 annotations across the same eight criteria.

Known caveats and divergences

Raspberry Pi 5 scope. Paper §III-D reports scalability on three Raspberry Pi 5 models (qwen3:0.6b, gemma3:1b, granite4:tiny-h), selected on TPS/MMLU/size criteria from the RPi 4 results for Robot Study Companion architecture planning. This record contains Pi 5 data for 24 of the 25 models across three sessions (2025-12-01, 2025-12-02, 2025-12-10). granite4:1b-h is the sole model not covered on the Pi 5. Users may analyse the broader Pi 5 dataset; the paper's three-model subset remains the analytical focus cited therein.
Teaching-rating parse failures. Three rows in teaching_effectiveness_ratings.csv carry score=0 and error="Failed to parse rating": nemotron-mini:4b (question 9), phi4-mini-reasoning:3.8b (question 2), tinyllama:latest (question 3). Their response_preview fields contain the model output that GPT-4o-mini could not parse into the expected JSON rubric. These three are excluded from the per-model teaching-effectiveness summary used in paper Table II.
Nemotron-mini MMLU format violations. The nemotron-mini_MMLU.csv file records a 0% aggregate due to repeated output-format violations; see Table II's ‡ footnote. The file is retained for completeness and reproducibility of the failure mode.
Granite4 1B quantisation note. Paper §IV-B flags that granite4:1b ships in BF16 (not Q4_K_M), yielding an atypical 3.3 GB on-disk footprint and 0.89 TPS — values not directly comparable to its Q4-quantised peers.
Single-run hardware metrics. Runtime constraints on the RPi 4 precluded multiple independent runs per model. All hardware metrics (TPS, TPJ, inference time, etc.) are single-run values; thermal drift may introduce unquantified variance into absolute values while preserving relative rankings.
Token counting. TPS and related metrics use streamed-chunk counts rather than tokeniser-level token counts. This may introduce minor discrepancies across runtimes but does not affect relative comparisons within the benchmark.
Power estimation uncertainty. Absolute tokens-per-joule values carry an estimated ±15–20% uncertainty owing to the linear CPU-load-to-current approximation (idle 0.6A, full-load 3A at 5V nominal). Relative rankings remain stable.
Reasoning-model throughput. DeepSeek-R1 (1.5B, 7B) and Phi4-mini-reasoning (3.8B) emit internal chain-of-thought tokens in the runtime stream. Their throughput and latency metrics may appear optimistic relative to the useful output delivered.
Laptop coverage. The laptop dataset covers 23 of the 25 models; granite4:1b-h and granite4:3b-h were not run on the laptop. Including this gap, full-platform coverage is: Pi 4 all 25, Pi 5 24 (missing granite4:1b-h), laptop 23 (missing both hybrid variants).
qwen3:1.7b double-run on laptop. qwen3:1.7b was run twice on the laptop (sessions 2025-11-19 15:26 and 15:29); both runs are preserved in results_computer/ and in benchmarks_merged.csv. Rows distinguish via timestamp. All other (model, platform) pairs have a single run.
falcon3:3b degenerate generations on Pi 4. falcon3:3b on Pi 4 produced a one-character response to question 10 ("I am strugulling with C++ where should i start?") and a zero-character response to question 8 ("what are the steps to write a master thesis?"). Both rows have NaN power and energy metrics (vcgencmd sampler did not fire during these runs). The question-8 row carries tokens_per_second ≈ 0.2, depressing falcon3:3b's Pi 4 throughput aggregate; exclude via response_length_chars > 0 when computing per-model means.
GPT-4o-mini hallucinated scores for degenerate responses. Following on from the falcon3:3b Pi 4 runs above, GPT-4o-mini rated question 8 at 7/10 and question 10 at 6/10 with no error flags and empty or near-empty response_preview fields. These rows inflate falcon3:3b's teaching aggregate; exclude via response_length_chars > 0 if joining with benchmarks for analysis. Combined with the three parse-failures above, the total known GPT-rating anomalies stand at five.

Reproducibility

Benchmark scripts. Available at https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1
Analysis notebook. rated_teaching/human_rate_gpt4o/annotation_analysis.ipynb at the root of this record.
Hardware. Raspberry Pi 4 Model B (8GB), Raspberry Pi 5 (8GB), and a laptop with NVIDIA RTX 4060 GPU. Pi Lite OS (64-bit, 2025-11-24, kernel v6.12, Debian 12 bookworm). MMLU evaluation performed on NVIDIA Tesla A100 (40GB) via University of Tartu HPC services.

Files

supplemental_materials_arso2026.zip

Files (2.1 MB)

Name	Size	Download all
supplemental_materials_arso2026.zip md5:6a3032839b2af18e3b5c78d9974a1bda	2.1 MB	Preview Download

Additional details

Is documented by: Software: https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1 (Other)

Estonian Research Council
PRG3237

Repository URL: https://github.com/RobotStudyCompanion/Benchmark_LM
Programming language: Python , Shell
Development Status: Active

	All versions	This version
Views	47	47
Downloads	5	5
Data volume	12.6 MB	12.6 MB

Overview

Contents

Data dictionary

`benchmarks/results_/benchmark_all_models_.csv`

`benchmarks/benchmarks_merged.csv`

`MMLU/models_MMLU_scores/{model}_MMLU.csv` and `.json`

`MMLU/MMLU_merged.csv`

`rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv`

`rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx`

`rated_teaching/human_ratings_merged.csv`

`annotation_analysis.ipynb`

Methodology

Known caveats and divergences

Reproducibility

supplemental_materials_arso2026.zip

Files (2.1 MB)

Related works

Funding

Software

Supplemental materials to "Benchmarking Local Language Models for Social Robots using Edge Devices"

Authors/Creators

Description

Overview

Contents

Data dictionary

benchmarks/results_*/benchmark_all_models_*.csv

benchmarks/benchmarks_merged.csv

MMLU/models_MMLU_scores/{model}_MMLU.csv and .json

MMLU/MMLU_merged.csv

rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv

rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx

rated_teaching/human_ratings_merged.csv

annotation_analysis.ipynb

Methodology

Known caveats and divergences

Reproducibility

Files

supplemental_materials_arso2026.zip

Files (2.1 MB)

Additional details

Related works

Funding

Software

`benchmarks/results_/benchmark_all_models_.csv`

`benchmarks/benchmarks_merged.csv`

`MMLU/models_MMLU_scores/{model}_MMLU.csv` and `.json`

`MMLU/MMLU_merged.csv`

`rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv`

`rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx`

`rated_teaching/human_ratings_merged.csv`

`annotation_analysis.ipynb`