Published April 18, 2026 | Version v1
Dataset Open

Supplemental materials to "Benchmarking Local Language Models for Social Robots using Edge Devices"

Description

This record accompanies the paper "Benchmarking Local Language Models for Social Robots using Edge Devices" [accepted IEEE ARSO 2026] and contains the raw benchmark data, MMLU scores, automated teaching-effectiveness ratings, human rater sheets, and the analysis notebook supporting the results reported therein.

Overview

We benchmarked 25 open-source language models for local deployment on edge hardware in a social-educational robotics context (the Robot Study Companion project, rsc.ee). Each model was evaluated across three dimensions — inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality validated against five independent human raters) — primarily on the Raspberry Pi 4, with scalability comparisons on the Raspberry Pi 5 and a laptop NVIDIA RTX 4060 GPU.

This record contains: per-query hardware telemetry across all three platforms, per-model MMLU scores, GPT-4o-mini teaching-effectiveness ratings, five human rater workbooks, and the notebook that computes inter-rater agreement statistics. Readers should consult the paper for methodology, results, and discussion; this record serves as the underlying evidentiary base.

Contents

.
├── benchmarks/
│   ├── benchmarks_merged.csv                 727-row consolidated per-query telemetry across all platforms
│   ├── results_pi4/                          7 CSV files, per-query benchmarks on Raspberry Pi 4
│   ├── results_pi5/                          3 CSV files, per-query benchmarks on Raspberry Pi 5
│   └── results_computer/                     24 CSV files, per-query benchmarks on laptop GPU
│ └── fig1_final.pdf Figure 1 (benchmark summary) ├── MMLU/ │ ├── MMLU_merged.csv 25-row merged MMLU results (model tags harmonised with benchmarks) │ └── models_MMLU_scores/ 25 paired CSV+JSON files, per-model MMLU results └── rated_teaching/ ├── human_ratings_merged.csv 200-row merged human rater workbook data, deblinded ├── teaching_effectiveness_ratings/ GPT-4o-mini per-response ratings (250 rows) └── human_rate_gpt4o/ 5 rater workbooks + analysis ├── annotation_analysis.ipynb human-rater analysis (α, ICC, Pearson r, Fig 2) └── figure_2_human_annotation.pdf Figure 2 (human annotation validation)

Data dictionary

benchmarks/results_*/benchmark_all_models_*.csv

Per-query records from benchmark runs. Each row captures one model answering one question, with full response text and hardware telemetry. Filenames encode the run timestamp: benchmark_all_models_YYYYMMDD_HHMMSS.csv.

Shared columns (all platforms, 18 fields):

Column Description
timestamp ISO-8601 start of inference
model Ollama model tag (e.g. qwen3:0.6b)
model_parameters Nominal parameter count, as reported
question Benchmark prompt text (see Table I in paper)
response Full model-generated answer (not truncated)
response_length_chars Character count of the response
estimated_tokens Token count estimated from streaming chunks (see caveat)
inference_time_s Total generation time, via time.time()
time_to_first_token_s Latency to first streamed chunk, via time.time()
tokens_per_second Throughput
cpu_baseline_percent, cpu_average_percent CPU load (pre-inference baseline; average during inference)
cpu_per_core Per-core utilisation, stringified Python list; parse with ast.literal_eval
cpu_freq_mhz CPU frequency during inference
memory_baseline_mb, memory_peak_mb, memory_increase_mb, memory_percent RAM telemetry

Raspberry Pi-only additional columns (19 fields, present in results_pi4/ and results_pi5/):

Column Description
temperature_c CPU die temperature (vcgencmd measure_temp)
throttled Throttling flag (vcgencmd get_throttled)
avg_voltage_v Mean rail voltage during inference (vcgencmd measure_volts)
estimated_current_a Current estimate from linear CPU-load model (idle 0.6A, full-load 3A at 5V)
avg_power_watts V × A, averaged over inference
total_energy_joules Estimated energy consumption during inference
tokens_per_joule Energy efficiency metric (estimated_tokens / total_energy_joules)
io_read_count, io_write_count, io_read_bytes, io_write_bytes, io_read_time_ms, io_write_time_ms, io_total_ops, io_total_bytes, io_iops, io_throughput_mb_s, io_avg_read_latency_ms, io_avg_write_latency_ms Disk I/O telemetry

Laptop-only additional columns (4 fields, present in results_computer/ only):

Column Description
inference_time_perf_s Generation time via time.perf_counter()
time_to_first_token_perf_s TTFT via time.perf_counter()
tokens_per_second_perf Throughput computed from perf_counter timings
timing_diff_ms Precision difference between time.time() and time.perf_counter()

benchmarks/benchmarks_merged.csv

Consolidated per-query telemetry across all three platforms. 727 rows (250 Pi 4 + 237 Pi 5 + 240 laptop) × 42 columns (41 raw columns unioned across platforms, plus a platform identifier prepended). One row per (model, platform, question); the qwen3:1.7b laptop double-run contributes 20 rows rather than 10 (see caveats).

Columns: platform (rpi4 / rpi5 / laptop), followed by all columns documented above. Pi-only columns fill as NaN on laptop rows; laptop-only *_perf columns fill as NaN on Pi rows. No values are transformed; the file is a pure union merge of the per-session CSVs with platform provenance added.

MMLU/models_MMLU_scores/{model}_MMLU.csv and .json

One-row-per-model aggregate MMLU results on the six-category subset used in the paper. Both formats preserved: CSV flattens per-task scores into columns (score_{task}); JSON preserves the native task_scores dict structure.

Column Description
model_name Ollama model tag
overall_score Mean accuracy across the six categories (0–1 scale)
tasks Comma-separated task list
num_tasks Task count (always 6)
n_shots Prompting shots (always 3)
timestamp Run timestamp
status success or failure marker

score_formal_logic, score_global_factsscore_college_computer_science, score_college_mathematics, score_marketing, score_high_school_macroeconomics

Per-task accuracy (0–1)

MMLU/MMLU_merged.csv

25-row summary aggregating the per-model files above; underlies Table II's MMLU column and Table III. Columns: model, overall_score, n_shots, status, timestamp, and the six per-task score_* columns. Model tags harmonised with the benchmarks dataset — the upstream per-model file nemotron-mini_MMLU.csv appears here as nemotron-mini:4b to match the Ollama canonical form used elsewhere in the record.

rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv

Per-response GPT-4o-mini ratings (250 rows = 25 models × 10 questions). Ratings produced via the OpenAI API; the rating prompt appears in our paper under §III-B.

Column Description
model, model_parameters Model identifiers
question Benchmark prompt
response_preview First 200 characters of the model response (full text in the benchmark CSVs)
score Teaching-effectiveness rating, 1–10 scale
strengths, weaknesses Stringified lists of rater-identified strengths and weaknesses
justification One- or two-sentence rationale
tokens_per_second, inference_time_s Carried over from benchmark run for joint analysis
error Populated when rating JSON failed to parse (score then defaults to 0)

rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx

One workbook per rater; rater pseudonyms: bird, duck, sky, squirrel, tree. Each workbook contains three sheets:

  • Instructions — participant information, ethics statement, consent block
  • Responses — 40 rows (4 models × 10 questions) × 8 teaching-quality criteria on a 1–10 scale, plus an optional Comments column
  • GPT Scores (DO NOT VIEW) — GPT-4o-mini scores for the same 40 responses, intended to be consulted only after annotation

Raters received model responses blinded via A/B/C/D labels (A=Gemma3 0.27B, B=Gemma3 1B, C=Granite4 Tiny Hybrid 7B, D=Mistral 7B). No personally identifiable data was retained; the contact email in the consent block is the principal contact above.

rated_teaching/human_ratings_merged.csv

200-row merged dataset aggregating all five rater workbooks (5 raters × 4 models × 10 questions). One row per (rater, model, question). Models are deblinded to their canonical Ollama tags; the A/B/C/D labels visible to raters are dropped.

Column Description
rater Rater pseudonym: bird, duck, sky, squirrel, or tree
model Ollama model tag (deblinded from A/B/C/D)
question_num Question index (1–10)
question Benchmark prompt (full text)
response_preview Truncated model response as shown to the rater
clarity, accuracy, engagement, structure, completeness, appropriate_level, examples_analogies, actionable Per-criterion rating, 1–10 scale
mean_score Arithmetic mean across the eight criteria
comments Rater's free-text comment, if any (otherwise NaN)

annotation_analysis.ipynb

Jupyter notebook computing the statistics reported in paper §IV-E: Krippendorff's α, ICC(C,1) and ICC(C,k), Pearson r, mean absolute difference. Generates Figure 2.

Methodology

Metric Description Type of Metrics
Token Per Second (TPS) Number of tokens generated by the model in one second (average based on the total generation time divided by the total number of tokens). Hardware
Inference time Total time spent generating one output (time between the query being input to the model and the last generated token). Hardware
Time To First Token (TTFT) Time spent to generate the first token of the output. Hardware
Response length Total number of characters for an output. Hardware
IOPS Input and output memory operations. Hardware
Token Per Joule (TPJ) Number of tokens generated by the model per joule of energy consumed. Hardware

Massive Multitask Language Understanding (MMLU)

Benchmark used to assess the general knowledge of a model across multiple topics. Accuracy

Teaching effectiveness

Rating of the output based on teaching criteria on a scale from 1 to 10 (judged by a larger LLM: GPT-4o-mini). Accuracy
Human rater  Rating of the output based on the same eight teaching criteria on a 1–10 scale (judged by five independent human raters on a four-model subset; 200 annotations total). Accuracy

For full methodology, please consult paper §III. In brief:

  • MMLU subset covers Formal Logic, Global Facts, College Computer Science, College Mathematics, Marketing, and High School Macroeconomics (1,050 questions total). DeepEval [16] orchestrates 3-shot prompting at temperature 0.1. The evaluation runs on an NVIDIA A100 (40GB) via University of Tartu HPC, since MMLU accuracy is hardware-agnostic.
  • Inference benchmarks use Ollama with streaming enabled on the target platform. Ten pedagogical questions per model (Table I in paper) cover explanatory depth, adaptability, misconception handling, and student guidance. Models above 1.4B parameters receive a structured system prompt; models below receive the raw question only, owing to sensitivity to prompt formatting at small parameter counts.
  • Teaching-effectiveness ratings use GPT-4o-mini against eight criteria (clarity, accuracy, engagement, structure, completeness, appropriate level, examples/analogies, actionable). The rating prompt appears verbatim in paper §III-B.
  • Human validation (paper §IV-E) covers four representative models × 10 questions × 5 raters = 200 annotations across the same eight criteria.

Known caveats and divergences

  • Raspberry Pi 5 scope. Paper §III-D reports scalability on three Raspberry Pi 5 models (qwen3:0.6b, gemma3:1b, granite4:tiny-h), selected on TPS/MMLU/size criteria from the RPi 4 results for Robot Study Companion architecture planning. This record contains Pi 5 data for 24 of the 25 models across three sessions (2025-12-01, 2025-12-02, 2025-12-10). granite4:1b-h is the sole model not covered on the Pi 5. Users may analyse the broader Pi 5 dataset; the paper's three-model subset remains the analytical focus cited therein.
  • Teaching-rating parse failures. Three rows in teaching_effectiveness_ratings.csv carry score=0 and error="Failed to parse rating": nemotron-mini:4b (question 9), phi4-mini-reasoning:3.8b (question 2), tinyllama:latest (question 3). Their response_preview fields contain the model output that GPT-4o-mini could not parse into the expected JSON rubric. These three are excluded from the per-model teaching-effectiveness summary used in paper Table II.
  • Nemotron-mini MMLU format violations. The nemotron-mini_MMLU.csv file records a 0% aggregate due to repeated output-format violations; see Table II's ‡ footnote. The file is retained for completeness and reproducibility of the failure mode.
  • Granite4 1B quantisation note. Paper §IV-B flags that granite4:1b ships in BF16 (not Q4_K_M), yielding an atypical 3.3 GB on-disk footprint and 0.89 TPS — values not directly comparable to its Q4-quantised peers.
  • Single-run hardware metrics. Runtime constraints on the RPi 4 precluded multiple independent runs per model. All hardware metrics (TPS, TPJ, inference time, etc.) are single-run values; thermal drift may introduce unquantified variance into absolute values while preserving relative rankings.
  • Token counting. TPS and related metrics use streamed-chunk counts rather than tokeniser-level token counts. This may introduce minor discrepancies across runtimes but does not affect relative comparisons within the benchmark.
  • Power estimation uncertainty. Absolute tokens-per-joule values carry an estimated ±15–20% uncertainty owing to the linear CPU-load-to-current approximation (idle 0.6A, full-load 3A at 5V nominal). Relative rankings remain stable.
  • Reasoning-model throughput. DeepSeek-R1 (1.5B, 7B) and Phi4-mini-reasoning (3.8B) emit internal chain-of-thought tokens in the runtime stream. Their throughput and latency metrics may appear optimistic relative to the useful output delivered.
  • Laptop coverage. The laptop dataset covers 23 of the 25 models; granite4:1b-h and granite4:3b-h were not run on the laptop. Including this gap, full-platform coverage is: Pi 4 all 25, Pi 5 24 (missing granite4:1b-h), laptop 23 (missing both hybrid variants).
  • qwen3:1.7b double-run on laptop. qwen3:1.7b was run twice on the laptop (sessions 2025-11-19 15:26 and 15:29); both runs are preserved in results_computer/ and in benchmarks_merged.csv. Rows distinguish via timestamp. All other (model, platform) pairs have a single run.
  • falcon3:3b degenerate generations on Pi 4. falcon3:3b on Pi 4 produced a one-character response to question 10 ("I am strugulling with C++ where should i start?") and a zero-character response to question 8 ("what are the steps to write a master thesis?"). Both rows have NaN power and energy metrics (vcgencmd sampler did not fire during these runs). The question-8 row carries tokens_per_second ≈ 0.2, depressing falcon3:3b's Pi 4 throughput aggregate; exclude via response_length_chars > 0 when computing per-model means.
  • GPT-4o-mini hallucinated scores for degenerate responses. Following on from the falcon3:3b Pi 4 runs above, GPT-4o-mini rated question 8 at 7/10 and question 10 at 6/10 with no error flags and empty or near-empty response_preview fields. These rows inflate falcon3:3b's teaching aggregate; exclude via response_length_chars > 0 if joining with benchmarks for analysis. Combined with the three parse-failures above, the total known GPT-rating anomalies stand at five.

Reproducibility

  • Benchmark scripts. Available at https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1
  • Analysis notebook. rated_teaching/human_rate_gpt4o/annotation_analysis.ipynb at the root of this record.
  • Hardware. Raspberry Pi 4 Model B (8GB), Raspberry Pi 5 (8GB), and a laptop with NVIDIA RTX 4060 GPU. Pi Lite OS (64-bit, 2025-11-24, kernel v6.12, Debian 12 bookworm). MMLU evaluation performed on NVIDIA Tesla A100 (40GB) via University of Tartu HPC services.

Files

supplemental_materials_arso2026.zip

Files (2.1 MB)

Name Size Download all
md5:6a3032839b2af18e3b5c78d9974a1bda
2.1 MB Preview Download

Additional details

Related works

Is documented by
Software: https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1 (Other)

Funding

Estonian Research Council
PRG3237

Software

Repository URL
https://github.com/RobotStudyCompanion/Benchmark_LM
Programming language
Python , Shell
Development Status
Active