Supplemental materials to "Benchmarking Local Language Models for Social Robots using Edge Devices"
Authors/Creators
Description
This record accompanies the paper "Benchmarking Local Language Models for Social Robots using Edge Devices" [accepted IEEE ARSO 2026] and contains the raw benchmark data, MMLU scores, automated teaching-effectiveness ratings, human rater sheets, and the analysis notebook supporting the results reported therein.
Overview
We benchmarked 25 open-source language models for local deployment on edge hardware in a social-educational robotics context (the Robot Study Companion project, rsc.ee). Each model was evaluated across three dimensions — inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality validated against five independent human raters) — primarily on the Raspberry Pi 4, with scalability comparisons on the Raspberry Pi 5 and a laptop NVIDIA RTX 4060 GPU.
This record contains: per-query hardware telemetry across all three platforms, per-model MMLU scores, GPT-4o-mini teaching-effectiveness ratings, five human rater workbooks, and the notebook that computes inter-rater agreement statistics. Readers should consult the paper for methodology, results, and discussion; this record serves as the underlying evidentiary base.
Contents
.
├── benchmarks/
│ ├── benchmarks_merged.csv 727-row consolidated per-query telemetry across all platforms
│ ├── results_pi4/ 7 CSV files, per-query benchmarks on Raspberry Pi 4
│ ├── results_pi5/ 3 CSV files, per-query benchmarks on Raspberry Pi 5
│ └── results_computer/ 24 CSV files, per-query benchmarks on laptop GPU
│ └── fig1_final.pdf Figure 1 (benchmark summary)
├── MMLU/
│ ├── MMLU_merged.csv 25-row merged MMLU results (model tags harmonised with benchmarks)
│ └── models_MMLU_scores/ 25 paired CSV+JSON files, per-model MMLU results
└── rated_teaching/
├── human_ratings_merged.csv 200-row merged human rater workbook data, deblinded
├── teaching_effectiveness_ratings/ GPT-4o-mini per-response ratings (250 rows)
└── human_rate_gpt4o/ 5 rater workbooks + analysis
├── annotation_analysis.ipynb human-rater analysis (α, ICC, Pearson r, Fig 2)
└── figure_2_human_annotation.pdf Figure 2 (human annotation validation)
Data dictionary
benchmarks/results_*/benchmark_all_models_*.csv
Per-query records from benchmark runs. Each row captures one model answering one question, with full response text and hardware telemetry. Filenames encode the run timestamp: benchmark_all_models_YYYYMMDD_HHMMSS.csv.
Shared columns (all platforms, 18 fields):
| Column | Description |
|---|---|
timestamp |
ISO-8601 start of inference |
model |
Ollama model tag (e.g. qwen3:0.6b) |
model_parameters |
Nominal parameter count, as reported |
question |
Benchmark prompt text (see Table I in paper) |
response |
Full model-generated answer (not truncated) |
response_length_chars |
Character count of the response |
estimated_tokens |
Token count estimated from streaming chunks (see caveat) |
inference_time_s |
Total generation time, via time.time() |
time_to_first_token_s |
Latency to first streamed chunk, via time.time() |
tokens_per_second |
Throughput |
cpu_baseline_percent, cpu_average_percent |
CPU load (pre-inference baseline; average during inference) |
cpu_per_core |
Per-core utilisation, stringified Python list; parse with ast.literal_eval |
cpu_freq_mhz |
CPU frequency during inference |
memory_baseline_mb, memory_peak_mb, memory_increase_mb, memory_percent |
RAM telemetry |
Raspberry Pi-only additional columns (19 fields, present in results_pi4/ and results_pi5/):
| Column | Description |
|---|---|
temperature_c |
CPU die temperature (vcgencmd measure_temp) |
throttled |
Throttling flag (vcgencmd get_throttled) |
avg_voltage_v |
Mean rail voltage during inference (vcgencmd measure_volts) |
estimated_current_a |
Current estimate from linear CPU-load model (idle 0.6A, full-load 3A at 5V) |
avg_power_watts |
V × A, averaged over inference |
total_energy_joules |
Estimated energy consumption during inference |
tokens_per_joule |
Energy efficiency metric (estimated_tokens / total_energy_joules) |
io_read_count, io_write_count, io_read_bytes, io_write_bytes, io_read_time_ms, io_write_time_ms, io_total_ops, io_total_bytes, io_iops, io_throughput_mb_s, io_avg_read_latency_ms, io_avg_write_latency_ms |
Disk I/O telemetry |
Laptop-only additional columns (4 fields, present in results_computer/ only):
| Column | Description |
|---|---|
inference_time_perf_s |
Generation time via time.perf_counter() |
time_to_first_token_perf_s |
TTFT via time.perf_counter() |
tokens_per_second_perf |
Throughput computed from perf_counter timings |
timing_diff_ms |
Precision difference between time.time() and time.perf_counter() |
benchmarks/benchmarks_merged.csv
Consolidated per-query telemetry across all three platforms. 727 rows (250 Pi 4 + 237 Pi 5 + 240 laptop) × 42 columns (41 raw columns unioned across platforms, plus a platform identifier prepended). One row per (model, platform, question); the qwen3:1.7b laptop double-run contributes 20 rows rather than 10 (see caveats).
Columns: platform (rpi4 / rpi5 / laptop), followed by all columns documented above. Pi-only columns fill as NaN on laptop rows; laptop-only *_perf columns fill as NaN on Pi rows. No values are transformed; the file is a pure union merge of the per-session CSVs with platform provenance added.
MMLU/models_MMLU_scores/{model}_MMLU.csv and .json
One-row-per-model aggregate MMLU results on the six-category subset used in the paper. Both formats preserved: CSV flattens per-task scores into columns (score_{task}); JSON preserves the native task_scores dict structure.
| Column | Description |
|---|---|
model_name |
Ollama model tag |
overall_score |
Mean accuracy across the six categories (0–1 scale) |
tasks |
Comma-separated task list |
num_tasks |
Task count (always 6) |
n_shots |
Prompting shots (always 3) |
timestamp |
Run timestamp |
status |
success or failure marker |
|
|
Per-task accuracy (0–1) |
MMLU/MMLU_merged.csv
25-row summary aggregating the per-model files above; underlies Table II's MMLU column and Table III. Columns: model, overall_score, n_shots, status, timestamp, and the six per-task score_* columns. Model tags harmonised with the benchmarks dataset — the upstream per-model file nemotron-mini_MMLU.csv appears here as nemotron-mini:4b to match the Ollama canonical form used elsewhere in the record.
rated_teaching/teaching_effectiveness_ratings/teaching_effectiveness_ratings.csv
Per-response GPT-4o-mini ratings (250 rows = 25 models × 10 questions). Ratings produced via the OpenAI API; the rating prompt appears in our paper under §III-B.
| Column | Description |
|---|---|
model, model_parameters |
Model identifiers |
question |
Benchmark prompt |
response_preview |
First 200 characters of the model response (full text in the benchmark CSVs) |
score |
Teaching-effectiveness rating, 1–10 scale |
strengths, weaknesses |
Stringified lists of rater-identified strengths and weaknesses |
justification |
One- or two-sentence rationale |
tokens_per_second, inference_time_s |
Carried over from benchmark run for joint analysis |
error |
Populated when rating JSON failed to parse (score then defaults to 0) |
rated_teaching/human_rate_gpt4o/teaching_eval_{pseudonym}.xlsx
One workbook per rater; rater pseudonyms: bird, duck, sky, squirrel, tree. Each workbook contains three sheets:
- Instructions — participant information, ethics statement, consent block
- Responses — 40 rows (4 models × 10 questions) × 8 teaching-quality criteria on a 1–10 scale, plus an optional Comments column
- GPT Scores (DO NOT VIEW) — GPT-4o-mini scores for the same 40 responses, intended to be consulted only after annotation
Raters received model responses blinded via A/B/C/D labels (A=Gemma3 0.27B, B=Gemma3 1B, C=Granite4 Tiny Hybrid 7B, D=Mistral 7B). No personally identifiable data was retained; the contact email in the consent block is the principal contact above.
rated_teaching/human_ratings_merged.csv
200-row merged dataset aggregating all five rater workbooks (5 raters × 4 models × 10 questions). One row per (rater, model, question). Models are deblinded to their canonical Ollama tags; the A/B/C/D labels visible to raters are dropped.
| Column | Description |
|---|---|
rater |
Rater pseudonym: bird, duck, sky, squirrel, or tree |
model |
Ollama model tag (deblinded from A/B/C/D) |
question_num |
Question index (1–10) |
question |
Benchmark prompt (full text) |
response_preview |
Truncated model response as shown to the rater |
clarity, accuracy, engagement, structure, completeness, appropriate_level, examples_analogies, actionable |
Per-criterion rating, 1–10 scale |
mean_score |
Arithmetic mean across the eight criteria |
comments |
Rater's free-text comment, if any (otherwise NaN) |
annotation_analysis.ipynb
Jupyter notebook computing the statistics reported in paper §IV-E: Krippendorff's α, ICC(C,1) and ICC(C,k), Pearson r, mean absolute difference. Generates Figure 2.
Methodology
| Metric | Description | Type of Metrics |
|---|---|---|
| Token Per Second (TPS) | Number of tokens generated by the model in one second (average based on the total generation time divided by the total number of tokens). | Hardware |
| Inference time | Total time spent generating one output (time between the query being input to the model and the last generated token). | Hardware |
| Time To First Token (TTFT) | Time spent to generate the first token of the output. | Hardware |
| Response length | Total number of characters for an output. | Hardware |
| IOPS | Input and output memory operations. | Hardware |
| Token Per Joule (TPJ) | Number of tokens generated by the model per joule of energy consumed. | Hardware |
|
Massive Multitask Language Understanding (MMLU) |
Benchmark used to assess the general knowledge of a model across multiple topics. | Accuracy |
|
Teaching effectiveness |
Rating of the output based on teaching criteria on a scale from 1 to 10 (judged by a larger LLM: GPT-4o-mini). | Accuracy |
| Human rater | Rating of the output based on the same eight teaching criteria on a 1–10 scale (judged by five independent human raters on a four-model subset; 200 annotations total). | Accuracy |
For full methodology, please consult paper §III. In brief:
- MMLU subset covers Formal Logic, Global Facts, College Computer Science, College Mathematics, Marketing, and High School Macroeconomics (1,050 questions total). DeepEval [16] orchestrates 3-shot prompting at temperature 0.1. The evaluation runs on an NVIDIA A100 (40GB) via University of Tartu HPC, since MMLU accuracy is hardware-agnostic.
- Inference benchmarks use Ollama with streaming enabled on the target platform. Ten pedagogical questions per model (Table I in paper) cover explanatory depth, adaptability, misconception handling, and student guidance. Models above 1.4B parameters receive a structured system prompt; models below receive the raw question only, owing to sensitivity to prompt formatting at small parameter counts.
- Teaching-effectiveness ratings use GPT-4o-mini against eight criteria (clarity, accuracy, engagement, structure, completeness, appropriate level, examples/analogies, actionable). The rating prompt appears verbatim in paper §III-B.
- Human validation (paper §IV-E) covers four representative models × 10 questions × 5 raters = 200 annotations across the same eight criteria.
Known caveats and divergences
- Raspberry Pi 5 scope. Paper §III-D reports scalability on three Raspberry Pi 5 models (qwen3:0.6b, gemma3:1b, granite4:tiny-h), selected on TPS/MMLU/size criteria from the RPi 4 results for Robot Study Companion architecture planning. This record contains Pi 5 data for 24 of the 25 models across three sessions (2025-12-01, 2025-12-02, 2025-12-10).
granite4:1b-his the sole model not covered on the Pi 5. Users may analyse the broader Pi 5 dataset; the paper's three-model subset remains the analytical focus cited therein. - Teaching-rating parse failures. Three rows in
teaching_effectiveness_ratings.csvcarryscore=0anderror="Failed to parse rating": nemotron-mini:4b (question 9), phi4-mini-reasoning:3.8b (question 2), tinyllama:latest (question 3). Theirresponse_previewfields contain the model output that GPT-4o-mini could not parse into the expected JSON rubric. These three are excluded from the per-model teaching-effectiveness summary used in paper Table II. - Nemotron-mini MMLU format violations. The
nemotron-mini_MMLU.csvfile records a 0% aggregate due to repeated output-format violations; see Table II's ‡ footnote. The file is retained for completeness and reproducibility of the failure mode. - Granite4 1B quantisation note. Paper §IV-B flags that
granite4:1bships in BF16 (not Q4_K_M), yielding an atypical 3.3 GB on-disk footprint and 0.89 TPS — values not directly comparable to its Q4-quantised peers. - Single-run hardware metrics. Runtime constraints on the RPi 4 precluded multiple independent runs per model. All hardware metrics (TPS, TPJ, inference time, etc.) are single-run values; thermal drift may introduce unquantified variance into absolute values while preserving relative rankings.
- Token counting. TPS and related metrics use streamed-chunk counts rather than tokeniser-level token counts. This may introduce minor discrepancies across runtimes but does not affect relative comparisons within the benchmark.
- Power estimation uncertainty. Absolute tokens-per-joule values carry an estimated ±15–20% uncertainty owing to the linear CPU-load-to-current approximation (idle 0.6A, full-load 3A at 5V nominal). Relative rankings remain stable.
- Reasoning-model throughput. DeepSeek-R1 (1.5B, 7B) and Phi4-mini-reasoning (3.8B) emit internal chain-of-thought tokens in the runtime stream. Their throughput and latency metrics may appear optimistic relative to the useful output delivered.
- Laptop coverage. The laptop dataset covers 23 of the 25 models;
granite4:1b-handgranite4:3b-hwere not run on the laptop. Including this gap, full-platform coverage is: Pi 4 all 25, Pi 5 24 (missinggranite4:1b-h), laptop 23 (missing both hybrid variants). - qwen3:1.7b double-run on laptop.
qwen3:1.7bwas run twice on the laptop (sessions 2025-11-19 15:26 and 15:29); both runs are preserved inresults_computer/and inbenchmarks_merged.csv. Rows distinguish viatimestamp. All other (model, platform) pairs have a single run. - falcon3:3b degenerate generations on Pi 4. falcon3:3b on Pi 4 produced a one-character response to question 10 ("I am strugulling with C++ where should i start?") and a zero-character response to question 8 ("what are the steps to write a master thesis?"). Both rows have NaN power and energy metrics (
vcgencmdsampler did not fire during these runs). The question-8 row carriestokens_per_second≈ 0.2, depressing falcon3:3b's Pi 4 throughput aggregate; exclude viaresponse_length_chars > 0when computing per-model means. - GPT-4o-mini hallucinated scores for degenerate responses. Following on from the falcon3:3b Pi 4 runs above, GPT-4o-mini rated question 8 at 7/10 and question 10 at 6/10 with no error flags and empty or near-empty
response_previewfields. These rows inflate falcon3:3b's teaching aggregate; exclude viaresponse_length_chars > 0if joining with benchmarks for analysis. Combined with the three parse-failures above, the total known GPT-rating anomalies stand at five.
Reproducibility
- Benchmark scripts. Available at https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1
- Analysis notebook.
rated_teaching/human_rate_gpt4o/annotation_analysis.ipynbat the root of this record. - Hardware. Raspberry Pi 4 Model B (8GB), Raspberry Pi 5 (8GB), and a laptop with NVIDIA RTX 4060 GPU. Pi Lite OS (64-bit, 2025-11-24, kernel v6.12, Debian 12 bookworm). MMLU evaluation performed on NVIDIA Tesla A100 (40GB) via University of Tartu HPC services.
Files
supplemental_materials_arso2026.zip
Files
(2.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6a3032839b2af18e3b5c78d9974a1bda
|
2.1 MB | Preview Download |
Additional details
Related works
- Is documented by
- Software: https://github.com/RobotStudyCompanion/Benchmark_LM/releases/tag/v0.1 (Other)
Funding
- Estonian Research Council
- PRG3237
Software
- Repository URL
- https://github.com/RobotStudyCompanion/Benchmark_LM
- Programming language
- Python , Shell
- Development Status
- Active