- Authors: Dong Won Lee, Yubin Kim, Sooyeon Jeong, Denison Guvenoz, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
- Institutions: MIT, Purdue University, Carnegie Mellon University
- License: [Pending final approval]
The Social Human Robot Embodied Conversation (SHREC) Dataset is a unique, one-of-a-kind large-scale, real-world benchmark designed to evaluate social reasoning in language and vision-language models through physically embodied human-robot interactions (HRI). It contains:
- ~400 real-world interaction videos
- ~10,000+ trained human annotations
- Labels for social errors, competencies, rationales, and corrections
- Coverage of seven social attributes critical for social intelligence
The dataset is split into 3 subsets:
- The SHREC Wellness Home subset contains real-world human-robot interaction video data, longitudinal from Jeong et al. (2023) recordings from an 8-week in-home study with adult participants aged 18–83. Participants engaged with a socially assistive robot designed to improve psychological well-being, affect, and readiness for change through evidence-based positive psychology interventions (PPIs).
- The SHREC Wellness Dorm subset contains longitudinal, real-world human-robot interaction video data data from Jeong et al. (2020), where a robotic positive psychology coach was deployed in MIT student dormitories. Participants engaged in daily wellbeing sessions with the robot over the course of 1–4 weeks.
- The SHREC Empathic subset contains real-world human-robot interaction video data from Shen et al. (2024), collected over a month-long deployment of social robots in participants’ homes, as participants engage in natural, empathic storytelling interactions with a social robot.
It supports research in rapport-building, mental health intervention, and social reasoning in intimate, longitudinal HRI settings.
To downloadload this dataset from huggingface, then load into a pandas df:
import pandas as pd from datasets import load_dataset import pandas as pd import glob import os from huggingface_hub import snapshot_download repo_id = "MIT-personal-robots/shrec_wellness_home" snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir="shrec_wellness_home", token = True) parquet_files = glob.glob("downloaded_repo/**/*.parquet", recursive=True) df = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)
Change repo_id accordingly for different subsets:
- HF Repo ID: "MIT-personal-robots/shrec_wellness_home"
- HF Repo ID: "MIT-personal-robots/shrec_wellness_dorm",
- HF Repo ID: "MIT-personal-robots/shrec_wellness_empathic"
Before running experiments, install all necessary dependencies:
pip install -r requirements.txt
If you'd like to use OpenAI or Google Gemini models, ensure these environment variables are set in your shell:
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_GENAI_API_KEY="your-google-api-key"
We use VLMEvalKit to test a wide suite of vision-language models (VLMs).
To install it:
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
More details and configuration options can be found in their Quickstart guide.
Then, you can evaluate LLMs and VLMs on SHREC tasks by following these steps:
Run the following script to extract task-specific data from the raw HuggingFace dataset (in .csv
format). This will create .pickle
files under ./output_datasets
for each supported task.
python main_vlm_get_data.py --data_path ../shrec_empathic.csv --data_name shrec_empathic --task_type pre
Arguments:
--data_path
: Path to the HuggingFace-downloaded CSV file.--data_name
: Dataset identifier (e.g.,shrec_empathic
,shrec_wellness_home
, etc.)--task_type
: Task to extract. Options:detection
,attribute
,rationale
,correction
post
,pre
attribute_agreed_multiple_subj
,detection_error_only
After preprocessing, run the benchmark with the following command:
python main_vlm_exp.py \
--context_window 15 \
--model GPT4o_MINI_Image \
--data_path ./output_datasets \
--task_type shrec_empathic_pre.pickle \
--video \
--csv_path ../shrec_empathic.csv \
--images_dir ../shrec_empathic
Key Flags:
--context_window
: Number of utterances for context (e.g., 15).--model
: Model to evaluate (see list below).--task_type
: Preprocessed.pickle
file generated in Step 1.--video
: Include frame-based input (set this for vision-language models).--images_dir
: Directory with extracted image frames for each interaction.
After running inference, evaluate model predictions using the following steps:
(a) Parse Model Outputs:
python eval_pydantic.py
- This extracts predicted answer choices from LLM output files located in
./output/
. - Outputs are saved into
./output_pydantic/
.
(b) Compute Accuracy Metrics:
python eval.py
- This script computes task-specific performance metrics across all models.
Below are the models currently supported in the SHREC benchmark pipeline:
Category | Model Identifier |
---|---|
Open-source VLMs | paligemma-3b-mix-448 , llava_next_llama3 , llava_video_qwen2_7b , InternVL2-8B , MiniCPM-V-2_6 , Llama-3.2-3B , Llama-3.2-3B-Instruct , Llama-3.2-11B-Vision-Instruct |
GPT-4o Variants | GPT4o_Image , GPT4o_MINI_Image , GPT4o_Lang , GPT4o_MINI_Lang , GPT4o_Image_few_shot , GPT4o_Image_cot |
Google Gemini | gemini-1.5-flash , gemini-2.0-flash-exp , gemini-1.5-pro , gemini-1.5-flash-8b |
Others | o1 , o1-mini , llava_video_next , llava_video_next_7b_dpo , DeepSeek-R1-Distill-Qwen-32B |
Each model is loaded via a unified interface. For GPT models and Gemini, utils_gpt.py
provides consistent handling of prompt strategies (zero-shot
, few-shot
, cot
, etc.).
Each interaction sample includes:
video_id
: Identifier for the interaction sessionframe_paths
: List of image paths (15 selected frames from the video)transcript
: Multi-turn dialogue between user and robotlabel
:"competence"
,"error"
, or"none"
social_attributes
: List of relevant attributes from 7 core categoriesrationale
: Explanation for the error or competencecorrection
: Suggested repair if the segment is an error
Task | Description | task_type argument |
---|---|---|
Error / Competence / None Detection | Classify the robot’s behavior as a social error, competence, or neither. | detection |
Error Detection | Determine whether a given behavior constitutes a social error. | detection_error_only |
Task | Description | task_type argument |
---|---|---|
Social Attribute Identification | Identify which of the seven social attributes are relevant to a given behavior. | attribute |
Multiple Attribute Detection | Determine whether multiple social attributes are present in the behavior. | attribute_agreed_multiple_subj |
Seven Social Attributes:
- Emotions – Identifying and responding to emotional expressions
- Engagement – Monitoring user interest and presence
- Conversational Mechanics – Managing turn-taking, timing, and pauses
- Knowledge State – Tracking shared knowledge and references
- Intention – Inferring the goals or motives behind actions
- Social Context & Relationships – Acting appropriately based on context and social role
- Social Norms & Routines – Following culturally appropriate social conventions
Task | Description | task_type argument |
---|---|---|
Pre-Condition Reasoning | Given the robot’s utterance, choose the plausible user behavior that came before. | pre |
Post-Condition Reasoning | Given the user’s utterance, select the robot’s likely follow-up behavior. | post |
These tasks are structured as multiple-choice questions, with distractors sampled from real robot-user interactions.
Task | Description | task_type argument |
---|---|---|
Rationale Selection | Choose the correct explanation for why the robot’s behavior was an error. | rationale |
Correction Suggestion | Select the most appropriate corrective action the robot should have taken instead. | correction |
These tasks evaluate both diagnostic (understanding what went wrong) and prescriptive (knowing how to fix it) reasoning abilities.
{
"ID": "P15_s002-006",
"sample_frame": "P15_s002-006/0000.png",
"transcript": "AI Agent: (00:00:02) Hey there. How was your day today?\nUser A: (00:00:04) Good. How was yours?\n...\nAI Agent: (00:10:42) ... brighten our days.",
"Annotations_A": [
{
"timestamp": {"start": 7.21, "end": 20.23},
"error": true,
"source": {"Verbal": true, "Non-Verbal": false},
"attribute": {
"Conversational Mechanics": true,
"Intention": false,
"Emotions": false,
"Engagement": false,
"Knowledge State": false,
"Social Context & Relationships",
"Social Norms & Routines"
},
"rationale": "Delayed response and failure to understand participant.",
"correction": "Should have responded within 2–3 seconds."
}
],
"Annotations_B": [
{
"..."
}
],
"Annotations_C": [
{
"..."
}
],
"framerate": 15.0,
"frame_paths": [
"P15_s002-006/0000.png",
"P15_s002-006/0013.png",
"P15_s002-006/0034.png",
"..."
]
}