Social Human Robot Embodied Conversation (SHREC) Dataset

Authors: Dong Won Lee, Yubin Kim, Sooyeon Jeong, Denison Guvenoz, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
Institutions: MIT, Purdue University, Carnegie Mellon University
License: [Pending final approval]

🧠 SHREC Dataset Summary

The Social Human Robot Embodied Conversation (SHREC) Dataset is a unique, one-of-a-kind large-scale, real-world benchmark designed to evaluate social reasoning in language and vision-language models through physically embodied human-robot interactions (HRI). It contains:

~400 real-world interaction videos
~10,000+ trained human annotations
Labels for social errors, competencies, rationales, and corrections
Coverage of seven social attributes critical for social intelligence

The dataset is split into 3 subsets:

The SHREC Wellness Home subset contains real-world human-robot interaction video data, longitudinal from Jeong et al. (2023) recordings from an 8-week in-home study with adult participants aged 18–83. Participants engaged with a socially assistive robot designed to improve psychological well-being, affect, and readiness for change through evidence-based positive psychology interventions (PPIs).
The SHREC Wellness Dorm subset contains longitudinal, real-world human-robot interaction video data data from Jeong et al. (2020), where a robotic positive psychology coach was deployed in MIT student dormitories. Participants engaged in daily wellbeing sessions with the robot over the course of 1–4 weeks.
The SHREC Empathic subset contains real-world human-robot interaction video data from Shen et al. (2024), collected over a month-long deployment of social robots in participants’ homes, as participants engage in natural, empathic storytelling interactions with a social robot.

It supports research in rapport-building, mental health intervention, and social reasoning in intimate, longitudinal HRI settings.

💾 Download Dataset from HuggingFace

To downloadload this dataset from huggingface, then load into a pandas df:

For Wellness Home

import pandas as pd
from datasets import load_dataset
import pandas as pd
import glob
import os
from huggingface_hub import snapshot_download

repo_id = "MIT-personal-robots/shrec_wellness_home"

snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir="shrec_wellness_home", token = True)

parquet_files = glob.glob("downloaded_repo/**/*.parquet", recursive=True)

df = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)

Change repo_id accordingly for different subsets:

HF Repo ID: "MIT-personal-robots/shrec_wellness_home"
HF Repo ID: "MIT-personal-robots/shrec_wellness_dorm",
HF Repo ID: "MIT-personal-robots/shrec_wellness_empathic"

🧪 Running SHREC Benchmark Experiments

🔑 Environment Setup

Before running experiments, install all necessary dependencies:

pip install -r requirements.txt

If you'd like to use OpenAI or Google Gemini models, ensure these environment variables are set in your shell:

export OPENAI_API_KEY="your-openai-key"
export GOOGLE_GENAI_API_KEY="your-google-api-key"

We use VLMEvalKit to test a wide suite of vision-language models (VLMs).

To install it:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

More details and configuration options can be found in their Quickstart guide.

Then, you can evaluate LLMs and VLMs on SHREC tasks by following these steps:

🔧 Step 1: Preprocess the Dataset

Run the following script to extract task-specific data from the raw HuggingFace dataset (in .csv format). This will create .pickle files under ./output_datasets for each supported task.

python main_vlm_get_data.py --data_path ../shrec_empathic.csv --data_name shrec_empathic --task_type pre

Arguments:

--data_path: Path to the HuggingFace-downloaded CSV file.
--data_name: Dataset identifier (e.g., shrec_empathic, shrec_wellness_home, etc.)
--task_type: Task to extract. Options:
- detection, attribute, rationale, correction
- post, pre
- attribute_agreed_multiple_subj, detection_error_only

🚀 Step 2: Run the Model Evaluation

After preprocessing, run the benchmark with the following command:

python main_vlm_exp.py \
  --context_window 15 \
  --model GPT4o_MINI_Image \
  --data_path ./output_datasets \
  --task_type shrec_empathic_pre.pickle \
  --video \
  --csv_path ../shrec_empathic.csv \
  --images_dir ../shrec_empathic

Key Flags:

--context_window: Number of utterances for context (e.g., 15).
--model: Model to evaluate (see list below).
--task_type: Preprocessed .pickle file generated in Step 1.
--video: Include frame-based input (set this for vision-language models).
--images_dir: Directory with extracted image frames for each interaction.

📊 Step 3: Evaluate Model Performance

After running inference, evaluate model predictions using the following steps:

(a) Parse Model Outputs:

python eval_pydantic.py

This extracts predicted answer choices from LLM output files located in ./output/.
Outputs are saved into ./output_pydantic/.

(b) Compute Accuracy Metrics:

python eval.py

This script computes task-specific performance metrics across all models.

🧠 Supported Models

Below are the models currently supported in the SHREC benchmark pipeline:

Category	Model Identifier
Open-source VLMs	`paligemma-3b-mix-448`, `llava_next_llama3`, `llava_video_qwen2_7b`, `InternVL2-8B`, `MiniCPM-V-2_6`, `Llama-3.2-3B`, `Llama-3.2-3B-Instruct`, `Llama-3.2-11B-Vision-Instruct`
GPT-4o Variants	`GPT4o_Image`, `GPT4o_MINI_Image`, `GPT4o_Lang`, `GPT4o_MINI_Lang`, `GPT4o_Image_few_shot`, `GPT4o_Image_cot`
Google Gemini	`gemini-1.5-flash`, `gemini-2.0-flash-exp`, `gemini-1.5-pro`, `gemini-1.5-flash-8b`
Others	`o1`, `o1-mini`, `llava_video_next`, `llava_video_next_7b_dpo`, `DeepSeek-R1-Distill-Qwen-32B`

Each model is loaded via a unified interface. For GPT models and Gemini, utils_gpt.py provides consistent handling of prompt strategies (zero-shot, few-shot, cot, etc.).

📦 Dataset Structure

Each interaction sample includes:

video_id: Identifier for the interaction session
frame_paths: List of image paths (15 selected frames from the video)
transcript: Multi-turn dialogue between user and robot
label: "competence", "error", or "none"
social_attributes: List of relevant attributes from 7 core categories
rationale: Explanation for the error or competence
correction: Suggested repair if the segment is an error

🧪 SHREC Task Overview

1. Detecting Social Behavior

Task	Description	`task_type` argument
Error / Competence / None Detection	Classify the robot’s behavior as a social error, competence, or neither.	`detection`
Error Detection	Determine whether a given behavior constitutes a social error.	`detection_error_only`

2. Identifying Social Attributes

Task	Description	`task_type` argument
Social Attribute Identification	Identify which of the seven social attributes are relevant to a given behavior.	`attribute`
Multiple Attribute Detection	Determine whether multiple social attributes are present in the behavior.	`attribute_agreed_multiple_subj`

Seven Social Attributes:

Emotions – Identifying and responding to emotional expressions
Engagement – Monitoring user interest and presence
Conversational Mechanics – Managing turn-taking, timing, and pauses
Knowledge State – Tracking shared knowledge and references
Intention – Inferring the goals or motives behind actions
Social Context & Relationships – Acting appropriately based on context and social role
Social Norms & Routines – Following culturally appropriate social conventions

3. Understanding Interaction Flow

Task	Description	`task_type` argument
Pre-Condition Reasoning	Given the robot’s utterance, choose the plausible user behavior that came before.	`pre`
Post-Condition Reasoning	Given the user’s utterance, select the robot’s likely follow-up behavior.	`post`

These tasks are structured as multiple-choice questions, with distractors sampled from real robot-user interactions.

4. Rationalizing & Correcting Social Errors

Task	Description	`task_type` argument
Rationale Selection	Choose the correct explanation for why the robot’s behavior was an error.	`rationale`
Correction Suggestion	Select the most appropriate corrective action the robot should have taken instead.	`correction`

These tasks evaluate both diagnostic (understanding what went wrong) and prescriptive (knowing how to fix it) reasoning abilities.

🔍 Example Sample

{
  "ID": "P15_s002-006",
  "sample_frame": "P15_s002-006/0000.png",
  "transcript": "AI Agent: (00:00:02) Hey there. How was your day today?\nUser A: (00:00:04) Good. How was yours?\n...\nAI Agent: (00:10:42) ... brighten our days.",
  "Annotations_A": [
    {
      "timestamp": {"start": 7.21, "end": 20.23},
      "error": true,
      "source": {"Verbal": true, "Non-Verbal": false},
      "attribute": {
        "Conversational Mechanics": true,
        "Intention": false,
        "Emotions": false,
        "Engagement": false,
        "Knowledge State": false,
        "Social Context &  Relationships",
        "Social Norms & Routines"
      },
      "rationale": "Delayed response and failure to understand participant.",
      "correction": "Should have responded within 2–3 seconds."
    }
  ],
  "Annotations_B": [
    {
      "..."
    }
  ],
  "Annotations_C": [
    {
      "..."
    }
  ],
  "framerate": 15.0,
  "frame_paths": [
    "P15_s002-006/0000.png",
    "P15_s002-006/0013.png",
    "P15_s002-006/0034.png",
    "..."
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Social Human Robot Embodied Conversation (SHREC) Dataset

🧠 SHREC Dataset Summary

💾 Download Dataset from HuggingFace

For Wellness Home

🧪 Running SHREC Benchmark Experiments

🔑 Environment Setup

🔧 Step 1: Preprocess the Dataset

🚀 Step 2: Run the Model Evaluation

📊 Step 3: Evaluate Model Performance

🧠 Supported Models

📦 Dataset Structure

🧪 SHREC Task Overview

1. Detecting Social Behavior

2. Identifying Social Attributes

3. Understanding Interaction Flow

4. Rationalizing & Correcting Social Errors

🔍 Example Sample

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
images		images
README.md		README.md
eval.py		eval.py
eval_pydantic.py		eval_pydantic.py
main_vlm_exp.py		main_vlm_exp.py
main_vlm_get_data.py		main_vlm_get_data.py
requirements.txt		requirements.txt
utils.py		utils.py
utils_deepseek.py		utils_deepseek.py
utils_gpt.py		utils_gpt.py

mitmedialab/SHREC

Folders and files

Latest commit

History

Repository files navigation

Social Human Robot Embodied Conversation (SHREC) Dataset

🧠 SHREC Dataset Summary

💾 Download Dataset from HuggingFace

For Wellness Home

🧪 Running SHREC Benchmark Experiments

🔑 Environment Setup

🔧 Step 1: Preprocess the Dataset

🚀 Step 2: Run the Model Evaluation

📊 Step 3: Evaluate Model Performance

🧠 Supported Models

📦 Dataset Structure

🧪 SHREC Task Overview

1. Detecting Social Behavior

2. Identifying Social Attributes

3. Understanding Interaction Flow

4. Rationalizing & Correcting Social Errors

🔍 Example Sample

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages