Skip to content

Social Human Robot Embodied Conversation Dataset: a multimodal video dataset of human robot interaction to benchmark AI agents' social Intelligence.

Notifications You must be signed in to change notification settings

mitmedialab/SHREC

Repository files navigation

Social Human Robot Embodied Conversation (SHREC) Dataset

  • Authors: Dong Won Lee, Yubin Kim, Sooyeon Jeong, Denison Guvenoz, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
  • Institutions: MIT, Purdue University, Carnegie Mellon University
  • License: [Pending final approval]

🧠 SHREC Dataset Summary

The Social Human Robot Embodied Conversation (SHREC) Dataset is a unique, one-of-a-kind large-scale, real-world benchmark designed to evaluate social reasoning in language and vision-language models through physically embodied human-robot interactions (HRI). It contains:

  • ~400 real-world interaction videos
  • ~10,000+ trained human annotations
  • Labels for social errors, competencies, rationales, and corrections
  • Coverage of seven social attributes critical for social intelligence

The dataset is split into 3 subsets:

  • The SHREC Wellness Home subset contains real-world human-robot interaction video data, longitudinal from Jeong et al. (2023) recordings from an 8-week in-home study with adult participants aged 18–83. Participants engaged with a socially assistive robot designed to improve psychological well-being, affect, and readiness for change through evidence-based positive psychology interventions (PPIs).
  • The SHREC Wellness Dorm subset contains longitudinal, real-world human-robot interaction video data data from Jeong et al. (2020), where a robotic positive psychology coach was deployed in MIT student dormitories. Participants engaged in daily wellbeing sessions with the robot over the course of 1–4 weeks.
  • The SHREC Empathic subset contains real-world human-robot interaction video data from Shen et al. (2024), collected over a month-long deployment of social robots in participants’ homes, as participants engage in natural, empathic storytelling interactions with a social robot.

It supports research in rapport-building, mental health intervention, and social reasoning in intimate, longitudinal HRI settings.

💾 Download Dataset from HuggingFace

To downloadload this dataset from huggingface, then load into a pandas df:

For Wellness Home

import pandas as pd
from datasets import load_dataset
import pandas as pd
import glob
import os
from huggingface_hub import snapshot_download

repo_id = "MIT-personal-robots/shrec_wellness_home"

snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir="shrec_wellness_home", token = True)

parquet_files = glob.glob("downloaded_repo/**/*.parquet", recursive=True)

df = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)

Change repo_id accordingly for different subsets:

  • HF Repo ID: "MIT-personal-robots/shrec_wellness_home"
  • HF Repo ID: "MIT-personal-robots/shrec_wellness_dorm",
  • HF Repo ID: "MIT-personal-robots/shrec_wellness_empathic"

🧪 Running SHREC Benchmark Experiments

🔑 Environment Setup

Before running experiments, install all necessary dependencies:

pip install -r requirements.txt

If you'd like to use OpenAI or Google Gemini models, ensure these environment variables are set in your shell:

export OPENAI_API_KEY="your-openai-key"
export GOOGLE_GENAI_API_KEY="your-google-api-key"

We use VLMEvalKit to test a wide suite of vision-language models (VLMs).

To install it:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

More details and configuration options can be found in their Quickstart guide.


Then, you can evaluate LLMs and VLMs on SHREC tasks by following these steps:

🔧 Step 1: Preprocess the Dataset

Run the following script to extract task-specific data from the raw HuggingFace dataset (in .csv format). This will create .pickle files under ./output_datasets for each supported task.

python main_vlm_get_data.py --data_path ../shrec_empathic.csv --data_name shrec_empathic --task_type pre

Arguments:

  • --data_path: Path to the HuggingFace-downloaded CSV file.
  • --data_name: Dataset identifier (e.g., shrec_empathic, shrec_wellness_home, etc.)
  • --task_type: Task to extract. Options:
    • detection, attribute, rationale, correction
    • post, pre
    • attribute_agreed_multiple_subj, detection_error_only

🚀 Step 2: Run the Model Evaluation

After preprocessing, run the benchmark with the following command:

python main_vlm_exp.py \
  --context_window 15 \
  --model GPT4o_MINI_Image \
  --data_path ./output_datasets \
  --task_type shrec_empathic_pre.pickle \
  --video \
  --csv_path ../shrec_empathic.csv \
  --images_dir ../shrec_empathic

Key Flags:

  • --context_window: Number of utterances for context (e.g., 15).
  • --model: Model to evaluate (see list below).
  • --task_type: Preprocessed .pickle file generated in Step 1.
  • --video: Include frame-based input (set this for vision-language models).
  • --images_dir: Directory with extracted image frames for each interaction.

📊 Step 3: Evaluate Model Performance

After running inference, evaluate model predictions using the following steps:

(a) Parse Model Outputs:

python eval_pydantic.py
  • This extracts predicted answer choices from LLM output files located in ./output/.
  • Outputs are saved into ./output_pydantic/.

(b) Compute Accuracy Metrics:

python eval.py
  • This script computes task-specific performance metrics across all models.

🧠 Supported Models

Below are the models currently supported in the SHREC benchmark pipeline:

Category Model Identifier
Open-source VLMs paligemma-3b-mix-448, llava_next_llama3, llava_video_qwen2_7b, InternVL2-8B, MiniCPM-V-2_6, Llama-3.2-3B, Llama-3.2-3B-Instruct, Llama-3.2-11B-Vision-Instruct
GPT-4o Variants GPT4o_Image, GPT4o_MINI_Image, GPT4o_Lang, GPT4o_MINI_Lang, GPT4o_Image_few_shot, GPT4o_Image_cot
Google Gemini gemini-1.5-flash, gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash-8b
Others o1, o1-mini, llava_video_next, llava_video_next_7b_dpo, DeepSeek-R1-Distill-Qwen-32B

Each model is loaded via a unified interface. For GPT models and Gemini, utils_gpt.py provides consistent handling of prompt strategies (zero-shot, few-shot, cot, etc.).

📦 Dataset Structure

Each interaction sample includes:

  • video_id: Identifier for the interaction session
  • frame_paths: List of image paths (15 selected frames from the video)
  • transcript: Multi-turn dialogue between user and robot
  • label: "competence", "error", or "none"
  • social_attributes: List of relevant attributes from 7 core categories
  • rationale: Explanation for the error or competence
  • correction: Suggested repair if the segment is an error

🧪 SHREC Task Overview

1. Detecting Social Behavior

Task Description task_type argument
Error / Competence / None Detection Classify the robot’s behavior as a social error, competence, or neither. detection
Error Detection Determine whether a given behavior constitutes a social error. detection_error_only

2. Identifying Social Attributes

Task Description task_type argument
Social Attribute Identification Identify which of the seven social attributes are relevant to a given behavior. attribute
Multiple Attribute Detection Determine whether multiple social attributes are present in the behavior. attribute_agreed_multiple_subj

Seven Social Attributes:

  • Emotions – Identifying and responding to emotional expressions
  • Engagement – Monitoring user interest and presence
  • Conversational Mechanics – Managing turn-taking, timing, and pauses
  • Knowledge State – Tracking shared knowledge and references
  • Intention – Inferring the goals or motives behind actions
  • Social Context & Relationships – Acting appropriately based on context and social role
  • Social Norms & Routines – Following culturally appropriate social conventions

3. Understanding Interaction Flow

Task Description task_type argument
Pre-Condition Reasoning Given the robot’s utterance, choose the plausible user behavior that came before. pre
Post-Condition Reasoning Given the user’s utterance, select the robot’s likely follow-up behavior. post

These tasks are structured as multiple-choice questions, with distractors sampled from real robot-user interactions.

4. Rationalizing & Correcting Social Errors

Task Description task_type argument
Rationale Selection Choose the correct explanation for why the robot’s behavior was an error. rationale
Correction Suggestion Select the most appropriate corrective action the robot should have taken instead. correction

These tasks evaluate both diagnostic (understanding what went wrong) and prescriptive (knowing how to fix it) reasoning abilities.

🔍 Example Sample

{
  "ID": "P15_s002-006",
  "sample_frame": "P15_s002-006/0000.png",
  "transcript": "AI Agent: (00:00:02) Hey there. How was your day today?\nUser A: (00:00:04) Good. How was yours?\n...\nAI Agent: (00:10:42) ... brighten our days.",
  "Annotations_A": [
    {
      "timestamp": {"start": 7.21, "end": 20.23},
      "error": true,
      "source": {"Verbal": true, "Non-Verbal": false},
      "attribute": {
        "Conversational Mechanics": true,
        "Intention": false,
        "Emotions": false,
        "Engagement": false,
        "Knowledge State": false,
        "Social Context &  Relationships",
        "Social Norms & Routines"
      },
      "rationale": "Delayed response and failure to understand participant.",
      "correction": "Should have responded within 2–3 seconds."
    }
  ],
  "Annotations_B": [
    {
      "..."
    }
  ],
  "Annotations_C": [
    {
      "..."
    }
  ],
  "framerate": 15.0,
  "frame_paths": [
    "P15_s002-006/0000.png",
    "P15_s002-006/0013.png",
    "P15_s002-006/0034.png",
    "..."
  ]
}

About

Social Human Robot Embodied Conversation Dataset: a multimodal video dataset of human robot interaction to benchmark AI agents' social Intelligence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages