As a part of our work on unfamiliar gesture recognition, we encountered a need for a large, high-quality dataset of situated gesture and speech — gestures and speech used at the same time by the same person to describe the same things. Such a dataset is central to our approach to zero-shot learning for gesture understanding (Thomason and Knepper, 2016). As we were unable to find a large dataset of this type, we collected this dataset during 2016 and 2017.

Dataset Collection

The data in this set were collected by recording participants in an experiment designed to elicit a high volume of coincident gesture and speech. Participants were given a set of instructions for folding a moderately complex piece of origami and told that the instructions had been generated by a machine learning model. They were then asked to use the instructions to teach the study conductor how to fold the piece of origami without ever showing the instructions to the study conductor.

To avoid inducing a bias toward unnatural gesture use, participants were never told to use gesture. Instead, the study conductor told participants that any mode of communication except for showing the instructions or looking at what had been folded thus far was acceptable.

However, due to the construction of the origami instructions, participants found it very difficult to complete the exercise (and convey the instructions to the study conductor) using speech alone. Thus, participants resorted to simultaneous speech and gesture to describe the geometry of the origami and the folding actions which it required.

Dataset Properties

This dataset currently comprises 26 trials of roughly 20 minutes each of recorded data. We have an additional 12 trials which we are still processing for addition to the dataset. NiTE skeleton data and audio are provided for each trial. Some trials include raw depth data, but this is unfortunately not available for all trials.

Data are currently provided in raw format (described below). In the coming months, we will release a pre-processed version of the data with recordings segmented into individual gestures.


If you use this dataset, please cite:

  title={Recognizing Unfamiliar Gestures for Human-Robot Interaction through Zero-Shot Learning},
  author={Thomason, Wil and Knepper, Ross A},
  booktitle={International Symposium on Experimental Robotics},


“Fold the two sides together like this” Example 1

“Grab the corner and pull apart” Example 2

Data Download

You can download various forms of the dataset below.

Raw Data

These are the raw data from the collection experiment, unprocessed except to remove noise. They are contained in one tar.gz archive containing a separate directory for each trial. Audio files are stored as .flac files, and skeleton data is saved as pickled Python objects in the format returned by the NiTE framework.

You can download the raw data here.

Data Tools

Visualization: nite-skeleton-visualizer