As a part of our work on unfamiliar gesture recognition, we encountered a need for a large, high-quality dataset of situated gesture and speech — gestures and speech used at the same time by the same person to describe the same things. Such a dataset is central to our approach to zero-shot learning for gesture understanding (Thomason and Knepper, 2016). As we were unable to find a large dataset of this type, we collected this dataset during 2016 and 2017.

Dataset Collection

The data in this set were collected by recording participants in an experiment designed to elicit a high volume of coincident gesture and speech. Participants were given a set of instructions for folding a moderately complex piece of origami and told that the instructions had been generated by a machine learning model. They were then asked to use the instructions to teach the study conductor how to fold the piece of origami without ever showing the instructions to the study conductor.

To avoid inducing a bias toward unnatural gesture use, participants were never told to use gesture. Instead, the study conductor told participants that any mode of communication except for showing the instructions or looking at what had been folded thus far was acceptable.

However, due to the construction of the origami instructions, participants found it very difficult to complete the exercise (and convey the instructions to the study conductor) using speech alone. Thus, participants resorted to simultaneous speech and gesture to describe the geometry of the origami and the folding actions which it required.

Dataset Properties

This dataset comprises 30 trials of roughly 20 minutes each of recorded data. NiTE skeleton data and audio are provided for each trial. Some trials include raw depth data, but this is unfortunately not available for all trials.


  “Fold the two sides together like this”  
Frame 1 Frame 2 Frame 3
  “Grab the corner and pull apart”  
Frame 1 Frame 2 Frame 3

Data Download

You can download various forms of the dataset below.

Raw Data

These are the raw data from the collection experiment, unprocessed except to remove noise. They are contained in one tar.gz archive containing a separate directory for each trial. Audio files are stored as .mp4 files, and skeleton data is saved as pickled Python objects in the format returned by the NiTE framework.

You can download the raw data here.

Data Tools

Visualization: nite-skeleton-visualizer

Gesture Segmentation: Forthcoming!

Audio Alignment: Forthcoming!