Project 2 Tutorial

This tutorial overviews how to use the data provided using the HCI display on Prospect St (out of the CEID) in New Haven.

The data, captured by a depth camera, includes information about the pose of pedestrians. By the end of this tutorial, you will:

  • be able to connect to the HCI display
  • access and view the raw pose data
  • interpret and use the raw data through simple web application
  • create a simple visualization of the data

Display Functionality

  • Dante: group detection
  • AlphaPose: pose estimation
  • Input: depth camera image, video, image list
  • Output: basic RGB image with keypoint display/saving (e.g., PNG, JPG, AVI), keypoint saving (JSON)

Tutorial Contents

  1. Current Server IPs
  2. Tools Before Starting
  3. Creating a Simple Website
  4. Websocket Connections
  5. Understanding the Data
  6. Seeing the Camera View
  7. Creating a Visualization
  8. Recording Data
  9. Demos

Current Server IPs

In this guide, the placeholders [Env_IP] must be set to either one of the [Production_IP] or the [Development_IP] IP addresses specified in this section, depending on which data you want to see. These may change from time to time, so we suggest setting these to a variable value in your code so they can be easily updated. If you’re having trouble connecting, please check here for an updated IP address first.

  • [Production_IP] = 172.28.142.145

    • The production machine serves live data from the display in the CEID. Use this IP to test your code with live data.

    • View the live data currently being shown on the display in your browser here: http://172.28.142.145:8888/

    • Change the project currently being shown on the display with the “remote control” here: http://172.28.142.145:8888/remote. You’ll need to be at the display to get the 4 digit code. The code changes every time the displayed project changes.

    • Preview what your project will look like by replacing 1 in group1 in this url with your group id: http://172.28.142.145:8888?project=group1. This will only work once your app has been deployed to the laptop. You can check if your app has been deployed by visiting http://172.28.142.145:8888/remote and seeing if your group is on the list.

  • [Development_IP] = 172.29.41.16

    • The development machine serves pre-recorded data on a loop. Use this IP to test your code on pre-recorded data with lots of people in view.

    • View the demo data being played back in your browser here: http://172.29.41.16:8888/

Tools Before Starting

This tutorial aims to be independent of whichever operating system and web browser you prefer or are using. However, there may be some nuanced differences in how objects are displayed and connections made.

  • An Integrated Development Environment (IDE) for source code editing and syntax highlighting. Some recommendations include Visual Studio Code, Sublime Text, or Notepad++.

  • Connection to the Yale Virtual Private Network (VPN). Many of Yale’s digital resources, including the data from the HCI display, are only accessible by connecting to the VPN. Follow these set up instructions if you don’t already have the VPN installed or want to know more about how to use the VPN.

Creating a Simple Website

Using HTML (a standard markup language for creating websites) and CSS (a language that describes the style of an HTML document), we can create a simple website. In a file called index.html:

<!DOCTYPE html>
<html>
  <head>
    <title>HCI Tutorial</title>
  </head>
  <body>
      <!-- page content goes here -->
  </body>
</html>

Using JavaScript (a language to program the behavior of web pages), we’ll import jQuery (a JavaScript library that can simplify event handling for our project). Download the compressed jQuery file from code.jquery.com and save it in your local directory containing your previously created HTML file.

...
    <body>
        <!-- page content goes here -->
        <script src = "https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
        <script src="main.js" type="text/javascript"></script>
    </body>
...

You’ll also notice our import of main.js in the code snippet above. Create an empty JavaScript file in your local directory called main.js. The code that goes in here will allow us to access the raw data from HCI display and communicate it to our HTML page.

Websocket Connections

The WebSocket protocol allows us to continuously exchange data between the browser and server through a persistent connection (that is, without breaking the connection and without additional HTTP-requests).

In main.js, create a new WebSocket using the special protocol ws in the url. Any data we’ll need from the HCI display can be accessed through the IP and port address [Env_IP]:8888.

var socket = new WebSocket("ws://[Env_IP]:8888/frames");

For example, substituting the [Production IP], the whole connection string is:

var socket = new WebSocket("ws://172.28.142.145:8888/frames");

Opening the index.html file in your browser currently will not show anything happening. However, if you right-click in the browser and ‘Inspect’ the page source, you can view the status of this websocket connection under the ‘Network’ tab.

The following expansion on this code snippet creates a connection to [Env_IP]:8888/frames and prints the details of each raw frame collected by the Kinect camera on the HCI display.

var host = "[Env_IP]:8888";

$(document).ready(function () {
    frames.start();
});

var frames = {
    socket: null,

    start: function () {
        var url = "ws://" + host + "/frames";
        frames.socket = new WebSocket(url);
        frames.socket.onmessage = function (event) {
            frames.show(JSON.parse(event.data));
        }
    },

    show: function (frame) {
        console.log(frame);
    }
};

In addition to /frames, you can subscribe to /depth for the raw depth image and to /twod for an image with pose information overlaid on the depth image.

You can confirm that you’re successfully connected by inspecting the page and checking that the ‘Status’ column for the site you’re connecting to displays a status code of 101.

Selecting the connection name in the left menu will provide you more information about the connection header and the overall connection status:

Finally, in the ‘Console’ tab, you can view the current stream of the raw keypoints from each frame.

Understanding the Data

Each /frame has information about the people the camera can currently see. If there are two or more people and their orientation can be detected, group information is also provided.

Keypoints

Pose information for each person is provided as keypoints (positions of a person’s joints), as shown in the following image:

Image Source

Where the index in the image corresponds to the joint name:

0: 'Nose'
1: 'LEye'
2: 'REye'
3: 'LEar'
4: 'REar'
5: 'LShoulder'
6: 'RShoulder'
7: 'LElbow'
8: 'RElbow'
9: 'LWrist'
10: 'RWrist'
11: 'LHip'
12: 'RHip'
13: 'LKnee'
14: 'RKnee'
15: 'LAnkle'
16: 'RAnkle'

Note that only the joints that can be detected by the camera are sent in a given /frame.

Units

All distances are given in Millimeters.

3D Frame of Reference

Since we wish to localize people not only within the 2D image, but within the 3D world as well, we use the depth camera on the display to approximate a 3D coordinate point for each joint. Every coordinate point must be relative to some “coordinate frame.” For the data here, the coordinate frame is that of the camera. All 3D coordinates in the data are relative to this coordinate frame:

Image Source

This tutorial from the camera manufacturer provides more information about the camera’s coordinate frame.

TV Frame

For those building interactive systems, the location of the TV relative to the camera may be important.

The center of the TV is approximately 450 in the positive Y direction (e.g. below) the camera.

An Example Data Point

To understand what each field in the /frame means, read through this example. See the #comments below for a detailed description of each field:

{
    # Timestamp in milliseconds since the Unix epoch.
    "ts": 1616975727738,

    # The format is always "coco." Referring to the coco dataset: https://cocodataset.org/#keypoints-2020
    "format": "coco",

    # There is one element in the "people" object for each person in view.
    # Object keys are an identifier for the person and values are an object with pose information for the person.
    "people": {

        # Start of the object for person with ID "29"
        # Note: Only the first person is annotated, the rest follow the same format.
        "29": {
            # idx is this person's identifier and unique for as long as a person is in view.
            # When a person is occluded or leaves the view of the camera, the id is re-assigned to
            # another detected person. If a previously-seen person comes back into view, a new id is assigned.
            # This value is the same as the key of this object.
            "idx": 29,

            # The average position of all detected joints
            # Note that the 4th value of each 3D position is a probability indicating the confidence of the prediction.
            "avg_position": [
                88.72024130821228, # Average Y position in mm from the camera coordinate frame.
                344.2315616607666, # Average Y position in mm from the camera coordinate frame.
                2669.3439331054688, # Average Z position in mm from the camera coordinate frame.
                0.7196953520178795 # Average confidence of the prediction.
            ],

            # An object containing all the detected keypoints for this person.
            # The key is the name of the joint and the value has 3D position and confidence information.
            # Note: Only the first keypoint is annotated, the rest follow the same format.
            "keypoints": {
                "LEye": [ # Joint name (see Figure above for further description)
                    161.81634521484375, # X position in mm from the camera coordinate frame.
                    68.1683349609375, # Y position in mm from the camera coordinate frame.
                    2964.96435546875, # Z position in mm from the camera coordinate frame.
                    0.8872065544128418 # Confidence of the prediction.
                ],
                "REye": [
                    137.91452026367188,
                    64.6593246459961,
                    2812.340576171875,
                    0.9290732145309448
                ],
                "LEar": [
                    51.196128845214844,
                    78.58700561523438,
                    2754.69580078125,
                    0.6128555536270142
                ],
                "REar": [
                    73.93463134765625,
                    78.43740844726562,
                    2749.4521484375,
                    0.8814399242401123
                ],
                "LShoulder": [
                    49.954246520996094,
                    218.06932067871094,
                    2687.874267578125,
                    0.6979869604110718
                ],
                "RShoulder": [
                    49.83028793334961,
                    232.37428283691406,
                    2681.20458984375,
                    0.8318760991096497
                ],
                "LElbow": [
                    62.25393295288086,
                    488.0362243652344,
                    2580.78076171875,
                    0.7045618891716003
                ],
                "RElbow": [
                    62.11945343017578,
                    501.24114990234375,
                    2575.205810546875,
                    0.5965441465377808
                ],
                "LWrist": [
                    204.5781707763672,
                    573.5643920898438,
                    2486.921142578125,
                    0.6451338529586792
                ],
                "RWrist": [
                    243.98159790039062,
                    599.56689453125,
                    2538.71533203125,
                    0.7768394351005554
                ],
                "LHip": [
                    -23.678733825683594,
                    614.4974365234375,
                    2601.93505859375,
                    0.49556764960289
                ],
                "RHip": [
                    -9.257685661315918,
                    613.5769653320312,
                    2598.037353515625,
                    0.5772589445114136
                ]
            },

            # Angle, in radians, of the head and body relative to the camera coordinate frame.
            # These values are only present when they can be estimated.
            "orientation": {
                "head": 1.4250283142169624,
                "body": 1.5070544814520053
            },

            # Overall confidence (probability) of the keypoints.
            "kp_score": [
                0.4779687225818634
            ],

            # Overall confidence (probability) of the bounding box proposal.
            "proposal_score": [
                0.4779687225818634
            ],

            # A bounding box proposal where values correspond to pixel coordinates in the depth image
            # for a 2D bounding box outlining the person.
            "box": [
                # xmin
                279.9930599102781,
                # ymin
                167.5819030761719,
                # x width
                78.35422441772505,
                # y height
                167.28322753906252
            ],

            # The 'headpose' field is only present if it can be estimated.
            # Done via MTCNN and deep-head-pose https://github.com/natanielruiz/deep-head-pose
            # The values are in axis-angle rotation, units are degrees
            # The frame right handed, is z-forward (the way the person is facing), y-down
            'headpose': {
                'pitch': -7.35711669921875,
                'roll': -6.687309265136719,
                'yaw': 3.6068115234375
            },

            # Orientation of the person in radians in the camera coordinate frame.
            # Taken from orientation.head or orientation.body and only present when the camera can
            # estimate one of these values.
            "theta": 1.4250283142169624
        },
        # End of person 29

        "30": {
            "idx": 30,
            "avg_position": [
                93.74205417633057,
                346.6014114379883,
                2645.3414794921873,
                0.6191366255283356
            ],
            "keypoints": {
                "REye": [
                    143.68145751953125,
                    56.085060119628906,
                    2773.3662109375,
                    0.8639733791351318
                ],
                "LEar": [
                    81.44731903076172,
                    78.3418960571289,
                    2746.10400390625,
                    0.5425136685371399
                ],
                "REar": [
                    73.93463134765625,
                    78.43740844726562,
                    2749.4521484375,
                    0.8331313133239746
                ],
                "LShoulder": [
                    49.83028793334961,
                    232.37428283691406,
                    2681.20458984375,
                    0.45515918731689453
                ],
                "RShoulder": [
                    34.840354919433594,
                    246.2038116455078,
                    2670.1796875,
                    0.6443066000938416
                ],
                "LElbow": [
                    47.97573471069336,
                    509.596923828125,
                    2581.417236328125,
                    0.5547038316726685
                ],
                "RElbow": [
                    54.87013626098633,
                    514.3778076171875,
                    2569.598388671875,
                    0.6305257081985474
                ],
                "LWrist": [
                    221.01174926757812,
                    566.6220703125,
                    2517.2548828125,
                    0.47555679082870483
                ],
                "RWrist": [
                    246.3041534423828,
                    591.08349609375,
                    2562.88232421875,
                    0.668350875377655
                ],
                "RHip": [
                    -16.475282669067383,
                    592.891357421875,
                    2601.955322265625,
                    0.5231449007987976
                ]
            },
            "orientation": {
                "head": 1.5580835336248964
            },
            "kp_score": [
                0.3251804709434509
            ],
            "proposal_score": [
                0.3251804709434509
            ],
            "box": [
                293.544189453125,
                172.2435302734375,
                65.81124877929688,
                102.1400146484375
            ],
            "theta": 1.5580835336248964
        }
    },

    # Group information is a list of shape = (N, M) where N is the number of groups and
    # M varies by the number of members within a group. The values correspond to people
    # idx values above. Each idx above will be the member of one and only one group.
    "groups": [
        # A group, consisting of people with idxs 29 and 30:
        [
            29,
            30
        ]
    ]
}

Seeing the Camera View

It may be useful to view the raw image feed from the camera. In order to do so, include an img object in the body of your index.html.

<img class="twod"/>

Then, in your script file, update the src attribute of the image object whenever a new image frame is received from the camera connection.

var twod = {
    socket: null,

    // create a connection to the camera feed
    start: function () {
        var url = "ws://" + host + "/twod";
        twod.socket = new WebSocket(url);

        // whenever a new frame is received...
        twod.socket.onmessage = function (event) {

            // parse and show the raw data
            twod.show(JSON.parse(event.data));
        }
    },

    // show the image by adjusting the source attribute of the HTML img object previously created
    show: function (twod) {
        $('img.twod').attr("src", 'data:image/pnjpegg;base64,' + twod.src);
    },
};

The result is an image on your HTML page that updates according to the output published by the camera feed.

Just a note: while the webpage you’re working with is local (meaning the hyperlink bar will look like a file address instead of a typical website URL or IP address such as file:///C:/Users/rramn/Desktop/hci-display-tutorial/index.html), you can view the data you’re getting from the HCI display directly by visiting http://[Env_IP]:8888/debug.

If there are error messages in the console or the ‘Network’ tab does not show the data and connection you expect, then the HCI display’s camera may not be turned on at that moment.

Creating A Visualization

The canvas element in HTML can be used to draw graphics on the web page.

In a script file, you can dynamically create a canvas object using the createCanvas method. The following code snippet demonstrates how you can adjust attributes of the HTML canvas object in JavaScript

function setup() {
    // get the dimensions of the parent HTML element
    height = document.getElementById('sketch-holder').clientHeight;
    width = document.getElementById('sketch-holder').clientWidth;

    // create canvas
    var canvas = createCanvas(width, height);

    // stretch canvas to fit dimensions of parent
    canvas.parent('sketch-holder');
    canvas.width = width;
    canvas.height = height;
}

Some useful resources for understanding how to use the canvas object in HTML and JavaScript are w3schools.com and developer.mozilla.org. A library that makes graphics and creative coding more accessible and has great documentation is p5.js.

Access and use the variables from main.js by calling them where necessary (e.g., frames['people'] to get the bounding box dimensions of a given frame). See JSON property accessors for more information on how to access the properties of a JSON object.

A web development framework such as VueJS, React, or AngularJS can also be used.

Creating A 3D Visualization

Unity can be used to create a 3D visualization of the data.

The screenshot above is from the demonstration code, which can be used as a starting point to your project, available here: https://github.com/yale-img/hci-unity.

Recording Data

Data can be recorded using the data recording tool.

Usage instructions are described in the README.

The data recording tool is also a good example of how to subscribe to the spatial data in a Python project. If you plan on building a Python based project, you can use this tool as a starting point for your application.

Demos

Demo projects can be used as a starting point for your own project: