2. Research

2.1. Related works

The field of interactive digital systems that react to the body is not new, ranging from early works in the 1970s like Myron Krueger’s Videoplace 1, to 2020’s Super You 2 by collective Universal Everything. Previous works which focus on music can be located in the following fields of application:

In the following I will introduce four quite different projects and products.

2.1.1. Installation: Very Nervous System

Figure 1. This montage illustrates the principle feedback loop of VNS: Action -> Observation -> Software interpretation -> Musical feedback -> Observation...Image: Rokeby

The installation Very Nervous System 3 by David Rokeby was presented at the 1986 Venice Biennale. It is a system of cameras that tracks the users, uses software to interpret their movements, and plays a synthetic orchestra as a response. There is no visual feedback. Rokeby himself said that his main interest is the interaction between user and system, rather than the individual input and output phenomena:

“While the ‘sound’ of the system and the ‘dance’ of the person within the space are of interest to me, the central aspect of the work is neither the ‘sound’ nor the ‘dance’. It is the relationship that develops between the sounding installation and the dancing person that is the core of the work. [...] The installation watches and sings; the person listens and dances. But the relationship that develops is not simply that of a dialogue between person and system. Dialogue in its back-and forthing implies a separation of the functions of perceiving and responding. But for the installation, perception and expression are virtually simultaneous. As a result the installation and participant form a feedback loop which is very tight, yet very complex.” (Rokeby, 1990 4)

Rokeby relies on the visitor to give up direct control of the computer in favor of a synergy of human movement, and digital perception and reaction. There is interpretation on both sides: the machine has to process the low resolution images of the (initially self-crafted) cameras through preconceived rule-systems. The visitor, on their part, tries to relate the immediate musical feedback to their actions and can be inspired for further actions.

Rokeby began work on Very Nervous System in 1982, at a time when computers were still perceived primarily as “calculating machines” and were largely reserved for industry, science and technology enthusiasts. He continued to develop the system over 13 years, modernizing components such as video input and sound processing. The latter was later based on the Max/MSP graphical development environment, which is popular especially in the audio field. Until 2002 Rokeby also sold the system softVNS, a collection of Max objects that could be used for tracking and processing motion 5.

I had a brief contact with Rokeby by e-mail after I discovered his work during my research. I asked him directly if he considers my project obsolete. In his opinion this was not the case, for the following reasons: Everyone that he knew who had worked on this topic so far had found their own approach. In addition, the approach of bringing such a system into the private sphere should be quite interesting and worthwhile.

2.1.2. Video Game: Just Dance

Figure 2. Screen from “Just Dance 2021” for Sony Playstation.Image: Ubisoft

Just Dance is a representative of dance video games, which itself are a niche in the genre of music video games 6. Just Dance was released on Sony Playstation, Microsoft XBox, Nintendo Wii, iPhone and for Android smartphones. The gameplay is based on players dancing choreographies to well-known pop hits. The chains of postures are strictly defined, and timing is of the essence. Depending on the version and platform, the movement and pose of the players is detected by various technologies, from camera input and peripherals such as hand-held motion-sensitive gaming controllers, to the smartphone, which registers the movements via acceleration sensors. The choreographies are displayed on-screen during the game, with successive steps scrolling through the image from right to left. These visual timing aids appear in many music games like Singstar, Guitar Hero or Rock Band, and before that had been used in karaoke videos, where a ball bounces from one syllable of the lyrics to the next. In addition to these timing hints, the rest of the interface of Just Dance is visually rich: for each track there is an exclusive music video with dynamic visual effects. In fact, the feedback to the players performance is purely visual as there is no change in the audio layer at all. The goal of the normal mode is to re-dance the choreographies as accurately as possible to achieve the highest possible score. The series lives strongly from the well-known songs and star performers and has been released in annual iterations with new songs since 2009.

2.1.3. Interactive Entertainment: Nagual Dance

Figure 3. Interface of a Nagual Dance Session with an overlay of the video stream on the left.Image: Nagual Dance

Nagual Dance was developed by the Berlin startup “Nagual Sounds” 7 from 2011 to 2015. I discovered it quite late in the process, event though back then it got a relatively broad media response, won several prizes and was financed by investors. The project used a first-generation Microsoft Kinect camera for body tracking and applied a unified interaction model that took into account both the position of the limbs and the position of the players in space. Players move freely on a 2 × 2 grid, and, depending on which field they are in, different levels of the song are played. By moving their hands and feet they activate rhythm, harmony and melody, which show variations according to the joint positions. Nagual Dance can also be played by two players simultaneously. One person then controls the tonal elements and the other person the percussion. The music selection consists mainly of pop and dance tracks without vocals. Abstract avatars of the players standing in the playfield are displayed on the screen.

I was able to try different iterations of Nagual Dance during a meeting with co-founder Matthias Strobel. Strangely enough, the whole thing felt somewhat stiff. Technically, the software worked well most of the time, but I missed an emotional spark. I can’t put my finger exactly on the cause. Maybe it was also a different taste in music, but I remember constantly looking at the screen in front of me to rationally establish a connection between my action and the musical events, merely executing a function in the system. This can simply be my individual perception.

Nagual Dance was discontinued in 2015 after investors backed out. The company filed for bankruptcy in 2019.

2.1.4. Therapy and performance: Motion Composer

Another area in which motion-sensitive music can be used is therapeutic applications. Sonification of body movements for this is an active area of research 8. One product in this category is Motion Composer. The project was founded in 2002 by choreographer Robert Wechsler and consists of a team of twelve people, including six composers. Motion Composer explicitly addresses people with limited movement possibilities and therefore wants to be especially accessible. The company is technically supported by the Chemnitz-based company “Fusion Systems”, which offers image analysis solutions. In the artistic field Motion Composer works together with the “Palindrome Dance Company”. It has also been used for purely artistic on-stage dance performances but the project seems to be strongly focused on therapeutic use cases now.

According to the offer pamphlet for “Motion Composer 3.0”, an all-in-one solution is sold for 12.450 €. This includes a device with stereo cameras, “more than 50 music worlds”, speakers, an introductory workshop and a book with activity suggestions 9.

Figure 4. Image of the Motion Composer device, taken from the marketing PDF.Image: Motion Composer

2.2. Platform and technology

The first and in my opinion obvious idea was to publish a web-based software. This would mean the widest possible distribution and accessibility and also promises absolute freedom to me as an author, because I would not be bound to the rules of proprietary app distribution platforms, whose vendors have the last word on what can be published. I already had experience with the web-based body tracking software Posenet 10, which — like all modern algorithms in this field — is based on machine learning. It was developed at Google by Dan Oved and provides 2D coordinates for several people simultaneously.

Back then I was disappointed with the precision of the tracking and was sure that visible progress could only be a matter of time. Although Posenet 2 has been released in the meantime, the results are still not satisfactory for my application, which made me consider the following alternatives.

PC application with Microsoft Kinect 2

The Microsoft Kinect 2 can calculate 3D positions of up to six people through the combination of a 2D camera image and a projected grid of infrared light points. The tracking has a very low latency, and the PC as a platform is widely used. With this properties, the device is popular for media installations and such. The clear disadvantage is that not only additional hardware is required, but that this hardware is also no longer officially distributed by Microsoft. There is a successor in the form of the “Azure Kinect”, 11, but it is explicitly not aimed at consumers but at the industry.

PC application which uses advanced camera image-based tracking

There are now several pose estimation algorithms that calculate 3D position data in real-time from a 2D image alone. One of the most popular ones is OpenPose by Gines Hidalgo et. al. 12 which calculates 2D data for several people at once, or 3D data for one person. A more recent and impressive model is XNect by Mehta et. al., which can calculate 3D data for several people 13. All current algorithms in this area are based on Neural Networks and require a powerful (i.e. expensive) graphics card to provide fast tracking performance.

Figure 5. Demonstration of 3D body estimation from a 2D image with OpenPose.Image: OpenPose

iOS-App using the Motion Capture feature

Motion Capture for iOS was introduced in version 3 of Apple’s proprietary augmented reality framework ARKit in 2019. It delivers three-dimensional position data of virtual joints of a person at a rate of about 45 frames per second. The great advantage of such a device is its mobility: it can be used in a wide variety of situations and locations. A disadvantage is the relatively high price of the supported devices.

After some consideration, I finally decided to use iOS as a platform to reach the easiest possible setup for users and the largest possible audience.

2.3. Tracking

2.3.1. API and input image

Figure 6. Illustration of the joint structure provided by “Motion Capture”. This Image is taken from an article called “Rigging a Model for Motion Capture”. This title emphasizes that the feature is originally made for “virtual puppeteering”.Image: Apple

The Application Programming Interface (API) for Motion Capture provides data for 91 joints, but only 26 of them are actively tracked 14:

The positions of the joints that are not actively tracked are added algorithmically to obtain a standardized and detailed skeletal structure to which a humanoid 3D model can be attached. The system’s unit of measurement is meters.

The API requires devices with at least an “A12” chip, which was the first one to include a dedicated machine learning processor, meaning that all iPhone models released after 2018 can use this feature. Although the devices have cameras on the front and back, the feature only works with the rear camera. The input image has a resolution of 1920 × 1440 pixels. This 4:3 aspect ratio makes it possible to use the functionality in both portrait and landscape mode. In tests, a distance of 3 meters was sufficient to capture the entire body of a person in portrait or landscape format.

Figure 7. Left - Joint path visualisation from an older project of mine using Posenet. Right - Joint path visualisation using Motion Capture on iOS.Image: Weibezahn

2.3.3. Tracking quality


Despite the name “Motion Capture” one should not expect the precision of professional motion capture systems. The accuracy of the data varies for different joints and postures, and is also influenced by lighting conditions. In one test I measured the noise of the most important joints in a static scene over a period of 10 seconds. The test person stands absolutely still in a neutral pose, and the camera’s view is unobstructed. The average size of the bounding box of the measured noise is 2.091 × 0.982 × 3.123 cm. For 12 of 14 joints, noise dominates in the z-dimension. Also the rotation of the pose is not estimated correctly: Compared to the actual scene, the virtual skeleton is rotated about 30 degrees counterclockwise.

Mis-detections leading to “Artificial Motion”

In another test, a person initially stands still in a static scene, with her hands resting on the height of her hips. She then raises her right hand in a circular fashion over her head, then lowers it again until it hangs by her hip. The rest of her body does not move significantly at all. The tracking data, however, suggests that not only the right arm is moving, but also all other joints. Especially the left arm and hand seem to move backward and forward, upward and downward significantly, as if this arm is swinging back and forth. This happens both during the upward and the downward movement of the right arm. Over the course of this test, the movement bounding box for the left hand is 11.351 × 3.049 × 10.301 cm in size, when it should be much closer to the size of the one recorded when standing perfectly still.

My speculation is that these errors are a result of the model trying to fit the data into its learned patterns. I faced this problem before with Posenet to a much higher extend. Unfortunately, unlike other software in this field, the API does not provide a “confidence” factor along with the estimation, so errors can not be inferred easily by the developer.

Figure 8. The sequence shows the misinterpretations of the model, here in the case of a unilateral arm movement.Image: Weibezahn

2.4. Interpretation of motion data

Laban Movement Analysis

This system was conceived by Rudolf von Laban and further developed by Irmgard Bartenieff as a vocabulary for human movement in several dimensions 15. Laban Movement Analysis (LMA) describes different spatiotemporal components of such motion, not only in dance. To do so, movement is analysed and expressed in four main categories: “Body” describes the organisation and relationships of body parts in the so called Kinesphere, which is different from the general space in which the body moves as a whole and is the concern of the “Space” category. “Effort” then describes the quality of the movement, for example the directness of a gesture (pointing somewhere VS waiving randomly) or how much kinematic force one puts behind it (punching VS nudging). Later, the system was extended by Bartenieff to include the categories “Phrasing” (temporal structure of the first four categories) and “Relationship” (describing the relationship to other subjects). LMA is referenced in a lot of research carried out in the field of human movement analysis, from robotics 16 to 17 art. I do not use the terms from LMA in my work but it is useful as a general guidance when thinking about the body.

Quantisation of body movement

In “The Dancer in the Eye: Towards a Multi-Layered Computational Framework of Qualities in Movement”, Camurri et al. describe a conceptual framework for the analysis of expressive qualities of movement 18. The research deals not only with camera input but also with other possible measurement methods, such as body mounted accelerometers, pressure sensitive floor plates, breath sounds and more. The authors divide the processing of these data into four layers with increasing degrees of abstraction. I use some aspects from the first two layers (bold):

Layer 1, Physical Signals, Virtual SensorsPosition data, boundary forms, silhouette, breathing sounds, non-verbal sounds, body function sensors (EEG etc.), weight.

Layer 2, Low Level FeaturesSpeed, acceleration, jerk, gravity, kinetic energy of point-volume, symmetry, contraction, motion uniformity, “dynamic balance”, “tension” of posture.

The other two layers have a high degree of abstraction and take into account more complex structures and longer time periods:

Layer 3, Mid Level Features – e.g. equilibrium, repetitions, simultaneity, pauses, rotation patterns.

Layer 4, Qualities Communication – e.g. predictability, resistance, “groove”, conveyed emotion.

Visual analysis

To illustrate the tracking, I developed a testing and visualization software for the PC, as well as a rudimentary tracking app for iPhone, based on an open source project 19. The iPhone app sends the tracking data using the Open Sound Protocol (OSC) to the PC via WiFi. Among other things, it contains various visualizations, like joint positions and motion paths. Furthermore, the bounding boxes of these paths can be displayed, as well as their average direction of movement and the bounding box of the virtual skeleton. The software already includes a module called “MotionProcessor”, which contains the algorithms for processing the tracking data: Smoothing by using averaged values, calculating directions of movement, path volumes, detecting rotations. In the course of the project I tested many ideas and approaches here. The software can also store capturing sessions, which is especially useful for repeatable tests.

Figure 10. Montage of visualisations for different kinds of data: 1 - Joint positions, pose bounding box. 2 - Virtual skeleton, average joint directions. 3 - Joint positions, connections to center (average position data). 4 - Joint motion paths. 5 - Joint motion paths, paths bounding boxes. 6 - Motion paths, path bounding boxes, overall paths bounding box.Image: Weibezahn

2.5. Electroacoustics and audio software development

I developed my first audio experiments on the PC using data received from the phone. For this I used the popular C++ framework openFrameworks 20, with which I also had build the visualization. Among other things I was able to create and combine oscillators from scratch and program step sequencers for sample loops. But at some point these abilities were exhausted and I would have had to go deep into audio programming myself. Since I only had a basic knowledge of electro-acoustics, this would have been beyond the time frame of this thesis. In addition, I had problems with the performance of openFrameworks on the iPhone regarding audio.

In my search for a professional iOS framework for audio, I found Audiokit 21. It offers numerous features, from live synthesis, samplers, filters, effects, dynamics and much more. Audiokit can simulate analog devices and offers well-known techniques of electroacoustics, such as FM synthesis. There was still a steep learning curve: Compared to graphics programming (where most of my experience lies), the functionalities and performance profiles of many components were not very accessible to me. Also the phenomenon of unwanted sound behaviours like clicks and droning, especially during live synthesis with many oscillators end effects, caused problems that were unpredictable for me. Since I had previously mainly developed systems with visual feedback, debugging by ear was challenging, because I could not easily make out the connection of the sonic phenomena and their root cause. So a big part of my research in this area was learning to use the software.

Previous: ConceptNext: Interaction