Abstract
Over recent decades, significant advancements in the computer vision field have
been driven by factors such as deep learning techniques, hardware acceleration,
and the availability of large-scale datasets. Furthermore, the computer vision field
is continually extending its reach across various application domains, rapidly fuelling
the emergence of novel applications. Classical ballet, a captivating art form
characterised by its principles of grace, precision, and narrative conveyed through
movement, presents an especially intriguing application domain. Yet, there remains
room for the utilisation of computer vision technology in ballet, particularly in the
notation of choreography. Recording ballet choreography in a dance notation format
effectively protects choreographers’ original works and accurately preserves the
dance heritage of past and future generations. This thesis proposes an approach for
the automated notation of ballet choreography using computer vision techniques.
A novel video dataset, AnnChor, is presented first to address the need for a highquality
annotated dataset for ballet. A baseline study is conducted to evaluate the
dataset for the task of temporal action localisation using Coarse-Fine Networks and
TriDet models. A choreographic ontology and digital bit vector approach are then
created as a basis for an appropriate intermediate representation of dance notation
for computer vision. Furthermore, a rule-based approach based on pose estimated
data and the developed bit vector representation is used to generate ground-truth
digital dance notation data. The digital bit vector representation is inspected for
distinct groupings of different actions using t-distributed stochastic neighbour embedding.
Finally, all the components of the study can be assembled to construct computer
vision models for automated choreographic notation. Accordingly, encoderdecoder-
based sequence models for predicting dance notation from pose estimated
data are implemented and evaluated. The results of the benchmark performed on
the final developed sequence models reveal that the study accomplishes the overall
aim of automating the notation of ballet choreography using computer vision. An
ablation study and key results show that our models achieve promising results. The
top-performing model achieves low error on metrics including the mean squared
error (0.01), mean absolute error (0.02), root mean squared error (0.12) and mean
absolute percentage error (2.19 %). Additionally, the top-performing model correlation
results demonstrated high correlation with the ground truth data including
metrics such as: coefficient of determination (R2) (0.87), Spearman correlation (0.6),
Pearson correlation (0.93) and Matthew’s correlation coefficient (0.93). Further key
findings indicate that ballet movements are intricate and that certain positions of the
body, which involve subtle differences in foot orientations or high variance in arm
positions, may contribute to more error. Future work includes exploring alternative
model architectures to improve baseline results in light of the error variance revealed
in the study. The overall significance of this research work lies in the fact that it ventures
into unexplored territory, marking a first step in demonstrating feasibility for
fine-grained temporal action localisation and automated notation translation in ballet.
Therefore, this research provides choreographers, choreologists, and dancers
with a valuable tool that enables the preservation of their dance heritage and legacy.