Action Quality Assessment (AQA) aims to automatically assess the quality of a human
action from a video. AQA is essential for evaluating the quality of a cricket
stroke. In cricket, the General Certificate of Secondary Education (GSCE) highlights
three critical phases in coaching batting: buildup, execution, and follow-through, focusing
on the batter’s stance, movements of hands, feet, head, and body before, during,
and after stroking the ball. Current research in AQA focuses on Olympic events and
the representation of actions, notably observing that most studies have separated
spatio-temporal aspects and pose estimated keypoints. Our study focuses on cricket
batsmen scoring and it introduces a unique two-stream approach, merging these elements
for a more comprehensive action representation. In this study, we propose a
new multi-variate scoring system to assess cricket strokes based on individual body
movements, diverging from traditional methods that evaluate actions as a whole.
Additionally, we introduce a new dataset named CricketVision: The Ultimate AQA
dataset for assessing Cricket Strokes in video, which focuses on the execution phase
of a stroke. The dataset categories the strokes into poor (1478), average (2415), good
(2689), and excellent (1958) categories based on batters skill level, totalling 8540 samples.
It covers both left-handed (2969) and right-handed (5571) batters, analysing front
foot and back foot strokes, such as off-drive, on-drive, cut, glance, and block. In our
study, we explored three different experiments within the AQA domain, each employing
distinct action representation schemes over three trials. The first experiment
assessed spatio-temporal action representations utilising Convolutional 3D (C3D),
Inflated 3D (I3D), and SlowFast models, with C3D and I3D emerging as top performers
in terms of efficiency and Spearman Rank Correlation SRC scores (0.81 and
0.82), respectively. The second experiment focused on using pose estimated keypoints
through OpenPose, ViTPose, and MediaPipe, with ViTPose leading the pack
with the highest SRC score of 0.78, attributed to its transformer-based architecture.
The third experiment combined ViTPose with each of the spatio-temporal methods
(C3D, I3D, SlowFast), resulting in improved SRC scores as follows; C3D+ViTPose:
0.83, I3D+ViTPose: 0.84 and SlowFast+ViTPose: 0.80, thereby confirming that combining
spatio-temporal and pose estimated keypoints provides a more nuanced action
representation, enhancing performance in AQA tasks. Finally, the I3D-AELSTM
model is benchmarked on the MLT-AQA dataset, which achieves an average
SRC of 0.98, achieving state-of-the-art performance. Future work will aim to develop
a more comprehensive dataset and to further our contributions within the
AQA domain.
- Leveraging deep learning for action quality assessment in cricket batting using video footage
- Tevin Moodley
- Dustin Terence Van Der Haar Prof.
- University of Johannesburg; Doctor of Philosophy (PHD)
- Doctor of Philosophy (PHD), University of Johannesburg
- 9956302107691
- University of Johannesburg
- University of Johannesburg; Department of Applied Chemistry; Faculty of Science
- English
- Dissertation