Leveraging deep learning for action quality assessment in cricket batting using video footage

Tevin Moodley

Action Quality Assessment (AQA) aims to automatically assess the quality of a human action from a video. AQA is essential for evaluating the quality of a cricket stroke. In cricket, the General Certificate of Secondary Education (GSCE) highlights three critical phases in coaching batting: buildup, execution, and follow-through, focusing on the batter’s stance, movements of hands, feet, head, and body before, during, and after stroking the ball. Current research in AQA focuses on Olympic events and the representation of actions, notably observing that most studies have separated spatio-temporal aspects and pose estimated keypoints. Our study focuses on cricket batsmen scoring and it introduces a unique two-stream approach, merging these elements for a more comprehensive action representation. In this study, we propose a new multi-variate scoring system to assess cricket strokes based on individual body movements, diverging from traditional methods that evaluate actions as a whole. Additionally, we introduce a new dataset named CricketVision: The Ultimate AQA dataset for assessing Cricket Strokes in video, which focuses on the execution phase of a stroke. The dataset categories the strokes into poor (1478), average (2415), good (2689), and excellent (1958) categories based on batters skill level, totalling 8540 samples. It covers both left-handed (2969) and right-handed (5571) batters, analysing front foot and back foot strokes, such as off-drive, on-drive, cut, glance, and block. In our study, we explored three different experiments within the AQA domain, each employing distinct action representation schemes over three trials. The first experiment assessed spatio-temporal action representations utilising Convolutional 3D (C3D), Inflated 3D (I3D), and SlowFast models, with C3D and I3D emerging as top performers in terms of efficiency and Spearman Rank Correlation SRC scores (0.81 and 0.82), respectively. The second experiment focused on using pose estimated keypoints through OpenPose, ViTPose, and MediaPipe, with ViTPose leading the pack with the highest SRC score of 0.78, attributed to its transformer-based architecture. The third experiment combined ViTPose with each of the spatio-temporal methods (C3D, I3D, SlowFast), resulting in improved SRC scores as follows; C3D+ViTPose: 0.83, I3D+ViTPose: 0.84 and SlowFast+ViTPose: 0.80, thereby confirming that combining spatio-temporal and pose estimated keypoints provides a more nuanced action representation, enhancing performance in AQA tasks. Finally, the I3D-AELSTM model is benchmarked on the MLT-AQA dataset, which achieves an average SRC of 0.98, achieving state-of-the-art performance. Future work will aim to develop a more comprehensive dataset and to further our contributions within the AQA domain.

Leveraging deep learning for action quality assessment in cricket batting using video footage

(1)