Abstract
Action recognition has garnered ever-increasing attention in recent times. Most of the research
conducted is heavily geared towards general human action recognition, and most of the datasets
available support this field of study. However, when it comes to more specialised contexts such as
basketball there are limited datasets are comprehensive and publicly available. Accordingly, this
study set out to achieve fine-grain action recognition in the sport of basketball and validates
methods using the SpaceJam dataset. It proposes taking 3 popular methods in the field of action
recognition namely Temporal Segment Networks (TSN), Two-Stream CNN using Inflated 3Dconvolutional
Neural Networks (I3D) and Pose-C3D. All three experiments involved pre-trained
models using ImageNet and were fine-tuned on SpaceJam. TSN is the oldest of the approaches but
obtained the best results of the 3 experiments. The best-performing experiment (TSN) had a top-1
accuracy of 59% and top-5 accuracy of 96%, followed by I3D with a top-1 accuracy of 41% and
top-5 accuracy of 85% and the least performant approach was PoseC3D with a top-1 accuracy of
15% and top-5 accuracy of 50%. The study found actions such as block and shoot were the most
challenging, and that actions such as dribble, pass and ball-in-hand suffered several mispredictions.
When looking at the results it shows that current models cannot find a significant distinction
between some actions such as ball in hand, pass and dribble, which in part can explain the high
top-5 accuracy. Furthermore, some of the actions are difficult to characterise in 16 frames such as
block. The study shows the need for more research and larger datasets to improve results, as most
research is geared towards general action recognition. The research going further would ideally be
more context-aware such as the action of shooting a basketball.