Conception d'architectures compactes pour la détection spatiotemporelle d'actions en temps réel

Yu Liu

Résumé

This thesis tackles the spatiotemporal action detection problem from an online, efficient, and real-time processing point of view. In the last decade, the explosive growth of video content has driven a broad range of application demands for automating human action understanding. Aside from accurate detection, vast sensing scenarios in the real-world also mandate incremental, instantaneous processing of scenes under restricted computational budgets. However, current research and related detection frameworks are incapable of simultaneously fulfilling the above criteria. The main challenge lies in their heavy architectural designs and detection pipelines to extract pertinent spatial and temporal context, such as incorporating 3D Convolutoinal Neural Networks (CNN) or explicit motion cues (e.g., optical flow). We hypothesize that reasoning actions' spatiotemporal pattern can be realized much more efficiently (down to feasible deployment on resource-constrained devices) without significantly compromising detection quality.To this end, we propose three action detection architectures coupling various spatiotemporal modeling schemes with compact 2D CNNs. We start by accelerating frame-level action detection by allocating bottom-up feature extraction to only a sparse set of video frames while approximating the rest. This is realized by spatially warping CNN features under the guidance of relative motion between successive frames, which we later expand to align-and-accumulate observations over time for modeling temporal variations of actions. Following the frame-level approach, we subsequently explore a multi-frame detection paradigm to concurrently process video sequences and predict the underlying action-specific bounding boxes (i.e., tubelets). Modeling of an action sequence is decoupled into multi-frame feature aggregation and trajectory tracking for enhanced classification and localization, respectively. Finally, we devise a flow-like motion representation that can be computed on-the-fly from raw video frames, and extend the above tubelet detection approach into two-CNN pathways to jointly extract actions' static visual and dynamic cues. We demonstrate that our online action detectors progressively improve and obtain a superior mix of accuracy, efficiency, and speed performance than state-of-the-art methods on public benchmarks.

Depuis la dernière décennie, la croissance explosive de vidéos fait naître un large éventail d’applications nécessitant l’analyse et la compréhension des actions humaines. Les recherches connexes actuelles se concentrent principalement sur l’amélioration des performances de détection de reconnaissance d’actions. Cependant, certains scénarios du monde réel exigent des réponses spontanées réalisées sur des systèmes embarqués avec des ressources limitées. Les méthodes existantes sont difficilement déployables dans ce contexte, puisqu’elles utilisent des architectures lourdes comme coréseaux de neuronesnvolutifs 3D pour extraire les caractéristiques spatiotemporelles d’un vidéo ou calculent explicitement le flux optique des mouvement. Dans cette thèse, nous explorons la faisabilité de réaliser la détection spatiotemporelle d’action humaine satisfaisant simultanément plusieurs contraintes d’applications grand publique : robustesse, temps réel, bas coût, ergonomie, bonne portabilité et longue autonomie énergétique.Pour ce faire, nous proposons trois architectures de détection d'action couplant différents schémas de modélisation spatiotemporelle avec des CNN 2D compacts. La première réalise la détection au niveau d’une image statique en approximant les caractéristiques de la plupart des frames d’une séquence vidéo pour accélérer le traitement. Nous explorons ensuite un paradigme de détection multi-images pour traiter simultanément la détection temporelle et la prédiction des boîtes englobantes des actions spécifiques pour former des tubelets. Enfin, nous concevons une représentation de mouvement de type flux calculé à la volée à partir d'images vidéo brutes, et étendons l'approche de détection de tubelet à deux CNN pour extraire conjointement les caractéristiques spatiales et temporelles des actions. Les résultats expérimentaux obtenus sur des bases de données publiques montrent les améliorations progressives de nos approches en termes de précision, d’efficacité, et de vitesse de traitement.

Conception d'architectures compactes pour la détection spatiotemporelle d'actions en temps réel

Lightweight architectures for spatiotemporal action detection in real-time

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager