Abstract : This paper describes a novel methodology for automated recognition of high-level activities. A key aspect of our framework relies on the concept of co-occurring visual words for describing interactions between several persons. Motivated by the numerous success of human activity recognition methods using bag-of-words, this paradigm is extended. A 3-D XYT spatio-temporal volume is generated for each interacting person and a set of visual words is extracted to represent his activity. The interaction is then represented by the frequency of co-occurring visual words between persons. For our experiments, we used the UT-interaction dataset which contains several complex human-human interactions.