Sparse Representation-Based Semi-Supervised Regression for People Counting

Label imbalance and the insufficiency of labeled training samples are major obstacles in most methods for counting people in images or videos. In this... (more)

Caching Online Video

Online video presents new challenges to traditional caching with over a thousand-fold increase in number of assets, rapidly changing popularity of assets and much higher throughput requirements. We propose a new hierarchical filtering algorithm for caching online video HiFi. Our algorithm is designed to optimize hit rate, replacement rate and cache... (more)

Multimodal Retrieval with Diversification and Relevance Feedback for Tourist Attraction Images

In this article, we present a novel framework that can produce a visual description of a tourist... (more)


In this work, we explore the increasing demand for novel user interfaces to navigate large media collections. We implement a geometric data structure to store and retrieve item-to-item similarity information and propose a novel navigation framework that uses vector operations and real-time user... (more)

Securing Speech Noise Reduction in Outsourced Environment

Cloud data centers (CDCs) are becoming a cost-effective method for processing and storage of multimedia data including images, video, and audio. Since... (more)

Interactive Film Recombination

In this article, we discuss an innovative media entertainment application called Interactive Movietelling. As an offspring of Interactive Storytelling applied to movies, we propose to integrate narrative generation through artificial intelligence (AI) planning with video processing and modeling to construct filmic variants starting from the baseline content. The integration is possible thanks to content description using semantic attributes pertaining to intermediate-level concepts shared between video processing and planning levels. The output is a recombination of segments taken from the input movie performed so as to convey an alternative plot. User tests on the prototype proved how... (more)

Complexity Correlation-Based CTU-Level Rate Control with Direction Selection for HEVC

Rate control is a crucial consideration in high-efficiency video coding (HEVC). The estimation of model parameters is very important for coding tree... (more)

When Smart Devices Interact With Pervasive Screens

The meeting of pervasive screens and smart devices has witnessed the birth of screen-smart device interaction (SSI), a key enabler to many novel... (more)


Location-based games have been around already since 2000 but only recently when PokemonGo came to markets it became clear that they can reach wide popularity. In this article, we perform a literature-based analytical study of what kind of issues location-based game design faces, and how they can be solved. We study how to use and verify the... (more)


[June 2017]

The Impact Factor for the year 2016 is now available. ACM TOMM increased its IF from 0.982 to 2.250 being now the second ranked journal in the area of Multimedia. Thank you to all the EB members, authors, reviewers and readers for this excellent results.

Special Issue on " Multi-modal Understanding of Social, Affective and Subjective Attributes of Data". Cfp . Submission deadline Oct. 1st 2017

Special Issue on "Deep Learning for Intelligent Multimedia Analytics". Cfp. Submission deadline Oct. 15 2017

[April 2017]

Special Issue on  "QoE Management for Multimedia Services". Cfp Submission deadline May 15, 2017 Extended to June 15, 2017

[April 2017]

Call for Nominations for TOMM Nicolas D. Georganas Best Paper Award 2017

The Editor-in-Chief of ACM TOMM invites nominations for the ACM TOMM Nicolas D. Georganas Best Paper. Deadline for nominations of papers published in ACM TOMM from January 2016 to December 2016, is June 15th, 2017. See the call for nomination  cfn

[February 2017]

Upcoming special issues:

- "Delay-Sensitive Video Computing in the Cloud". Cfp   Submission deadline  Aug. 20, 2017

- "QoE Management for Multimedia Services". Cfp Submission deadline May 15, 2017

- "Representation, Analysis and Recognition of 3D Humans" Call for papers 

[January 2017]

ACM TOMM AE guidelines have been added

[December 2016]

ACM TOMM Special Issue on "Delay-Sensitive Video Computing in the Cloud". Cfp Submission deadline Nov. 30, 2016 Extended to Dec. 30, 2016

[November 2016]

- ACM TOMM Special Issue on "Deep Learning for Mobile Multimedia". Cfp  Submission deadline Oct15, 2016 Extended to Nov. 25, 2016

- Special Section on "Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild"Cfp Submission deadline Oct. 31, 2016 Extended to Nov. 25, 2016

- Special Section on "Multimedia Understanding via Multimodal Analytics". Cfp Submission deadline Oct. 31, 2016 Extended to Nov. 25, 2016


[September 2016]


The 2016 ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) Nicolas D. Georganas Best Paper Award is provided to the paper “Cross-Platform Emerging Topic Detection and Elaboration from Multimedia Streams” (TOMM vol. 11, Issue 4) by Bing-Kun Bao, Changsheng Xu, Weiqing Min and Mohammod Shamim Hossain. 

Dr. Cheng-Hsin Hsu has been nominated the ACM TOMM Associate Editor of the Year for 2016! Congratulations to Cheng-Hsin!

[August 2016]

Call for Nominations for TOMM Nicolas D. Georganas Best Paper Award

The Editor-in-Chief of ACM TOMM invites nominations for the ACM TOMM Nicolas D. Georganas Best Paper. Deadline for nominations of papers published in ACM TOMM from January 2015 to December 2015, is September 10th, 2016. See the cfn

[June 2016]

Forthcoming Special Issues in 2017

We received 11 competitive proposals this year, and we had limited slots available, so it was a very tough decision. At the end, the following four SI proposals have been selected and scheduled as follows:

- "Deep Learning for Mobile Multimedia". Cfp  Submission deadline Oct. 15, 2016 Extended to Oct. 31, 2016

- "Representation, Analysis and Recognition of 3D Human". Cfp   Submission deadline Jan. 15, 2017 Extended to Feb. 15, 2017 

Two Special Section have been also accepted and scheduled for publication in 2017:

- "Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild". Cfp Submission deadline Oct. 31, 2016

- "Multimedia Understanding via Multimodal Analytics". Cfp Submission deadline Oct. 31, 2016

[June 2016]

Forthcoming Special Issues in 2016

"Trust Management for Multimedia Big Data" - Publication date August 2016

"Multimedia Big Data: Networking" - Publication date November 2016

[February 2016]

Advisory Board

We have created the ACM TOMM Advisory Board to support the Editor in Chief in the definition and implementation of strategies with no editorial duties. The following colleagues have been appointed as members of the ACM TOMM Advisory Board: Prof. Wen Gao,  Peking University, Prof. Arnold Smeulders, University of Amsterdam, Prof. Nicu Sebe, University of Trento. 

[January 2016]

New Assistant Information Director

Starting on January 1st 2016, Marco Bertini will be in charge of Assistant Information Director of ACM TOMM. 

[January 2016]

New Information Director         Starting on January 1st 2016, Stefano Berretti will be in charge of Information Director of ACM TOMM.  

[January 2016]

New Editor-in-Chief

After the end of the second term of Ralf Steinmetz, Alberto Del Bimbo from the University of Florence will be the next TOMM Editor-in-Chief starting on January 1st 2016. 

ACM TOMM Nicolas D. Georgans Best Paper Award 2015

The award goes to the article "A Quality of Experience Model for Haptic Virtual Environments” (TOMM vol.10, Issue 3) by Abdelwahab Hamam, Abdulmotaleb El Saddik and Jihad Alja'am. Congratulations!

ACM TOMM Associate Editor of the Year 2015 Award

The award goes Pradeep Atrey from State University of New York, USA for his excellent work for the journal. Congratulations!

CfP: Special Issue "Multimedia Big Data: Networking"

Please consider submitting to the second special issue in next years special issue series. Call for Papers


CfP: Special Issue "Trust Management for Multimedia Big Data"

Next year, TOMM will feature a special issue series on "Multimedia Big Data". First topic will be "Trust Management". Extended Deadline: October 15th! Call for Papers


Call for Nominations TOMM Editor-in-Chief

After two terms of the current EiC Ralf Steinmetz, the search committee started the search for a new Editor-in-Chief. Call for Nominations


New ACM submission templates

The new ACM submission templates are online. Please use the most recent link on the authors' guide to find the files.


About TOMM


A peer-reviewed, quarterly archival journal in print and digital form, TOMM consists primarily of research papers of lasting importance and value in the field of multimedia computing, communications and applications. 

News archive
Forthcoming Articles
Modeling and Analysis of Power Consumption in Live Video Streaming Systems

This paper develops an aggregate power consumption model for many-to-one live video streaming systems, such as video surveillance, where multiple video sources stream videos to a central monitoring station. In such systems, power consumption is a major concern, especially for battery-powered video sources. We model the video capturing, encoding, and transmission aspects and then provide an overall model of the power consumed by the video cameras and/or sensors. The developed model captures the following main parameters: resolution, frame rate, quantization, motion estimation range, and number of reference frames. We also analyze the power consumed by the monitoring station, which is due to video reception, potential video upscaling, and video decoding of all received video streams. In addition to modeling the power consumption, we model the achieved bitrate of video encoding. We validate and analyze the power consumption models of each phase as well the aggregate power consumption model through extensive experiments. The analysis includes examining individual parameters separately and examining the impacts of changing more than one parameter at a time.

An Efficient Computation Framework for Connection Discovery using Shared Images

With the advent and popularity of the social network, social graphs become essential to improve services and information relevance to users for many social media applications to predict follower/followee relationship, community membership, etc. However, the social graphs could be hidden by users due to privacy concerns, or kept by social media. Recently, connections discovered from user shared images using non-user generated labels are proved to be more accessible alternatives to social graphs. But real-time discovery is difficult due to high complexity, and many applications are not possible. This paper proposes an efficient computation framework for connection discovery using user shared images. The framework applies the architecture divided into online and offline computation to facilitate faster processing for connection discovery using user shared images. Furthermore, this paper devises a general and scalable online computation framework in which many algorithms can fit into the framework to help discover connections on the fly. The performance of the framework is evaluated on the application of follower/followee recommendation with 300K+ user shared images from two social networks. It is proven that the proposed computation framework on average reduces 90% as much time as existing frameworks, with 90% as accurate as those frameworks with discovered connections.

A Distributed Streaming Framework for Connection Discovery Using Shared Videos

With the advances in mobile devices and the popularity of social networks, users can share multimedia content anytime, anywhere. One of the most important types of emerging content is video, which is commonly shared on platforms such as Instagram and Facebook. User connections, which indicate whether two users are follower/followee or have the same interests, are essential to improve services and information relevant to users for many social media applications, but are normally hidden due to users privacy concerns, or are kept confidential by social media sites. Using user-shared content is an alternative way to discover user connections. This paper proposes to use user shared videos for connection discovery with Bag of Feature Tagging (BoFT) method and proposes a distributed streaming computation framework to facilitate fast response. Exploiting the uniqueness of shared videos, the proposed framework is divided into Streaming processing, Online and Offline Computation. With experiments using a dataset from Twitter, it is proven that using user-shared videos for connection discovery is feasible and that the proposed computation framework reduces the processing time to only 35% for follower/followee recommendation. It is also proven that comparable performance can be achieved with only partial video data.

Early Recognition of 3D Human Actions

Action recognition is an important research problem of Human Motion Analysis (HMA). In recent years, 3D observation based action recognition is receiving increasing interest in the multimedia and computer vision communities, due to recent advent of the cost-effective sensors, such as depth camera Kinect. This work takes one step further, focusing on early recognition of ongoing 3D human actions, which is beneficial for a large variety of time-critical applications, e.g. gesture based human machine interaction, somatosensory game, etc. Our goal is to infer the class label information of 3D human actions with partial observation of temporally incomplete action executions. By considering 3D action data as multivariate time series (m.t.s.) synchronized to a shared common clock (frames), we propose a stochastic process called Dynamic Marked Point Process (DMP) to model the 3D action as temporal dynamic patterns, where both timing and strength information are captured. To achieve even better earliness and accuracy of recognition, we also explore the temporal dependency patterns between feature dimensions. A probabilistic suffix tree is constructed to represent sequential patterns among features in terms of Variable order Markov Model (VMM). Our approach and several baselines are evaluated on four 3D human action datasets. Extensive results show that our approach achieves superior performance for early recognition of 3D human actions.

Emotion Recognition Using Multiple Kernel Learning Towards E-learning Applications

Personalized elearning models tailor learning resource according to learning needs of learners. Adaptive Hypermedia Architecture (AHA), is a successful implementation of the personalized elearning model which uses learning outcomes as personalization parameter to adapt to learning experience of learners. However, besides learning outcomes, emotions of the learner which can have much influence on memory and problem solving is completely neglected in the AHA model. This paper presents Adaptive Educational Hypermedia (AEH) model, known as Expert Elearning System (EES), which is built on top of the AHA to incorporate facial emotion recognition framework. The emotion recognition framework here in, denoted as MKLDT-WFA, is realized by training simple Multiple Kernel Learning (MKL) with Weighted Kernel Alignment (WFA) in a Decision Tree (DT) classifier. The MKLDT-WFA framework has two merits over classical SimpleMKL. First, the WFA component preserves only relevant kernel weights to improve discrimination for emotion classes. Secondly, training in the DT eliminates misclassification issues associated with off-the-shelf SimpleMKL classifiers. The suggested framework has been evaluated on different emotion databases. Results of evaluation reveal good performances for emotion recognition and it is potential to improve personalization in the AEH models

PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs

Deep Convolutional Neural Networks (DCNNs) exhibit remarkable performance in a number of pattern recognition and classification tasks. Modern DCNNs involve many millions of parameters and billions of operations. Inference using such DCNNs, if implemented as software running on an embedded processor, results in considerable execution time and energy consumption, which is prohibitive in many mobile applications. FPGA-based acceleration of DCNN inference is a promising approach to improve both energy consumption and classification throughput. However, the engineering effort required for development and verification of an optimized FPGA-based architecture is significant. In this paper, we present PLACID, an automated PLatform for Accelerator CreatIon for DCNNs. PLACID uses an analytical approach to characterization and exploration of the implementation space. PLACID enables generation of an accelerator with the highest throughput for a given DCNN on a specific target FPGA platform. Subsequently, it generates an RTL level architecture in Verilog, which can be passed onto commercial tools for FPGA implementation. PLACID is fully automated, and reduces the accelerator design time from a few months down to a few hours. Experimental results show that architectures synthesized by PLACID yield 2X higher throughput density than the best competing approach.

Game Categorization for Deriving QoE-Driven Video Encoding Configuration Strategies for Cloud Gaming

Cloud gaming has been recognized as a promising shift in the online game industry, with the aim of implementing the on demand service concept that has achieved market success in other areas of digital entertainment such as movies and TV shows. The concepts of cloud computing are leveraged to render the game scene as a video stream which is then delivered to players in real-time. The main advantage of this approach is the capability of delivering high-quality graphics games to any type of end user device, however at the cost of high bandwidth consumption and strict latency requirements. A key challenge faced by cloud game providers lies in conguring the video encoding parameters so as to maximize player Quality of Experience (QoE) while meeting bandwidth availability constraints. In this paper we tackle one aspect of this problem by addressing the following research question: Is it possible to improve service adaptation based on information about the characteristics of the game being streamed? To answer this question two main challenges need to be addressed: the need for different QoE-driven video encoding (re-)conguration strategies for different categories of games, and how to determine a relevant game categorization to be used for assigning appropriate conguration strategies. We investigate these problems by conducting two subjective laboratory studies with a total of 80 players and three different games. Results indicate that different strategies should likely be applied for different types of games, and show that existing game classications are not necessarily suitable for differentiating game types in this context. We thus further analyze objective video metrics of collected game play video traces as well as player actions per minute and use this as input data for clustering of games into two clusters. Subjective results verify that different video encoding conguration strategies may be applied to games belonging to different clusters.

Performance Analysis of Game Engines on Mobile and Fixed Devices

Mobile Gaming is an emerging concept, wherein gamers are using mobile devices, like smart phones and tablets, to play best seller games. Compared to dedicated gaming boxes or PCs, these devices still fall short of executing newly complex 3D-video games with a rich immersion. Three novel solutions, relying on cloud computing infrastructure, namely Computation Offloading, Cloud Gaming, and Traditional Client-Server architecture will represent the next generation of game engine architecture aiming at improving the gaming experiences and immersions. The basis of these above-mentioned solutions is the distribution of the game code over different devices (including set-top-boxes, PCs, and servers). In order to know how the game code should be distributed, advanced knowledge of game engines is required. By consequence, dissecting and analyzing game engine performances will surely help to better understand how to move in these new directions (i.e., distribute game code), which is so far missing in the literature. Aiming at filling this gap, we propose in this paper to analyze and evaluate one of the famous engines in the market, i.e., Unity 3D. We begin by detailing the architecture and the game logic of game engines. Then, we use a test-bed to evaluate the CPU and GPU consumption per frame and per module for five representative games on three platforms, namely a stand-alone computer, embedded systems and web players. Based on the obtained results and observations, we build a valued graph of each module, composing the Unity 3D architecture, which reflects the internal flow and CPU consumption.

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM(Long-Short Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different way to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models "translate'' image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve state-of-the-art results on both caption generation and image-sentence retrieval even without integrating additional mechanism (e.g. object detection, attention model etc.). Our experiments also proves that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate our transfer learning performance of Bi-LSTM model significantly outperforms previous methods on Pascal1K dataset.

Semantic Reasoning in Zero Example Video Event Retrieval

Retrieval of high-level or complex events, such as a parade or a car accident, within video data without example images or videos is still a challenge. Current research in deep neural networks is highly beneficial for retrieval of high-level events based upon examples, but without any examples it is still hard to 1) determine which concepts are useful to pre-train (Vocabulary challenge); 2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our paper, we present our Semantic Event Retrieval System that 1) shows the importance of high-level concepts in a vocabulary for the retrieval of high-level events and 2) uses a novel concept selection method based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.

A Unified Framework for Multi-Modal Isolated Gesture Recognition

In this paper, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework which exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a 3D depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarksChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15% respectively. We will release our codes to facilitate future research.

An Efficient Motion Detection and Tracking Scheme for Encrypted Surveillance Videos

Surveillance video is the main input source of intelligent video surveillance system. Detection performed on surveillance video contributes significantly to the safety and security goals. However, performing detection on the unprotected surveillance video may reveal the privacies of innocent people in the video. How to strike a balance between personal privacy and the feasibility of detection is an important issue. A promising solution to this problem is to encrypt the surveillance videos, and to perform detection on these encrypted videos. Most existing encrypted signal processing focus on the cases of images or small data. Since videos have a much huger data size, it is a great challenge to study how to process encrypted videos. In this paper, we propose an efficient motion detection and tracking scheme for encrypted H.264/AVC video bitstreams, which does not require the previous decryption on the encrypted video. The main idea is to estimate motion information from the bitstream structure and codeword length firstly, and then propose a region update (RU) algorithm to deal with the loss and error drifting of motion caused by the video encryption. We extract information from the codeword of the encrypted motion vector differences. With the prior knowledge that the object motion in the video is continuous in space and time, we design the RU algorithm which can fix the error information and renew the detected region. Comparing to the existing scheme based on the video encryption that encrypts at pixel level [Chu et al. 2013], the proposed scheme has the advantages of small storage in the encrypted video and low computational cost in encryption and detection. Experimental results show that our scheme has a better performance in detection accuracy, execution speed, and easy installation. Moreover, the proposed scheme can use not only the video encryption in this paper, but also other format- compliant video encryption, provided that the positions of macroblock can be extracted from the encrypted video bitstream. Due to the coupling of video stream encryption and detection algorithms, our scheme can be directly connected to the video stream output, e.g., surveillance cameras, without any modification to these cameras.

Texture and Geometry Scattering Representation based Facial Expression Recognition in 2D+3D Videos

Facial Expression Recognition (FER) is one of the most important topics in the domain of computer vision and pattern recognition and it has attracted increasing attention for its scientific challenges and application potentials. In this paper, we propose a novel and effective approach to FER using multi-model 2D and 3D videos, which encodes both static and dynamic cues by scattering convolution network. Firstly, a shape based detection method is introduced to locate the start and the end of an expression in videos, segment its onset, apex, and offset states, and sample the important frames for emotion analysis. Secondly, the frames in Apex of 2D videos are represented by scattering, conveying static texture details. Those of 3D videos are processed in a similar way, but to highlight static shape details, several geometric maps in terms of multiple order differential quantities, i.e. Normal Maps (NOM) and Shape Index Maps (SIM), are generated as the input of scattering, instead of original smooth facial surfaces. Thirdly, the average of neighboring samples centred at each key texture frame or shape map evenly distributed in Onset, is computed, and the scattering features extracted from all the average samples of 2D and 3D videos are then concatenated to capture dynamic texture and shape cues respectively. Finally, Support Vector Machine (SVM) is adopted to measure the similarity of individual features in either 2D or 3D modality, and all the scores are combined for multi-modal decision making to predict the expression label. Thanks to the scattering descriptor, the proposed approach not only encodes distinct local texture and shape variations of different expressions as by several milestone operators, such as SIFT, HOG, etc., but also captures subtle information hidden in high frequencies in both channels, which is quite crucial to better distinguish expressions that are easily confused. The validation is conducted on the BU-4DFE database, and the state of the art one accuracy is reached, indicating its competency for this issue.

Multimodal Multiplatform Social Media Event Summarization

Social media platforms are turning into important news sources for users since they provide real-time information with a wide range of perspectives. However, high volume, dynamism, noise and redundancy exhibited by social media data create difficulties for users in comprehending the entire content. Recent works emphasize on summarizing the content of either a single social media platform or of a single modality (either textual or visual). However, each platform has its own unique characteristics and user base, which brings to light different aspects of real-world events. This makes it critical as well as challenging to combine textual and visual data from different platforms. In this article, we propose summarization of real-word events with data stemming from different platforms and multiple modalities. We present the use of Markov Random Fields based similarity measure to link content across multiple platforms. This measure also enables the linking of content across time which is useful for tracking the evolution of long-running events. For the final content selection, summarization is modeled as a subset selection problem. To handle the complexity of the optimal subset selection, we propose the use of submodular objectives. Facets such as coverage, novelty and significance are modeled as submodular objectives in a multimodal social media setting. We conduct a series of quantitative and qualitative experiments to illustrate the effectiveness of our approach compared to alternative methods.

Structure-aware Multi-modal Feature Fusion for RGB-D Scene Classification and Beyond

While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically mean that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity --- that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity --- that features from all modalities are encouraged to co-exist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on action recognition task to demonstrate that our framework can be generalized for other multi-modal well-structured features. In particular, for action recognition, we enforce inter-part sparsity to choose more discriminative body parts, and inter-modal non-sparsity to make informative features from both appearance and motion modalities to co-exist. Experimental results on JHMDB and MPII Cooking datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state-of-the-art.


