ACM Transactions on

Multimedia Computing, Communications, and Applications (TOMM)

Latest Articles

Joint Head Attribute Classifier and Domain-Specific Refinement Networks for Face Alignment

In this article, a two-stage refinement network is proposed for facial landmarks detection on unconstrained conditions. Our model can be divided into... (more)

Unsupervised Similarity Learning through Rank Correlation and kNN Sets

The increasing amount of multimedia data collections available today evinces the pressing need for methods capable of indexing and retrieving this... (more)

Thinking Like a Director: Film Editing Patterns for Virtual Cinematographic Storytelling

This article introduces Film Editing Patterns (FEP), a language to formalize film editing practices and stylistic choices found in movies. FEP constructs are constraints, expressed over one or more shots from a movie sequence, that characterize changes in cinematographic visual properties, such as shot sizes, camera angles, or layout of actors on... (more)

SKEPRID: Pose and Illumination Change-Resistant Skeleton-Based Person Re-Identification

Currently, the surveillance camera-based person re-identification is still challenging because of diverse factors such as people’s changing poses and various illumination. The various poses make it hard to conduct feature matching across images, and the illumination changes make color-based features unreliable. In this article, we present... (more)

Unsupervised Person Re-identification: Clustering and Fine-tuning

The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re-identification (re-ID). In this article, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep... (more)

Robust Electric Network Frequency Estimation with Rank Reduction and Linear Prediction

This article deals with the problem of Electric Network Frequency (ENF) estimation where Signal to Noise Ratio (SNR) is an essential challenge. By... (more)

Probability Model-Based Early Merge Mode Decision for Dependent Views Coding in 3D-HEVC

As a 3D extension to the High Efficiency Video Coding (HEVC) standard, 3D-HEVC was developed to improve the coding efficiency of multiview videos. It... (more)

A Hybrid Approach for Spatio-Temporal Validation of Declarative Multimedia Documents

Declarative multimedia documents represent the description of multimedia applications in terms of media items and relationships among them.... (more)

Image Captioning via Semantic Guidance Attention and Consensus Selection Strategy

Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable... (more)

OmniArt: A Large-scale Artistic Benchmark

Baselines are the starting point of any quantitative multimedia research, and benchmarks are essential for pushing those baselines further. In this article, we present baselines for the artistic domain with a new benchmark dataset featuring over 2 million images with rich structured metadata dubbed OmniArt. OmniArt contains annotations for dozens... (more)

Collaborations on YouTube: From Unsupervised Detection to the Impact on Video and Channel Popularity

YouTube is the most popular platform for streaming of user-generated videos. Nowadays, professional YouTubers are organized in so-called multichannel networks (MCNs). These networks offer services such as brand deals, equipment, and strategic advice in exchange for a share of the YouTubers’ revenues. A dominant strategy to gain more... (more)


[July 2018]

Special issue call: "Face Analysis for Applications". Call for papers   Submission deadline Oct. 14th, 2018


[June 2018]

Call for Nominations for TOMM Nicolas D. Georganas Best Paper Award 2018

The Editor-in-Chief of ACM TOMM invites nominations for the ACM TOMM Nicolas D. Georganas Best Paper. Deadline for nominations of papers published in ACM TOMM from January 2017 to December 2017, is July 10th, 2018. See the call for nomination cfn

[May 2018]

Special issue call: "Affective Computing for Large-Scale Heterogeneous Multimedia Data". Cfp Submission deadline Dec. 15th, 2018


[April 2018]

Special Issue call: "Big Data, Machine Learning and AI Technologies for Art and Design". CfpSubmission deadline June 15th, 2018 Extended to August 31st 2018


[February 2018]

Special Issue on "Cross-Media Analysis for Visual Questions" Cfp Submission deadline June 30th 2018 Extended to July 31th, 2018

[October 2017]


We invite highly qualified scientists to submit proposals for 2018-19 ACM TOMM Special Issues. Each Special Issue is in the responsibility of the Guest Editors. Proposals are accepted until December 31st, 2017. They should be prepared according to the instructions outlined below, and sent by e-mail to the Information Director Stefano Berretti ([email protected]) and the Editor in Chief of ACM TOMM Alberto del Bimbo ([email protected]). More information about the proposals submission can be found in the CfP.


[June 2017]

The Impact Factor for the year 2016 is now available. ACM TOMM increased its IF from 0.982 to 2.250 being now the second ranked journal in the area of Multimedia. Thank you to all the EB members, authors, reviewers and readers for this excellent results.

Special Issue on " Multi-modal Understanding of Social, Affective and Subjective Attributes of Data". Cfp . Submission deadline Oct. 1st 2017

Special Issue on "Deep Learning for Intelligent Multimedia Analytics". Cfp. Submission deadline Oct. 15 2017

[April 2017]

Special Issue on  "QoE Management for Multimedia Services". Cfp Submission deadline May 15, 2017 Extended to June 15, 2017

[April 2017]

Call for Nominations for TOMM Nicolas D. Georganas Best Paper Award 2017

The Editor-in-Chief of ACM TOMM invites nominations for the ACM TOMM Nicolas D. Georganas Best Paper. Deadline for nominations of papers published in ACM TOMM from January 2016 to December 2016, is June 15th, 2017. See the call for nomination  cfn

[February 2017]

Upcoming special issues:

- "Delay-Sensitive Video Computing in the Cloud". Cfp   Submission deadline  Aug. 20, 2017

- "QoE Management for Multimedia Services". Cfp Submission deadline May 15, 2017

- "Representation, Analysis and Recognition of 3D Humans" Call for papers 

[January 2017]

ACM TOMM AE guidelines have been added

[December 2016]

ACM TOMM Special Issue on "Delay-Sensitive Video Computing in the Cloud". Cfp Submission deadline Nov. 30, 2016 Extended to Dec. 30, 2016

[November 2016]

- ACM TOMM Special Issue on "Deep Learning for Mobile Multimedia". Cfp  Submission deadline Oct15, 2016 Extended to Nov. 25, 2016

- Special Section on "Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild"Cfp Submission deadline Oct. 31, 2016 Extended to Nov. 25, 2016

- Special Section on "Multimedia Understanding via Multimodal Analytics". Cfp Submission deadline Oct. 31, 2016 Extended to Nov. 25, 2016


[September 2016]


The 2016 ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) Nicolas D. Georganas Best Paper Award is provided to the paper “Cross-Platform Emerging Topic Detection and Elaboration from Multimedia Streams” (TOMM vol. 11, Issue 4) by Bing-Kun Bao, Changsheng Xu, Weiqing Min and Mohammod Shamim Hossain. 

Dr. Cheng-Hsin Hsu has been nominated the ACM TOMM Associate Editor of the Year for 2016! Congratulations to Cheng-Hsin!

[August 2016]

Call for Nominations for TOMM Nicolas D. Georganas Best Paper Award

The Editor-in-Chief of ACM TOMM invites nominations for the ACM TOMM Nicolas D. Georganas Best Paper. Deadline for nominations of papers published in ACM TOMM from January 2015 to December 2015, is September 10th, 2016. See the cfn

[June 2016]

Forthcoming Special Issues in 2017

We received 11 competitive proposals this year, and we had limited slots available, so it was a very tough decision. At the end, the following four SI proposals have been selected and scheduled as follows:

- "Deep Learning for Mobile Multimedia". Cfp  Submission deadline Oct. 15, 2016 Extended to Oct. 31, 2016

- "Representation, Analysis and Recognition of 3D Human". Cfp   Submission deadline Jan. 15, 2017 Extended to Feb. 15, 2017 

Two Special Section have been also accepted and scheduled for publication in 2017:

- "Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild". Cfp Submission deadline Oct. 31, 2016

- "Multimedia Understanding via Multimodal Analytics". Cfp Submission deadline Oct. 31, 2016

[June 2016]

Forthcoming Special Issues in 2016

"Trust Management for Multimedia Big Data" - Publication date August 2016

"Multimedia Big Data: Networking" - Publication date November 2016

[February 2016]

Advisory Board

We have created the ACM TOMM Advisory Board to support the Editor in Chief in the definition and implementation of strategies with no editorial duties. The following colleagues have been appointed as members of the ACM TOMM Advisory Board: Prof. Wen Gao,  Peking University, Prof. Arnold Smeulders, University of Amsterdam, Prof. Nicu Sebe, University of Trento. 

[January 2016]

New Assistant Information Director

Starting on January 1st 2016, Marco Bertini will be in charge of Assistant Information Director of ACM TOMM. 

[January 2016]

New Information Director         Starting on January 1st 2016, Stefano Berretti will be in charge of Information Director of ACM TOMM.  

[January 2016]

New Editor-in-Chief

After the end of the second term of Ralf Steinmetz, Alberto Del Bimbo from the University of Florence will be the next TOMM Editor-in-Chief starting on January 1st 2016. 

ACM TOMM Nicolas D. Georgans Best Paper Award 2015

The award goes to the article "A Quality of Experience Model for Haptic Virtual Environments” (TOMM vol.10, Issue 3) by Abdelwahab Hamam, Abdulmotaleb El Saddik and Jihad Alja'am. Congratulations!

ACM TOMM Associate Editor of the Year 2015 Award

The award goes Pradeep Atrey from State University of New York, USA for his excellent work for the journal. Congratulations!

CfP: Special Issue "Multimedia Big Data: Networking"

Please consider submitting to the second special issue in next years special issue series. Call for Papers


CfP: Special Issue "Trust Management for Multimedia Big Data"

Next year, TOMM will feature a special issue series on "Multimedia Big Data". First topic will be "Trust Management". Extended Deadline: October 15th! Call for Papers


Call for Nominations TOMM Editor-in-Chief

After two terms of the current EiC Ralf Steinmetz, the search committee started the search for a new Editor-in-Chief. Call for Nominations


New ACM submission templates

The new ACM submission templates are online. Please use the most recent link on the authors' guide to find the files.


About TOMM


A peer-reviewed, quarterly archival journal in print and digital form, TOMM consists primarily of research papers of lasting importance and value in the field of multimedia computing, communications and applications. 


[October 2017]



News archive
Forthcoming Articles

Special Section on Multimodal Understanding of Social, Affective and Subjective Attributes

Editorial to Special Issue on Deep Learning for Intelligent Multimedia Analytics

Structure-Aware Deep Learning for Product Image Classification

Automatic product image classification is a task of crucial importance towards better understanding and management of online retailers. Motivated by recent advancements of deep convolutional neural network (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to further improve the classification accuracy. With these structural-aware clues, we argue that more advanced CNN models could be developed beyond the one-versus-all classification as did by conventional CNNs. To this end, novel efforts of this work include: developing a salient-sensitive CNN that could focus more on the product foreground by inserting a spatial attention layer to a proper location, proposing a multi-class regression based refinement method that is expected to generate more accurate predictions by utilizing prediction scores from preceding multiple CNNs, each corresponding to a distinctive classifier on a categorical layer in the hierarchy, and devising a multi-task deep learning architecture that effectively explore correlations among the categories and attributes for better categorical label prediction. Experimental results on nearly one million real-world product images basically validate the effectiveness of the proposed efforts jointly and individually, from which performance gains are observed.

Dense 3D-Convolutional Neural Network for Person Re-Identification in Videos

Person re-identification aims at identifying a certain pedestrian across non-overlapping multi-camera networks in different time and places. Existing person re-identification approaches mainly focus on matching pedestrians on still images, however little attention is paid to person re-identification in videos. Compared to images, video clips contain motion of pedestrians which is crucial to re-identification. Moreover, consecutive video frames present pedestrian appearance with different poses and from different viewpoints, providing valuable information towards addressing the challenge of pose variation, occlusion, and viewpoint change etc. In this paper, we propose a Dense 3D-Convolutional Network (D3DNet) to jointly learn spatio-temporal and appearance features for person re-identification in videos. The D3DNet consists of multiple 3D dense blocks and transition layers. The 3D dense blocks enlarge the receptive fields of visual neurons in spatial and temporal dimensions, leading to discriminative appearance representation as well as short-term and long-term motion information of pedestrians without the requirement of an additional motion estimation module. Moreover, we propose an improved loss function consisting of identification loss and center loss to minimize intra-class variance and maximize inter-class variance simultaneously, towards addressing the challenge of large intra-class variance and small inter-class variance, which is common phenomenon in person re-identification task. Extensive experiments on two widely-used surveillance video datasets, i.e., MARS and iLIDS-VID, have shown the effectiveness of the proposed approach.

Discovering Latent Topics by Gaussian Latent Dirichlet Allocation and Spectral Clustering

Nowadays, diversifying the retrieval results of a certain query will improve search efficiency for customers. Showing the multiple aspects information provides users an overview of the object, which helps them fast target their demands. To discover aspects, researches focus on generating image clusters from initially retrieved results. As an effective approach, Latent Dirichlet Allocation (LDA) has proved to have a good performance on discovering high level topics. However, traditional LDA is designed to process textual words, and it needs the input in the form of discrete data. When we apply this algorithm to process continuous visual images, a common solution is to quantize the continuous features into discrete representation by Bag-of-Visual-Word (BoVW) algorithm. During this process, quantization error will lead to information loss inevitably. To construct topic model with complete visual information, this work applies Gaussian Latent Dirichlet Allocation (GLDA) on diversity issue of image retrieval. In this model, traditional multinomial distribution is substituted by Gaussian distribution to model continuous visual features. Besides, we propose a two-phase spectral clustering strategy, called as dual spectral clustering, to generate clusters from segment level to image level. The experiments on challenging landmarks of DIV400 database show that our proposal improves relevance and diversity by about 10% comparing with traditional topic models.

Deep Patch Representations with Shared Codebook for Scene Classification

Scene classification is a challenging problem. Compared with object images, scene images are more abstract, which are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representation including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this paper, the idea of share codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1000 and Places365 concepts (Mixed1365), using convolutional neural networks. As the mixed 1365 features covers all the semantic information including both object and scene concepts, we can extract shared codebook from the mixed 1365 features which only contains a subset of the whole 1365 concepts with the same codebooks size. The shared codebook can not only provide complementary representations without additional codebook training, but also it can be adaptively extracted towards different scene classification tasks. A method of combining both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public dataset validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.

Efficient QoE-Aware Scheme for Video Quality Switching Operations in Dynamic Adaptive Streaming

Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution technique that adapts the streaming session according to the users network condition typically in terms of downlink bandwidth. This video quality adaptation can be achieved by scaling the frame quality, spatial resolution or frame rate. Despite the flexibility on the video quality scaling methods, each of these quality scaling dimensions has varying effects on the Quality of Experience (QoE) for end users. Furthermore, in video streaming, the changes in motion over time along with the scaling method employed have an influence on QoE, hence the need to carefully tailor scaling methods to suit streaming applications and content type. In this work, we investigate an intelligent DASH approach for the latest video coding standard H.265 and propose a heuristic QoE-aware cost-efficient adaptation scheme that does not switch unnecessarily to the highest quality level but rather stays temporarily at an intermediate quality level in certain streaming scenarios. Such an approach achieves a comparable and consistent level of quality under impaired network conditions as commonly found in Internet and mobile networks whilst reducing bandwidth requirements and quality switching overhead. The rationale is based on our empirical experiments, which show that an increase in bitrate does not necessarily mean noticeable improvement in QoE. Furthermore, our work demonstrates that the Signal-to-Noise Ratio (SNR) and the spatial resolution scalability types are the best fit for our proposed algorithm. Finally, we demonstrate an innovative interaction between quality scaling methods and the polarity of switching operations. The proposed QoE-aware scheme is implemented and empirical results show that it is able to reduce bandwidth requirements by up to 41% whilst achieving equivalent QoE compared with a representative DASH reference implementation.

Virtual Portraitist: An intelligent Tool for Taking Well-Posed Selfies

Smart photography carries the promise of quality improvement and functionality extension in making aesthetically appealing pictures. In this paper, we focus on self-portrait photographs and introduce new methods that guide a user in how to best pose while taking a selfie. While most of the current solutions use a post processing procedure to beautify a picture, the developed tool enables a novel function of recommending a good look before the photo is captured. Given an input face image, the tool automatically estimates the pose-based aesthetic score, finds the most attractive angle of the face and suggests how the pose should be adjusted. The recommendation results are determined adaptively to the appearance and initial pose of the input face. We apply a data mining approach to find distinctive, frequent itemsets and association rules from online profile pictures, upon which the aesthetic estimation and pose recommendation methods are developed. A simulated and a real image set are used for experimental evaluation. The results show the proposed aesthetic estimation method can effectively select user-favorable photos. Moreover, the recommendation performance for the vertical adjustment is moderately related to the degree of conformity among the professional photographers' recommendations. This study echoes the trend of instant photo sharing, in which a user takes a picture and then immediately shares it on a social network without engaging in tedious editing.

Show, Reward and Tell: Adversarial Visual Story Generation

Despite the promising progress made in visual captioning and paragraphing, visual storytelling is still largely unexplored. This task is more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. To deal with these challenges, we propose an Attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training (AHGRL). First, to model the ordered photo sequence and the complex story structure, we propose an attribute-based hierarchical generator. The generator incorporates semantic attributes to create more accurate and relevant descriptions. The hierarchical framework enables the generator to learn from the complex paragraph structure. Second, to generate story-style paragraphs, we design a language-style discriminator, which provides word-level rewards to optimize the generator by policy gradient. Third, we further consider the story generator and the reward critic as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critic aims at distinguishing them and further improving the generator. Extensive experiments on the widely-used dataset well demonstrate the advantages of the proposed method over state-of-the-art methods.

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Recent studies have shown that spatial relationships among objects are very important for visual recognition since they provide rich clues on object contexts within images. In this paper, we introduce a novel method to learn Semantic Feature Map (SFM) with attention based deep neural networks for image and video classification in an end to end manner, with an aim to explicitly model spatial object contexts within the images. In particular, for every object proposals obtained from the input image, we extract high-level semantic object features with convolutional neural networks. Then, we explicitly apply gate units to these extracted features for important objects selection and noise removal. These selected object features are organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as classifiers on top of the SFM for content recognition, which are expected to exploit the spatial relationships among objects. We also introduce a novel multi-task learning framework to help learn the model parameters in the training phase. It consists of a basic image classification loss in cross entropy form, an object localization loss to guide important object selection, as well as a grid labeling loss to predict object labels at SFM grids. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach and very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the SFMs learned on the image domain are transferred to video classification on CCV and FCVID benchmarks and the results successfully demonstrate its robustness and generalization capability.

Reconstructing 3D Face Models by Incremental Aggregation and Refinement of Depth Frames

Face recognition from 2D still images and videos is quite successful even ``in the wild'' conditions. Instead, less consolidated results are available for the cases where face data come from non-conventional cameras, like infrared or depth. In this paper, we investigate this latter scenario assuming a low-resolution depth camera is used to perform face recognition in an uncooperative context. To this end, we propose, first, to automatically select a set of frames from the depth sequence of the camera according to the fact they provide a good view of the face in terms of pose and distance. Then, we design a progressive refinement approach to reconstruct a higher-resolution model from the selected low-resolution frames. This process accounts for the anisotropic error of the existing points in the current 3D model and the points in a newly acquired frame so that the refinement step can progressively adjust the point positions in the model using a Kalman-like estimation. The quality of the reconstructed model is evaluated by considering the error between the reconstructed models and their corresponding high-resolution scans used as ground truth. In addition, we performed face recognition using the reconstructed models as probes against either a gallery of reconstructed models and a gallery with high-resolution scans. The obtained results confirm the possibility to effectively use the reconstructed models for the face recognition task.

Symmetrical Residual Connections for Single Image Super-Resolution

Single-image super-resolution (SISR) methods based on convolutional neural network (CNN) have shown great success in the literature. However, most deep CNN models dont have direct access to the subsequent layers, this seriously hinders the information flow. Whats more, they also dont make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we present a special SISR CNN with symmetrical nested residual connections for super-resolution reconstruction to further improve the quality of reconstruction image. Compared with previous SISR CNNs, our learning architecture shows significant improvements in accuracy and execution time. It has larger image region for contextual spreading. Its symmetrical combinations provide multiple short paths for the forward propagation to improve the reconstruction accuracy and for the backward propagation of gradient flow to accelerate the convergence speed. Extensive experiments on the open challenge datasets confirm the effectiveness of symmetrical residual connections. Our method can reconstruct high-quality high-resolution (HR) images at a relatively fast speed and outperform other methods by a large margin.

HTTP/2-Based Frame Discarding for Low-Latency Adaptive Video Streaming

In this paper, we propose video delivery schemes insuring around one-second delivery latency. To this purpose, we use Dynamic Adaptive Streaming over HTTP (DASH), which is a standard version of HTTP Live Streaming (HLS), as to benefit from the video representation switching between successive video segments. We also propose HTTP/2-based algorithms to apply video frame discarding policies inside a video segment. When a selected DASH representation does not match with the available network resources, current solutions suffer from rebuffering events. Rebuffering does not only impact the Quality of Experience (QoE) but it also increases the delivery delay between the displayed and the original video streams. We observe that rebuffering-based solutions may increase the delivery delay with 1.5 s to 2 s inside a six-second video segment. In this work, we develop optimal and practical algorithms in order to respect the one-second targeted latency. In all algorithms, we selectively drop the least meaningful video frames thanks to HTTP/2 stream resetting feature. An important number of missing video frames results in a temporal fluidity break known as video jitters. The displayed video seems as a series of snapshots. Our simulations show that we respect the one-second targeted latency while insuring an acceptable video quality with at least a Peak Signal to Noise Ratio (PSNR) of 30 dB. We also quantify and qualify the resulting jitters for each algorithm. We show that both, the optimal and the practical algorithms we propose, decrease the jitters impact on the displayed videos. For example, 97 % of the optimal algorithm outputs and 87 % of the practical algorithms outputs are considered as acceptable comparing to only 57 % of the First In First Out (FIFO) basic algorithm outputs.

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation. Inspired by this, we aim to effectively correlate existing large-scale heterogeneous data of different modalities by utilizing the power of GANs to model the cross-modal joint distribution, and its idea for adversarial learning can fully be exploited to learn discriminative common representation for bridging the heterogeneity gap. Thus, in this paper we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions: (1) Cross-modal GANs architecture is proposed to model the joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They can not only exploit the cross-modal correlation for learning the common representation, but also preserve the reconstruction information for capturing the semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representation more discriminative by adversarial training process. In summary, our proposed CM-GANs approach can utilize GANs to perform cross-modal common representation learning, by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval, compared with 11 state-of-the-art methods on 3 cross-modal datasets.

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

Personalized Emotion Recognition by Personality-aware High-order Learning of Physiological Signals

Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.

Modeling Dyadic and Group Impressions with Inter-Modal and Inter-Person Features

This paper proposes a novel feature-extraction framework for inferring impressed personality traits, emergent leadership skills, communicative competence and hiring decisions. The proposed framework extracts multimodal features, describing each participant's nonverbal activities. It captures inter-modal and inter-person relationships in interaction and captures how the target interactor generates nonverbal behavior when the other interactors also generate the nonverbal behavior. The inter-modal and inter-personal patterns are identified as frequent co-occurring events based on graph clustering from multimodal sequences. The framework can be applied to any type of interaction task. The proposed framework is applied to the SONVB corpus, which is an audio-visual dataset collected from dyadic job interviews, and the ELEA audio-visual data corpus, which is a dataset collected from group meetings. We evaluate the framework on a binary classification task of 15 impression variables in two data corpora. The experimental results show that the model trained with co-occurrence features is more accurate than previous models for 14 out of 15 traits.

Deep Learning based Multimedia Analytics: A Review

Multimedia community has witnessed the rise of deep learning based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep learning and multimedia analytics has boosted the performance of several traditional tasks such as classification, detection, regression, and also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning and content generation. This paper aims to review the development path of major tasks in multimedia analytics, and then take a peep for future directions. We start by summarizing the fundamental deep-techniques related to multimedia analytics, especially in visual domain, and then review representative high-level tasks powered by recent advancement. Moreover, the performance review on popular benchmarks gives a pathway of technology advancement, and helps identify both the milestone works and future directions.

Cross-Modality Feature Learning via Convolutional AutoEncoder

Learning robust and representative feature across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this paper, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.

Applying Deep Learning to Epilepsy Seizure Detection and Brain Mapping

Deep convolution neural network (CNN) has achieved remarkable results in computer vision tasks for end-to-end learning. We evaluate here the power of a deep CNN to learn robust features from raw EEG data to detect seizures. Seizure are hard to detect as they vary both inter- and intra-patient. In this paper, we use a deep CNN model for seizure detection task on an open access EEG epilepsy dataset collected at the Childrens Hospital Boston. Our deep learning model is able to extract spectral, temporal features from EEG epilepsy data and use them to learn general structure of a seizure that is less sensitive to variations. Our method produced an overall sensitivity of 90.00 %, specificity of 91.65% and accuracy of 98.05% for whole dataset of 23 patients. Hence, it can be used as an excellent cross-patient classifier. The results show that our model performs better than previous state of the art models for cross-patient seizure detection task. The proposed model can also visualize special orientation of band power features. We use correlation maps to relate spectral amplitude features to the output in the form of images. By using the results from our deep learning model, this visualization method can be used as an effective multimedia tool for producing quick and relevant brain mapping images that can be used by medical experts for further investigation.

Deep Semantic Mapping for Heterogeneous Multimedia Transfer Learning Using Co-Occurrence Data

Transfer learning, which focuses on finding a favorable representation for instances of different domains based on auxiliary data, can mitigate the divergence between domains through knowledge transfer. Recently, increasing efforts on transfer learning have employed deep neural network (DNN) to learn more robust and higher level feature representations to better tackle cross-media disparity. However, only a few papers consider the correction and semantic matching between multi-layer heterogeneous domain networks. In this paper, we propose a deep semantic mapping model for heterogeneous multimedia transfer learning (DHTL) using co-occurrence data. More specifically, we integrate the DNN with canonical correlation analysis (CCA) to derive a deep correlation subspace as the joint semantic representation for associating data across different domains. In the proposed DHTL, a multi-layer correlation matching network across domains is constructed, in which the CCA is combined to bridge each pair of domain-specific hidden layers. To train the network, a joint objective function is defined and the optimization processes are presented. When the deep semantic representation is achieved, the shared features of the source domain are transferred for task learning in the target domain. Extensive experiments for three multimedia recognition applications demonstrate that the proposed DHTL can effectively find deep semantic representations for heterogeneous domains, and is superior to the several existing state-of-the-art methods for deep transfer learning.

BTDP: Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering

Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.

Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018

This work summarizes the findings of the seventh iteration of the Video Browser Showdown (VBS) competition organized as a workshop at the 24th International Conference on Multimedia Modeling in Bangkok. The competition focuses on video retrieval scenarios in which the searched scenes were either previously observed or described by another person (i.e., an example shot is not available). During the event, nine teams competed with their video retrieval tools in providing access to a shared video collection with 600 hours of video content. Evaluation objectives, rules, scoring, tasks and all the participating tools are described in the paper. In addition, we provide some insights into how the different teams interacted with their video browsers, which was made possible by a novel interaction logging mechanism introduced for this iteration of VBS. The results collected at the Video Browser Showdown evaluation server confirm that searching for one particular scene in the collection given a limited time is still a challenging task for many of the approaches that were showcased during the event. Given only a short textual description, finding the correct scene is even harder. In ad-hoc search with multiple relevant scenes, the tools were mostly able to find at least one scene, while recall was the issue for many teams. The logs also reveal that, even though recent exciting advances in machine learning narrow the classical semantic gap problem, user centric interfaces are still required to mediate access to specific content. Finally, open challenges and lessons learned are presented for future VBS events.

Convolutional Attention Networks for Scene Text Recognition

In this paper, we present convolutional attention networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on convolutional neural networks (CNN) and recurrent neural networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods, which is completely built on CNN and combines attention mechanism. The distinctive characteristics of our method include: (1) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN. (2) The attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling. (3) Position embeddings are equipped in both spatial encoder and sequence decoder to give our networks a sense of locations. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results validate the effectiveness of different components, and show our convolutional-based method achieves state-of-the-art or competitive performance than prior works, even without the use of RNN.

Orchestrating Caching, Transcoding and Request Routing for Adaptive Video Streaming Over ICN

Information-centric networking (ICN) has been touted as a revolutionary solution for future Internet, which will be dominated by video traffic. This work investigates the challenge of distributing video contents of adaptive bit rate (ABR) over ICN. In particular, we utilize the in-network caching capability of ICN routers to serve users; in addition, with the help of named function, we enable ICN routers to transcode videos to lower-bitrate versions to improve the cache hit ratio. Mathematically, we formulate this design challenge into a constrained optimization problem, which aims to maximize the cache hit ratio for service providers and minimize the service delay for end users. We design a two-step iterative algorithm to find the optimum. First, given a content management scheme, we minimize the service delay via optimally configuring the routing scheme. Second, we maximize the cache hits for a giving routing policy. Finally, we rigorously prove its convergence. Through extensive simulations, we verify the convergence and the performance gains over other algorithms. We also find that more resources should be allocated to ICN routers with heavier request rate, and the routing scheme favors the shortest path to schedule more traffic.

Image Captioning with Visual-Semantic Double Attention

In this paper, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention model (SEA) is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks resulting in abundant irrelevant semantic features. In contract, at each time step, our model selects the most relevant word which aligns with current context. That is, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Since visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on MS COCO dataset and the results show that VSDA outperforms other methods and achieves promising performance.

Expression Robust 3D Facial Landmarking via Progressive Coarse-to-Fine Tuning

Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on 2D images; however, they suffer from illumination and pose variations that may largely degrade landmarking performance. The emergence of 3D data theoretically provides an alternative to overcome these weaknesses in the 2D domain. This paper proposes a novel approach to 3D facial landmarking, which combines both the advantages of feature based methods as well as model based ones in a progressive coarse-to-fine manner (initial, intermediate and fine stages). For the initial stage, a few fiducial landmarks (i.e. the nose tip and two inner eye corners) are robustly detected through curvature analysis, and these points are further exploited to initialize the subsequent stage. For the intermediate stage, a statistical model is learned in the feature space of three normal components of the facial point-cloud rather than the smooth original coordinates, namely Active Normal Model (ANM). For the fine stage, cascade regression is employed to locally refine the landmarks according to their geometry attributes. The proposed approach can accurately localize dozens of fiducial points on each 3D face scan, greatly surpassing feature based ones, and it also improves the state of the art of the model based ones in two aspects, i.e., sensitivity to initialization and deficiency in discrimination. The proposed method is evaluated on the BU-3DFE and Bosphorus databases, and competitive results are achieved in comparison with the ones in literature, clearly demonstrating its effectiveness.

All ACM Journals | See Full Journal Index

Search TOMM
enter search term and/or author name