ACM Transactions on

Multimedia Computing, Communications, and Applications (TOMM)

Latest Articles

Efficient QoE-Aware Scheme for Video Quality Switching Operations in Dynamic Adaptive Streaming

Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution... (more)

HTTP/2-based Frame Discarding for Low-Latency Adaptive Video Streaming

In this article, we propose video delivery schemes insuring around 1s delivery latency with Dynamic Adaptive Streaming over HTTP (DASH), which is a... (more)

Symmetrical Residual Connections for Single Image Super-Resolution

Single-image super-resolution (SISR) methods based on convolutional neural networks (CNN) have shown great potential in the literature. However, most... (more)

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint... (more)

Expression Robust 3D Facial Landmarking via Progressive Coarse-to-Fine Tuning

Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on... (more)

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distributions and representations of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and have shown their strong ability to... (more)

Reconstructing 3D Face Models by Incremental Aggregation and Refinement of Depth Frames

Face recognition from two-dimensional (2D) still images and videos is quite successful even with “in the wild” conditions. Instead,... (more)

Orchestrating Caching, Transcoding and Request Routing for Adaptive Video Streaming Over ICN

Information-centric networking (ICN) has been touted as a revolutionary solution for the future of the Internet, which will be dominated by video... (more)

Discovering Latent Topics by Gaussian Latent Dirichlet Allocation and Spectral Clustering

Today, diversifying the retrieval results of a certain query will improve customers’ search efficiency. Showing the multiple aspects of... (more)

Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a... (more)

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information... (more)


[December 2018]


Special issue call: "Multimodal Machine Learning for Human Behavior Analysis"Call for papers Submission deadline April 15th, 2019

Special issue call: "Computational Intelligence for Biomedical Data and Imaging". Call for papers . Submission deadline May 30th, 2019

Special issue call: "Smart Communications and Networking for Future Video Surveillance". Call for papers Submission deadline June 30th, 2019

Special issue call: "Trusted Multimedia for Smart Digital Environments". Call for papers . Submission deadline September 20th, 2019


News archive
Forthcoming Articles
Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate accurate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.

Advanced Stereo Seam Carving by Considering Occlusions on Both Sides

Stereo image retargeting plays a significant role in the field of image processing, which aims at making major objects prominent as possible when the resolution of an image is changed, including maintaining disparity and depth information at the same time. Many researchers in relevant fields have proposed seam carving methods that generally preserve geometric consistency of the images. However, they did not take into account the regions of occlusion on both sides. We propose a solution to this problem using a new strategy of seams finding by considering occluded and occluding regions on both of the input images, and leaving geometric consistency in both images intact. We also introduced line segment detection and superpixel segmentation to further improve the quality of the images. Imaging effects are optimized in the process and visual comfort, which is also influenced by other factors, can be boosted as well.

Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation

Recently, progress on semantic image segmentation is substantial, benefitting from the rapid development of Convolutional Neural Networks (CNNs). Semantic image segmentation approaches proposed lately have been mostly based on Fully convolutional Networks (FCNs). However, these FCN-based methods use large receptive fields and too many pooling layers to depict the discriminative semantic information of the images. These operations often cause low spatial resolution inside deep layers, which leads to spatially fragmented prediction. To address this problem, we exploit the inherent multi-scale and pyramidal hierarchy of deep convolutional networks to extract the feature maps with different resolutions, and take full advantages of these feature maps via a gradually stacked fusing way. Specifically, for two adjacent convolutional layers, we upsample the features from deeper layer with stride of 2, and then stack them on the features from shallower layer. Then, a convolutional layer with kernels of 1 × 1 is followed to fuse these stacked features. The fused feature remains the spatial structure information of the image, meanwhile it owns strong discriminative capability for pixel classification. Additionally, to further preserve the spatial structure information and regional connectivity of the predicted category label map, we propose a novel loss term for the network. In detail, two graph model based spatial affinity matrixes are proposed, which are used to depict the pixel-level relationships in the input image and predicted category label map respectively, then their cosine distance is backward propagated to the network. The proposed architecture, called spatial structure preserving feature pyramid network (SSPFPN), significantly improves the spatial resolution of the predicted category label map for semantic image segmentation.

Low-complexity Scalable Extension of the High-Efficiency Video Coding (SHVC) Encoding Systems

Scalable extension of the High Efficiency Video Coding (SHVC) adopts a hierarchical quadtree based coding unit (CU) to be suitable for various texture and motion properties of videos. Currently, the test model of SHVC selects the best CU size by performing the exhaustive quadtree depth level search, which achieves high compression efficiency at heavy cost of computational complexity. When motion/texture properties of coding regions can be early identified, a fast algorithm can be designed to adapt CU depth level decision procedures to video contents and avoid unnecessary computation on the CU depth level traversing. In this paper, we propose a fast CU quadtree depth level decision algorithm for inter frames on enhancement layers based on the analysis of inter-layer, spatial and temporal correlations. The proposed algorithm first determines the motion activity level at the treeblock size of the hierarchical quadtree, utilizing motion vectors from its corresponding blocks at the base layer. Based on the motion activity level, some neighboring encoded CUs with larger correlations are preferentially chosen to predict the optimal depth level of current treeblock. Finally, two parameters including a motion activity level and a predicted CU depth level are used to determine a subset of candidate CU depth levels and adaptively optimize CU depth level decision processes. Experimental results show that the proposed approach can reduce the entire coding time of enhancement layers by 70% on average, with almost no loss of compression efficiency. It is efficient for all types of scalable video sequences on different coding conditions and outperforms state-of-the-art SHVC and HEVC fast algorithms.

Learning Discriminative Sentiment Representation from Strongly- and Weakly-Supervised CNNs

Visual sentiment analysis is getting increasing attention with the rapidly growing amount of images uploaded to social websites. Learning rich visual representations often requires training deep Convolutional Neural Networks (CNNs) on massive manually labeled data, which is expensive or scarce especially for a subjective task like visual sentiment analysis. Meanwhile, a large quantity of social image is quite available yet noisy by querying social network using the sentiment categories as keywords, where a various type of images related to the specific sentiment can be easily collected. In this paper, we propose a multiple kernel network (MKN) for visual sentiment recognition, which learns representation from strongly- and weakly- supervised CNNs. Specifically, the weakly-supervised deep model is trained using the large-scale data from social images, while the strongly-supervised deep model is fine-tuned on the affective datasets that are manually labeled. We employ the multiple kernel scheme on the multiple layers of these CNNs, which can automatically select the discriminative representation by learning a linear combination of a set of predefined kernels. In addition, we introduce a large-scale dataset collected from popular comics of various countries, e.g., America, Japan, China and France, which consists of 11,821 images with various artistic styles. Experimental results show that MKCNN achieves consistent improvements over the state-of-the-art methods on the public affective datasets as well as the newly established comics dataset.

Show, Reward and Tell: Adversarial Visual Story Generation

Despite the promising progress made in visual captioning and paragraphing, visual storytelling is still largely unexplored. This task is more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. To deal with these challenges, we propose an Attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training (AHGRL). First, to model the ordered photo sequence and the complex story structure, we propose an attribute-based hierarchical generator. The generator incorporates semantic attributes to create more accurate and relevant descriptions. The hierarchical framework enables the generator to learn from the complex paragraph structure. Second, to generate story-style paragraphs, we design a language-style discriminator, which provides word-level rewards to optimize the generator by policy gradient. Third, we further consider the story generator and the reward critic as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critic aims at distinguishing them and further improving the generator. Extensive experiments on the widely-used dataset well demonstrate the advantages of the proposed method over state-of-the-art methods.

Using Eye Tracking and Heart Rate Activity to Examine Crossmodal Correspondences QoE in Mulsemedia

Different senses provide us with information of various levels of precision and enable us to construct a more precise representation of the world. Rich multisensory simulations are thus beneficial for comprehension, memory reinforcement or retention of information. Crossmodal mappings refer to the systematic associations often made between different sensory modalities (e.g., high pitch is matched with angular shapes) and govern multisensory processing. A great deal of research effort has been put in exploring crossmodal correspondences in the field of cognitive science. However, the possibilities they open in the digital world have been relatively unexplored. Mulsemedia - multiple sensorial media - provides a highly immersive experience to the users and enhances their Quality of Experience (QoE) in the digital world. Thus, we consider that studying the plasticity and the effects of crossmodal correspondences in a mulsemedia setup can bring interesting insights about improving the human computer dialogue and experience. In our experiments, we exposed users to videos with certain visual dimensions (brightness, color and shape) and we investigated if the pairing with a crossmodal matching sound (high and low pitch) and the corresponding auto-generated haptic effects lead to an enhanced QoE. For this, we captured the eye gaze and the heart rate of users while experiencing mulsemedia and we asked them to fill in a set of questions targeting their enjoyment and perception at the end of the experiment. Results showed differences in eye gaze patterns and heart rate between the experimental and the control group indicating changes in participants' engagement when videos were accompanied by matching crossmodal sounds (this effect was the strongest for the video displaying angular shapes and high pitch audio) and transitively generated crossmodal haptic effects.

Semantic Concept Network and Deep Walk Based Visual Question Answering

Visual Question Answering (VQA) is a research hot-spot in computer vision and natural language processing, and also a fusion of the two research areas in high level applications. For VQA, this work aims to describe a novel model based on semantic concept network construction and deep walk. Extracting visual image semantic representation is a significant and effective method for spanning the semantic gap problem. Moreover, current research presents that co-occurrence patterns of concepts can enhance semantic representation. This work is motivated by the challenge that semantic concepts have complex interrelations and the relationships are similar to a network. Therefore, we construct a semantic concept network adopted by a complex network modeling way called Word Activation Forces (WAFs), and mine the co-occurrence patterns of semantic concepts using deep walk. Then the model performs polynomial logistic regression in basis of the extracted deep walk vector joint embedding the visual image feature and question feature. The proposed model effectively integrates visual and semantic features of the image and natural language question. The experimental results show that our algorithm outperforms challenging baselines on three benchmark image QA datasets. Furthermore, through experiments in image annotation refinement and semantic analysis on pre-labeled LabelMe dataset, we test and verify the effectiveness of our constructed concept network for mining concept co-occurrence patterns, sensible concept clusters and hierarchies.

A2CMHNE: Attention-Aware Collaborative Multimodal Heterogeneous Network Embedding

Network embedding for distributed node representation learning is playing an important role in network analysis, due to its effectiveness in a variety of applications. However, most existing network embedding models focus on homogeneous network and neglect the diverse properties such as different types of network structures and associated multimedia content information. In this paper, we learn node representations for multimodal heterogeneous networks which contain multiple types of nodes and/or links as well as multimodal content such as texts and images.We propose a novel attention-aware collaborative multimodal heterogeneous network embedding method (A2CMHNE), where an attention-based collaborative representation learning approach is proposed to promote the collaboration of structure-based embedding and content-based embedding, and generate the robust node representation by introducing an attention mechanism that enables informative embedding integration. In experiments, we compare our model with existing network embedding models on two real-world datasets. Our method leads to dramatic improvements in performance by 5%, and 9% compared with five state-of-the-art embedding methods on one benchmark (M10 Dataset), and on multi-modal heterogeneous network dataset (WeChat dataset) for node classification, respectively. Experimental results show that the effectiveness of our proposed method on both node classification and link prediction tasks.

Image Captioning by Asking Questions

Image captioning and visual question answering are typical tasks which connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this paper, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image captioning method which transfers the knowledge learned from the VQA corpora to the image captioning task. VQA models are firstly pretrained on image-question-answer instances. To effectively extract semantic features by the VQA model, we build a large set of free-form open-ended questions. Then the pretrained VQA model is used to extract VQA-grounded semantic representations which interpret images from a different perspective. We incorporate the VQA model into the image captioning model by fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image captioning (VQA-IIC) models significantly outperform the conventional image captioning methods on large-scale public datasets.

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Paillier Cryptosystem based Mean Value Computation for Encrypted Domain Image Processing Operations

Due to its large storage facility and high-end computing capability, cloud computing has received great attention as a huge amount of personal multimedia data and computationally expensive tasks can be outsourced to the cloud. However, the cloud being third-party semi-trusted, are prone to privacy risks. Signal processing in the encrypted domain (SPED) has aroused as a new research paradigm on privacy-preserving processing over outsourced data by semi-trusted cloud. In this paper, we propose a solution for non-integer mean value computation in the homomorphic encrypted domain without any interactive protocol between the client and the service provider. Using the proposed solution, various image processing operations such as local smoothing filter, un-sharp masking and histogram equalization can be performed in the encrypted domain at the cloud server without any privacy concerns. Our experimental results from standard test images reveal that these image processing operations can be performed without pre-processing, without client-server interactive protocol and without any error between the encrypted domain and the plain domain.

How Deep Features Have Improved Event Recognition in Multimedia: a Survey

Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition and detection, deep learning has demonstrated to perform well also in event recognition tasks. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this paper, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this paper we extensively review different deep learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.

Harvesting Visual Objects from Internet Images via Deep Learning Based Objectness Assessment

The collection of internet images has been growing in an astonishing speed. It is undoubted that these images contain rich visual information that can be useful in many applications, such as visual media creation and data-driven image synthesis. In this paper, we focus on the methodologies for building a visual object database from a collection of internet images. Such database is built to contain a large number of high-quality visual objects that can help with various data-driven image applications. Our method is based on dense proposal generation and objectness-oriented re-ranking. A novel deep convolutional neural network is designed for the inference of proposal objectness, the probability of a proposal containing optimally-located foreground object. In our work, the objectness is quantitatively measured in regard of completeness and fullness, reflecting two complementary features of an optimal proposal: a complete foreground and relatively small background. Our experiments indicate that object proposals re-ranked according to the output of our network generally achieve higher performance than those produced by other state-of-the-art methods. As a concrete example, a database of over 1.2 million visual objects has been built using the proposed method, and has been successfully used in various data-driven image applications.

Artificial Intelligence, Artists, and Art: Attitudes Toward Artwork Produced by Humans vs. Artificial Intelligence

This study examines how people perceive artwork created by artificial intelligence (AI) and how knowledge of the artist?s identity (Human vs AI) affects individuals? evaluation of art. Drawing on Schema theory and theory of Computers Are Social Actors (CASA), this study used a survey-experiment that controlled for the identity of the artist (AI vs. Human) and presented participants with two types of artworks (AI-created vs. Human-created). After seeing images of six artworks created by either AI or human artists, participants (n=288) were asked to evaluate the artistic value using a validated scale commonly employed among art professionals. The study found that human-created artworks and AI-created artworks were not judged to be equivalent in their artistic value. Additionally, knowing that a piece of was created by AI did not in general influence participants? evaluation of art pieces? artistic value. However, having a schema that AI cannot make art significantly influenced evaluation. Implications of the findings for application and theory are discussed.

Visual Arts Search on Mobile Devices

Visual arts, especial paintings appear everywhere in our daily lives. They are not only liked by art lovers, but also ordinary people, that are usually curious about the stories behind these art pieces and also interested in exploring related art pieces. Among various methods, mobile visual search has its merit in providing an alternative solution, where text and voice searches are not always applicable. Mobile visual search for visual arts is far more challenging than general image visual search. Conventionally, visual search, such as searching products and plant, focuses on locating images with simialr objects. Hence, approaches are designed to locate the objects and extract scale-invariant features from distorted images that captured by mobile camera. However, the objects is only part of a visual art, the background and the painting style are all important factors that are not considered in the conventional approaches. An empirical study is conducted to study issues of photos taken by mobile camera, such as, orientation variance, motion blur, and how they influence the results of visual arts search. A photo rectification pipeline is designed to rectify the photos into perfect one for feature extraction. A new method is proposed to learn high discriminative features for visual arts, which consider both the content information and style information in visual arts. Apart from conducting solid experiments, a real-world system is built to prove the effectiveness of proposed methods. To the best of our knowledge, this is the first paper to solve problems for visual arts search on mobile devices.

QoE for Mobile Clients with Segment Aware Rate Adaptation Algorithm (SARA) for DASH Video Streaming

Dynamic adaptive streaming over HTTP (DASH) is widely used for video streaming on mobile devices. Ensuring a good quality of experience (QoE) for mobile video streaming is essential as it severely impacts both the network and content providers' revenue. Thus, a good rate adaptation algorithm at the client end that provides high QoE is critically important. Recently, segment size-aware rate adaptation (SARA) algorithm was proposed for DASH clients. However, its performance on mobile clients has not been investigated so far. The main scope of this paper is two-folds: 1) we discuss SARA's implementation for mobile clients to improve the QoE in mobile video streaming; one that accurately predicts the download time for the next segment and makes an informed bitrate selection; 2) we developed a new parametric QoE model to compute a cumulative score that helps in fair comparison of different adaptation algorithms. Based on our subjective and objective evaluation, we observed that SARA for mobile clients outperforms others by 17% on average, in terms of the Mean Opinion Score, while achieving, on average, a 76% improvement in terms of the interruption ratio. The score obtained from our new parametric QoE model also demonstrates that the SARA algorithm for mobile clients gives a better QoE among all the algorithms.

A Pseudo-likelihood Approach For Geo-localization of Events From Crowd-sourced Sensor-Metadata

Events such as live concert, protest march, an exhibition are often video recorded by many people at the same time, typically using smartphone devices. In this work, we address the problem of geo-localizing such events from crowd generated data. Traditional approaches for solving such a problem using multiple video sequences of the event would require highly complex co

Alone vs in-a-group: A Multi-modal Framework for Automatic Affect Recognition

Recognition and analysis of human affect has been researched extensively in neuroscience, psychology, cognitive sciences and computer sciences in the last two decades. However, most of the past research in automatic analysis of human affect has focused on the recognition of affect displayed by people in individual settings and little attention has been paid to the analysis of the affect expressed in group settings. In this paper, we first analyse the affect expressed by individuals along arousal and valence dimensions in both individual and group videos and then propose methods to recognize the contextual information, i.e., (1) group membership of each individual and (2) whether a person is alone or in-a-group by using their face and body behavioural cues. For affect analysis, we first propose affect recognition models separately in individual and group videos and then introduce the cross-condition affect recognition model that is trained by combining two different types of data. We conduct a set of experiments on two newly collected datasets including both individual and group videos. Our experiments show that (1) the best recognition results are obtained using the proposed (vQLZM-FV) that outperforms other unimodal features; (2) decision-level fusion helps improve the affect recognition, indicating that body behaviours carry emotional information that is complementary rather than redundant to the emotion content in facial behaviours; (3) it is possible to predict the context - whether a person is alone or in-a-group using non-verbal behaviours, indicating that people behave distinctly in individual and group settings; and (4) group membership can be recognized using the non-verbal face and body features, indicating that group members influence the behaviours among each others within a group setting.

Resilient Color Image Watermarking Using Accurate Quaternion Radial Substituted Chebyshev Moments

In this work, a new quaternion-based method for color image watermarking is proposed. In this method, a novel set of quaternion radial substituted Chebyshev moments (QRSCMs) is presented for robust geometrically invariant image watermarking. An efficient computational method is proposed for highly accurate, fast and numerically stable QRSCMs in polar coordinates. The proposed watermarking method consists of three stages. In the first stage, the Arnold transform is used to improve the security of the watermarking scheme by scrambling the binary watermark. In the second stage, the proposed accurate and stable QRSCMs of the host color image are computed. In the third stage, the encrypted binary watermark is embedded into the host image by employed the quantization technique on selected magnitudes QRSCMs where the watermarked color image is obtained by adding the original host color image to the compensation image. Then, the binary watermark can be extracted directly without using the original image from the magnitudes of QRSCMs. Numerical experiments are performed where the performance of proposed method is compared with the existing quaternion moment-based watermarking methods. The comparison clearly shows that the proposed method is very efficient in terms of the visual imperceptibility capability and the robustness under different attacks compared to the existing quaternion moment-based watermarking algorithms.

Subtitle Region Selection of S3D Images in Consideration of Visual Discomfort and Viewing Habit

Statistical Early Termination and Early Skip Models for Fast Mode Decision in HEVC INTRA Coding

In this paper, statistical Early Termination (ET) and Early Skip (ES) models are proposed for fast Coding Unit (CU) and prediction mode decision in HEVC INTRA coding , in which three categories of ET and ES sub-algorithms are included. Firstly, the CU ranges of the current CU are recursively predicted based on the texture and CU depth of the spatial neighboring CUs. Secondly, the statistical model based ET and ES schemes are proposed and applied to joint CU and intra prediction mode decision, in which the coding complexities over different decision layers are jointly minimized subject to acceptable rate-distortion degradation. Thirdly, the mode correlations among the intra prediction modes are exploited to early terminate the full rate-distortion optimization in each CU decision layer. Extensive experiments are performed to evaluate the coding performance of each sub-algorithm and the overall algorithm. Experimental results reveal that the overall proposed algorithm can achieve 45.47% to 74.77%, and 58.09% on average complexity reduction, while the overall Bjøntegaard delta bit rate increase and Bjøntegaard delta peak signal-to-noise rate degradation are 2.29% and -0.11 dB, respectively, which are negligible.

A Multi-Sensor Framework for Personal Presentation Analytics

Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated by painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speakers vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.

Appearance-consistent video object segmentation based on a multinomial event model

In this study, we propose an effective and efficient algorithm for unconstrained video object segmentation, which is achieved in a Markov random field (MRF). In the MRF graph, each node is modeled as a superpixel and labeled as either foreground or background during the segmentation process. The unary potential is computed for each node by learning a transductive SVM classifier under supervision by a few labeled frames. The pairwise potential is used for the spatial-temporal smoothness. In addition, a high-order potential based on the multinomial event model is employed to enhance the appearance consistency throughout the frames. To minimize this intractable feature, we also introduce a more efficient technique that simply extends the original MRF structure. The proposed approach was evaluated in experiments with different measures and the results based on a benchmark demonstrated its effectiveness compared with other state-of-the-art algorithms.

Video Question Answering via Knowledge-Based Progressive Spatial-Temporal Attention Network

Visual Question Answering (VQA) is a challenging task which has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though it is quite hot recently, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. And because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this paper, we not only take the spatial and temporal dimension of video content into account, but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network (K-PSTANet) to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object which is extracted from DBpedia. Then, we develop the question-knowledge guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments validate the effectiveness of our method.

From Selective Deep Convolutional Features to Compact Binary Representations for Image Retrieval

In the large-scale image retrieval task, the two most important requirements are the discriminability of image representations and the efficiency in computation and storage of representations. Regarding the former requirement, Convolutional Neural Network (CNN) is proven to be a very powerful tool to extract highly discriminative local descriptors for effective image search. Additionally, in order to further improve the discriminative power of the descriptors, recent works adopt fine-tuned strategies. In this paper, taking a different approach, we propose a novel, computationally efficient, and competitive framework. Specifically, we firstly propose various strategies to compute masks, namely SIFT-mask, SUM-mask, and MAX-mask, to select a representative subset of local convolutional features and eliminate redundant features. Our in-depth analyses demonstrate that proposed masking schemes are effective to address the burstiness drawback and improve retrieval accuracy. Secondly, we propose to employ recent embedding and aggregating methods which can significantly boost the feature discriminability. Regarding the computation and storage efficiency, we include a hashing module to produce very compact binary image representations. Extensive experiments on six image retrieval benchmarks demonstrate that our proposed framework achieves the state-of-the-art retrieval performances.

Interpretable Partitioned Embedding for Intelligent Multi-item Fashion Outfit Composition

Intelligent fashion outfit composition becomes more and more popular in these years. Some deep learning based approaches reveal competitive composition recently. However, the uninterpretable characteristic makes such deep learning based approach cannot meet the designers, businesses, and consumers? urge to comprehend the importance of different attributes in an outfit composition. To realize interpretable and customized multi-item fashion outfit compositions, we propose a partitioned embedding network to learn interpretable embeddings from clothing items. The network contains two vital components: attribute partition module and partition adversarial module. In the attribute partition module, multiple attribute labels are adopted to ensure that different parts of the overall embedding correspond to different attributes. In the partition adversarial module, adversarial operations are adopted to achieve the independence of different parts. With the interpretable and partitioned embedding, we then construct an outfit composition graph and an attribute matching map. Extensive experiments demonstrate that 1) the partitioned embedding have unmingled parts which corresponding to different attributes and 2) outfits recommended by our model are more desirable in comparison with the existing methods.

Design, large-scale usage testing, and important metrics for augmented reality gaming applications

Augmented Reality (AR) offers the possibility to enrich the real world with the digital mediated contents, increasing in this way the quality of many everyday experiences. Whilst in some research areas such as cultural heritage, tourism or medicine there is a strong technological investment, AR for gaming purposes struggles to become a widespread commercial application. In this paper a novel framework for AR kids gaming has been developed, together with the general guidelines and long life usage tests and metrics. The proposed application is designed for augmenting puzzle experience. Once the user has assembled the real puzzle, AR functionality within the mobile application can be unlocked, bringing to life puzzle characters, creating a seamless game that merges AR interactions with the puzzle reality. Main goals and benefits of this framework can be seen in the development of the novel set of AR tests and metrics in the pre-release phase (in order to help the commercial launch and developers), and in the release phase by introducing the measures for the long-life app optimization, usage tests and hint on final users together with measure to design policy, providing a method for automatic testing of quality and popularity improvements. Moreover, smart configuration tools enabling multi-app and eventually also multi-user development have been proposed, facilitating the serialization of the applications. Results were obtained from a large-scale user test with about 4 million users on a family of 8 gaming applications, providing the scientific community a work-flow for implicit quantitative analysis in AR gaming. They also prove that the proposed approach is affordable and reliable for long life testing and optimization.

Watch Me from Distance (WMD): A Privacy-Preserving Long-Distance Video Surveillance System

Preserving the privacy of people in video surveillance systems is quite challenging and a significant amount of research has been done to solve this problem in recent times. Majority of existing techniques are based on detecting bodily cues such as face and/or silhouette and obscuring them so that people in the videos cannot be identified. We observe that merely hiding bodily cues is not enough for protecting identities of the individuals in the videos. An adversary, who has prior contextual knowledge about the surveilled area, can identify people in the video by exploiting the implicit inference channels such as behavior, place and time. This paper presents an anonymous surveillance system, called "Watch Me from Distance" (WMD), which advocates for outsourcing of surveillance video monitoring (similar to call centers) to the long-distance sites where professional security operators watch the video and alert the local site when any suspicious or abnormal event takes place. We find that long-distance monitoring helps decoupling the contextual knowledge of security operators. Since security operators at the remote site could turn into adversaries, a trust computation model to determine the credibility of the operators is presented as an integral part of the proposed system. The feasibility study and experiments suggest that the proposed system provides more robust measures of privacy yet maintaining the surveillance effectiveness.

Detecting Online Counterfeit-goods Seller using Connection Discovery

With the advancement of social media and mobile technology, any smartphone users can easily become a seller on social media and e-commerce platforms, such as Instagram and Carousell in Hong Kong, or Taobao in China. A seller shows images of their products, and annotates their images with suitable tags that can be searched easily by others. Those images could be taken by the seller, or the seller could use images shared by other sellers. Among sellers, some sell counterfeit goods, and these sellers may use disguising tags and language, which make detecting them a difficult task. This paper proposes a framework to detect counterfeit sellers by using deep learning to discover connections among sellers from their shared images. Based on 473K shared images from Taobao, Instagram and Carousell, it is proven that the proposed framework can detect counterfeit sellers. The framework is 30% better than approaches using object recognition in detecting counterfeit sellers. To the best of our knowledge, this is the first work to detect online counterfeit sellers from their shared images.

BTDP: Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering

Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.

A Deep Learning System for Recognizing Facial Expression in Real-Time

This paper presents an image-based real-time facial expression recognition system, which is capable of consequently recognizing one of the basic facial expressions of several subjects simultaneously from the webcam. Our proposed methodology combines a supervised transfer learning strategy and a joint supervision method with a new supervision signal which is crucial for facial tasks. A newly proposed Convolutional Neural Network (CNN) model, MobileNet, that contains both accuracy and speed is deployed in both offline and real-time framework which enables fast and accurate real-time output. Evaluations towards two publicly available datasets, JAFFE and CK+, are carried out respectively. It reaches an accuracy of 95.24% on JAFFE dataset, and an accuracy of 96.92% on 6-class CK+ dataset which only contains the last frames of image sequences. At last, the average run-time cost for the recognition of the real-time implementation is around 3.57 ms/frame on an NVIDIA Quadro K4200 GPU.

Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition

Recently, many deep learning approaches have shown remarkable progress on human action recognition. However, few efforts have been made to improve the performance of action recognition by applying the visual attention mechanism in deep learning model. In this paper, we propose a novel deep framework called Moving Foreground Attention (MFA) which enhances the performance of action recognition by guiding the model to focus on the discriminative foreground targets. In our work, MFA detects the moving foreground through a proposed variance-based algorithm. Meanwhile, an unsupervised proposal is utilized to mine the action-related key volumes and generate corresponding correlation scores. Based on these scores, a new stochastic-out scheme is adopted to effectively train the MFA. In addition, we integrate an independent Temporal Net into the proposed framework for temporal dynamic modeling. Experiments on two standard benchmarks UCF101 and HMDB51 show that the proposed MFA is effective and reaches state-of-the-art performance.

Increasing Image Memorability with Neural Style Transfer

Recent works in computer vision and multimedia have shown that image memorability can be automatically inferred exploiting powerful deep learning models. This paper advances the state of the art in this area by addressing a novel and more challenging issue: Given an arbitrary input image, can we make it more memorable?. To tackle this problem we introduce an approach based on an editing-by-applying-filters paradigm: given an input image, we propose to automatically retrieve a set of style seeds,i.e.a set of style images which, applied to the input image through a neural style transfer algorithm, provide the highest increase in memorability. We show the effectiveness of the proposed approach with experiments on the publicly available LaMem dataset, performing both a quantitative evaluation and a user study. To demonstrate the flexibility of the proposed framework, we also analyze the impact of different implementation choices, such as using different state of the art neural style transfer methods. Finally, we show several qualitative results to provide additional insights on the link between image style and memorability.

Art by Computing Machinery: Is Machine-Art Acceptable in the Artworld?

When does a machine-created work become art? What is art, and can machine artworks fit in to the historical and present discourse? Are machine artworks a mere type of new media which artists extend their creativity with? Will solely machine-created artworks be accepted by our artworlds? This article probes these questions by first identifying the frameworks for defining and explaining art and evaluating its suitability for explaining machine artworks. It then explores how artworks have a necessary relationship with their human artists and the wider context of history, institutions, of styles and approaches, and of audiences and artworlds. The article then questions if machines have such a relational context and if machines will ever live up to our standard of what constitutes an artwork as defined by us, or are machines good only for assisting creativity. The question of IP, rights and ownership are also discussed for human-machine artworks and purely machine-produced works of art. The article views the viability of machines as artists as the central question in the historical discourse, extended through the art and the artworld. The article evaluates machine-produced work from such a basis.

Multi-source Multi-level Attention Networks for Visual Question Answering

In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with the reference to a given image. VQA is challenging because the reasoning process on visual domain needs full understanding of spatial relationship, semantic concepts as well as common sense for real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to lack of the modeling of the rich spatial context of regions, high-level semantics of images and knowledges across multiple sources. To solve the challenges, we propose a multi-source multi-level attention network for visual question answering that can benefit both spatial inference by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level from image source as well as sentence level from external knowledge base. First, we encode region-based middle-level outputs from convolutional neural networks (CNN) into spatially-embedded representation by a multi-directional 2D recurrent neural network, and further locate the answer-related regions by multiple layer perceptron (MLP) as visual attention. Second, we generate semantic concepts from high-level semantics in CNN and select those question-related concepts as concept attention. Third, we query semantic knowledges from general knowledge base by concepts and selected those question-related knowledges as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.

Cross-Modality Retrieval by Joint Correlation Learning

As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this paper, in order to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. Firstly, an auto-encoder is used to embed the visual features by minimizing {the error of feature reconstruction} and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter- correlations among the image-sentence pairs, \emph{i.e.}, the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its {performance}. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, \emph{i.e.,} Flickr8k, Flickr30k, and MS-COCO.

Look at Me! Correcting Eye Gaze in Live Video Communication

Although live video communication, such as live broadcasting and video conferencing, is widely used recently, still, it is less engaged than face-to-face communication because of lacking social, emotional, and haptic feedback. Missing eye contact is one of the problems, which is caused by the physical deviation between the screen and the camera on a device. Manipulating video frames to correct the eye gaze is a solution. However, to the best of our knowledge, there is no existing methods that can dynamically correct eye gaze in real time while achieving high visual quality. In this paper, we introduce a system to estimate the rotation of eyes according to the positions between the camera, the local and the remote participants' eyes. Then, the system adopts a warping-based convolutional neural network to relocate pixels on the eyes regions. To improve visual quality, we minimize not only the L2 distance between the ground truths and the warped eyes but also the newly designed loss functions when training the network. These new loss functions are designed to preserve the shape of eyes structures and ease the artifacts caused by occlusions. To evaluate the presented network and the loss functions, we objectively and subjectively compare the results generated by our system and the state-of-the-art, DeepWarp, on two datasets. The experiment results demonstrate the effectiveness of our system. In addition, we show that our system can perform in real time on a consumer level laptop. The quality and efficiency make gaze correction by post-processing a feasible solution to the missing eye contact in video communication.

Multi-level Similarity Perception Network for Person Re-identification

In this paper, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception for person re-identification (re-ID) problem. According to the distinct characteristics of diverse feature maps, we effectively apply different similarity constraints to both low-level and high-level feature maps, during training stage. Due to the introduction of appropriate similarity comparison mechanisms at different levels, the proposed approach can adaptively learn discriminative local and global feature representations respectively, while the former is more sensitive in localizing part-level prominent patterns relevant to re-identifying people across cameras. Meanwhile, a novel strong activation pooling strategy is utilized on the last convolutional layer for abstract local feature aggregation to pursue more representative feature representations. Based on this, we propose final feature embedding by simultaneously encoding original global features and discriminative local features. In addition, our framework has two other benefits. Firstly, classification constraints can be easily incorporated into the framework, forming a unified multi-task network with similarity constraints. Secondly, as similarity comparable information has been encoded in the network's learning parameters via back-propagation, pairwise input is not necessary at test time. That means we can extract features of each gallery image and build index in an off-line manner, which is essential for large-scale real-world applications. Experimental results on multiple challenging benchmarks demonstrate that our method achieves splendid performance compared with the current state-of-the-art approaches.

A Simplistic Global Median Filtering Forensics Based on Frequency Domain Analysis of Image Residuals

Sophisticated image forgeries introduce digital image forensics as an active area of research. In this area, many researchers have addressed the problem of median filtering forensics. Existing median filtering detectors are adequate to classify median filtered images in uncompressed mode and in compressed mode at high quality factors. Despite that, the field is lacking a robust method to detect median filtering in low resolution images compressed with low quality factors. In this article, a novel feature set (four feature dimensions), based on first order statistics of frequency contents of median filtered residuals (MFRs) of original and median filtered images, has been proposed. The proposed feature set outperforms handcrafted features based state-of-the- art detectors, in terms of feature set dimensions, robustness for low resolution images at all quality factors and robustness against existing anti-forensic method. Also, results reveal the efficacy of proposed method over convolutional neural network (CNN) based median filtering detector. Comprehensive results expose the efficacy of the proposed detector to detect median filtering against other similar manipulations. Additionally, generalization ability test on cross-database images support the cross-validation results on four different databases. Thus, our proposed detector meets the current challenges in the field, to a great extent.

All ACM Journals | See Full Journal Index

Search TOMM
enter search term and/or author name