Scalable extension of the High Efficiency Video Coding (SHVC) adopts a hierarchical quadtree based coding unit (CU) to be suitable for various texture and motion properties of videos. Currently, the test model of SHVC selects the best CU size by performing the exhaustive quadtree depth level search, which achieves high compression efficiency at heavy cost of computational complexity. When motion/texture properties of coding regions can be early identified, a fast algorithm can be designed to adapt CU depth level decision procedures to video contents and avoid unnecessary computation on the CU depth level traversing. In this paper, we propose a fast CU quadtree depth level decision algorithm for inter frames on enhancement layers based on the analysis of inter-layer, spatial and temporal correlations. The proposed algorithm first determines the motion activity level at the treeblock size of the hierarchical quadtree, utilizing motion vectors from its corresponding blocks at the base layer. Based on the motion activity level, some neighboring encoded CUs with larger correlations are preferentially chosen to predict the optimal depth level of current treeblock. Finally, two parameters including a motion activity level and a predicted CU depth level are used to determine a subset of candidate CU depth levels and adaptively optimize CU depth level decision processes. Experimental results show that the proposed approach can reduce the entire coding time of enhancement layers by 70% on average, with almost no loss of compression efficiency. It is efficient for all types of scalable video sequences on different coding conditions and outperforms state-of-the-art SHVC and HEVC fast algorithms.
Despite the promising progress made in visual captioning and paragraphing, visual storytelling is still largely unexplored. This task is more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. To deal with these challenges, we propose an Attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training (AHGRL). First, to model the ordered photo sequence and the complex story structure, we propose an attribute-based hierarchical generator. The generator incorporates semantic attributes to create more accurate and relevant descriptions. The hierarchical framework enables the generator to learn from the complex paragraph structure. Second, to generate story-style paragraphs, we design a language-style discriminator, which provides word-level rewards to optimize the generator by policy gradient. Third, we further consider the story generator and the reward critic as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critic aims at distinguishing them and further improving the generator. Extensive experiments on the widely-used dataset well demonstrate the advantages of the proposed method over state-of-the-art methods.
Different senses provide us with information of various levels of precision and enable us to construct a more precise representation of the world. Rich multisensory simulations are thus beneficial for comprehension, memory reinforcement or retention of information. Crossmodal mappings refer to the systematic associations often made between different sensory modalities (e.g., high pitch is matched with angular shapes) and govern multisensory processing. A great deal of research effort has been put in exploring crossmodal correspondences in the field of cognitive science. However, the possibilities they open in the digital world have been relatively unexplored. Mulsemedia - multiple sensorial media - provides a highly immersive experience to the users and enhances their Quality of Experience (QoE) in the digital world. Thus, we consider that studying the plasticity and the effects of crossmodal correspondences in a mulsemedia setup can bring interesting insights about improving the human computer dialogue and experience. In our experiments, we exposed users to videos with certain visual dimensions (brightness, color and shape) and we investigated if the pairing with a crossmodal matching sound (high and low pitch) and the corresponding auto-generated haptic effects lead to an enhanced QoE. For this, we captured the eye gaze and the heart rate of users while experiencing mulsemedia and we asked them to fill in a set of questions targeting their enjoyment and perception at the end of the experiment. Results showed differences in eye gaze patterns and heart rate between the experimental and the control group indicating changes in participants' engagement when videos were accompanied by matching crossmodal sounds (this effect was the strongest for the video displaying angular shapes and high pitch audio) and transitively generated crossmodal haptic effects.
Visual Question Answering (VQA) is a research hot-spot in computer vision and natural language processing, and also a fusion of the two research areas in high level applications. For VQA, this work aims to describe a novel model based on semantic concept network construction and deep walk. Extracting visual image semantic representation is a significant and effective method for spanning the semantic gap problem. Moreover, current research presents that co-occurrence patterns of concepts can enhance semantic representation. This work is motivated by the challenge that semantic concepts have complex interrelations and the relationships are similar to a network. Therefore, we construct a semantic concept network adopted by a complex network modeling way called Word Activation Forces (WAFs), and mine the co-occurrence patterns of semantic concepts using deep walk. Then the model performs polynomial logistic regression in basis of the extracted deep walk vector joint embedding the visual image feature and question feature. The proposed model effectively integrates visual and semantic features of the image and natural language question. The experimental results show that our algorithm outperforms challenging baselines on three benchmark image QA datasets. Furthermore, through experiments in image annotation refinement and semantic analysis on pre-labeled LabelMe dataset, we test and verify the effectiveness of our constructed concept network for mining concept co-occurrence patterns, sensible concept clusters and hierarchies.
Image captioning and visual question answering are typical tasks which connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this paper, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image captioning method which transfers the knowledge learned from the VQA corpora to the image captioning task. VQA models are firstly pretrained on image-question-answer instances. To effectively extract semantic features by the VQA model, we build a large set of free-form open-ended questions. Then the pretrained VQA model is used to extract VQA-grounded semantic representations which interpret images from a different perspective. We incorporate the VQA model into the image captioning model by fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image captioning (VQA-IIC) models significantly outperform the conventional image captioning methods on large-scale public datasets.
Rich Visual and Language Representation with Complementary Semantics for Video Captioning
Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition and detection, deep learning has demonstrated to perform well also in event recognition tasks. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this paper, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this paper we extensively review different deep learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.
Dynamic adaptive streaming over HTTP (DASH) is widely used for video streaming on mobile devices. Ensuring a good quality of experience (QoE) for mobile video streaming is essential as it severely impacts both the network and content providers' revenue. Thus, a good rate adaptation algorithm at the client end that provides high QoE is critically important. Recently, segment size-aware rate adaptation (SARA) algorithm was proposed for DASH clients. However, its performance on mobile clients has not been investigated so far. The main scope of this paper is two-folds: 1) we discuss SARA's implementation for mobile clients to improve the QoE in mobile video streaming; one that accurately predicts the download time for the next segment and makes an informed bitrate selection; 2) we developed a new parametric QoE model to compute a cumulative score that helps in fair comparison of different adaptation algorithms. Based on our subjective and objective evaluation, we observed that SARA for mobile clients outperforms others by 17% on average, in terms of the Mean Opinion Score, while achieving, on average, a 76% improvement in terms of the interruption ratio. The score obtained from our new parametric QoE model also demonstrates that the SARA algorithm for mobile clients gives a better QoE among all the algorithms.
Single-image super-resolution (SISR) methods based on convolutional neural network (CNN) have shown great success in the literature. However, most deep CNN models dont have direct access to the subsequent layers, this seriously hinders the information flow. Whats more, they also dont make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we present a special SISR CNN with symmetrical nested residual connections for super-resolution reconstruction to further improve the quality of reconstruction image. Compared with previous SISR CNNs, our learning architecture shows significant improvements in accuracy and execution time. It has larger image region for contextual spreading. Its symmetrical combinations provide multiple short paths for the forward propagation to improve the reconstruction accuracy and for the backward propagation of gradient flow to accelerate the convergence speed. Extensive experiments on the open challenge datasets confirm the effectiveness of symmetrical residual connections. Our method can reconstruct high-quality high-resolution (HR) images at a relatively fast speed and outperform other methods by a large margin.
Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated by painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speakers vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.
In the large-scale image retrieval task, the two most important requirements are the discriminability of image representations and the efficiency in computation and storage of representations. Regarding the former requirement, Convolutional Neural Network (CNN) is proven to be a very powerful tool to extract highly discriminative local descriptors for effective image search. Additionally, in order to further improve the discriminative power of the descriptors, recent works adopt fine-tuned strategies. In this paper, taking a different approach, we propose a novel, computationally efficient, and competitive framework. Specifically, we firstly propose various strategies to compute masks, namely SIFT-mask, SUM-mask, and MAX-mask, to select a representative subset of local convolutional features and eliminate redundant features. Our in-depth analyses demonstrate that proposed masking schemes are effective to address the burstiness drawback and improve retrieval accuracy. Secondly, we propose to employ recent embedding and aggregating methods which can significantly boost the feature discriminability. Regarding the computation and storage efficiency, we include a hashing module to produce very compact binary image representations. Extensive experiments on six image retrieval benchmarks demonstrate that our proposed framework achieves the state-of-the-art retrieval performances.
Augmented Reality (AR) offers the possibility to enrich the real world with the digital mediated contents, increasing in this way the quality of many everyday experiences. Whilst in some research areas such as cultural heritage, tourism or medicine there is a strong technological investment, AR for gaming purposes struggles to become a widespread commercial application. In this paper a novel framework for AR kids gaming has been developed, together with the general guidelines and long life usage tests and metrics. The proposed application is designed for augmenting puzzle experience. Once the user has assembled the real puzzle, AR functionality within the mobile application can be unlocked, bringing to life puzzle characters, creating a seamless game that merges AR interactions with the puzzle reality. Main goals and benefits of this framework can be seen in the development of the novel set of AR tests and metrics in the pre-release phase (in order to help the commercial launch and developers), and in the release phase by introducing the measures for the long-life app optimization, usage tests and hint on final users together with measure to design policy, providing a method for automatic testing of quality and popularity improvements. Moreover, smart configuration tools enabling multi-app and eventually also multi-user development have been proposed, facilitating the serialization of the applications. Results were obtained from a large-scale user test with about 4 million users on a family of 8 gaming applications, providing the scientific community a work-flow for implicit quantitative analysis in AR gaming. They also prove that the proposed approach is affordable and reliable for long life testing and optimization.
Preserving the privacy of people in video surveillance systems is quite challenging and a significant amount of research has been done to solve this problem in recent times. Majority of existing techniques are based on detecting bodily cues such as face and/or silhouette and obscuring them so that people in the videos cannot be identified. We observe that merely hiding bodily cues is not enough for protecting identities of the individuals in the videos. An adversary, who has prior contextual knowledge about the surveilled area, can identify people in the video by exploiting the implicit inference channels such as behavior, place and time. This paper presents an anonymous surveillance system, called "Watch Me from Distance" (WMD), which advocates for outsourcing of surveillance video monitoring (similar to call centers) to the long-distance sites where professional security operators watch the video and alert the local site when any suspicious or abnormal event takes place. We find that long-distance monitoring helps decoupling the contextual knowledge of security operators. Since security operators at the remote site could turn into adversaries, a trust computation model to determine the credibility of the operators is presented as an integral part of the proposed system. The feasibility study and experiments suggest that the proposed system provides more robust measures of privacy yet maintaining the surveillance effectiveness.
With the advancement of social media and mobile technology, any smartphone users can easily become a seller on social media and e-commerce platforms, such as Instagram and Carousell in Hong Kong, or Taobao in China. A seller shows images of their products, and annotates their images with suitable tags that can be searched easily by others. Those images could be taken by the seller, or the seller could use images shared by other sellers. Among sellers, some sell counterfeit goods, and these sellers may use disguising tags and language, which make detecting them a difficult task. This paper proposes a framework to detect counterfeit sellers by using deep learning to discover connections among sellers from their shared images. Based on 473K shared images from Taobao, Instagram and Carousell, it is proven that the proposed framework can detect counterfeit sellers. The framework is 30% better than approaches using object recognition in detecting counterfeit sellers. To the best of our knowledge, this is the first work to detect online counterfeit sellers from their shared images.
Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
This paper presents an image-based real-time facial expression recognition system, which is capable of consequently recognizing one of the basic facial expressions of several subjects simultaneously from the webcam. Our proposed methodology combines a supervised transfer learning strategy and a joint supervision method with a new supervision signal which is crucial for facial tasks. A newly proposed Convolutional Neural Network (CNN) model, MobileNet, that contains both accuracy and speed is deployed in both offline and real-time framework which enables fast and accurate real-time output. Evaluations towards two publicly available datasets, JAFFE and CK+, are carried out respectively. It reaches an accuracy of 95.24% on JAFFE dataset, and an accuracy of 96.92% on 6-class CK+ dataset which only contains the last frames of image sequences. At last, the average run-time cost for the recognition of the real-time implementation is around 3.57 ms/frame on an NVIDIA Quadro K4200 GPU.
Recent works in computer vision and multimedia have shown that image memorability can be automatically inferred exploiting powerful deep learning models. This paper advances the state of the art in this area by addressing a novel and more challenging issue: Given an arbitrary input image, can we make it more memorable?. To tackle this problem we introduce an approach based on an editing-by-applying-filters paradigm: given an input image, we propose to automatically retrieve a set of style seeds,i.e.a set of style images which, applied to the input image through a neural style transfer algorithm, provide the highest increase in memorability. We show the effectiveness of the proposed approach with experiments on the publicly available LaMem dataset, performing both a quantitative evaluation and a user study. To demonstrate the flexibility of the proposed framework, we also analyze the impact of different implementation choices, such as using different state of the art neural style transfer methods. Finally, we show several qualitative results to provide additional insights on the link between image style and memorability.
Although live video communication, such as live broadcasting and video conferencing, is widely used recently, still, it is less engaged than face-to-face communication because of lacking social, emotional, and haptic feedback. Missing eye contact is one of the problems, which is caused by the physical deviation between the screen and the camera on a device. Manipulating video frames to correct the eye gaze is a solution. However, to the best of our knowledge, there is no existing methods that can dynamically correct eye gaze in real time while achieving high visual quality. In this paper, we introduce a system to estimate the rotation of eyes according to the positions between the camera, the local and the remote participants' eyes. Then, the system adopts a warping-based convolutional neural network to relocate pixels on the eyes regions. To improve visual quality, we minimize not only the L2 distance between the ground truths and the warped eyes but also the newly designed loss functions when training the network. These new loss functions are designed to preserve the shape of eyes structures and ease the artifacts caused by occlusions. To evaluate the presented network and the loss functions, we objectively and subjectively compare the results generated by our system and the state-of-the-art, DeepWarp, on two datasets. The experiment results demonstrate the effectiveness of our system. In addition, we show that our system can perform in real time on a consumer level laptop. The quality and efficiency make gaze correction by post-processing a feasible solution to the missing eye contact in video communication.
In this paper, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception for person re-identification (re-ID) problem. According to the distinct characteristics of diverse feature maps, we effectively apply different similarity constraints to both low-level and high-level feature maps, during training stage. Due to the introduction of appropriate similarity comparison mechanisms at different levels, the proposed approach can adaptively learn discriminative local and global feature representations respectively, while the former is more sensitive in localizing part-level prominent patterns relevant to re-identifying people across cameras. Meanwhile, a novel strong activation pooling strategy is utilized on the last convolutional layer for abstract local feature aggregation to pursue more representative feature representations. Based on this, we propose final feature embedding by simultaneously encoding original global features and discriminative local features. In addition, our framework has two other benefits. Firstly, classification constraints can be easily incorporated into the framework, forming a unified multi-task network with similarity constraints. Secondly, as similarity comparable information has been encoded in the network's learning parameters via back-propagation, pairwise input is not necessary at test time. That means we can extract features of each gallery image and build index in an off-line manner, which is essential for large-scale real-world applications. Experimental results on multiple challenging benchmarks demonstrate that our method achieves splendid performance compared with the current state-of-the-art approaches.