Special Section on Multimodal Understanding of Social, Affective and Subjective Attributes
Editorial to Special Issue on Deep Learning for Intelligent Multimedia Analytics
Nowadays, diversifying the retrieval results of a certain query will improve search efficiency for customers. Showing the multiple aspects information provides users an overview of the object, which helps them fast target their demands. To discover aspects, researches focus on generating image clusters from initially retrieved results. As an effective approach, Latent Dirichlet Allocation (LDA) has proved to have a good performance on discovering high level topics. However, traditional LDA is designed to process textual words, and it needs the input in the form of discrete data. When we apply this algorithm to process continuous visual images, a common solution is to quantize the continuous features into discrete representation by Bag-of-Visual-Word (BoVW) algorithm. During this process, quantization error will lead to information loss inevitably. To construct topic model with complete visual information, this work applies Gaussian Latent Dirichlet Allocation (GLDA) on diversity issue of image retrieval. In this model, traditional multinomial distribution is substituted by Gaussian distribution to model continuous visual features. Besides, we propose a two-phase spectral clustering strategy, called as dual spectral clustering, to generate clusters from segment level to image level. The experiments on challenging landmarks of DIV400 database show that our proposal improves relevance and diversity by about 10% comparing with traditional topic models.
Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution technique that adapts the streaming session according to the users network condition typically in terms of downlink bandwidth. This video quality adaptation can be achieved by scaling the frame quality, spatial resolution or frame rate. Despite the flexibility on the video quality scaling methods, each of these quality scaling dimensions has varying effects on the Quality of Experience (QoE) for end users. Furthermore, in video streaming, the changes in motion over time along with the scaling method employed have an influence on QoE, hence the need to carefully tailor scaling methods to suit streaming applications and content type. In this work, we investigate an intelligent DASH approach for the latest video coding standard H.265 and propose a heuristic QoE-aware cost-efficient adaptation scheme that does not switch unnecessarily to the highest quality level but rather stays temporarily at an intermediate quality level in certain streaming scenarios. Such an approach achieves a comparable and consistent level of quality under impaired network conditions as commonly found in Internet and mobile networks whilst reducing bandwidth requirements and quality switching overhead. The rationale is based on our empirical experiments, which show that an increase in bitrate does not necessarily mean noticeable improvement in QoE. Furthermore, our work demonstrates that the Signal-to-Noise Ratio (SNR) and the spatial resolution scalability types are the best fit for our proposed algorithm. Finally, we demonstrate an innovative interaction between quality scaling methods and the polarity of switching operations. The proposed QoE-aware scheme is implemented and empirical results show that it is able to reduce bandwidth requirements by up to 41% whilst achieving equivalent QoE compared with a representative DASH reference implementation.
Smart photography carries the promise of quality improvement and functionality extension in making aesthetically appealing pictures. In this paper, we focus on self-portrait photographs and introduce new methods that guide a user in how to best pose while taking a selfie. While most of the current solutions use a post processing procedure to beautify a picture, the developed tool enables a novel function of recommending a good look before the photo is captured. Given an input face image, the tool automatically estimates the pose-based aesthetic score, finds the most attractive angle of the face and suggests how the pose should be adjusted. The recommendation results are determined adaptively to the appearance and initial pose of the input face. We apply a data mining approach to find distinctive, frequent itemsets and association rules from online profile pictures, upon which the aesthetic estimation and pose recommendation methods are developed. A simulated and a real image set are used for experimental evaluation. The results show the proposed aesthetic estimation method can effectively select user-favorable photos. Moreover, the recommendation performance for the vertical adjustment is moderately related to the degree of conformity among the professional photographers' recommendations. This study echoes the trend of instant photo sharing, in which a user takes a picture and then immediately shares it on a social network without engaging in tedious editing.
Despite the promising progress made in visual captioning and paragraphing, visual storytelling is still largely unexplored. This task is more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. To deal with these challenges, we propose an Attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training (AHGRL). First, to model the ordered photo sequence and the complex story structure, we propose an attribute-based hierarchical generator. The generator incorporates semantic attributes to create more accurate and relevant descriptions. The hierarchical framework enables the generator to learn from the complex paragraph structure. Second, to generate story-style paragraphs, we design a language-style discriminator, which provides word-level rewards to optimize the generator by policy gradient. Third, we further consider the story generator and the reward critic as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critic aims at distinguishing them and further improving the generator. Extensive experiments on the widely-used dataset well demonstrate the advantages of the proposed method over state-of-the-art methods.
Different senses provide us with information of various levels of precision and enable us to construct a more precise representation of the world. Rich multisensory simulations are thus beneficial for comprehension, memory reinforcement or retention of information. Crossmodal mappings refer to the systematic associations often made between different sensory modalities (e.g., high pitch is matched with angular shapes) and govern multisensory processing. A great deal of research effort has been put in exploring crossmodal correspondences in the field of cognitive science. However, the possibilities they open in the digital world have been relatively unexplored. Mulsemedia - multiple sensorial media - provides a highly immersive experience to the users and enhances their Quality of Experience (QoE) in the digital world. Thus, we consider that studying the plasticity and the effects of crossmodal correspondences in a mulsemedia setup can bring interesting insights about improving the human computer dialogue and experience. In our experiments, we exposed users to videos with certain visual dimensions (brightness, color and shape) and we investigated if the pairing with a crossmodal matching sound (high and low pitch) and the corresponding auto-generated haptic effects lead to an enhanced QoE. For this, we captured the eye gaze and the heart rate of users while experiencing mulsemedia and we asked them to fill in a set of questions targeting their enjoyment and perception at the end of the experiment. Results showed differences in eye gaze patterns and heart rate between the experimental and the control group indicating changes in participants' engagement when videos were accompanied by matching crossmodal sounds (this effect was the strongest for the video displaying angular shapes and high pitch audio) and transitively generated crossmodal haptic effects.
Visual Question Answering (VQA) is a research hot-spot in computer vision and natural language processing, and also a fusion of the two research areas in high level applications. For VQA, this work aims to describe a novel model based on semantic concept network construction and deep walk. Extracting visual image semantic representation is a significant and effective method for spanning the semantic gap problem. Moreover, current research presents that co-occurrence patterns of concepts can enhance semantic representation. This work is motivated by the challenge that semantic concepts have complex interrelations and the relationships are similar to a network. Therefore, we construct a semantic concept network adopted by a complex network modeling way called Word Activation Forces (WAFs), and mine the co-occurrence patterns of semantic concepts using deep walk. Then the model performs polynomial logistic regression in basis of the extracted deep walk vector joint embedding the visual image feature and question feature. The proposed model effectively integrates visual and semantic features of the image and natural language question. The experimental results show that our algorithm outperforms challenging baselines on three benchmark image QA datasets. Furthermore, through experiments in image annotation refinement and semantic analysis on pre-labeled LabelMe dataset, we test and verify the effectiveness of our constructed concept network for mining concept co-occurrence patterns, sensible concept clusters and hierarchies.
Rich Visual and Language Representation with Complementary Semantics for Video Captioning
Face recognition from 2D still images and videos is quite successful even ``in the wild'' conditions. Instead, less consolidated results are available for the cases where face data come from non-conventional cameras, like infrared or depth. In this paper, we investigate this latter scenario assuming a low-resolution depth camera is used to perform face recognition in an uncooperative context. To this end, we propose, first, to automatically select a set of frames from the depth sequence of the camera according to the fact they provide a good view of the face in terms of pose and distance. Then, we design a progressive refinement approach to reconstruct a higher-resolution model from the selected low-resolution frames. This process accounts for the anisotropic error of the existing points in the current 3D model and the points in a newly acquired frame so that the refinement step can progressively adjust the point positions in the model using a Kalman-like estimation. The quality of the reconstructed model is evaluated by considering the error between the reconstructed models and their corresponding high-resolution scans used as ground truth. In addition, we performed face recognition using the reconstructed models as probes against either a gallery of reconstructed models and a gallery with high-resolution scans. The obtained results confirm the possibility to effectively use the reconstructed models for the face recognition task.
Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition and detection, deep learning has demonstrated to perform well also in event recognition tasks. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this paper, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this paper we extensively review different deep learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.
Single-image super-resolution (SISR) methods based on convolutional neural network (CNN) have shown great success in the literature. However, most deep CNN models dont have direct access to the subsequent layers, this seriously hinders the information flow. Whats more, they also dont make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we present a special SISR CNN with symmetrical nested residual connections for super-resolution reconstruction to further improve the quality of reconstruction image. Compared with previous SISR CNNs, our learning architecture shows significant improvements in accuracy and execution time. It has larger image region for contextual spreading. Its symmetrical combinations provide multiple short paths for the forward propagation to improve the reconstruction accuracy and for the backward propagation of gradient flow to accelerate the convergence speed. Extensive experiments on the open challenge datasets confirm the effectiveness of symmetrical residual connections. Our method can reconstruct high-quality high-resolution (HR) images at a relatively fast speed and outperform other methods by a large margin.
In this paper, we deal with the problem of understanding human-to-human interactions as a fundamental component of social events analysis. Inspired by the recent success of multi-modal visual data in many recognition tasks, we propose a novel approach to model dyadic interaction by means of features extracted from synchronized 3D skeleton coordinates, depth and RGB sequences. From skeleton data, we extract a new view-invariant proxemic features, named UProD, that is able to incorporate intrinsic and extrinsic distances between two interacting subjects. A novel key frame selection method is introduced to identify salient instants of the interaction sequence based on the joints'energy. From RGBD videos, more holistic CNN features are extracted by applying an adaptive pre-trained CNNs on optical flow frames. For better understanding the dynamics of interactions, we expand the boundaries of dyadic interactions analysis by proposing a fundamentally new modelling for non-treated problem aiming to discern the active from the passive interactor. Extensive experiments have been carried out on four multi-modal and multi-view interactions datasets. The experimental results demonstrate the superiority of our proposed techniques against the state-of-the-art approaches.
In this paper, we propose video delivery schemes insuring around one-second delivery latency. To this purpose, we use Dynamic Adaptive Streaming over HTTP (DASH), which is a standard version of HTTP Live Streaming (HLS), as to benefit from the video representation switching between successive video segments. We also propose HTTP/2-based algorithms to apply video frame discarding policies inside a video segment. When a selected DASH representation does not match with the available network resources, current solutions suffer from rebuffering events. Rebuffering does not only impact the Quality of Experience (QoE) but it also increases the delivery delay between the displayed and the original video streams. We observe that rebuffering-based solutions may increase the delivery delay with 1.5 s to 2 s inside a six-second video segment. In this work, we develop optimal and practical algorithms in order to respect the one-second targeted latency. In all algorithms, we selectively drop the least meaningful video frames thanks to HTTP/2 stream resetting feature. An important number of missing video frames results in a temporal fluidity break known as video jitters. The displayed video seems as a series of snapshots. Our simulations show that we respect the one-second targeted latency while insuring an acceptable video quality with at least a Peak Signal to Noise Ratio (PSNR) of 30 dB. We also quantify and qualify the resulting jitters for each algorithm. We show that both, the optimal and the practical algorithms we propose, decrease the jitters impact on the displayed videos. For example, 97 % of the optimal algorithm outputs and 87 % of the practical algorithms outputs are considered as acceptable comparing to only 57 % of the First In First Out (FIFO) basic algorithm outputs.
It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation. Inspired by this, we aim to effectively correlate existing large-scale heterogeneous data of different modalities by utilizing the power of GANs to model the cross-modal joint distribution, and its idea for adversarial learning can fully be exploited to learn discriminative common representation for bridging the heterogeneity gap. Thus, in this paper we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions: (1) Cross-modal GANs architecture is proposed to model the joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They can not only exploit the cross-modal correlation for learning the common representation, but also preserve the reconstruction information for capturing the semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representation more discriminative by adversarial training process. In summary, our proposed CM-GANs approach can utilize GANs to perform cross-modal common representation learning, by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval, compared with 11 state-of-the-art methods on 3 cross-modal datasets.
Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated by painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speakers vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely NUS Multi-Sensor Presentation (NUSMSP) dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.
This paper proposes a novel feature-extraction framework for inferring impressed personality traits, emergent leadership skills, communicative competence and hiring decisions. The proposed framework extracts multimodal features, describing each participant's nonverbal activities. It captures inter-modal and inter-person relationships in interaction and captures how the target interactor generates nonverbal behavior when the other interactors also generate the nonverbal behavior. The inter-modal and inter-personal patterns are identified as frequent co-occurring events based on graph clustering from multimodal sequences. The framework can be applied to any type of interaction task. The proposed framework is applied to the SONVB corpus, which is an audio-visual dataset collected from dyadic job interviews, and the ELEA audio-visual data corpus, which is a dataset collected from group meetings. We evaluate the framework on a binary classification task of 15 impression variables in two data corpora. The experimental results show that the model trained with co-occurrence features is more accurate than previous models for 14 out of 15 traits.
Multimedia community has witnessed the rise of deep learning based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep learning and multimedia analytics has boosted the performance of several traditional tasks such as classification, detection, regression, and also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning and content generation. This paper aims to review the development path of major tasks in multimedia analytics, and then take a peep for future directions. We start by summarizing the fundamental deep-techniques related to multimedia analytics, especially in visual domain, and then review representative high-level tasks powered by recent advancement. Moreover, the performance review on popular benchmarks gives a pathway of technology advancement, and helps identify both the milestone works and future directions.
This paper focuses on cross-modal retrieval which matches images and texts. A basic solution of this task is to learn a shared representation for images and texts so that commonly used distance metrics can be employed for similarity measurement. This solution assumes that images and texts are embedded onto a shared image-text space, i.e., the image and text embedding spaces are considered identical. However, this assumption may not hold. The reason is that the embeddings of each modality are generated by its modality-specific generator, and there is still distribution difference between the embedding spaces of the two modalities. To address this problem, this paper proposes to learn a modality invariant image-text embedding by adversarial learning. Specifically, we aim at pulling close the image and text embedding spaces. On top of a triplet loss based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or the text modality. Through minimizing the triplet loss and maximizing the proposed adversarial loss simultaneously, the proposed network not only imposes the image-text similarity constraints by groundtruth labels, but also enforces the image and text embedding spaces to be similar, thus producing modality invariant image-text embeddings. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
Deep convolution neural network (CNN) has achieved remarkable results in computer vision tasks for end-to-end learning. We evaluate here the power of a deep CNN to learn robust features from raw EEG data to detect seizures. Seizure are hard to detect as they vary both inter- and intra-patient. In this paper, we use a deep CNN model for seizure detection task on an open access EEG epilepsy dataset collected at the Childrens Hospital Boston. Our deep learning model is able to extract spectral, temporal features from EEG epilepsy data and use them to learn general structure of a seizure that is less sensitive to variations. Our method produced an overall sensitivity of 90.00 %, specificity of 91.65% and accuracy of 98.05% for whole dataset of 23 patients. Hence, it can be used as an excellent cross-patient classifier. The results show that our model performs better than previous state of the art models for cross-patient seizure detection task. The proposed model can also visualize special orientation of band power features. We use correlation maps to relate spectral amplitude features to the output in the form of images. By using the results from our deep learning model, this visualization method can be used as an effective multimedia tool for producing quick and relevant brain mapping images that can be used by medical experts for further investigation.
Transfer learning, which focuses on finding a favorable representation for instances of different domains based on auxiliary data, can mitigate the divergence between domains through knowledge transfer. Recently, increasing efforts on transfer learning have employed deep neural network (DNN) to learn more robust and higher level feature representations to better tackle cross-media disparity. However, only a few papers consider the correction and semantic matching between multi-layer heterogeneous domain networks. In this paper, we propose a deep semantic mapping model for heterogeneous multimedia transfer learning (DHTL) using co-occurrence data. More specifically, we integrate the DNN with canonical correlation analysis (CCA) to derive a deep correlation subspace as the joint semantic representation for associating data across different domains. In the proposed DHTL, a multi-layer correlation matching network across domains is constructed, in which the CCA is combined to bridge each pair of domain-specific hidden layers. To train the network, a joint objective function is defined and the optimization processes are presented. When the deep semantic representation is achieved, the shared features of the source domain are transferred for task learning in the target domain. Extensive experiments for three multimedia recognition applications demonstrate that the proposed DHTL can effectively find deep semantic representations for heterogeneous domains, and is superior to the several existing state-of-the-art methods for deep transfer learning.
Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
This work summarizes the findings of the seventh iteration of the Video Browser Showdown (VBS) competition organized as a workshop at the 24th International Conference on Multimedia Modeling in Bangkok. The competition focuses on video retrieval scenarios in which the searched scenes were either previously observed or described by another person (i.e., an example shot is not available). During the event, nine teams competed with their video retrieval tools in providing access to a shared video collection with 600 hours of video content. Evaluation objectives, rules, scoring, tasks and all the participating tools are described in the paper. In addition, we provide some insights into how the different teams interacted with their video browsers, which was made possible by a novel interaction logging mechanism introduced for this iteration of VBS. The results collected at the Video Browser Showdown evaluation server confirm that searching for one particular scene in the collection given a limited time is still a challenging task for many of the approaches that were showcased during the event. Given only a short textual description, finding the correct scene is even harder. In ad-hoc search with multiple relevant scenes, the tools were mostly able to find at least one scene, while recall was the issue for many teams. The logs also reveal that, even though recent exciting advances in machine learning narrow the classical semantic gap problem, user centric interfaces are still required to mediate access to specific content. Finally, open challenges and lessons learned are presented for future VBS events.
Information-centric networking (ICN) has been touted as a revolutionary solution for future Internet, which will be dominated by video traffic. This work investigates the challenge of distributing video contents of adaptive bit rate (ABR) over ICN. In particular, we utilize the in-network caching capability of ICN routers to serve users; in addition, with the help of named function, we enable ICN routers to transcode videos to lower-bitrate versions to improve the cache hit ratio. Mathematically, we formulate this design challenge into a constrained optimization problem, which aims to maximize the cache hit ratio for service providers and minimize the service delay for end users. We design a two-step iterative algorithm to find the optimum. First, given a content management scheme, we minimize the service delay via optimally configuring the routing scheme. Second, we maximize the cache hits for a giving routing policy. Finally, we rigorously prove its convergence. Through extensive simulations, we verify the convergence and the performance gains over other algorithms. We also find that more resources should be allocated to ICN routers with heavier request rate, and the routing scheme favors the shortest path to schedule more traffic.
Tremendous progress on deep learning has shown exciting potential for a variety of face completion tasks. However, most learning-based methods are limited to handle general or structure specified face images (e.g., well-aligned faces). In this paper, we propose a novel face completion algorithm, called Learning and Preserving Face Completion Network (LP-FCN), which simultaneously parses face images and extracts face identity-preserving (FIP) features. By tackling these two tasks in a mutually boosting way, the LP-FCN can guide an identity preserving inference and ensure pixel faithfulness of completed faces. In addition, we adopt a global discriminator and a local discriminator to distinguish real images from synthesized ones. By training with a combination of semantic parsing loss, identity preserving loss and two adversarial losses, the LP-FCN encourages the completion results to be semantically valid and visually consistent for more complicated image completion tasks. Experiments show that our approach obtains similar visual quality, but achieves better performance on unaligned faces completion and ne detailed synthesis against the state-of-the-art methods.
In this paper, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention model (SEA) is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks resulting in abundant irrelevant semantic features. In contract, at each time step, our model selects the most relevant word which aligns with current context. That is, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Since visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on MS COCO dataset and the results show that VSDA outperforms other methods and achieves promising performance.
Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on 2D images; however, they suffer from illumination and pose variations that may largely degrade landmarking performance. The emergence of 3D data theoretically provides an alternative to overcome these weaknesses in the 2D domain. This paper proposes a novel approach to 3D facial landmarking, which combines both the advantages of feature based methods as well as model based ones in a progressive coarse-to-fine manner (initial, intermediate and fine stages). For the initial stage, a few fiducial landmarks (i.e. the nose tip and two inner eye corners) are robustly detected through curvature analysis, and these points are further exploited to initialize the subsequent stage. For the intermediate stage, a statistical model is learned in the feature space of three normal components of the facial point-cloud rather than the smooth original coordinates, namely Active Normal Model (ANM). For the fine stage, cascade regression is employed to locally refine the landmarks according to their geometry attributes. The proposed approach can accurately localize dozens of fiducial points on each 3D face scan, greatly surpassing feature based ones, and it also improves the state of the art of the model based ones in two aspects, i.e., sensitivity to initialization and deficiency in discrimination. The proposed method is evaluated on the BU-3DFE and Bosphorus databases, and competitive results are achieved in comparison with the ones in literature, clearly demonstrating its effectiveness.