Automatic product image classification is a task of crucial importance towards better understanding and management of online retailers. Motivated by recent advancements of deep convolutional neural network (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to further improve the classification accuracy. With these structural-aware clues, we argue that more advanced CNN models could be developed beyond the one-versus-all classification as did by conventional CNNs. To this end, novel efforts of this work include: developing a salient-sensitive CNN that could focus more on the product foreground by inserting a spatial attention layer to a proper location, proposing a multi-class regression based refinement method that is expected to generate more accurate predictions by utilizing prediction scores from preceding multiple CNNs, each corresponding to a distinctive classifier on a categorical layer in the hierarchy, and devising a multi-task deep learning architecture that effectively explore correlations among the categories and attributes for better categorical label prediction. Experimental results on nearly one million real-world product images basically validate the effectiveness of the proposed efforts jointly and individually, from which performance gains are observed.
Person re-identification aims at identifying a certain pedestrian across non-overlapping multi-camera networks in different time and places. Existing person re-identification approaches mainly focus on matching pedestrians on still images, however little attention is paid to person re-identification in videos. Compared to images, video clips contain motion of pedestrians which is crucial to re-identification. Moreover, consecutive video frames present pedestrian appearance with different poses and from different viewpoints, providing valuable information towards addressing the challenge of pose variation, occlusion, and viewpoint change etc. In this paper, we propose a Dense 3D-Convolutional Network (D3DNet) to jointly learn spatio-temporal and appearance features for person re-identification in videos. The D3DNet consists of multiple 3D dense blocks and transition layers. The 3D dense blocks enlarge the receptive fields of visual neurons in spatial and temporal dimensions, leading to discriminative appearance representation as well as short-term and long-term motion information of pedestrians without the requirement of an additional motion estimation module. Moreover, we propose an improved loss function consisting of identification loss and center loss to minimize intra-class variance and maximize inter-class variance simultaneously, towards addressing the challenge of large intra-class variance and small inter-class variance, which is common phenomenon in person re-identification task. Extensive experiments on two widely-used surveillance video datasets, i.e., MARS and iLIDS-VID, have shown the effectiveness of the proposed approach.
Scene classification is a challenging problem. Compared with object images, scene images are more abstract, which are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representation including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this paper, the idea of share codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1000 and Places365 concepts (Mixed1365), using convolutional neural networks. As the mixed 1365 features covers all the semantic information including both object and scene concepts, we can extract shared codebook from the mixed 1365 features which only contains a subset of the whole 1365 concepts with the same codebooks size. The shared codebook can not only provide complementary representations without additional codebook training, but also it can be adaptively extracted towards different scene classification tasks. A method of combining both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public dataset validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.
Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution technique that adapts the streaming session according to the users network condition typically in terms of downlink bandwidth. This video quality adaptation can be achieved by scaling the frame quality, spatial resolution or frame rate. Despite the flexibility on the video quality scaling methods, each of these quality scaling dimensions has varying effects on the Quality of Experience (QoE) for end users. Furthermore, in video streaming, the changes in motion over time along with the scaling method employed have an influence on QoE, hence the need to carefully tailor scaling methods to suit streaming applications and content type. In this work, we investigate an intelligent DASH approach for the latest video coding standard H.265 and propose a heuristic QoE-aware cost-efficient adaptation scheme that does not switch unnecessarily to the highest quality level but rather stays temporarily at an intermediate quality level in certain streaming scenarios. Such an approach achieves a comparable and consistent level of quality under impaired network conditions as commonly found in Internet and mobile networks whilst reducing bandwidth requirements and quality switching overhead. The rationale is based on our empirical experiments, which show that an increase in bitrate does not necessarily mean noticeable improvement in QoE. Furthermore, our work demonstrates that the Signal-to-Noise Ratio (SNR) and the spatial resolution scalability types are the best fit for our proposed algorithm. Finally, we demonstrate an innovative interaction between quality scaling methods and the polarity of switching operations. The proposed QoE-aware scheme is implemented and empirical results show that it is able to reduce bandwidth requirements by up to 41% whilst achieving equivalent QoE compared with a representative DASH reference implementation.
This paper deals with the problem of electric network frequency (ENF) estimation where signal to noise ratio (SNR) is an essential challenge. By exploiting the low-rank structure of the ENF signal from the audio spectrogram, we propose an approach based on robust principle component analysis to get rid of the interference from speech contents and some of the background noise, which in our case can be regarded as sparse in nature. Weighted linear prediction is enforced on the low-rank signal subspace to gain accurate ENF estimation. The performance of the proposed scheme is analyzed and evaluated as a function of SNR, and the Cramér-Rao lower bound (CRLB) is approached at an SNR level above -10dB. Experiments on real datasets have demonstrated the advantages of the proposed method over the state-of-the-art works in terms of estimation accuracy. Specifically, the proposed scheme can effectively capture the ENF fluctuations along the time axis using small number of signal observations while preserving sufficient frequency precision.
Recent studies have shown that spatial relationships among objects are very important for visual recognition since they provide rich clues on object contexts within images. In this paper, we introduce a novel method to learn Semantic Feature Map (SFM) with attention based deep neural networks for image and video classification in an end to end manner, with an aim to explicitly model spatial object contexts within the images. In particular, for every object proposals obtained from the input image, we extract high-level semantic object features with convolutional neural networks. Then, we explicitly apply gate units to these extracted features for important objects selection and noise removal. These selected object features are organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as classifiers on top of the SFM for content recognition, which are expected to exploit the spatial relationships among objects. We also introduce a novel multi-task learning framework to help learn the model parameters in the training phase. It consists of a basic image classification loss in cross entropy form, an object localization loss to guide important object selection, as well as a grid labeling loss to predict object labels at SFM grids. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach and very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the SFMs learned on the image domain are transferred to video classification on CCV and FCVID benchmarks and the results successfully demonstrate its robustness and generalization capability.
Baselines are the starting point of any quantitative multimedia research, and benchmarks are essential for pushing those baselines further. In this paper, we present baselines for the artistic domain with a new benchmark dataset featuring over 2 million images with rich structured metadata dubbed OmniArt. It contains annotations for dozens of attribute types and features semantic context information through concepts, IconClass labels, color information, and (limited) object level bounding boxes. We establish and present baseline scores on multiple tasks like artist attribution, creation period estimation, type, style, and school prediction. As an example of additional types of analyses, we explore the color spaces of art through different types and evaluate a transfer learning object recognition pipeline.
Single-image super-resolution (SISR) methods based on convolutional neural network (CNN) have shown great success in the literature. However, most deep CNN models dont have direct access to the subsequent layers, this seriously hinders the information flow. Whats more, they also dont make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we present a special SISR CNN with symmetrical nested residual connections for super-resolution reconstruction to further improve the quality of reconstruction image. Compared with previous SISR CNNs, our learning architecture shows significant improvements in accuracy and execution time. It has larger image region for contextual spreading. Its symmetrical combinations provide multiple short paths for the forward propagation to improve the reconstruction accuracy and for the backward propagation of gradient flow to accelerate the convergence speed. Extensive experiments on the open challenge datasets confirm the effectiveness of symmetrical residual connections. Our method can reconstruct high-quality high-resolution (HR) images at a relatively fast speed and outperform other methods by a large margin.
In this paper, we propose video delivery schemes insuring around one-second delivery latency. To this purpose, we use Dynamic Adaptive Streaming over HTTP (DASH), which is a standard version of HTTP Live Streaming (HLS), as to benefit from the video representation switching between successive video segments. We also propose HTTP/2-based algorithms to apply video frame discarding policies inside a video segment. When a selected DASH representation does not match with the available network resources, current solutions suffer from rebuffering events. Rebuffering does not only impact the Quality of Experience (QoE) but it also increases the delivery delay between the displayed and the original video streams. We observe that rebuffering-based solutions may increase the delivery delay with 1.5 s to 2 s inside a six-second video segment. In this work, we develop optimal and practical algorithms in order to respect the one-second targeted latency. In all algorithms, we selectively drop the least meaningful video frames thanks to HTTP/2 stream resetting feature. An important number of missing video frames results in a temporal fluidity break known as video jitters. The displayed video seems as a series of snapshots. Our simulations show that we respect the one-second targeted latency while insuring an acceptable video quality with at least a Peak Signal to Noise Ratio (PSNR) of 30 dB. We also quantify and qualify the resulting jitters for each algorithm. We show that both, the optimal and the practical algorithms we propose, decrease the jitters impact on the displayed videos. For example, 97 % of the optimal algorithm outputs and 87 % of the practical algorithms outputs are considered as acceptable comparing to only 57 % of the First In First Out (FIFO) basic algorithm outputs.
It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation. Inspired by this, we aim to effectively correlate existing large-scale heterogeneous data of different modalities by utilizing the power of GANs to model the cross-modal joint distribution, and its idea for adversarial learning can fully be exploited to learn discriminative common representation for bridging the heterogeneity gap. Thus, in this paper we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions: (1) Cross-modal GANs architecture is proposed to model the joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They can not only exploit the cross-modal correlation for learning the common representation, but also preserve the reconstruction information for capturing the semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representation more discriminative by adversarial training process. In summary, our proposed CM-GANs approach can utilize GANs to perform cross-modal common representation learning, by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval, compared with 11 state-of-the-art methods on 3 cross-modal datasets.
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.
This paper proposes a novel feature-extraction framework for inferring impressed personality traits, emergent leadership skills, communicative competence and hiring decisions. The proposed framework extracts multimodal features, describing each participant's nonverbal activities. It captures inter-modal and inter-person relationships in interaction and captures how the target interactor generates nonverbal behavior when the other interactors also generate the nonverbal behavior. The inter-modal and inter-personal patterns are identified as frequent co-occurring events based on graph clustering from multimodal sequences. The framework can be applied to any type of interaction task. The proposed framework is applied to the SONVB corpus, which is an audio-visual dataset collected from dyadic job interviews, and the ELEA audio-visual data corpus, which is a dataset collected from group meetings. We evaluate the framework on a binary classification task of 15 impression variables in two data corpora. The experimental results show that the model trained with co-occurrence features is more accurate than previous models for 14 out of 15 traits.
Multimedia community has witnessed the rise of deep learning based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep learning and multimedia analytics has boosted the performance of several traditional tasks such as classification, detection, regression, and also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning and content generation. This paper aims to review the development path of major tasks in multimedia analytics, and then take a peep for future directions. We start by summarizing the fundamental deep-techniques related to multimedia analytics, especially in visual domain, and then review representative high-level tasks powered by recent advancement. Moreover, the performance review on popular benchmarks gives a pathway of technology advancement, and helps identify both the milestone works and future directions.
Learning robust and representative feature across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this paper, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.
Deep convolution neural network (CNN) has achieved remarkable results in computer vision tasks for end-to-end learning. We evaluate here the power of a deep CNN to learn robust features from raw EEG data to detect seizures. Seizure are hard to detect as they vary both inter- and intra-patient. In this paper, we use a deep CNN model for seizure detection task on an open access EEG epilepsy dataset collected at the Childrens Hospital Boston. Our deep learning model is able to extract spectral, temporal features from EEG epilepsy data and use them to learn general structure of a seizure that is less sensitive to variations. Our method produced an overall sensitivity of 90.00 %, specificity of 91.65% and accuracy of 98.05% for whole dataset of 23 patients. Hence, it can be used as an excellent cross-patient classifier. The results show that our model performs better than previous state of the art models for cross-patient seizure detection task. The proposed model can also visualize special orientation of band power features. We use correlation maps to relate spectral amplitude features to the output in the form of images. By using the results from our deep learning model, this visualization method can be used as an effective multimedia tool for producing quick and relevant brain mapping images that can be used by medical experts for further investigation.
Transfer learning, which focuses on finding a favorable representation for instances of different domains based on auxiliary data, can mitigate the divergence between domains through knowledge transfer. Recently, increasing efforts on transfer learning have employed deep neural network (DNN) to learn more robust and higher level feature representations to better tackle cross-media disparity. However, only a few papers consider the correction and semantic matching between multi-layer heterogeneous domain networks. In this paper, we propose a deep semantic mapping model for heterogeneous multimedia transfer learning (DHTL) using co-occurrence data. More specifically, we integrate the DNN with canonical correlation analysis (CCA) to derive a deep correlation subspace as the joint semantic representation for associating data across different domains. In the proposed DHTL, a multi-layer correlation matching network across domains is constructed, in which the CCA is combined to bridge each pair of domain-specific hidden layers. To train the network, a joint objective function is defined and the optimization processes are presented. When the deep semantic representation is achieved, the shared features of the source domain are transferred for task learning in the target domain. Extensive experiments for three multimedia recognition applications demonstrate that the proposed DHTL can effectively find deep semantic representations for heterogeneous domains, and is superior to the several existing state-of-the-art methods for deep transfer learning.
The increasing amount of multimedia data collections available today evinces the pressing need for methods capable of indexing and retrieving this content. Despite of the continuous advances in multimedia features and representation models, to establish an effective measure for comparing different multimedia objects still remains a challenging task. While supervised and semi-supervised techniques made relevant advances on similarity learning tasks, scenarios where labeled data is non-existent require different strategies. In such situations, unsupervised learning has been established as a promising solution, capable of considering the contextual information and the dataset structure for computing new similarity/dissimilarity measures. This paper extends a recent unsupervised learning algorithm which uses an iterative re-ranking strategy to take advantage of different kNN sets and rank correlation measures. Two novel approaches are proposed for computing the kNN sets and their corresponding top-k lists. The proposed approaches were validated in conjunction with various rank correlation measures, yielding superior effectiveness results in comparison with previous works. In addition, we also evaluate the ability of the method in considering different multimedia objects, conducting an extensive experimental evaluation on various image and video datasets.
Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
This paper introduces Film Editing Patterns (FEP), a language to formalize film editing practices and stylistic choices found in movies. FEP constructs are constraints expressed over one or more shots from a movie sequence that characterize changes in cinematographic visual properties such as shot size, region, angle of on-screen actors. We first present the elements of the FEP language, then introduce its usage in annotated film data, and finally describe how it can support users in the creative design of film sequences in 3D, more specifically: (i) we present an application to craft edited filmic sequences from 3D animated scenes that uses FEPs to support the user in selecting camera framings and editing choices that follow certain best practices used in cinema; (ii) we conduct an evaluation of the application with professional and non-professional filmmakers. The evaluation suggests that users generally appreciate the idea of FEP, and that it can effectively help novice and medium experienced users in crafting film sequences with little training and satisfying results.
In this paper, we present convolutional attention networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on convolutional neural networks (CNN) and recurrent neural networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods, which is completely built on CNN and combines attention mechanism. The distinctive characteristics of our method include: (1) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN. (2) The attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling. (3) Position embeddings are equipped in both spatial encoder and sequence decoder to give our networks a sense of locations. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results validate the effectiveness of different components, and show our convolutional-based method achieves state-of-the-art or competitive performance than prior works, even without the use of RNN.
Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on 2D images; however, they suffer from illumination and pose variations that may largely degrade landmarking performance. The emergence of 3D data theoretically provides an alternative to overcome these weaknesses in the 2D domain. This paper proposes a novel approach to 3D facial landmarking, which combines both the advantages of feature based methods as well as model based ones in a progressive coarse-to-fine manner (initial, intermediate and fine stages). For the initial stage, a few fiducial landmarks (i.e. the nose tip and two inner eye corners) are robustly detected through curvature analysis, and these points are further exploited to initialize the subsequent stage. For the intermediate stage, a statistical model is learned in the feature space of three normal components of the facial point-cloud rather than the smooth original coordinates, namely Active Normal Model (ANM). For the fine stage, cascade regression is employed to locally refine the landmarks according to their geometry attributes. The proposed approach can accurately localize dozens of fiducial points on each 3D face scan, greatly surpassing feature based ones, and it also improves the state of the art of the model based ones in two aspects, i.e., sensitivity to initialization and deficiency in discrimination. The proposed method is evaluated on the BU-3DFE and Bosphorus databases, and competitive results are achieved in comparison with the ones in literature, clearly demonstrating its effectiveness.
Declarative multimedia documents represent the description of multimedia applications in terms of media items and relationships among them. Relationships specify how media items are dynamically arranged in time and space during runtime. Although a declarative approach usually facilitates the authoring task, authors can still make mistakes due to incorrect use of language constructs or inconsistent or missing relationships in a document. In order to properly support multimedia application authoring, it is important to provide tools with validation capabilities. Document validation can indicate possible inconsistencies in a given document to an author, so that it can be revised before deployment. Although very useful, multimedia validation tools are not often provided by authoring tools. This work proposes a multimedia validation approach that relies on a formal model, called Simple Hypermedia Model (SHM). SHM is used for representing a document for the purpose of validation. An SHM document is validated using a hybrid approach based on two complimentary techniques. The first one captures the document spatio-temporal layout in terms of its state throughout its execution by means of a rewrite theory and validation is performed through model-checking. The second one captures the document layout in terms of intervals and event occurrences by means of SMT (Satisfiability Modulo Theories) formulas and validation is performed through SMT solving. Due to different characteristics of both approaches, each validation technique complements the other in terms of expressiveness of SHM and tests to be checked. We briefly present validation tools that use our approach. They were evaluated with real NCL and web documents and by usability tests.