Automatic product image classification is a task of crucial importance towards better understanding and management of online retailers. Motivated by recent advancements of deep convolutional neural network (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to further improve the classification accuracy. With these structural-aware clues, we argue that more advanced CNN models could be developed beyond the one-versus-all classification as did by conventional CNNs. To this end, novel efforts of this work include: developing a salient-sensitive CNN that could focus more on the product foreground by inserting a spatial attention layer to a proper location, proposing a multi-class regression based refinement method that is expected to generate more accurate predictions by utilizing prediction scores from preceding multiple CNNs, each corresponding to a distinctive classifier on a categorical layer in the hierarchy, and devising a multi-task deep learning architecture that effectively explore correlations among the categories and attributes for better categorical label prediction. Experimental results on nearly one million real-world product images basically validate the effectiveness of the proposed efforts jointly and individually, from which performance gains are observed.
Person re-identification aims at identifying a certain pedestrian across non-overlapping multi-camera networks in different time and places. Existing person re-identification approaches mainly focus on matching pedestrians on still images, however little attention is paid to person re-identification in videos. Compared to images, video clips contain motion of pedestrians which is crucial to re-identification. Moreover, consecutive video frames present pedestrian appearance with different poses and from different viewpoints, providing valuable information towards addressing the challenge of pose variation, occlusion, and viewpoint change etc. In this paper, we propose a Dense 3D-Convolutional Network (D3DNet) to jointly learn spatio-temporal and appearance features for person re-identification in videos. The D3DNet consists of multiple 3D dense blocks and transition layers. The 3D dense blocks enlarge the receptive fields of visual neurons in spatial and temporal dimensions, leading to discriminative appearance representation as well as short-term and long-term motion information of pedestrians without the requirement of an additional motion estimation module. Moreover, we propose an improved loss function consisting of identification loss and center loss to minimize intra-class variance and maximize inter-class variance simultaneously, towards addressing the challenge of large intra-class variance and small inter-class variance, which is common phenomenon in person re-identification task. Extensive experiments on two widely-used surveillance video datasets, i.e., MARS and iLIDS-VID, have shown the effectiveness of the proposed approach.
Scene classification is a challenging problem. Compared with object images, scene images are more abstract, which are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representation including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this paper, the idea of share codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1000 and Places365 concepts (Mixed1365), using convolutional neural networks. As the mixed 1365 features covers all the semantic information including both object and scene concepts, we can extract shared codebook from the mixed 1365 features which only contains a subset of the whole 1365 concepts with the same codebooks size. The shared codebook can not only provide complementary representations without additional codebook training, but also it can be adaptively extracted towards different scene classification tasks. A method of combining both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public dataset validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.
Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution technique that adapts the streaming session according to the users network condition typically in terms of downlink bandwidth. This video quality adaptation can be achieved by scaling the frame quality, spatial resolution or frame rate. Despite the flexibility on the video quality scaling methods, each of these quality scaling dimensions has varying effects on the Quality of Experience (QoE) for end users. Furthermore, in video streaming, the changes in motion over time along with the scaling method employed have an influence on QoE, hence the need to carefully tailor scaling methods to suit streaming applications and content type. In this work, we investigate an intelligent DASH approach for the latest video coding standard H.265 and propose a heuristic QoE-aware cost-efficient adaptation scheme that does not switch unnecessarily to the highest quality level but rather stays temporarily at an intermediate quality level in certain streaming scenarios. Such an approach achieves a comparable and consistent level of quality under impaired network conditions as commonly found in Internet and mobile networks whilst reducing bandwidth requirements and quality switching overhead. The rationale is based on our empirical experiments, which show that an increase in bitrate does not necessarily mean noticeable improvement in QoE. Furthermore, our work demonstrates that the Signal-to-Noise Ratio (SNR) and the spatial resolution scalability types are the best fit for our proposed algorithm. Finally, we demonstrate an innovative interaction between quality scaling methods and the polarity of switching operations. The proposed QoE-aware scheme is implemented and empirical results show that it is able to reduce bandwidth requirements by up to 41% whilst achieving equivalent QoE compared with a representative DASH reference implementation.
This paper deals with the problem of electric network frequency (ENF) estimation where signal to noise ratio (SNR) is an essential challenge. By exploiting the low-rank structure of the ENF signal from the audio spectrogram, we propose an approach based on robust principle component analysis to get rid of the interference from speech contents and some of the background noise, which in our case can be regarded as sparse in nature. Weighted linear prediction is enforced on the low-rank signal subspace to gain accurate ENF estimation. The performance of the proposed scheme is analyzed and evaluated as a function of SNR, and the Cramér-Rao lower bound (CRLB) is approached at an SNR level above -10dB. Experiments on real datasets have demonstrated the advantages of the proposed method over the state-of-the-art works in terms of estimation accuracy. Specifically, the proposed scheme can effectively capture the ENF fluctuations along the time axis using small number of signal observations while preserving sufficient frequency precision.
Currently, the surveillance camera-based person re-identification is still challenging because of diverse factors such as peoples changing poses and various illumination. The various poses make it hard to conduct feature matching across images, and the illumination changes make color-based features unreliable. In this paper, we present SKEPRID1, a skeleton-based person re-identification method which handles strong pose and illumination changes jointly. To reduce the impacts of pose changes on re-identification, we estimate the joints positions of a person based on the deep learning technique, and thus make it possible to extract features on specific body parts with high accuracy. Based on the skeleton information, we design a set of local color comparison-based cloth type features, which are resistant to various lighting conditions. Moreover, to better evaluate SKEPRID, we build the PO&LI2 dataset, which has large pose and illumination diversity. Our experimental results show that SKEPRID outperforms state-of-the-art approaches in the case of strong pose and illumination variation.
Recent studies have shown that spatial relationships among objects are very important for visual recognition since they provide rich clues on object contexts within images. In this paper, we introduce a novel method to learn Semantic Feature Map (SFM) with attention based deep neural networks for image and video classification in an end to end manner, with an aim to explicitly model spatial object contexts within the images. In particular, for every object proposals obtained from the input image, we extract high-level semantic object features with convolutional neural networks. Then, we explicitly apply gate units to these extracted features for important objects selection and noise removal. These selected object features are organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as classifiers on top of the SFM for content recognition, which are expected to exploit the spatial relationships among objects. We also introduce a novel multi-task learning framework to help learn the model parameters in the training phase. It consists of a basic image classification loss in cross entropy form, an object localization loss to guide important object selection, as well as a grid labeling loss to predict object labels at SFM grids. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach and very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the SFMs learned on the image domain are transferred to video classification on CCV and FCVID benchmarks and the results successfully demonstrate its robustness and generalization capability.
Baselines are the starting point of any quantitative multimedia research, and benchmarks are essential for pushing those baselines further. In this paper, we present baselines for the artistic domain with a new benchmark dataset featuring over 2 million images with rich structured metadata dubbed OmniArt. It contains annotations for dozens of attribute types and features semantic context information through concepts, IconClass labels, color information, and (limited) object level bounding boxes. We establish and present baseline scores on multiple tasks like artist attribution, creation period estimation, type, style, and school prediction. As an example of additional types of analyses, we explore the color spaces of art through different types and evaluate a transfer learning object recognition pipeline.
YouTube is one of the most popular platforms for streaming of user-generated video. Nowadays, professional YouTubers have organized in so called multi-channel networks (MCNs). These networks offer services like brand deals, equipment, and strategic advice in exchange for a share of the YouTubers' revenue. A major strategy to gain more subscribers and, hence, revenue is collaborating with other YouTubers. Yet, collaborations on YouTube have not been studied in a detailed quantitative manner. This paper aims to close this gap. Therefore, we state three contributions. First, we collect a YouTube dataset covering video statistics over three months for 7,942 channels. Second, we design a framework for detection on an previously unknown number of persons in videos for analysis of collaborations in YouTube videos using a Deep Neural Network (DNN) based approach, named CATANA. Third, we analyze about seven years of video content and use CATANA to answer research questions providing guidance for YouTubers and MCNs for efficient collaboration strategies. Thereby, we focus on collaboration frequency and partner selectivity, (ii) the influence of MCNs on channel collaborations, (iii) collaborating channel types, and (iv) the impact of collaborations on video and channel popularity. Our results show that collaborations are in many cases significantly positive for the collaborating channels showing often more than 100% popularity growth compared with non-collaboration videos.
In this paper, we propose video delivery schemes insuring around one-second delivery latency. To this purpose, we use Dynamic Adaptive Streaming over HTTP (DASH), which is a standard version of HTTP Live Streaming (HLS), as to benefit from the video representation switching between successive video segments. We also propose HTTP/2-based algorithms to apply video frame discarding policies inside a video segment. When a selected DASH representation does not match with the available network resources, current solutions suffer from rebuffering events. Rebuffering does not only impact the Quality of Experience (QoE) but it also increases the delivery delay between the displayed and the original video streams. We observe that rebuffering-based solutions may increase the delivery delay with 1.5 s to 2 s inside a six-second video segment. In this work, we develop optimal and practical algorithms in order to respect the one-second targeted latency. In all algorithms, we selectively drop the least meaningful video frames thanks to HTTP/2 stream resetting feature. An important number of missing video frames results in a temporal fluidity break known as video jitters. The displayed video seems as a series of snapshots. Our simulations show that we respect the one-second targeted latency while insuring an acceptable video quality with at least a Peak Signal to Noise Ratio (PSNR) of 30 dB. We also quantify and qualify the resulting jitters for each algorithm. We show that both, the optimal and the practical algorithms we propose, decrease the jitters impact on the displayed videos. For example, 97 % of the optimal algorithm outputs and 87 % of the practical algorithms outputs are considered as acceptable comparing to only 57 % of the First In First Out (FIFO) basic algorithm outputs.
The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re- identification (re-ID). In this paper, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains. Our method is easy to implement and can be viewed as an effective baseline for unsupervised re-ID feature learning. Specifically, PUL iterates between 1) pedestrian clustering and 2) fine-tuning of the convolutional neural network (CNN) to improve the original model trained on the irrelevant labeled dataset. Since the clustering results can be very noisy, we add a selection operation between the clustering and fine-tuning. At the beginning when the model is weak, CNN is fine-tuned on a small amount of reliable examples which locate near to cluster centroids in the feature space. As the model becomes stronger in subsequent iterations, more images are being adaptively selected as CNN training samples. Progressively, pedestrian clustering and the CNN model are improved simultaneously until algorithm convergence. This process is naturally formulated as self-paced learning. We then point out promising directions that may lead to further improvement. Extensive experiments on three large-scale re-ID datasets demonstrate that PUL outputs discriminative features that improve the re-ID accuracy. Our code has been released at https://github.com/hehefan/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning.
Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.
This paper proposes a novel feature-extraction framework for inferring impressed personality traits, emergent leadership skills, communicative competence and hiring decisions. The proposed framework extracts multimodal features, describing each participant's nonverbal activities. It captures inter-modal and inter-person relationships in interaction and captures how the target interactor generates nonverbal behavior when the other interactors also generate the nonverbal behavior. The inter-modal and inter-personal patterns are identified as frequent co-occurring events based on graph clustering from multimodal sequences. The framework can be applied to any type of interaction task. The proposed framework is applied to the SONVB corpus, which is an audio-visual dataset collected from dyadic job interviews, and the ELEA audio-visual data corpus, which is a dataset collected from group meetings. We evaluate the framework on a binary classification task of 15 impression variables in two data corpora. The experimental results show that the model trained with co-occurrence features is more accurate than previous models for 14 out of 15 traits.
Multimedia community has witnessed the rise of deep learning based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep learning and multimedia analytics has boosted the performance of several traditional tasks such as classification, detection, regression, and also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning and content generation. This paper aims to review the development path of major tasks in multimedia analytics, and then take a peep for future directions. We start by summarizing the fundamental deep-techniques related to multimedia analytics, especially in visual domain, and then review representative high-level tasks powered by recent advancement. Moreover, the performance review on popular benchmarks gives a pathway of technology advancement, and helps identify both the milestone works and future directions.
Learning robust and representative feature across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this paper, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.
Deep convolution neural network (CNN) has achieved remarkable results in computer vision tasks for end-to-end learning. We evaluate here the power of a deep CNN to learn robust features from raw EEG data to detect seizures. Seizure are hard to detect as they vary both inter- and intra-patient. In this paper, we use a deep CNN model for seizure detection task on an open access EEG epilepsy dataset collected at the Childrens Hospital Boston. Our deep learning model is able to extract spectral, temporal features from EEG epilepsy data and use them to learn general structure of a seizure that is less sensitive to variations. Our method produced an overall sensitivity of 90.00 %, specificity of 91.65% and accuracy of 98.05% for whole dataset of 23 patients. Hence, it can be used as an excellent cross-patient classifier. The results show that our model performs better than previous state of the art models for cross-patient seizure detection task. The proposed model can also visualize special orientation of band power features. We use correlation maps to relate spectral amplitude features to the output in the form of images. By using the results from our deep learning model, this visualization method can be used as an effective multimedia tool for producing quick and relevant brain mapping images that can be used by medical experts for further investigation.
Transfer learning, which focuses on finding a favorable representation for instances of different domains based on auxiliary data, can mitigate the divergence between domains through knowledge transfer. Recently, increasing efforts on transfer learning have employed deep neural network (DNN) to learn more robust and higher level feature representations to better tackle cross-media disparity. However, only a few papers consider the correction and semantic matching between multi-layer heterogeneous domain networks. In this paper, we propose a deep semantic mapping model for heterogeneous multimedia transfer learning (DHTL) using co-occurrence data. More specifically, we integrate the DNN with canonical correlation analysis (CCA) to derive a deep correlation subspace as the joint semantic representation for associating data across different domains. In the proposed DHTL, a multi-layer correlation matching network across domains is constructed, in which the CCA is combined to bridge each pair of domain-specific hidden layers. To train the network, a joint objective function is defined and the optimization processes are presented. When the deep semantic representation is achieved, the shared features of the source domain are transferred for task learning in the target domain. Extensive experiments for three multimedia recognition applications demonstrate that the proposed DHTL can effectively find deep semantic representations for heterogeneous domains, and is superior to the several existing state-of-the-art methods for deep transfer learning.
The increasing amount of multimedia data collections available today evinces the pressing need for methods capable of indexing and retrieving this content. Despite of the continuous advances in multimedia features and representation models, to establish an effective measure for comparing different multimedia objects still remains a challenging task. While supervised and semi-supervised techniques made relevant advances on similarity learning tasks, scenarios where labeled data is non-existent require different strategies. In such situations, unsupervised learning has been established as a promising solution, capable of considering the contextual information and the dataset structure for computing new similarity/dissimilarity measures. This paper extends a recent unsupervised learning algorithm which uses an iterative re-ranking strategy to take advantage of different kNN sets and rank correlation measures. Two novel approaches are proposed for computing the kNN sets and their corresponding top-k lists. The proposed approaches were validated in conjunction with various rank correlation measures, yielding superior effectiveness results in comparison with previous works. In addition, we also evaluate the ability of the method in considering different multimedia objects, conducting an extensive experimental evaluation on various image and video datasets.
This paper introduces Film Editing Patterns (FEP), a language to formalize film editing practices and stylistic choices found in movies. FEP constructs are constraints expressed over one or more shots from a movie sequence that characterize changes in cinematographic visual properties such as shot size, region, angle of on-screen actors. We first present the elements of the FEP language, then introduce its usage in annotated film data, and finally describe how it can support users in the creative design of film sequences in 3D, more specifically: (i) we present an application to craft edited filmic sequences from 3D animated scenes that uses FEPs to support the user in selecting camera framings and editing choices that follow certain best practices used in cinema; (ii) we conduct an evaluation of the application with professional and non-professional filmmakers. The evaluation suggests that users generally appreciate the idea of FEP, and that it can effectively help novice and medium experienced users in crafting film sequences with little training and satisfying results.
In this paper, we present convolutional attention networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on convolutional neural networks (CNN) and recurrent neural networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods, which is completely built on CNN and combines attention mechanism. The distinctive characteristics of our method include: (1) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN. (2) The attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling. (3) Position embeddings are equipped in both spatial encoder and sequence decoder to give our networks a sense of locations. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results validate the effectiveness of different components, and show our convolutional-based method achieves state-of-the-art or competitive performance than prior works, even without the use of RNN.
Recently, a series of attempts have incorporated spatial attention mechanism into the task of image captioning, which achieve a remarkable improvement in the quality of generative captions. However, traditional spatial attention mechanism adopts the latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the semantic guidance attention (SGA) mechanism in this paper. Specifically, SGA utilizes the current hidden state and semantic word representations to provide an intuitive semantic guidance to focus on semantic-related regions accurately. Moreover, we lower perplexity to generate fluent sentences by updating the attention information in time. On the other hand, beam search algorithm is widely used to predict words in sequence generation. This algorithm generates sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome the above limitations, we design the consensus selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms the state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select the descriptive captions for image captioning task, achieving one of the best performance among all using the cross-entropy training method.
In this paper, a two-stage refinement network is proposed for facial landmarks detection on unconstrained conditions. Our model can be divided into two modules, namely Head Attitude Classification (HAC) module and Domain-Specific Refinement (DSR) module. Given an input facial image, HAC adopts multi-task learning mechanism to detect the head pose and obtain an initial shape. Based on the obtained head pose, DSR designs three different CNN-based refinement networks trained by specific domain respectively, and automatically selects the most approximate network for the landmarks refinement. In our framework, HAC combines head pose classification with facial landmarks detection to improve the accuracy of head pose estimation, as well as obtaining a robust initial shape. Moreover, an adaptive sub-network training strategy applied in the DSR module can strongly solve the issue of traditional multi-view methods that an improperly selected sub-network may result in alignment failure. The extensive experimental results on two public datasets, AFLW and 300W, confirm the validity of our model.
As a 3D extension to High Efficiency Video Coding (HEVC) standard, 3D-HEVC is developed to improve the coding efficiency of multi-view videos. It inherits the prediction modes from HEVC, yet both motion estimation (ME) and disparity estimation (DE) are required for the coding of dependent views. This improves coding efficiency at the cost of huge computational costs. In this paper, an early Merge mode decision approach is proposed for dependent texture views and dependent depth maps coding in 3D-HEVC based on priori posterior probability model. Firstly, the priori probability model is established by exploiting the hierarchical and inter-view correlations from those previously encoded blocks. Secondly, the posterior probability model is builded by using the coded block flag (CBF) of the current coding block. Finally, the joint priori and posterior probability model is adopted to early terminate Merge mode decision for both dependent texture views and dependent depth maps coding. Experimental results show that the proposed approach saves 45.2% and 30.6% encoding time on average for dependent texture views and dependent depth maps coding while maintaining negligible loss of coding efficiency, respectively.
Declarative multimedia documents represent the description of multimedia applications in terms of media items and relationships among them. Relationships specify how media items are dynamically arranged in time and space during runtime. Although a declarative approach usually facilitates the authoring task, authors can still make mistakes due to incorrect use of language constructs or inconsistent or missing relationships in a document. In order to properly support multimedia application authoring, it is important to provide tools with validation capabilities. Document validation can indicate possible inconsistencies in a given document to an author, so that it can be revised before deployment. Although very useful, multimedia validation tools are not often provided by authoring tools. This work proposes a multimedia validation approach that relies on a formal model, called Simple Hypermedia Model (SHM). SHM is used for representing a document for the purpose of validation. An SHM document is validated using a hybrid approach based on two complimentary techniques. The first one captures the document spatio-temporal layout in terms of its state throughout its execution by means of a rewrite theory and validation is performed through model-checking. The second one captures the document layout in terms of intervals and event occurrences by means of SMT (Satisfiability Modulo Theories) formulas and validation is performed through SMT solving. Due to different characteristics of both approaches, each validation technique complements the other in terms of expressiveness of SHM and tests to be checked. We briefly present validation tools that use our approach. They were evaluated with real NCL and web documents and by usability tests.