Automatic product image classification is a task of crucial importance towards better understanding and management of online retailers. Motivated by recent advancements of deep convolutional neural network (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to further improve the classification accuracy. With these structural-aware clues, we argue that more advanced CNN models could be developed beyond the one-versus-all classification as did by conventional CNNs. To this end, novel efforts of this work include: developing a salient-sensitive CNN that could focus more on the product foreground by inserting a spatial attention layer to a proper location, proposing a multi-class regression based refinement method that is expected to generate more accurate predictions by utilizing prediction scores from preceding multiple CNNs, each corresponding to a distinctive classifier on a categorical layer in the hierarchy, and devising a multi-task deep learning architecture that effectively explore correlations among the categories and attributes for better categorical label prediction. Experimental results on nearly one million real-world product images basically validate the effectiveness of the proposed efforts jointly and individually, from which performance gains are observed.
Person re-identification aims at identifying a certain pedestrian across non-overlapping multi-camera networks in different time and places. Existing person re-identification approaches mainly focus on matching pedestrians on still images, however little attention is paid to person re-identification in videos. Compared to images, video clips contain motion of pedestrians which is crucial to re-identification. Moreover, consecutive video frames present pedestrian appearance with different poses and from different viewpoints, providing valuable information towards addressing the challenge of pose variation, occlusion, and viewpoint change etc. In this paper, we propose a Dense 3D-Convolutional Network (D3DNet) to jointly learn spatio-temporal and appearance features for person re-identification in videos. The D3DNet consists of multiple 3D dense blocks and transition layers. The 3D dense blocks enlarge the receptive fields of visual neurons in spatial and temporal dimensions, leading to discriminative appearance representation as well as short-term and long-term motion information of pedestrians without the requirement of an additional motion estimation module. Moreover, we propose an improved loss function consisting of identification loss and center loss to minimize intra-class variance and maximize inter-class variance simultaneously, towards addressing the challenge of large intra-class variance and small inter-class variance, which is common phenomenon in person re-identification task. Extensive experiments on two widely-used surveillance video datasets, i.e., MARS and iLIDS-VID, have shown the effectiveness of the proposed approach.
Scene classification is a challenging problem. Compared with object images, scene images are more abstract, which are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representation including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this paper, the idea of share codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1000 and Places365 concepts (Mixed1365), using convolutional neural networks. As the mixed 1365 features covers all the semantic information including both object and scene concepts, we can extract shared codebook from the mixed 1365 features which only contains a subset of the whole 1365 concepts with the same codebooks size. The shared codebook can not only provide complementary representations without additional codebook training, but also it can be adaptively extracted towards different scene classification tasks. A method of combining both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public dataset validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.
We present a novel fine-grained image recognition framework using user click data, which can bridge the semantic gap in distinguishing categories that are similar in visual. As the query set is usually large-scale and redundant, we firstly propose a click feature based query merging approach to merge semantically similar queries and construct a compact click feature. Afterwards, we utilize this compact click feature and Convolutional Neural Network (CNN) based deep visual feature to jointly represent an image. Finally, with the combined feature, we employ the metric learning based template matching scheme for efficient recognition. Considering the heavy noise in the training data, we introduce a reliability variable to characterize the image reliability, and propose a Weakly supervised Metric and Template Leaning with Deep feature and Click data (WMTLDC) method to jointly learn the distance metric, object templates, and image reliability. Extensive experiments are conducted on the public Clickture-Dog dataset. It is shown that, the click data based query merging helps generating a highly compact click feature for images (the dimension is reduced to 0.9%), which greatly improves the computational efficiency. Also, introducing this click feature can boost the recognition accuracy by more than 20% compared to that using CNN feature only. The proposed framework performs much better than previous state-of-the-arts in fine-grained recognition tasks.
Recent studies have shown that spatial relationships among objects are very important for visual recognition since they provide rich clues on object contexts within images. In this paper, we introduce a novel method to learn Semantic Feature Map (SFM) with attention based deep neural networks for image and video classification in an end to end manner, with an aim to explicitly model spatial object contexts within the images. In particular, for every object proposals obtained from the input image, we extract high-level semantic object features with convolutional neural networks. Then, we explicitly apply gate units to these extracted features for important objects selection and noise removal. These selected object features are organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as classifiers on top of the SFM for content recognition, which are expected to exploit the spatial relationships among objects. We also introduce a novel multi-task learning framework to help learn the model parameters in the training phase. It consists of a basic image classification loss in cross entropy form, an object localization loss to guide important object selection, as well as a grid labeling loss to predict object labels at SFM grids. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach and very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the SFMs learned on the image domain are transferred to video classification on CCV and FCVID benchmarks and the results successfully demonstrate its robustness and generalization capability.
In designing an HTTP adaptive streaming (HAS) system, the bitrate adaptation scheme in the player is a key component to ensure a good quality of experience (QoE) for viewers. We propose a new online reinforcement learning optimization framework, called ORL-SDN, targeting HAS players running in a software defined networking (SDN) environment. We leverage SDN to facilitate the orchestration of the adaptation schemes for a set of HAS players. To reach a good level of QoE fairness in a large population of players, we cluster them based on a perceptual quality index. We formulate the adaptation process as a Partially Observable Markov Decision Process and solve the per-cluster optimization problem using an online Q-learning technique that leverages model predictive control and parallelism via aggregation to avoid a per-cluster sub-optimal selection and to accelerate the convergence to an optimum. This framework achieves maximum long-term revenue by selecting the optimal representation for each cluster under time-varying network conditions. The results show that ORL-SDN delivers substantial improvements in viewer QoE, presentation quality stability, fairness and bandwidth utilization over well-known adaptation schemes.
Mobile personal livecast (MPL) services are emerging and have received great attention recently. In MPL, numerous and geo-distributed ordinary people are used to broadcast their video contents to worldwide viewers. Different from conventional social networking services like Twitter and Facebook, which have much of a tolerance for interaction delay, the interactions (e.g., chat messages) in a personal livecast must be in real-time with low feedback latency. These unique characteristics intrigue us to: 1) investigate how the relationships (e.g., social links and geo-locations) between viewers and broadcasters influence the user behaviors, which has yet to be explored in depth; and 2) explore insights to benefit the improvement of system performance. In this paper, we carry out extensive measurements of Inke, one of the most popular MPL providers, with a large-scale dataset containing 11M users. Our key findings are as follows: 1) The user interests shift particularly frequently and the average viewing duration is considerably short in MPL; 2) The existence of social relationships significantly strengthens viewer stickinessfollowers dedicating longer viewing time, e.g., contributing 81% of the total viewing time; 3) Long content uploading distance on broadcaster side results in low system QoS (e.g., higher broadcast latency and higher rebuffering ratio) in current core-cloud based MPL paradigm; and 4) Most of the broadcasts in MPL are geographically local-popular (the majority of the views come from the same region of the broadcaster). Thus the emergence of edge computing, which provides cloud-computing capabilities at the edge of the mobile network, naturally sheds new lights on the MPL system, e.g., localized ingesting and delivering live contents. Based on the critical observations, we propose an edge-assisted MPL system that collaboratively utilizes the core-cloud and edge computing resources to improve efficiency and scalability for Inke-like services. In our framework, we consider dynamic broadcaster assignment to minimize the broadcast latency while keeping the resource lease cost low. We formulate the broadcaster scheduling as a stable matching with migration problem to solve it efficiently, e.g., compared with current core-cloud based system, our edge-assisted delivery approach reduces broadcast latency by about 35%.
In many distributed wireless surveillance applications, compressed videos are used for performing automatic video analysis tasks. The accuracy of object detection, which is essential for various video analysis tasks, can be reduced due to video quality degradation caused by lossy compression. This paper introduces a video encoding framework with the objective of boosting the accuracy of object detection for wireless surveillance applications. The proposed video encoding framework is based on systematic investigation of the effects of lossy compression on object detection. It has been found that current standardized video encoding schemes cause temporal domain fluctuation for encoded blocks in stable background areas and spatial texture degradation for encoded blocks in dynamic foreground areas of a raw video, both of which degrade the accuracy of object detection. Two measures, the sum-of-absolute frame difference (SFD) and the degradation of texture (TXD), are introduced to depict the temporal domain fluctuation and the spatial texture degradation in an encoded video, respectively. The proposed encoding framework is designed to suppress unnecessary temporal fluctuation in stable background areas and preserve spatial texture in dynamic foreground areas based on the two measures, and it introduces new mode decision strategies for both intra and inter frames to improve the accuracy of object detection while maintaining an acceptable rate-distortion performance. Experimental results show that, compared with traditional encoding schemes, the proposed scheme improves the performance of object detection and results in lower bit rate with comparable quality in terms of PSNR and SSIM.
Learning Multiple Kernel Metrics for Iterative Person Re-identification
Large scale image dataset and deep convolutional neural network (DCNN) are the two primary driving forces for the rapid progress in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarging existing datasets due to high labeling cost and unfair comparison issues. In this paper, we aim to achieve lower error rate by augmenting existing datasets in an automatic manner. Our method leverages both Web and DCNN, where Web provides massive images with rich contextual information, and DCNN replaces human to automatically label images under the guidance of Web contextual information. Experiments show that our method can automatically scale up existing datasets significantly from billions of web pages with high accuracy, and significantly improve the performance on object recognition tasks with the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the Web. Both the dataset and models trained on the dataset have been made publicly available.
Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.
Learning robust and representative feature across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this paper, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.
This paper presents the research on using an audio augmented reality (AAR) system to recreate the soundscape of a medieval archaeological site. The aim of our work is to explore whether it is possible to enhance the tourists archaeology experience, which is often preserved only in scarce remains. We developed a smartphone-based AAR system, which used location and orientation sensors to synthetize the soundscape of the site and played it to the user over headphones. We recreated the ancient soundscape of a medieval archaeological site in Croatia and tested it in situ on two groups of participants using the soundwalk method. One test group performed the soundwalk while listening to the recreated soundscape using the AAR, while the second control group did not use the AAR equipment. We measured the experiences of the participants using two methods: the standard soundwalk questionnaire and affective computing equipment for detecting the emotional state of participants. The results of both test methods showed that participants who were listening to the ancient soundscape using our AAR system had higher arousal than those visiting the site without the AAR
This survey introduces the current state of the art in image and video retargeting and describes important ideas and technologies that have influenced the recent work. Retargeting is the process of adapting an image or video from one screen resolution to another in order to fit different displays, for example, when watching a wide screen movie on a normal television screen or a mobile device. As there has been a lot of work done in this field already, this survey provides an overview over the techniques. It is meant to be a starting point for new research in the field and includes explanations of basic terms and operators, as well as the basic workflow of most methods.
The focus of this paper is on the adoption of a next generation internet and Virtual Reality (VR) technologies for the development of fully immersive and haptic simulators for training of medical residents in a surgical process termed Less Invasive Stabilization System (LISS) plating surgery. LISS surgery is an orthopaedic surgical process developed for healing of fractured femur bone. Development of such simulators is a complex task which involves multiple systems, technologies and human experts. Emerging Next Generation Internet technologies were used to develop the haptic based collaborative simulator. A standalone fully immersive surgical simulator was also developed using HTC Vive. Expert surgeons played an important role in developing the simulator system; use cases of the target surgical processes were built using a modeling language called the engineering Enterprise Modeling Language. The impact of the using the simulators has been explored through interactions with residents during multiple phases which underscores the potential of using such simulators in medical training.
In this paper, we present convolutional attention networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on convolutional neural networks (CNN) and recurrent neural networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods, which is completely built on CNN and combines attention mechanism. The distinctive characteristics of our method include: (1) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN. (2) The attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling. (3) Position embeddings are equipped in both spatial encoder and sequence decoder to give our networks a sense of locations. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results validate the effectiveness of different components, and show our convolutional-based method achieves state-of-the-art or competitive performance than prior works, even without the use of RNN.
For the entropy coding of independent and identically distributed (i.i.d.) binary sources, variable-to-variable length (V2V) codes are an interesting alternative to arithmetic coding. Such a V2V code translates variable length words of the source into variable length code words by employing two prefix-free codes. In this paper, several properties of V2V codes are studied and new concepts are developed. In particular, it is shown that the redundancy of a V2V code cannot be zero for a binary i.i.d. source with 0 < p(1) < 0.5. Furthermore, the concept of prime and composite V2V codes is proposed and it is shown why composite V2V codes can be disregarded in the search for particular classes of minimum redundancy codes. Moreover, a canonical representation for V2V codes is proposed, which identifies V2V codes that have the same average code length function. It is shown how these concepts can be employed to greatly reduce the complexity of a search for minimum redundancy (size-limited) V2V codes.
Image captioning is an increasingly important problem associated with artificial intelligence, computer vision and natural language processing. Recent works revealed that it is possible for a machine to generate meaningful and accurate sentences for images. However, most existing methods ignore emotional information latent in an image. In this paper, we propose a novel image captioning model with affective guiding and selective attention mechanism named AG-SAM. In our method, we aim to bridge the affective gap between image captioning and the emotional response elicited by the image. Firstly, we introduce affective components which capture higher-level concepts encoded in images into AG-SAM. Hence, our language model can be adapted to generate sentences which are more passionate and emotive. Besides, a selective gate acted on attention mechanism control the degree of how much visual information AG-SAM needs. Experimental results have shown that our model outperforms most existing methods, clearly reflecting association between images and emotional components which is usually ignoring in existing works.