In this paper, we address two practical problems for online clothing shopping: 1) What do I look like when wearing this clothing on? 2) What accessories shall I buy to pair this item? Our system contains two main parts: clothing trying on and accessories recommendation. Different from existing shopping websites presenting clothing almost uniformly on skinny models on the clean background, our clothing trying on module shows what ordinary people look like when wearing the clothing on street. Moreover, we recommend representative and diverse accessories to pair this item according to daily costume matching on social media, and provide the exact or similar items in online shops. These two sub-modules are mainly unitedly implemented through a bi-directional shop-to-street and street-to-shop clothing retrieval framework by deep feature embedding. There are three main challenges of cross-domain clothing retrieval task. First is to learn the discrepancy (e.g., background, pose, illumination) between street domain and shop domain clothing. Second, both intra-domain and cross-domain similarity need to be considered during feature embedding. Third, there are large biases between the number of matched and non-matched street and shop pairs. To solve these challenges, in this paper, we propose a deep bi-directional cross-triplet embedding algorithm by extending the start-of-the-art triplet embedding into cross-domain retrieval scenario. The extensive experimental evaluations well demonstrate the effectiveness of the proposed cross domain clothing retrieval framework and how it facilitates the clothing trying on and accessories recommendation applications.
Learning-based hashing has received a great deal of research attentions in the past few years for its great potential in fast and accurate similarity search among huge volumes of multimedia data. In this paper, we present a novel multimedia hashing framework, termed as Label Preserving Multimedia Hashing (LPMH) for multimedia similarity search. In LPMH, a general optimization method is used to learn the joint binary codes of multiple media types by explicitly preserving the semantic label information. Compared with existing hashing methods, which are typically developed under and thus restricted to some specific objective functions, the proposed optimization strategy is not tied to any specific loss function, and can easily incorporate bit balance constraints to produce well-balanced binary codes. Specifically, our formulation leads to a set of Binary Integer Programming (BIP) problems that have exact solutions both with and without the bit balance constraints. These problems can be solved extremely fast and the solution can easily scale up to large-scale datasets. In the hash function learning stage, the boosted decision trees algorithm is utilized to learn multiple media-specific hash functions that can map heterogeneous data sources into a homogeneous Hamming space for cross-media retrieval. We have comprehensively evaluated the proposed method using a range of large-scale datasets in both single-media and cross-media retrieval tasks. The experimental results demonstrate that LPMH is competitive against state-of-the-art methods in both speed and accuracy.
Many applications generate and/or consume multi-variate temporal data and experts often lack the means to adequately and systematically search for and interpret multi-variate observations. In this paper, we first observe that multi-variate time series often carry localized multi-variate temporal features that are robust against noise. We then argue that these multi-variate temporal features can be extracted by simultaneously considering, at multiple scales, temporal characteristics of the time-series along with external knowledge, including variate relationships, known a priori. Relying on these observations, we develop data models and algorithms to detect robust multi-variate temporal (RMT) features that can be indexed for efficient and accurate retrieval and can be used for supporting data exploration and analysis tasks. Experiments confirm that the proposed RMT algorithm is highly effective and efficient in identifying robust multi-scale temporal features of multi-variate time series.
Action recognition is an important research problem of Human Motion Analysis (HMA). In recent years, 3D observation based action recognition is receiving increasing interest in the multimedia and computer vision communities, due to recent advent of the cost-effective sensors, such as depth camera Kinect. This work takes one step further, focusing on early recognition of ongoing 3D human actions, which is beneficial for a large variety of time-critical applications, e.g. gesture based human machine interaction, somatosensory game, etc. Our goal is to infer the class label information of 3D human actions with partial observation of temporally incomplete action executions. By considering 3D action data as multivariate time series (m.t.s.) synchronized to a shared common clock (frames), we propose a stochastic process called Dynamic Marked Point Process (DMP) to model the 3D action as temporal dynamic patterns, where both timing and strength information are captured. To achieve even better earliness and accuracy of recognition, we also explore the temporal dependency patterns between feature dimensions. A probabilistic suffix tree is constructed to represent sequential patterns among features in terms of Variable order Markov Model (VMM). Our approach and several baselines are evaluated on four 3D human action datasets. Extensive results show that our approach achieves superior performance for early recognition of 3D human actions.
Personalized elearning models tailor learning resource according to learning needs of learners. Adaptive Hypermedia Architecture (AHA), is a successful implementation of the personalized elearning model which uses learning outcomes as personalization parameter to adapt to learning experience of learners. However, besides learning outcomes, emotions of the learner which can have much influence on memory and problem solving is completely neglected in the AHA model. This paper presents Adaptive Educational Hypermedia (AEH) model, known as Expert Elearning System (EES), which is built on top of the AHA to incorporate facial emotion recognition framework. The emotion recognition framework here in, denoted as MKLDT-WFA, is realized by training simple Multiple Kernel Learning (MKL) with Weighted Kernel Alignment (WFA) in a Decision Tree (DT) classifier. The MKLDT-WFA framework has two merits over classical SimpleMKL. First, the WFA component preserves only relevant kernel weights to improve discrimination for emotion classes. Secondly, training in the DT eliminates misclassification issues associated with off-the-shelf SimpleMKL classifiers. The suggested framework has been evaluated on different emotion databases. Results of evaluation reveal good performances for emotion recognition and it is potential to improve personalization in the AEH models
Cloud gaming has been recognized as a promising shift in the online game industry, with the aim of implementing the on demand service concept that has achieved market success in other areas of digital entertainment such as movies and TV shows. The concepts of cloud computing are leveraged to render the game scene as a video stream which is then delivered to players in real-time. The main advantage of this approach is the capability of delivering high-quality graphics games to any type of end user device, however at the cost of high bandwidth consumption and strict latency requirements. A key challenge faced by cloud game providers lies in conguring the video encoding parameters so as to maximize player Quality of Experience (QoE) while meeting bandwidth availability constraints. In this paper we tackle one aspect of this problem by addressing the following research question: Is it possible to improve service adaptation based on information about the characteristics of the game being streamed? To answer this question two main challenges need to be addressed: the need for different QoE-driven video encoding (re-)conguration strategies for different categories of games, and how to determine a relevant game categorization to be used for assigning appropriate conguration strategies. We investigate these problems by conducting two subjective laboratory studies with a total of 80 players and three different games. Results indicate that different strategies should likely be applied for different types of games, and show that existing game classications are not necessarily suitable for differentiating game types in this context. We thus further analyze objective video metrics of collected game play video traces as well as player actions per minute and use this as input data for clustering of games into two clusters. Subjective results verify that different video encoding conguration strategies may be applied to games belonging to different clusters.
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM(Long-Short Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different way to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models "translate'' image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve state-of-the-art results on both caption generation and image-sentence retrieval even without integrating additional mechanism (e.g. object detection, attention model etc.). Our experiments also proves that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate our transfer learning performance of Bi-LSTM model significantly outperforms previous methods on Pascal1K dataset.
With the success of emerging RGB-D cameras such as the Kinect sensor, com- bining the shape (depth) and texture information to improve the quality of recognition became a trend among computer vision researchers. In this work, we address the problem of face classification in the context of RGB images and depth data. Inspired by the psychological results for human face perception, this paper focuses on (i) finding out which facial parts are most effective at making the difference for some social aspects of face perception (gender, ethnicity and emotion state), (ii) determining the optimal decision by combining the decision rendered by the individual parts, and (iii) extracting the promising features from RGB-D faces in order to exploit all the potential that this data provide. Experimental results on EurecomKinect Face and CurtinFaces databases show that the proposed approach improves the recognition quality in many use cases.
The stage background is one of the most important features for a dance performance as it helps to create the scene and atmosphere. In conventional dance performances, the background images are usually selected or designed by professional stage designers according to the theme and the style of the dance. In new media dance performances, the stage effects are usually generated by media editing software. Selecting or producing a dance background is quite challenging, and is generally carried out by skilled technicians. The goal of the research reported in this paper is to ease this process. Instead of searching for background images from the sea of available resources, dancers are recommended images they are more likely to use. This paper proposes the idea of a novel system to recommend images based on content-based social computing. The core part of the system is a probabilistic prediction model to predict a dancer's interests in candidate images through social platforms. Different from traditional collaborative filtering models or content-based models, the model proposed in this paper effectively combines a dancer's social behaviors (rating action, click action, etc.) with the visual content of the images shared by the dancer using deep matrix factorization (DMF). With the help of such a system, dancers can select from the recommended images and set them as the backgrounds of their dance performances through a media editor. According to the experiment results, the proposed DMF model outperforms the previous methods, and when the dataset is very sparse, the proposed DMF model shows more significant results.
Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with two statistical approaches for extracting robust gait features directly from raw data: (1)~a~modification of Linear Discriminant Analysis with Maximum Margin Criterion and (2)~a~combination of Principal Component Analysis and Linear Discriminant Analysis. Experiments on the CMU MoCap database show that these methods outperform thirteen other relevant methods in terms of the distribution of biometric templates in respective feature spaces expressed in a number of class separability coefficients and classification metrics. Results also indicate a high portability of learned features, that means, we can learn what aspects of walk people generally differ in and extract those as general gait features. Recognizing people without needing group-specific features is convenient as particular people might not always provide annotated learning data. As a contribution to reproducible research, our evaluation framework and database have been made publicly available. This research makes motion capture technology directly applicable for human recognition.
In this paper, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework which exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a 3D depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarksChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15% respectively. We will release our codes to facilitate future research.
Content-based image retrieval (CBIR) is one of the most important applications of computer vision. Recent years have witnessed many important advances in the development of CBIR systems, especially Convolutional Neural Networks (CNNs) and other deep learning techniques. On the other hand, current CNN-based CBIR systems suffer from high computational complexity of CNNs. This problem becomes more severe as mobile applications become more and more popular. Current mainstream is to deploy the entire CBIR systems on server side while the client side only serves as an image provider. This architecture may increase computational burden on server side, which needs to process thousands of requests per second. Moreover, sending images have the potential of personal information leakage. As the need of mobile search expands, concerns about privacy are growing. In this paper, we propose a fast image search framework, named DeepSearch, which makes complex image search based on CNNs feasible on mobile phone. To implement the huge computation of CNN models, we present a tensor Block Term Decomposition method (BTD) to accelerate the CNNs involving in object detection and feature extraction. The extensive experiments on ImageNet dataset and Alibaba Large-scale Image Search Challenge (ALISC) dataset show that the proposed accelerating method BTD can significantly speed up the CNN models, and further makes CNN-based image search practical on common smart phone.
Egocentric videos, which mainly record the activities carried out by the users of the wearable cameras, have drawn much research attentions in recent years. Due to its lengthy content, a large number of ego-related applications have been developed to abstract the captured videos. As the users are accustomed to interacting with the target objects using their own hands while their hands usually appear within their visual fields during the interaction, an egocentric hand detection step is involved in tasks like gesture recognition, action recognition and social interaction understanding. In this work, we propose a dynamic region growing approach for hand region detection in egocentric videos, by jointly considering hand-related motion and egocentric cues. We first determine seed regions that most likely belong to the hand, by analyzing the motion patterns across successive frames. The hand regions can then be located by extending from the seed regions, according to the scores comuted for the adjacent superpixels. These scores are derived from four egocentric cues: contrast, location, position consistency and appearance continuity. We discuss how to apply the proposed method in real-life scenarios, where multiple hands irregularly appear and disappear from the videos. Experimental results on public datasets show that the proposed method achieves superior performance compared with the state-of-the-art methods, especially in complicated scenarios.
Today, interpersonal digital communication systems do not have an intuitive and natural way of communicating emotion, which in turn affects the degree to which we can emotionally connect and interact with one another while separated by distance. In answer to this problem, a natural, intuitive and implicit emotion communication system is proposed to recognize emotions using an electroencephalogram (EEG) signal at the transmitter end and display tactile sensation at the receiver side. The proposed system comprises two components: an emotion recognition subsystem that utilizes hemisphere asymmetry-based EEG signals analysis for emotion classification and a haptic jacket to display the apparent tactile sensation (named tactile gestures) at the receiver side. Emotions are modeled in terms of valence (positive/negative emotions) and arousal (intensity of the emotion). Furthermore, an authoring tool is utilized to created custom tactile gestures that can elicit specific emotional reactions. Performance analysis shows that the proposed EEG subject-dependent emotion recognition model with Free Asymmetry features allows for more flexible feature generation schemes than existing algorithms and attains an average accuracy of 92.5\% for valence and 96.5\% for arousal, vastly outperforming previous generation schemes in high feature space. As for the tactile feedback, a tactile gesture authoring tool and a haptic jacket are developed to design custom tactile gestures that can intensify emotional reactions in terms of valence and arousal. A usability study demonstrated that subject-independent emotion transmission through tactile gestures effectively communicated the arousal dimension of an emotion but was not as effective for valence. Consistency in subject dependent responses for both valence and arousal suggest that personalized tactile gestures would be more effective.
Facial Expression Recognition (FER) is one of the most important topics in the domain of computer vision and pattern recognition and it has attracted increasing attention for its scientific challenges and application potentials. In this paper, we propose a novel and effective approach to FER using multi-model 2D and 3D videos, which encodes both static and dynamic cues by scattering convolution network. Firstly, a shape based detection method is introduced to locate the start and the end of an expression in videos, segment its onset, apex, and offset states, and sample the important frames for emotion analysis. Secondly, the frames in Apex of 2D videos are represented by scattering, conveying static texture details. Those of 3D videos are processed in a similar way, but to highlight static shape details, several geometric maps in terms of multiple order differential quantities, i.e. Normal Maps (NOM) and Shape Index Maps (SIM), are generated as the input of scattering, instead of original smooth facial surfaces. Thirdly, the average of neighboring samples centred at each key texture frame or shape map evenly distributed in Onset, is computed, and the scattering features extracted from all the average samples of 2D and 3D videos are then concatenated to capture dynamic texture and shape cues respectively. Finally, Support Vector Machine (SVM) is adopted to measure the similarity of individual features in either 2D or 3D modality, and all the scores are combined for multi-modal decision making to predict the expression label. Thanks to the scattering descriptor, the proposed approach not only encodes distinct local texture and shape variations of different expressions as by several milestone operators, such as SIFT, HOG, etc., but also captures subtle information hidden in high frequencies in both channels, which is quite crucial to better distinguish expressions that are easily confused. The validation is conducted on the BU-4DFE database, and the state of the art one accuracy is reached, indicating its competency for this issue.
In sign language recognition with multi-modal data, the sign word can be represented by multi-modal features, for which there exist intrinsic property and mutually complementary relationship among them. To fully explore those relationships for sign language recognition, we propose an online early-late fusion method based on adaptive HMM. In terms of the intrinsic property, we discover that inherent latent change states of each sign are not only related to the number of key gestures and body poses, but also related to their translation relationships. We propose an adaptive HMM (Hidden Markov Model) method to obtain the hidden state number of each sign with affinity propagation clustering. For complementary relationship, we propose an online early-late fusion scheme. The early fusion (feature fusion) targets on preserving useful information to achieve a better complementary score while the late fusion (score fusion) uncovers the significance of those features and aggregates them in a weighting manner. For different queries, the fusion weight is inversely proportional to the area under the curve of the normalized query score list for each feature. Different from classical fusion methods, our fusion method is query-adaptive. The whole fusion process is effective and efficient. Experiments verify the effectiveness on the signer-independent SLR (Sign Language recognition) with large vocabulary. Either compared on different dataset sizes or to different SLR models, our method demonstrates consistent and promising performance.
Cloud-assisted video streaming has emerged as new paradigm to optimize multimedia content distribution over the Internet. This paper investigates the problem of streaming cloud-assisted real-time video to multiple destinations (e.g., cloud video conferencing, multi-player cloud gaming, etc.) over lossy communication networks. The user diversity and network dynamics result in the delay differences among multiple destinations. This research proposes Differentiated cloud-Assisted VIdeo Streaming (DAVIS) framework, which proactively leverages such delay differences in video coding and transmission optimization. First, we analytically formulate the optimization problem of joint coding and transmission to maximize received video quality. Second, we develop a quality optimization framework that integrates the video representation selection and FEC (Forward Error Correction) packet interleaving. The proposed DAVIS is able to effectively perform differentiated quality optimization for multiple destinations by taking advantage of the delay differences in cloud-assisted video streaming system. We conduct the performance evaluation through extensive experiments with the Amazon EC2 instances and Exata emulation platform. Evaluation results show that DAVIS outperforms the reference cloud-assisted streaming solutions in video quality and delay performance.
This paper tackles the problem of joint estimation of human age and facial expression. This problem is important yet challenging because expressions can alter the face appearance in a similar manner to human aging. Unlike previous approaches dealing with the two tasks independently, we propose a jointly trained convolutional neural network (CNN) model that unifies the ordinal regression and multi-class classification in a single framework to tackle this problem. We demonstrate experimentally that our method performs more favorably against state-of-the-art approaches.
Social media platforms are turning into important news sources for users since they provide real-time information with a wide range of perspectives. However, high volume, dynamism, noise and redundancy exhibited by social media data create difficulties for users in comprehending the entire content. Recent works emphasize on summarizing the content of either a single social media platform or of a single modality (either textual or visual). However, each platform has its own unique characteristics and user base, which brings to light different aspects of real-world events. This makes it critical as well as challenging to combine textual and visual data from different platforms. In this article, we propose summarization of real-word events with data stemming from different platforms and multiple modalities. We present the use of Markov Random Fields based similarity measure to link content across multiple platforms. This measure also enables the linking of content across time which is useful for tracking the evolution of long-running events. For the final content selection, summarization is modeled as a subset selection problem. To handle the complexity of the optimal subset selection, we propose the use of submodular objectives. Facets such as coverage, novelty and significance are modeled as submodular objectives in a multimodal social media setting. We conduct a series of quantitative and qualitative experiments to illustrate the effectiveness of our approach compared to alternative methods.
While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically mean that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity --- that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity --- that features from all modalities are encouraged to co-exist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on action recognition task to demonstrate that our framework can be generalized for other multi-modal well-structured features. In particular, for action recognition, we enforce inter-part sparsity to choose more discriminative body parts, and inter-modal non-sparsity to make informative features from both appearance and motion modalities to co-exist. Experimental results on JHMDB and MPII Cooking datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state-of-the-art.