This paper develops an aggregate power consumption model for many-to-one live video streaming systems, such as video surveillance, where multiple video sources stream videos to a central monitoring station. In such systems, power consumption is a major concern, especially for battery-powered video sources. We model the video capturing, encoding, and transmission aspects and then provide an overall model of the power consumed by the video cameras and/or sensors. The developed model captures the following main parameters: resolution, frame rate, quantization, motion estimation range, and number of reference frames. We also analyze the power consumed by the monitoring station, which is due to video reception, potential video upscaling, and video decoding of all received video streams. In addition to modeling the power consumption, we model the achieved bitrate of video encoding. We validate and analyze the power consumption models of each phase as well the aggregate power consumption model through extensive experiments. The analysis includes examining individual parameters separately and examining the impacts of changing more than one parameter at a time.
With the advent and popularity of the social network, social graphs become essential to improve services and information relevance to users for many social media applications to predict follower/followee relationship, community membership, etc. However, the social graphs could be hidden by users due to privacy concerns, or kept by social media. Recently, connections discovered from user shared images using non-user generated labels are proved to be more accessible alternatives to social graphs. But real-time discovery is difficult due to high complexity, and many applications are not possible. This paper proposes an efficient computation framework for connection discovery using user shared images. The framework applies the architecture divided into online and offline computation to facilitate faster processing for connection discovery using user shared images. Furthermore, this paper devises a general and scalable online computation framework in which many algorithms can fit into the framework to help discover connections on the fly. The performance of the framework is evaluated on the application of follower/followee recommendation with 300K+ user shared images from two social networks. It is proven that the proposed computation framework on average reduces 90% as much time as existing frameworks, with 90% as accurate as those frameworks with discovered connections.
With the advances in mobile devices and the popularity of social networks, users can share multimedia content anytime, anywhere. One of the most important types of emerging content is video, which is commonly shared on platforms such as Instagram and Facebook. User connections, which indicate whether two users are follower/followee or have the same interests, are essential to improve services and information relevant to users for many social media applications, but are normally hidden due to users privacy concerns, or are kept confidential by social media sites. Using user-shared content is an alternative way to discover user connections. This paper proposes to use user shared videos for connection discovery with Bag of Feature Tagging (BoFT) method and proposes a distributed streaming computation framework to facilitate fast response. Exploiting the uniqueness of shared videos, the proposed framework is divided into Streaming processing, Online and Offline Computation. With experiments using a dataset from Twitter, it is proven that using user-shared videos for connection discovery is feasible and that the proposed computation framework reduces the processing time to only 35% for follower/followee recommendation. It is also proven that comparable performance can be achieved with only partial video data.
Action recognition is an important research problem of Human Motion Analysis (HMA). In recent years, 3D observation based action recognition is receiving increasing interest in the multimedia and computer vision communities, due to recent advent of the cost-effective sensors, such as depth camera Kinect. This work takes one step further, focusing on early recognition of ongoing 3D human actions, which is beneficial for a large variety of time-critical applications, e.g. gesture based human machine interaction, somatosensory game, etc. Our goal is to infer the class label information of 3D human actions with partial observation of temporally incomplete action executions. By considering 3D action data as multivariate time series (m.t.s.) synchronized to a shared common clock (frames), we propose a stochastic process called Dynamic Marked Point Process (DMP) to model the 3D action as temporal dynamic patterns, where both timing and strength information are captured. To achieve even better earliness and accuracy of recognition, we also explore the temporal dependency patterns between feature dimensions. A probabilistic suffix tree is constructed to represent sequential patterns among features in terms of Variable order Markov Model (VMM). Our approach and several baselines are evaluated on four 3D human action datasets. Extensive results show that our approach achieves superior performance for early recognition of 3D human actions.
Personalized elearning models tailor learning resource according to learning needs of learners. Adaptive Hypermedia Architecture (AHA), is a successful implementation of the personalized elearning model which uses learning outcomes as personalization parameter to adapt to learning experience of learners. However, besides learning outcomes, emotions of the learner which can have much influence on memory and problem solving is completely neglected in the AHA model. This paper presents Adaptive Educational Hypermedia (AEH) model, known as Expert Elearning System (EES), which is built on top of the AHA to incorporate facial emotion recognition framework. The emotion recognition framework here in, denoted as MKLDT-WFA, is realized by training simple Multiple Kernel Learning (MKL) with Weighted Kernel Alignment (WFA) in a Decision Tree (DT) classifier. The MKLDT-WFA framework has two merits over classical SimpleMKL. First, the WFA component preserves only relevant kernel weights to improve discrimination for emotion classes. Secondly, training in the DT eliminates misclassification issues associated with off-the-shelf SimpleMKL classifiers. The suggested framework has been evaluated on different emotion databases. Results of evaluation reveal good performances for emotion recognition and it is potential to improve personalization in the AEH models
Deep Convolutional Neural Networks (DCNNs) exhibit remarkable performance in a number of pattern recognition and classification tasks. Modern DCNNs involve many millions of parameters and billions of operations. Inference using such DCNNs, if implemented as software running on an embedded processor, results in considerable execution time and energy consumption, which is prohibitive in many mobile applications. FPGA-based acceleration of DCNN inference is a promising approach to improve both energy consumption and classification throughput. However, the engineering effort required for development and verification of an optimized FPGA-based architecture is significant. In this paper, we present PLACID, an automated PLatform for Accelerator CreatIon for DCNNs. PLACID uses an analytical approach to characterization and exploration of the implementation space. PLACID enables generation of an accelerator with the highest throughput for a given DCNN on a specific target FPGA platform. Subsequently, it generates an RTL level architecture in Verilog, which can be passed onto commercial tools for FPGA implementation. PLACID is fully automated, and reduces the accelerator design time from a few months down to a few hours. Experimental results show that architectures synthesized by PLACID yield 2X higher throughput density than the best competing approach.
Cloud gaming has been recognized as a promising shift in the online game industry, with the aim of implementing the on demand service concept that has achieved market success in other areas of digital entertainment such as movies and TV shows. The concepts of cloud computing are leveraged to render the game scene as a video stream which is then delivered to players in real-time. The main advantage of this approach is the capability of delivering high-quality graphics games to any type of end user device, however at the cost of high bandwidth consumption and strict latency requirements. A key challenge faced by cloud game providers lies in conguring the video encoding parameters so as to maximize player Quality of Experience (QoE) while meeting bandwidth availability constraints. In this paper we tackle one aspect of this problem by addressing the following research question: Is it possible to improve service adaptation based on information about the characteristics of the game being streamed? To answer this question two main challenges need to be addressed: the need for different QoE-driven video encoding (re-)conguration strategies for different categories of games, and how to determine a relevant game categorization to be used for assigning appropriate conguration strategies. We investigate these problems by conducting two subjective laboratory studies with a total of 80 players and three different games. Results indicate that different strategies should likely be applied for different types of games, and show that existing game classications are not necessarily suitable for differentiating game types in this context. We thus further analyze objective video metrics of collected game play video traces as well as player actions per minute and use this as input data for clustering of games into two clusters. Subjective results verify that different video encoding conguration strategies may be applied to games belonging to different clusters.
Mobile Gaming is an emerging concept, wherein gamers are using mobile devices, like smart phones and tablets, to play best seller games. Compared to dedicated gaming boxes or PCs, these devices still fall short of executing newly complex 3D-video games with a rich immersion. Three novel solutions, relying on cloud computing infrastructure, namely Computation Offloading, Cloud Gaming, and Traditional Client-Server architecture will represent the next generation of game engine architecture aiming at improving the gaming experiences and immersions. The basis of these above-mentioned solutions is the distribution of the game code over different devices (including set-top-boxes, PCs, and servers). In order to know how the game code should be distributed, advanced knowledge of game engines is required. By consequence, dissecting and analyzing game engine performances will surely help to better understand how to move in these new directions (i.e., distribute game code), which is so far missing in the literature. Aiming at filling this gap, we propose in this paper to analyze and evaluate one of the famous engines in the market, i.e., Unity 3D. We begin by detailing the architecture and the game logic of game engines. Then, we use a test-bed to evaluate the CPU and GPU consumption per frame and per module for five representative games on three platforms, namely a stand-alone computer, embedded systems and web players. Based on the obtained results and observations, we build a valued graph of each module, composing the Unity 3D architecture, which reflects the internal flow and CPU consumption.
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM(Long-Short Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different way to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models "translate'' image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve state-of-the-art results on both caption generation and image-sentence retrieval even without integrating additional mechanism (e.g. object detection, attention model etc.). Our experiments also proves that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate our transfer learning performance of Bi-LSTM model significantly outperforms previous methods on Pascal1K dataset.
Retrieval of high-level or complex events, such as a parade or a car accident, within video data without example images or videos is still a challenge. Current research in deep neural networks is highly beneficial for retrieval of high-level events based upon examples, but without any examples it is still hard to 1) determine which concepts are useful to pre-train (Vocabulary challenge); 2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our paper, we present our Semantic Event Retrieval System that 1) shows the importance of high-level concepts in a vocabulary for the retrieval of high-level events and 2) uses a novel concept selection method based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.
In this paper, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework which exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long term structure of the video sequence and to reduce estimation variance when confronted with comprehensive inter-class variations. In addition, a 3D depth-saliency convolutional network is aggregated in parallel to capture subtle motion characteristics. Extensive experiments are done to analyze the performance of each component and our proposed approach achieves the best results on two public benchmarksChaLearn IsoGD and RGBD-HuDaAct, outperforming the closest competitor by a margin of over 10% and 15% respectively. We will release our codes to facilitate future research.
Surveillance video is the main input source of intelligent video surveillance system. Detection performed on surveillance video contributes significantly to the safety and security goals. However, performing detection on the unprotected surveillance video may reveal the privacies of innocent people in the video. How to strike a balance between personal privacy and the feasibility of detection is an important issue. A promising solution to this problem is to encrypt the surveillance videos, and to perform detection on these encrypted videos. Most existing encrypted signal processing focus on the cases of images or small data. Since videos have a much huger data size, it is a great challenge to study how to process encrypted videos. In this paper, we propose an efficient motion detection and tracking scheme for encrypted H.264/AVC video bitstreams, which does not require the previous decryption on the encrypted video. The main idea is to estimate motion information from the bitstream structure and codeword length firstly, and then propose a region update (RU) algorithm to deal with the loss and error drifting of motion caused by the video encryption. We extract information from the codeword of the encrypted motion vector differences. With the prior knowledge that the object motion in the video is continuous in space and time, we design the RU algorithm which can fix the error information and renew the detected region. Comparing to the existing scheme based on the video encryption that encrypts at pixel level [Chu et al. 2013], the proposed scheme has the advantages of small storage in the encrypted video and low computational cost in encryption and detection. Experimental results show that our scheme has a better performance in detection accuracy, execution speed, and easy installation. Moreover, the proposed scheme can use not only the video encryption in this paper, but also other format- compliant video encryption, provided that the positions of macroblock can be extracted from the encrypted video bitstream. Due to the coupling of video stream encryption and detection algorithms, our scheme can be directly connected to the video stream output, e.g., surveillance cameras, without any modification to these cameras.
Facial Expression Recognition (FER) is one of the most important topics in the domain of computer vision and pattern recognition and it has attracted increasing attention for its scientific challenges and application potentials. In this paper, we propose a novel and effective approach to FER using multi-model 2D and 3D videos, which encodes both static and dynamic cues by scattering convolution network. Firstly, a shape based detection method is introduced to locate the start and the end of an expression in videos, segment its onset, apex, and offset states, and sample the important frames for emotion analysis. Secondly, the frames in Apex of 2D videos are represented by scattering, conveying static texture details. Those of 3D videos are processed in a similar way, but to highlight static shape details, several geometric maps in terms of multiple order differential quantities, i.e. Normal Maps (NOM) and Shape Index Maps (SIM), are generated as the input of scattering, instead of original smooth facial surfaces. Thirdly, the average of neighboring samples centred at each key texture frame or shape map evenly distributed in Onset, is computed, and the scattering features extracted from all the average samples of 2D and 3D videos are then concatenated to capture dynamic texture and shape cues respectively. Finally, Support Vector Machine (SVM) is adopted to measure the similarity of individual features in either 2D or 3D modality, and all the scores are combined for multi-modal decision making to predict the expression label. Thanks to the scattering descriptor, the proposed approach not only encodes distinct local texture and shape variations of different expressions as by several milestone operators, such as SIFT, HOG, etc., but also captures subtle information hidden in high frequencies in both channels, which is quite crucial to better distinguish expressions that are easily confused. The validation is conducted on the BU-4DFE database, and the state of the art one accuracy is reached, indicating its competency for this issue.
Social media platforms are turning into important news sources for users since they provide real-time information with a wide range of perspectives. However, high volume, dynamism, noise and redundancy exhibited by social media data create difficulties for users in comprehending the entire content. Recent works emphasize on summarizing the content of either a single social media platform or of a single modality (either textual or visual). However, each platform has its own unique characteristics and user base, which brings to light different aspects of real-world events. This makes it critical as well as challenging to combine textual and visual data from different platforms. In this article, we propose summarization of real-word events with data stemming from different platforms and multiple modalities. We present the use of Markov Random Fields based similarity measure to link content across multiple platforms. This measure also enables the linking of content across time which is useful for tracking the evolution of long-running events. For the final content selection, summarization is modeled as a subset selection problem. To handle the complexity of the optimal subset selection, we propose the use of submodular objectives. Facets such as coverage, novelty and significance are modeled as submodular objectives in a multimodal social media setting. We conduct a series of quantitative and qualitative experiments to illustrate the effectiveness of our approach compared to alternative methods.
While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically mean that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity --- that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity --- that features from all modalities are encouraged to co-exist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on action recognition task to demonstrate that our framework can be generalized for other multi-modal well-structured features. In particular, for action recognition, we enforce inter-part sparsity to choose more discriminative body parts, and inter-modal non-sparsity to make informative features from both appearance and motion modalities to co-exist. Experimental results on JHMDB and MPII Cooking datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state-of-the-art.