This paper develops an aggregate power consumption model for many-to-one live video streaming systems, such as video surveillance, where multiple video sources stream videos to a central monitoring station. In such systems, power consumption is a major concern, especially for battery-powered video sources. We model the video capturing, encoding, and transmission aspects and then provide an overall model of the power consumed by the video cameras and/or sensors. The developed model captures the following main parameters: resolution, frame rate, quantization, motion estimation range, and number of reference frames. We also analyze the power consumed by the monitoring station, which is due to video reception, potential video upscaling, and video decoding of all received video streams. In addition to modeling the power consumption, we model the achieved bitrate of video encoding. We validate and analyze the power consumption models of each phase as well the aggregate power consumption model through extensive experiments. The analysis includes examining individual parameters separately and examining the impacts of changing more than one parameter at a time.
Deep Learning has become a crucial technology for multimedia computing. It offers a powerful instrument to automatically produce high-level abstractions of complex multimedia data, which can be exploited in a number of applications including object detection and recognition, speech-to- text, media retrieval, multimodal data analysis, and so on. e availability of aordable large-scale parallel processing architectures, and the sharing of eective open-source codes implementing the basic learning algorithms, caused a rapid diusion of deep learning methodologies, bringing a number of new technologies and applications that outperform in most cases traditional machine learning technologies. In recent years, the possibility of implementing deep learning technologies on mobile devices has aracted signicant aention. anks to this technology, portable devices may become smart objects able to learn and act. e path towards these exciting future scenarios, however, entangles a number of important research challenges. Deep learning architectures and algorithms are hardly adapted to the storage and computation resources of a mobile device. erefore, there is a need for new generations of mobile processors and chipsets, small footprint learning and inference algorithms, new models of collaborative and distributed processing, and a number of other fundamental building blocks. is survey reports the state of the art in this exciting research area, looking back to the evolution of neural networks, and arriving to the most recent results in terms of methodologies, technologies and applications for mobile environments.
In this work we explore the increasing demand for novel user interfaces to navigate large media collections. We implement a geometric data structure to store and retrieve item-to-item similarity information and propose a novel navigation framework that uses vector operations and real-time user feedback to direct the outcome. The framework is scalable to large media collections and is suitable for computationally-constrained devices. In particular, we implement this framework in the domain of music. To evaluate the effectiveness of the navigation process, we propose an automatic evaluation framework, based on synthetic user profiles, which allows to quickly simulate and compare navigation paths using different algorithms and datasets. Moreover, we perform a real user study. To do that, we developed and launched Mixtape, a simple web application that allows users to create playlists by providing real-time feedback through liking and skipping patterns.
Location-based games have been around already since 2000 but only recently when PokemonGo came to markets it become clear that they can reach wide popularity. In this paper, we make literature-based analytical study of what kind of issues location-based game design faces, and how they can be solved. We study how to use and verify the location, the role of the games as exergames, use in education, and study technical and safety issues. As a case study, we present O Mopsi game that combines physical activity with problem solving. It includes three challenges: (1) navigating to the next target, (2) deciding the order of targets, (3) physical movement. All of them are unavoidable and relevant. For guiding the players, we use three types of multimedia: images (targets and maps), sound (user guidance), and GPS (for positioning). We discuss about motivational aspects, analysis of the playing, and content creation. Quality of experiences are reported based on playing in SciFest Science festivals during 2011-2016.
Online video presents new challenges to traditional caching with over a thousand fold increase in number of assets, rapidly changing popularity of assets and much higher throughput requirements. We propose a new hierarchical filtering algorithm for caching online videoHiFi. Our algorithm is designed to optimize hit-rate, replacement rate and cache throughput. It has an associated implementation complexity comparable to that of LRU. Our results show that under typical operator conditions, HiFi can increase edge cache byte hit-rate by 5-24% over an LRU policy, but more importantly can increase RAM or memory byte hit-rate by 80% to 200% and reduce replacement rate by more than 100 times! These two factors combined can dramatically increase throughput for most caches. If SSDs are used for storage, the much lower replacement rate may also allow substitution of lower cost MLC based SSDs instead of SLC based SSDs. We extend previous multi-tier analytical models for LRU caches to caches with filtering. We analytically show how HiFi approaches the performance of an optimal caching policy and how to tune HiFi to reach as close to optimal performance as the traffic conditions allow. We develop a realistic simulation environment for online video using statistics from operator traces. We show that HiFi performs within a few percentage points from the optimal solution which was simulated by Beladys MIN algorithm under typical operator conditions.
With the advent and popularity of the social network, social graphs become essential to improve services and information relevance to users for many social media applications to predict follower/followee relationship, community membership, etc. However, the social graphs could be hidden by users due to privacy concerns, or kept by social media. Recently, connections discovered from user shared images using non-user generated labels are proved to be more accessible alternatives to social graphs. But real-time discovery is difficult due to high complexity, and many applications are not possible. This paper proposes an efficient computation framework for connection discovery using user shared images. The framework applies the architecture divided into online and offline computation to facilitate faster processing for connection discovery using user shared images. Furthermore, this paper devises a general and scalable online computation framework in which many algorithms can fit into the framework to help discover connections on the fly. The performance of the framework is evaluated on the application of follower/followee recommendation with 300K+ user shared images from two social networks. It is proven that the proposed computation framework on average reduces 90% as much time as existing frameworks, with 90% as accurate as those frameworks with discovered connections.
Mobile Gaming is an emerging concept, wherein gamers are using mobile devices, like smart phones and tablets, to play best seller games. Compared to dedicated gaming boxes or PCs, these devices still fall short of executing newly complex 3D-video games with a rich immersion. Three novel solutions, relying on cloud computing infrastructure, namely Computation Offloading, Cloud Gaming, and Traditional Client-Server architecture will represent the next generation of game engine architecture aiming at improving the gaming experiences and immersions. The basis of these above-mentioned solutions is the distribution of the game code over different devices (including set-top-boxes, PCs, and servers). In order to know how the game code should be distributed, advanced knowledge of game engines is required. By consequence, dissecting and analyzing game engine performances will surely help to better understand how to move in these new directions (i.e., distribute game code), which is so far missing in the literature. Aiming at filling this gap, we propose in this paper to analyze and evaluate one of the famous engines in the market, i.e., Unity 3D. We begin by detailing the architecture and the game logic of game engines. Then, we use a test-bed to evaluate the CPU and GPU consumption per frame and per module for five representative games on three platforms, namely a stand-alone computer, embedded systems and web players. Based on the obtained results and observations, we build a valued graph of each module, composing the Unity 3D architecture, which reflects the internal flow and CPU consumption.
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM(Long-Short Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different way to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models "translate'' image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve state-of-the-art results on both caption generation and image-sentence retrieval even without integrating additional mechanism (e.g. object detection, attention model etc.). Our experiments also proves that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate our transfer learning performance of Bi-LSTM model significantly outperforms previous methods on Pascal1K dataset.
The meeting of pervasive screens and ubiquitous smart devices in smart city has brought interaction revolution to Screen-Smart device Interaction (SSI). SSI leverages the rich features of smart device to interact with the connected-screens which have dispersed into every corners of the city. Despite the promising features of SSI, most current surveys focus their studies on the direct human-screen interaction, to the best of our knowledge, none of the surveys have investigated the state-of-the-art of SSI. This survey aims to examine the SSI from various perspectives, including both hardware implementation (i.e., screen and smart device) and software development (i.e., interaction modality and the multimedia content). The survey is structured into three main elements according to the notion of SSI, i.e., screen, smart device and interaction. Two evaluation metrics which can be used to benchmark the SSI performance in term of interaction latency and accuracy are discussed. Further, the interaction scalability to support simultaneous interaction with multiple screens and smart devices are investigated. The advantageous of SSI in stimulating social interaction and reflecting a smart lifestyle in smart city are identified. Lastly, future research challenges and opportunities are highlighted for the next generation of SSI.
Introduction to Special Issue on Deep Learning for Mobile Multimedia
In this paper we discuss an innovative media entertainment application called Interactive Movietelling. As an offspring of Interactive Storytelling applied to movies, we propose to integrate narrative generation through AI planning with video processing and modeling to construct filmic variants starting from the baseline content. The integration is possible thanks to content description using semantic attributes pertaining to intermediate-level concepts shared between video processing and planning levels. The output is a recombination of segments taken from the input movie performed so as to convey an alternative plot. User tests on the prototype proved how promising Interactive Movietelling might be, even if it was designed at a proof of concept level. Possible improvements that are suggested here lead to many challenging research issues.
Cloud Data Centers (CDCs) are becoming the cost effective methods for processing and storage of multimedia data including image, video and audio. Since CDCs are physically located in different jurisdictions, and are managed by external parties, data security is a growing concern. Data encryption at CDCs is commonly practiced to improve data security. However, to process the data at CDCs, data must often be decrypted, which raises issues in security. Thus, there is a growing demand for data processing techniques in encrypted domain. In this paper we analyze encrypted domain speech content processing techniques for noise reduction. Noise contaminates speech during transmission or during the acquisition process by recording. As a result, the quality of the speech content is degraded. We apply Shamir's Secret Sharing (SSS) as the cryptosystem to encrypt speech data before uploading it to a CDC. We then propose finite impulse response (FIR) digital filters to reduce white and wind noise in the speech in encrypted domain. We prove that our proposed schemes meet the security requirements of efficiency, accuracy and checkability for both semi-honest and malicious adversarial models. Experimental results show that our proposed filtering techniques for speech noise reduction in Encrypted Domain (ED) produce similar results when compared to Plaintext Domain (PD) processing.
Rate control is a crucial consideration in high-efficiency video coding (HEVC), but the CTU-level rate control model often fails due to inadequate consideration of the correlation of the neighboring CTU. In this study, we established a novel complexity correlation-based CTU level rate control for HEVC. First, we formulated the model parameter estimation scheme as a multivariable estimation problem; second, based on the complexity correlation of the neighboring CTU, an optimal direction is selected in five directions for CTU selection during model parameter estimation to further improve the prediction accuracy of the encoding CTU complexity. The most relative CTUs in the optimal direction are used to estimate current CTU model parameters. Third, in order to improve their precision, the estimated model parameters are solved via the estimated complexity of their relationship to the CTUs. Experimental results showed that the proposed algorithm can significantly improve the accuracy of the CTU-level rate control and thus the coding performance; the proposed scheme consistently outperformed HM16.0 and other state-of-the-art algorithms in a variety of testing configurations. More specifically, up to 8.9% and on average 6.4% BD-Rate reduction was achieved compared to HM16.0 and up to 4.2% and an average of 2.5% BD-Rate reduction was achieved compared to other algorithms, with only a slight complexity overhead.
Saliency detection has recently received extensive research interest beyond 2D features. Despite the many available capturing devices and algorithms, there still exists a wide spectrum of challenges that need to be addressed to achieve accurate saliency detection. Inspired by the success of light field technology, in this paper, we propose a new computational scheme to detect salient regions by investigating multiple visual cues from light-field images. First, saliency prior maps are generated from several light-field features based on superpixel-level intra-cue distinctiveness, such as color, depth, and flow inherited from different focal planes and multiple viewpoints. Then, these maps are merged into a single map using a random-search-based weighting approach, and a Gaussian-based location cue is proposed to enhance the integrated saliency values. Finally, structure-preserving graph-based methods are employed to effectively refine the object details and enhance the accuracy of the saliency map. Since the widely used light-field saliency dataset LFSD [Li et al. 2014] is not sufficiently challenging, we present a new light-field saliency analysis benchmark dataset, named HFUT-Lytro, which consists of 255 light fields with a range from 53 to 64 images generated from each light-field sample, therein spanning multiple occurrences of saliency detection challenges such as occlusions, cluttered background and appearance changes. The experimental results demonstrate the effectiveness of the proposed approach and show that our approach achieves 0.6-6.7% relative improvements over state-of-the-art methods in terms of the F-measure and precision performance evaluation metrics.
Label imbalance and the insufficiency of labeled training samples are major obstacles in most methods for counting people in images or videos. In this work, a sparse representation-based semi-supervised regression method is proposed to count people in images with limited data. The basic idea is to predict the unlabeled training data, select reliable samples to expand the labeled training set and retrain the regression model. In the algorithm, the initial regression model, which is learned from the labeled training data, is used to predict the number of people in the unlabeled training dataset. Then the unlabeled training samples are regarded as an over-complete dictionary. Each feature of the labeled training data can be expressed as a sparse linear approximation of the unlabeled data. In turn, the labels of the labeled training data can be estimated based on a sparse reconstruction in feature space. The label confidence in labeling an unlabeled sample is estimated by calculating the reconstruction error. The training set is updated by selecting unlabeled samples with minimal reconstruction errors, and the regression model is retrained on the new training set. A co-training style method is applied during the training process. The experimental results demonstrate that the proposed method has a low mean square error and mean absolute error compared with state-of-the-art people-counting benchmarks.
There has been significant research effort into Peer-to-Peer (P2P) Massively Multi-user Virtual Environments (MMVEs). A number of architectures have been proposed to implement the P2P approach; however, the development of fully distributed MMVEs has met with a number of challenges. In this work we address one of the key remaining challenges of state consistency and persistency in P2P MMVEs. Having reviewed state management and persistency architectures currently receiving research attention, we have identified deficiencies such as lack of fairness, responsiveness and scalability. To address these deficiencies, we present Pithos -- a reliable, responsive, secure, fair and scalable distributed storage system, suited to P2P MMVEs. Pithos is designed specifically for P2P MMVEs and we show that it improves the reliability and responsiveness of storage architectures as compared to existing P2P state persistency architectures. Pithos is implemented as an OverSim simulation running on the OMNeT++ network simulation framework. It is evaluated using up to 10,400 peers, with realistic latency profiles, with up to 15.8 million storage and retrieval requests that are generated to store a total of 2.4 million objects. Each peer in Pithos uses a maximum of 1950 Bps bandwidth to achieve 99.98% storage reliability, while the most reliable overlay storage configuration tested only achieved 93.65% reliability, using 2182 Bps bandwidth. Pithos is also more responsive than overlay storage, with an average responsiveness of 0.192s, compared with the average overlay responsiveness of 1.4s when retrieving objects from storage.
In this paper, we present a novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets, that describe different details of the queried location. The main strength of the proposed approach is its flexibility that permits to filter out non-relevant images, and to obtain a reliable set of diverse and relevant images by first clustering similar images according to their textual descriptions and their visual content, and then extracting images from different clusters according to a measure of user's credibility. Clustering is based on a two-step process where textual descriptions are used first, and the clusters are then refined according to the visual features. The degree of diversification can be further increased by exploiting users' judgments on the results produced by the proposed algorithm through a novel approach, where users not only provide a relevance feedback, but also a diversity feedback. Experimental results performed on the MediaEval 2015 ``Retrieving Diverse Social Images" dataset show that the proposed framework can achieve very good performance both in the case of automatic retrieval of diverse images, and in the case of the exploitation of the users' feedback. The effectiveness of the proposed approach has been also confirmed by a small case study involving a number of real users.
Social media platforms are turning into important news sources for users since they provide real-time information with a wide range of perspectives. However, high volume, dynamism, noise and redundancy exhibited by social media data create difficulties for users in comprehending the entire content. Recent works emphasize on summarizing the content of either a single social media platform or of a single modality (either textual or visual). However, each platform has its own unique characteristics and user base, which brings to light different aspects of real-world events. This makes it critical as well as challenging to combine textual and visual data from different platforms. In this article, we propose summarization of real-word events with data stemming from different platforms and multiple modalities. We present the use of Markov Random Fields based similarity measure to link content across multiple platforms. This measure also enables the linking of content across time which is useful for tracking the evolution of long-running events. For the final content selection, summarization is modeled as a subset selection problem. To handle the complexity of the optimal subset selection, we propose the use of submodular objectives. Facets such as coverage, novelty and significance are modeled as submodular objectives in a multimodal social media setting. We conduct a series of quantitative and qualitative experiments to illustrate the effectiveness of our approach compared to alternative methods.
While convolutional neural networks (CNN) have been excellent for object recognition, the greater spatial variability in scene images typically mean that the standard full-image CNN features are suboptimal for scene classification. In this paper, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV) encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation comprising multiple modalities of RGB, HHA and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity --- that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal non-sparsity --- that features from all modalities are encouraged to co-exist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on action recognition task to demonstrate that our framework can be generalized for other multi-modal well-structured features. In particular, for action recognition, we enforce inter-part sparsity to choose more discriminative body parts, and inter-modal non-sparsity to make informative features from both appearance and motion modalities to co-exist. Experimental results on JHMDB and MPII Cooking datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state-of-the-art.