Light field images, for example taken with plenoptic cameras, offer interesting post-processing opportunities, including depth-of-field management, depth estimation, viewpoint selection, and 3D image synthesis. Like most capture devices, however, plenoptic cameras have a limited dynamic range, so that over- and under-exposed areas in plenoptic images are commonplace. We therefore present a straightforward and robust plenoptic reconstruction technique based on the observation that vignetting causes peripheral views to receive less light than central views. Thus, corresponding pixels in different views can be used to reconstruct illumination, especially in areas where information missing in one view is present in another. Our algorithm accurately reconstructs under- and over-exposed regions (known as declipping), additionally affording an increase in peak luminance by up to 2 f-stops, and a comparable lowering of the noise floor. The key advantages of this approach are that no hardware modifications are necessary to improve the dynamic range, that no multiple exposure techniques are required, and therefore that no ghosting or other artefacts are introduced.
Hosting interactive video-based services, such as computer games, in the cloud poses particular challenges given the sensitivity to delay. A better understanding of the impact of delay on player-game interactions can help design cloud systems and games that accommodate delay inherent in cloud systems. Previous top-down studies of delay using full-featured games have helped understand the impact of delay, but often do not generalize nor lend themselves to analytic modeling. Bottom-up studies isolating user input and delay can better generalize and be used in models, but have yet to be applied to cloud-hosted computer games. In order to better understand delay impact in cloud-hosted computer games, we conduct a large bottom-up user study centered on a fundamental game interaction - selecting a moving target with user input subject to delay. Our work builds a custom game that controls both the target speed and input delay and has players select the target using an analog thumbstick controller. Analysis of data from over 50 users shows target selection time exponentially increases with delay and target speed and is well-fit by an exponential model that includes a delay & target speed interaction term. A comparison with two previous studies, both using a mouse instead of a thumbstick, suggests the model's relationship between delay and target speed holds more broadly, providing a foundation for a potential law explaining moving target selection with delay encountered in cloud-hosted games.
We present a novel fine-grained image recognition framework using user click data, which can bridge the semantic gap in distinguishing categories that are similar in visual. As the query set is usually large-scale and redundant, we firstly propose a click feature based query merging approach to merge semantically similar queries and construct a compact click feature. Afterwards, we utilize this compact click feature and Convolutional Neural Network (CNN) based deep visual feature to jointly represent an image. Finally, with the combined feature, we employ the metric learning based template matching scheme for efficient recognition. Considering the heavy noise in the training data, we introduce a reliability variable to characterize the image reliability, and propose a Weakly supervised Metric and Template Leaning with Deep feature and Click data (WMTLDC) method to jointly learn the distance metric, object templates, and image reliability. Extensive experiments are conducted on the public Clickture-Dog dataset. It is shown that, the click data based query merging helps generating a highly compact click feature for images (the dimension is reduced to 0.9%), which greatly improves the computational efficiency. Also, introducing this click feature can boost the recognition accuracy by more than 20% compared to that using CNN feature only. The proposed framework performs much better than previous state-of-the-arts in fine-grained recognition tasks.
The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids and clouds. Yet it remains a challenge to harness the available power and move towards gracefully handling web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this paper, we describe a prototype of a (near) web-scale multimedia service using the Spark framework running on the AWS cloud service. We present experimental results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
Inconsistency in contrast enhancement can be used to expose image forgeries. In this work, we describe a new method to estimate contrast enhancement from a single image. Our method takes advantage of the nature of contrast enhancement as a mapping between pixel values, and the distinct characteristics it introduces to the image pixel histogram. Our method recovers the original pixel histogram and the contrast enhancement simultaneously from a single image with an iterative algorithm. Unlike previous methods, our method is robust in the presence of additive noise perturbations that are used to hide the traces of contrast enhancement. Furthermore, we also develop an e effective method to to detect image regions undergone contrast enhancement transformations that are different from the rest of the image, and use this method to detect composite images. We perform extensive experimental evaluations to demonstrate the efficacy and efficiency of our method method.
Quality of Experience (QoE) has received much attention over the past years and has become a prominent issue for delivering services and applications. A significant amount of research has been devoted to understanding, measuring, and modelling QoE for a variety of media services. The next logical step is to actively exploit that accumulated knowledge to improve and manage the quality of multimedia services, while at the same time ensuring efficient and cost-effective network operations. Moreover, with many different players involved in the end-to-end service delivery chain, identifying the root causes of QoE impairments and finding effective solutions for meeting the end users' requirements and expectations in terms of service quality is a challenging and complex problem. In this paper we survey state-of-the-art findings and present emerging concepts and challenges related to managing QoE for networked multimedia services. Going beyond a number of previously published survey papers addressing the topic of QoE management, we address QoE management in the context of ongoing developments, such as the move to 5G and virtualized networks, the exploitation of big data analytics and machine learning, and the steady rise of new and immersive services (e.g., augmented and virtual reality). We address the implications of such paradigm shifts in terms of new approaches in QoE modeling, and the need for novel QoE monitoring and management infrastructures.
Best Papers of the ACM Multimedia Systems (MMSys) Conference 2017 and the ACM Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV) 2017
In this paper, we explore the possibility of enabling cloud-based virtual space applications, for better computational scalability and easy access from any end device, including future lightweight wireless head-mounted displays (HMDs). In particular, we investigate virtual classroom and virtual gallery applications, in which the scenes and activities are rendered in the cloud, with multiple views captured and streamed to each end device. A key challenge is the high bandwidth requirement to stream all the user views, leading to high operational cost and potential large delay in a bandwidth-restricted wireless network. We propose a novel hybrid-cast approach to save bandwidth in a multi-user streaming scenario. We identify and broadcast the common pixels shared by multiple users, while unicast the residual pixels for each user. We formulate the problem of minimizing the total bitrate needed to transmit the user views using hybrid-casting and describe our approach. A common view extraction approach and a smart grouping algorithm are proposed and developed to achieve our hybrid-cast approach. Simulation results show that the hybrid-cast approach can significantly reduce total bitrate by up to 55%, compared to traditional cloud-based approach of transmitting all the views as individual unicast streams, hence addressing the bandwidth challenges of cloud, with additional benefits in cost and delay.
Detection of aesthetic highlights is a challenge for understanding the affective processes taking place during movie watching. In this paper we focus our study on spectators responses to movie aesthetic stimuli in a social context. Moreover, we look for uncovering the emotional component of aesthetic highlights in movies. Our assumption is that synchronized spectators physiological and behavioral reactions occur during these highlights because: (i) aesthetic choices of filmmakers are made to elicit specific emotional reactions (e.g. special effects, empathy and compassion toward a character, etc.) and (ii) watching a movie together causes spectators affective reactions to be synchronized through emotional contagion. We compare different approaches to estimation of synchronization among groups of spectators signals, such as pairwise, group and overall synchronization measures to detect aesthetic highlights in movies. The results show that the unsupervised architecture relying on synchronization measures is able to capture different properties of spectators synchronization and detect aesthetic highlights based on both spectators electrodermal and acceleration signals. Pairwise synchronization measures perform the most accurately independently of the type of the highlights and movie genres. Moreover, we observe that electrodermal signals have more discriminative power than acceleration signals for highlight detection.
In designing an HTTP adaptive streaming (HAS) system, the bitrate adaptation scheme in the player is a key component to ensure a good quality of experience (QoE) for viewers. We propose a new online reinforcement learning optimization framework, called ORL-SDN, targeting HAS players running in a software defined networking (SDN) environment. We leverage SDN to facilitate the orchestration of the adaptation schemes for a set of HAS players. To reach a good level of QoE fairness in a large population of players, we cluster them based on a perceptual quality index. We formulate the adaptation process as a Partially Observable Markov Decision Process and solve the per-cluster optimization problem using an online Q-learning technique that leverages model predictive control and parallelism via aggregation to avoid a per-cluster sub-optimal selection and to accelerate the convergence to an optimum. This framework achieves maximum long-term revenue by selecting the optimal representation for each cluster under time-varying network conditions. The results show that ORL-SDN delivers substantial improvements in viewer QoE, presentation quality stability, fairness and bandwidth utilization over well-known adaptation schemes.
Mobile personal livecast (MPL) services are emerging and have received great attention recently. In MPL, numerous and geo-distributed ordinary people are used to broadcast their video contents to worldwide viewers. Different from conventional social networking services like Twitter and Facebook, which have much of a tolerance for interaction delay, the interactions (e.g., chat messages) in a personal livecast must be in real-time with low feedback latency. These unique characteristics intrigue us to: 1) investigate how the relationships (e.g., social links and geo-locations) between viewers and broadcasters influence the user behaviors, which has yet to be explored in depth; and 2) explore insights to benefit the improvement of system performance. In this paper, we carry out extensive measurements of Inke, one of the most popular MPL providers, with a large-scale dataset containing 11M users. Our key findings are as follows: 1) The user interests shift particularly frequently and the average viewing duration is considerably short in MPL; 2) The existence of social relationships significantly strengthens viewer stickinessfollowers dedicating longer viewing time, e.g., contributing 81% of the total viewing time; 3) Long content uploading distance on broadcaster side results in low system QoS (e.g., higher broadcast latency and higher rebuffering ratio) in current core-cloud based MPL paradigm; and 4) Most of the broadcasts in MPL are geographically local-popular (the majority of the views come from the same region of the broadcaster). Thus the emergence of edge computing, which provides cloud-computing capabilities at the edge of the mobile network, naturally sheds new lights on the MPL system, e.g., localized ingesting and delivering live contents. Based on the critical observations, we propose an edge-assisted MPL system that collaboratively utilizes the core-cloud and edge computing resources to improve efficiency and scalability for Inke-like services. In our framework, we consider dynamic broadcaster assignment to minimize the broadcast latency while keeping the resource lease cost low. We formulate the broadcaster scheduling as a stable matching with migration problem to solve it efficiently, e.g., compared with current core-cloud based system, our edge-assisted delivery approach reduces broadcast latency by about 35%.
Introduction to the Special Issue on Delay-Sensitive Video Computing in the Cloud
In many distributed wireless surveillance applications, compressed videos are used for performing automatic video analysis tasks. The accuracy of object detection, which is essential for various video analysis tasks, can be reduced due to video quality degradation caused by lossy compression. This paper introduces a video encoding framework with the objective of boosting the accuracy of object detection for wireless surveillance applications. The proposed video encoding framework is based on systematic investigation of the effects of lossy compression on object detection. It has been found that current standardized video encoding schemes cause temporal domain fluctuation for encoded blocks in stable background areas and spatial texture degradation for encoded blocks in dynamic foreground areas of a raw video, both of which degrade the accuracy of object detection. Two measures, the sum-of-absolute frame difference (SFD) and the degradation of texture (TXD), are introduced to depict the temporal domain fluctuation and the spatial texture degradation in an encoded video, respectively. The proposed encoding framework is designed to suppress unnecessary temporal fluctuation in stable background areas and preserve spatial texture in dynamic foreground areas based on the two measures, and it introduces new mode decision strategies for both intra and inter frames to improve the accuracy of object detection while maintaining an acceptable rate-distortion performance. Experimental results show that, compared with traditional encoding schemes, the proposed scheme improves the performance of object detection and results in lower bit rate with comparable quality in terms of PSNR and SSIM.
With the increasing accessibility of the mobile head-mounted displays (HMDs), mobile virtual reality (VR) systems are finding applications in various areas. However, mobile HMDs are highly constrained with limited graphics processing units (GPUs), low processing power and onboard memory. Hence, VR developers must be cognizant of the number of polygons contained within their virtual environments to avoid rendering at low frame rates and inducing simulator sickness. The most robust and rapid approach to keeping the overall number of polygons low is to use mesh simplification algorithms to create low-poly versions of pre-existing, high-poly models. Unfortunately, most existing mesh simplification algorithms cannot adequately handle meshes with lots of boundaries or non-manifold meshes, which are common attributes of many 3D models. In this paper, we present QEM4VR, a high-fidelity mesh simplification algorithm specifically designed for VR. This algorithm addresses the deficiencies of prior quadric error metric (QEM) approaches by leveraging the insight that the most relevant boundary edges lie along curvatures while linear boundary edges can be collapsed. Additionally, our algorithm preserves key surface properties, such as normals, texture coordinates, colors, and materials, as it pre-processes 3D models and generates their low-poly approximations offline. We evaluated the effectiveness of our QEM4VR algorithm by comparing its simplified-mesh results to those of prior QEM variations in terms of geometric approximation error, texture error, progressive approximation errors, frame rate impact, and perceptual quality measures. We found that QEM4VR consistently yielded simplified meshes with less geometric approximation error and texture error than the prior QEM variations. It afforded better frame rates than QEM variations with boundary preservation constraints that create unnecessary lower bounds on overall polygon count reduction. Our evaluation revealed that QEM4VR did not fair well in terms of existing perceptual distance measurements, but human-based inspections demonstrate that these algorithmic measurements are not suitable substitutes for actual human perception. In turn, we present a user-based methodology for evaluating the perceptual qualities of mesh simplification algorithms.
We show how to build the components of a privacy-aware, live video analytics ecosystem from the bottom up, starting with OpenFace, our new open-source face recognition system that approaches state-of-the-art accuracy. Integrating OpenFace with inter-frame tracking, we build RTFace, a mechanism for denaturing video streams that selectively blurs faces according to specified policies at full frame rates. This enables privacy management for live video analytics while providing a secure approach for handling retrospective policy exceptions. Finally, we present a scalable, privacy-aware architecture for large camera networks using RTFace, and show how it can be an enabler for a vibrant ecosystem and marketplace of privacy-aware video streams and analytics services.
Cloud gaming has gained significant popularity recently due to many important benefits such as removal of device constraints, instant-on and cross-platform, etc. The properties of intensive resource demands and dynamic workloads make cloud gaming appropriate to be supported by an elastic cloud platform. Facing a large user population, a fundamental problem is how to provide satisfactory cloud gaming service at modest cost. We observe that software maintenance cost could be substantial compared to server running cost in cloud gaming using elastic cloud resources. In this paper, we address the server provisioning problem for cloud gaming to optimize both server running cost and software maintenance cost. We find that the distribution of game softwares among servers and the selection of server types both trigger trade-offs between the software maintenance cost and server running cost in cloud gaming. We formulate the problem with a stochastic model and employ queueing theories to conduct solid theoretical analysis of the system behaviors under different request dispatching policies. We then propose several classes of algorithms to approximate the optimal solution. The proposed algorithms are evaluated by extensive experiments using real-world parameters. The results show that the proposed Ordered and Genetic algorithms are computationally efficient, nearly cost-optimal and highly robust to dynamic changes.
Understanding streaming user behavior is crucial to the design of large-scale video-on-demand (VoD) systems. However, existing studies usually treat all the users as an entire entity to analyze the collective user behaviors, especially the video popularity. In this paper, we begin with the measurement of individual viewing behavior from two perspectives: the temporal characteristics and the user interest, and present our results by dividing users into the active and less active groups. We observe that the active users spend more hours on each active day, and their daily request time distribution is more scattered than that of the less active users, while the inter-view time distribution differs negligible between two groups. The similar viewing behaviors are discovered among the active and less active users, e.g. the common interests in popular videos and the latest uploaded videos. To identify users with similar viewing behaviors, we cluster them into a number of classes using their daily request timestamp or the categories of watched videos, where the efficacy of the clustering is validated by the cluster centroid. We then analyze the predictability of video popularity using the early views. The classical approaches that select the same set of parameters for all the videos are shown to be far away from the real popularity. In the light of this observation, the auto-regressive and moving average (ARMA) model is employed to forecast the popularity of each video separately, thus achieving a much higher accuracy.
360 degree video is a new generation of video streaming technology that promises greater immersiveness than standard video streams. This level of immersiveness is similar to that produced by virtual reality devices -- users can control the field of view using head movements rather than needing to manipulate external devices. Although 360 degree video could revolutionize streaming technology, large scale adoption is hindered by a number of factors. 360 degree video streams have larger bandwidth requirements, require faster responsiveness to user inputs, and users may be more sensitive to lower quality streams. In this paper, we review standard approaches toward 360 degree video encoding and compare these to families of approaches that distort the spherical surface to allow oriented concentrations of the 360 degree view. We refer to these distorted projections as offset projections. At best, we estimate via measurement studies that these offset projections can produce better or similar visual quality with less than 50\% pixels under reasonable assumptions about user behavior. Offset projections complicate adaptive 360 degree video streaming because they require a combination of bitrate and view orientation adaptations. We estimate that this combination of streaming adaptation in two dimensions can cause over 57\% extra segments to be downloaded compared to an ideal downloading strategy, wasting 20\% of the total downloading bandwidth.
Large scale image dataset and deep convolutional neural network (DCNN) are the two primary driving forces for the rapid progress in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarging existing datasets due to high labeling cost and unfair comparison issues. In this paper, we aim to achieve lower error rate by augmenting existing datasets in an automatic manner. Our method leverages both Web and DCNN, where Web provides massive images with rich contextual information, and DCNN replaces human to automatically label images under the guidance of Web contextual information. Experiments show that our method can automatically scale up existing datasets significantly from billions of web pages with high accuracy, and significantly improve the performance on object recognition tasks with the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the Web. Both the dataset and models trained on the dataset have been made publicly available.
Main stream approaches for 3D human action recognition usually combine the depth and the skeleton feature to improve the recognition accuracy. However, this strategy may result in high feature dimension and low discrimination due to the redundant of feature vector. In order to solve this drawback, a multi-feature selection approach for 3D human action recognition is proposed in this paper. First, three novel single-modal features are proposed to respectively describe depth appearance, depth motion, and skeleton motion. Second, a classification entropy of random forest is used to evaluate the discrimination of depth appearance based feature. Furthermore, one of the three features is selected to recognize the sample according to the discrimination evaluation. Experimental results show that the proposed multi-feature selection significantly outperforms single-modal feature and other feature fusion based approaches.
Image captioning is an increasingly important problem associated with artificial intelligence, computer vision and natural language processing. Recent works revealed that it is possible for a machine to generate meaningful and accurate sentences for images. However, most existing methods ignore emotional information latent in an image. In this paper, we propose a novel image captioning model with affective guiding and selective attention mechanism named AG-SAM. In our method, we aim to bridge the affective gap between image captioning and the emotional response elicited by the image. Firstly, we introduce affective components which capture higher-level concepts encoded in images into AG-SAM. Hence, our language model can be adapted to generate sentences which are more passionate and emotive. Besides, a selective gate acted on attention mechanism control the degree of how much visual information AG-SAM needs. Experimental results have shown that our model outperforms most existing methods, clearly reflecting association between images and emotional components which is usually ignoring in existing works.
While cloud servers provide tremendous amount of resources for networked video applications, most successful stories of cloud-assisted video applications are presentational video services, such as YouTube and NetFlix. This article surveys the recent advances on delay-sensitive video computations in the cloud, which are crucial to cloud-assisted conversational video services, such as cloud gaming, Virtual Reality (VR), Augmented Reality (AR), and telepresence. Supporting conversational video services with cloud resources is challenging because most cloud servers are far away from the end users, while these services incur the following stringent requirements: high bandwidth, short delay, and high heterogeneity. In this article, we cover the literature with a top-down approach: from applications and experience, to architecture and management, and to optimization in and outside of the cloud. We also point out major open challenges, hoping to stimulate more research activities in this emerging and exciting direction.
For an efficient, high-quality, and low-cost solution to web online interactive rendering application, such as video gaming, virtual reality, simulation and so on, we propose a novel real-time cloud rendering system on the web. Different from the existing cloud rendering systems that render full-frame image sequences, we render the lightweight models in the web front-end with WebGL and put heavy Global illumination(GI) rendering calculations in the cloud back-end. Our system consists of three key stages, include cloud rendering, image data transmission, and final frames optimization. Compare to the traditional cloud rendering systems, our system overcomes the traditional cloud rendering system's transmission delay defect and shows satisfactory real-time rendering performance on web browsers.
The dramatic growth of video traffic represents a practical challenge for cellular network operators in providing a consistent streaming Quality of Experience (QoE) to their users. Satisfying this objective has so-far proved elusive, due to the inherent system wireless network conditions that degrade streaming performance, such as variability in both video bitrate and network conditions. In this paper, we propose stall aware pacing as a novel MPEG DASH video traffic management solution that reduces playback stalls and seeks to maintain a consistent QoE for cellular users, even those with diverse channel conditions. These goals are achieved by leveraging both network and client state information to optimize the pacing of individual video flows. We extensively evaluate the performance of two versions of stall aware pacing techniques, including stall aware pacing (SAP) and adaptive stall aware pacing (ASAP), using real video content and clients, operating over a simulated LTE network. We implement state-of-the-art client adaptation and traffic management strategies for direct comparison. Our results, using a heavily loaded base station, show that SAP reduces the number of stalls and the average stall duration per session by up to 95%. Additionally, SAP ensures that clients with good channel conditions do not dominate available wireless resources, evidenced by a reduction of up to 40% in the standard deviation of the QoE metric. We also show that the ASAP achieves further performance gains by adaptively pacing video streams based on the application buffer state.
Single-source HTTP Adaptive Streaming protocols (HAS), such as MPEG-DASH, have become the de-facto solutions to deliver live video over the Internet. By avoiding buffer stalling events that are mainly caused by the lack of throughput at client or at server side, HAS protocols increase end-users Quality of Experience (QoE). We propose to extend HAS capabilities to a pragmatic HAS-compliant multi-source protocol that simultaneously utilizes several servers: MS-Stream. MS-Stream aims at offering high QoE live content delivery by exploiting expanded bandwidth and link diversity in distributed heterogeneous infrastructures. By leveraging on end-users connectivity capacities, we further extend the QoE and scalability capabilities of our proposal and we expose a hybrid P2P/Multi-Source live-streaming solution: PMS. PMS is a distributed streaming solution trading-off the systems scalability and the end-users QoE. The bitrate adaptation algorithm relies on global and local indicators characterizing the capacity and efficiency of the entire system. This paper exposes our contribution in building a lightweight pragmatic and evolving solution utilizing both P2P and DASH to achieve low cost live-content delivery at high QoE.