INTRODUCTION TO THE SPECIAL ISSUE ON REPRESENTATION, ANALYSIS AND RECOGNITION OF 3D HUMANS
Computer Vision and Multimedia solutions are now offering an increasing number of applications ready for use by end users in everyday life. Many of these applications are centered on the detection, representation, and analysis of face and body. Methods based on 2D images and videos are the most widespread, but there is a recent trend that successfully extends the study to 3D human data as acquired by a new generation of 3D acquisition devices. Based on these premises, in this survey, we provide an overview on the newly designed techniques that exploit 3D human data also prospecting the most promising current and future research directions. In particular, we first propose a taxonomy of the methods for representation, analysis and recognition of 3D humans from 3D static and dynamic data. Then, we focus on the applications for body and face. In the Appendix, the main characteristics of the datasets used as benchmarks for evaluating and comparing the existing techniques are also summarized.
Hosting interactive video-based services, such as computer games, in the cloud poses particular challenges given the sensitivity to delay. A better understanding of the impact of delay on player-game interactions can help design cloud systems and games that accommodate delay inherent in cloud systems. Previous top-down studies of delay using full-featured games have helped understand the impact of delay, but often do not generalize nor lend themselves to analytic modeling. Bottom-up studies isolating user input and delay can better generalize and be used in models, but have yet to be applied to cloud-hosted computer games. In order to better understand delay impact in cloud-hosted computer games, we conduct a large bottom-up user study centered on a fundamental game interaction - selecting a moving target with user input subject to delay. Our work builds a custom game that controls both the target speed and input delay and has players select the target using an analog thumbstick controller. Analysis of data from over 50 users shows target selection time exponentially increases with delay and target speed and is well-fit by an exponential model that includes a delay & target speed interaction term. A comparison with two previous studies, both using a mouse instead of a thumbstick, suggests the model's relationship between delay and target speed holds more broadly, providing a foundation for a potential law explaining moving target selection with delay encountered in cloud-hosted games.
Inconsistency in contrast enhancement can be used to expose image forgeries. In this work, we describe a new method to estimate contrast enhancement from a single image. Our method takes advantage of the nature of contrast enhancement as a mapping between pixel values, and the distinct characteristics it introduces to the image pixel histogram. Our method recovers the original pixel histogram and the contrast enhancement simultaneously from a single image with an iterative algorithm. Unlike previous methods, our method is robust in the presence of additive noise perturbations that are used to hide the traces of contrast enhancement. Furthermore, we also develop an e effective method to to detect image regions undergone contrast enhancement transformations that are different from the rest of the image, and use this method to detect composite images. We perform extensive experimental evaluations to demonstrate the efficacy and efficiency of our method method.
Quality of Experience (QoE) has received much attention over the past years and has become a prominent issue for delivering services and applications. A significant amount of research has been devoted to understanding, measuring, and modelling QoE for a variety of media services. The next logical step is to actively exploit that accumulated knowledge to improve and manage the quality of multimedia services, while at the same time ensuring efficient and cost-effective network operations. Moreover, with many different players involved in the end-to-end service delivery chain, identifying the root causes of QoE impairments and finding effective solutions for meeting the end users' requirements and expectations in terms of service quality is a challenging and complex problem. In this paper we survey state-of-the-art findings and present emerging concepts and challenges related to managing QoE for networked multimedia services. Going beyond a number of previously published survey papers addressing the topic of QoE management, we address QoE management in the context of ongoing developments, such as the move to 5G and virtualized networks, the exploitation of big data analytics and machine learning, and the steady rise of new and immersive services (e.g., augmented and virtual reality). We address the implications of such paradigm shifts in terms of new approaches in QoE modeling, and the need for novel QoE monitoring and management infrastructures.
The next generation of multimedia services will have to be optimized in a personalized way, therefore the user factors will play a crucial role in individual experience evaluation. So far, the influence of user factors is mainly investigated in the controlled laboratory environment which often includes limited number of users and fails to reflect real-life environment. Social media, especially Facebook, provides an interesting alternative for internet-based subjective experimentation. In this paper, we developed an open-sourced Facebook application, named YouQ, as an experimental platform for studying individual experience evaluations. Our results show that subjective experimentation based on YouQ can produce reliable results as compared to a controlled laboratory experiment. Additionally, YouQ is able to collect user information automatically from Facebook, and such user information has shown its potential for modelling individual experience.
The large share of traffic in the Internet generated by video streaming services puts high loads on access networks and produces high costs for the content delivery infrastructure. To reduce the bandwidth consumed, while maintaining a high playback quality, video players use policies that control and limit the buffer level. This allows shaping the bandwidth consumed by video streams and limiting the traffic wasted in case of playback abortion. Especially in mobile scenarios, where the bandwidth can be highly variant, the buffer policy can have a high impact on the probability of interruptions during video playback. To find the optimal setting for the buffer policy in each network condition, the relationship between the parameters of the buffer policy, the network dynamics and the corresponding video playback behavior need to be understood. To this end, we model the video buffer as GI/GI/1 queue with pq-policy using discrete-time analysis. This allows evaluating the impact of varying network conditions and video bitrate on the efficiency of the buffer policy. By studying the stochastic properties of the buffer level distribution, we are able to accurately evaluate the impact of network and video bitrate dynamics on the video playback quality based on the buffer policy. Further, we can optimize the trade-off between the traffic wasted in case of video abortion and video streaming quality experienced by the user.
Video streaming applications currently dominate Internet traffic. Particularly, HTTP Adaptive Streaming (HAS) has emerged as the de facto standard for streaming videos over the best-effort Internet, thanks to its capability of matching the video quality to the available network resources. In HAS, the video client is equipped with a heuristic that dynamically decides the most suitable quality to stream the content, based on information such as the perceived network bandwidth or the video player buffer status. The goal of this heuristic is to optimize the quality as perceived by the user, the so-called Quality of Experience (QoE). Despite the many advantages brought by the adaptive streaming principle, optimizing users' QoE is far from trivial. Current heuristics are still suboptimal when sudden bandwidth drops occur, especially in wireless environments, thus leading to freezes in the video playout, the main factor influencing users' QoE. This issue is aggravated in case of live events, where the player buffer has to be kept as small as possible in order to reduce the playout delay between the user and the live signal. In light of the above, in recent years, several works have been proposed with the aim of extending the classical purely client-based structure of adaptive video streaming, in order to fully optimize users' QoE. In this paper, a survey is presented of research works on this topic together with a classification based on where the optimization takes place. This classification goes beyond client-based heuristics to investigate the usage of server- and network-assisted architectures and of new application and transport layer protocols. In addition, we outline the major challenges currently arising in the field of multimedia delivery, which are going to be of extreme relevance in future years.
Detection of aesthetic highlights is a challenge for understanding the affective processes taking place during movie watching. In this paper we focus our study on spectators responses to movie aesthetic stimuli in a social context. Moreover, we look for uncovering the emotional component of aesthetic highlights in movies. Our assumption is that synchronized spectators physiological and behavioral reactions occur during these highlights because: (i) aesthetic choices of filmmakers are made to elicit specific emotional reactions (e.g. special effects, empathy and compassion toward a character, etc.) and (ii) watching a movie together causes spectators affective reactions to be synchronized through emotional contagion. We compare different approaches to estimation of synchronization among groups of spectators signals, such as pairwise, group and overall synchronization measures to detect aesthetic highlights in movies. The results show that the unsupervised architecture relying on synchronization measures is able to capture different properties of spectators synchronization and detect aesthetic highlights based on both spectators electrodermal and acceleration signals. Pairwise synchronization measures perform the most accurately independently of the type of the highlights and movie genres. Moreover, we observe that electrodermal signals have more discriminative power than acceleration signals for highlight detection.
We propose a deformation based representation for analyzing expressions from 3D faces. A point cloud of a 3D face is decomposed into an ordered deformable set of curves that start from a fixed point. Subsequently, a mapping function is defined to identify the set of curves with an element of a high dimensional matrix Lie group, specifically the direct product of SE(3). Representing 3D faces as an element of a high dimensional Lie group has two main advantages. First, using the group structure, facial expressions can be decoupled from a neutral face. Second, an underlying non-linear facial expression manifold can be captured with the Lie group and mapped to a linear space, Lie algebra of the group. This opens up the possibility of classifying facial expressions with linear models without compromising the underlying manifold. Alternatively, linear combinations of linearised facial expressions can be mapped back from the Lie algebra to the Lie group. Tested on the BU-3DFE dataset, for expression recognition, the proposed approach outperforms state of art methods without using any features or landmark points.
Effective and efficient video retrieval has become a pressing need in the "big video'' era. The objective of this work is to provide a principled model for computing the ranking scores of a video in response to one or more concepts, where the concepts could be directly supplied by users or inferred by the system from the user queries. Indeed, how to deal with multi-concept queries has become a central component in modern video retrieval systems that accept text queries. However, it has been long overlooked and simply implemented by weighted averaging the corresponding concept detectors' scores. Our approach, which can be considered as a latent ranking SVM, integrates the advantages of various recent works in text and image retrieval, such as choosing ranking over structured prediction and modeling inter-dependencies between querying concepts and the others. Videos consist of shots and we use latent variables to account for the mutually complementary cues within and across shots. Concept labels of shots are scarce and noisy. We introduce a simple and effective technique to make our model robust to outliers. Our approach gives superior performance when it is tested on not only the queries seen at training but also novel queries, some of which consist of more concepts than the queries used for training.
Social interactions take place in environments that influence behaviours and perceptions of people. Nowadays, the users of Online Social Network (OSN) generate a massive amount of content based on social interactions. However, OSNs wide popularity and ease of access created a perfect scenario to practice malicious activities, compromising their reliability. To detect automatic information broadcast in OSN, we developed a wavelet-based model which classifies users as being human, legitimate robot or malicious robot, as a result of spectral patterns obtained from users' textual content. We create the feature vectors from the Discrete Wavelet Transform (DWT) along with a weighting scheme called Lexicon Based Coefficient Attenuation (LBCA). In particular, we induce a classification model using the Random Forest algorithm over two real Twitter datasets. The corresponding results show the developed model achieved an average accuracy of 94.47% considering two different scenarios: single theme and miscellaneous theme.
With the success of emerging RGB-D cameras such as the Kinect sensor, com- bining the shape (depth) and texture information to improve the quality of recognition became a trend among computer vision researchers. In this work, we address the problem of face classification in the context of RGB images and depth data. Inspired by the psychological results for human face perception, this paper focuses on (i) finding out which facial parts are most effective at making the difference for some social aspects of face perception (gender, ethnicity and emotion state), (ii) determining the optimal decision by combining the decision rendered by the individual parts, and (iii) extracting the promising features from RGB-D faces in order to exploit all the potential that this data provide. Experimental results on EurecomKinect Face and CurtinFaces databases show that the proposed approach improves the recognition quality in many use cases.
Feature learning has enjoyed much attention and achieved good performance in recent studies of image processing. Unlike the settings often assumed there, far less labeled data is typically available for training emotion classification systems. And in most of the current works the learning-based features are always learned from a full face image and acted as independent with each other. It lacks the power to describe the visually coherent facial images. Our method is therefore designed with the goal of simplifying the problem domain by removing expression irrelevant factors from the input images, with a key region-based mechanism. This, in an effort to reduce the amount of data required to effectively train the feature learning methods. Meanwhile, we can construct geometric constraints between the key regions and its detected positions. To this end, we introduce a Spatially Coherent feature learning method for Pose-invariant FER (SC-PFER). In our model, we first perform face frontalization through a 3D pose normalization technique, which could normalize pose while preserving the identity information through synthesizing frontal faces for facial images with arbitrary views. Subsequently, we select a sequence of key regions around 51 key points in the synthetic frontal face images for efficient unsupervised feature learning. Finally, We introduce a linkage structure over the learning-based features and the corresponding geometry information of each key region to encode the dependencies of the regions. Our method, on the whole, does not require training multiple models for each specific pose and avoid separate training and parameter tuning for each pose. Extensive experiments on three benchmark datasets show that our approach leads to stable and robust recognition performance, and outperforms several well-established FER methods.
We propose data musicalization, i.e., automated composition of music based on given data, as an approach to perceptualizing information. The aim of data musicalization is to evoke subjective experiences in the user rather than just convey unemotional information. We illustrate data musicalization by introducing several novel applications: one that perceptualizes physical sleep data as music, several ones that artistically composes music inspired by the sleep data, one that musicalizes on-line chat conversations to provide perceptualization of liveliness of a discussion, and one that uses musicalization in a game-like mobile application to allow its users to produce music. We also present a preliminary empirical evaluation of chat musicalization suggesting that some features of online conversations are naturally represented as music. We provide a number of electronic samples of music produced by the different musicalization applications so readers may judge the aesthetic pleasure and artistic quality themselves.
Introduction to the Special Section on Multimedia Computing and Applications of Socio-Affective Behaviors in the Wild
Features extracted by deep networks have been popular in many visual search tasks. This paper studies deep network structures and training schemes for mobile visual search. The goal is to learn an effective yet portable feature representation that is suitable for bridging the domain gap between mobile user pho- tos and (mostly) professionally taken product images, while keeping the computational cost acceptable for mobile based applications. The technical contributions are two-fold. First, we propose an alternative of the contrastive loss popularly used for training deep Siamese networks, namely robust contrastive loss, where we relax the penalty on some positive pairs to alleviate overfitting. Second, a simple multi-task fine-tuning scheme is leveraged to train the network, which not only utilizes knowledge from the provided training photo pairs, but also harnesses additional information from the large ImageNet dataset to regularize the fine-tuning process. Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multi-task fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.
Main stream approaches for 3D human action recognition usually combine the depth and the skeleton feature to improve the recognition accuracy. However, this strategy may result in high feature dimension and low discrimination due to the redundant of feature vector. In order to solve this drawback, a multi-feature selection approach for 3D human action recognition is proposed in this paper. First, three novel single-modal features are proposed to respectively describe depth appearance, depth motion, and skeleton motion. Second, a classification entropy of random forest is used to evaluate the discrimination of depth appearance based feature. Furthermore, one of the three features is selected to recognize the sample according to the discrimination evaluation. Experimental results show that the proposed multi-feature selection significantly outperforms single-modal feature and other feature fusion based approaches.
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.
In this paper, we present a sequential method to reconstruct articulated poses in three dimensions from monocular sequences. This is challenging because of the inherent depth ambiguities, high dimensionality and complexity of human motion. The approach models the human movement on kinematics manifold with the tangent bundle, which allows for a natural geometrical representation of the articulated motion. Combining with a second order stochastic dynamic model which is based on the Markov hypothesis, we generalize the Extended Rauch Tung Striebel smoother to Riemannian manifold to estimate subsequent a limited number of frames. Due to the movement might invalidate the Markov hypothesis when body is impacted by external forces. Hence, based on evidence of the current 2D observation, the current estimation is rened in a feasible solution region consisted by the previous and estimated poses. This region is called simplex, in which each element can be represented by a convex hull of all ingredients. We show that resolving this problem is equivalent to solving a convex optimization problem with the simplicial constraint. This formulation accords with the kinematic principle and the temporal-spatial continuity of the articulated motion, thus the reconstruction ambiguity can be alleviated essentially. Empirical evaluation conducted on multiple synthetic sequences of CMU and HDM05 motion capture dataset, shows that the proposed approach achieves greater accuracy over state-of-the-art baseline. Further, the proposed approach outperforms two baselines on the real sequences of Human3.6m motion capture dataset.
Broadcasting live video directly from mobile devices is rapidly gaining popularity with applications like Periscope and Facebook Live. The Quality of Experience (QoE) provided by these services comprises many factors, such as quality of transmitted video, video playback stalling, end-to-end latency, and impact on battery life, and they are not yet well understood. In this paper, we examine mainly the Periscope service through a comprehensive measurement study and compare it in some aspects to Facebook Live. We shed light on the usage of Periscope through analysis of crawled data and then investigate the aforementioned QoE factors through statistical analyses as well as controlled small scale measurements using a couple of different smartphones and both versions, Android and iOS, of the two applications. We report a number of findings including the discrepancy in latency between the two most commonly used protocols RTMP and HLS, surprising surges in bandwidth demand caused by the Periscope apps chat feature, substantial variations in video quality, poor adaptation of video bitrate to available upstream bandwidth at the video broadcaster side, and significant power consumption caused by the applications.
Single-source HTTP Adaptive Streaming protocols (HAS), such as MPEG-DASH, have become the de-facto solutions to deliver live video over the Internet. By avoiding buffer stalling events that are mainly caused by the lack of throughput at client or at server side, HAS protocols increase end-users Quality of Experience (QoE). We propose to extend HAS capabilities to a pragmatic HAS-compliant multi-source protocol that simultaneously utilizes several servers: MS-Stream. MS-Stream aims at offering high QoE live content delivery by exploiting expanded bandwidth and link diversity in distributed heterogeneous infrastructures. By leveraging on end-users connectivity capacities, we further extend the QoE and scalability capabilities of our proposal and we expose a hybrid P2P/Multi-Source live-streaming solution: PMS. PMS is a distributed streaming solution trading-off the systems scalability and the end-users QoE. The bitrate adaptation algorithm relies on global and local indicators characterizing the capacity and efficiency of the entire system. This paper exposes our contribution in building a lightweight pragmatic and evolving solution utilizing both P2P and DASH to achieve low cost live-content delivery at high QoE.
In this work, we discuss enhanced full 360° 3D reconstruction of dynamic scenes containing non-rigidly deforming objects using data acquired from commodity RGB-D or 3D cameras. Several approaches for enhanced and full 3D reconstruction of non-rigid objects have been proposed in the literature, but they suffer from several limitations due to requirement of a template, inability to tackle large local deformations and topology changes, inability to tackle highly noisy and low-resolution data, and inability to produce on-line results. We target on-line and template-free enhancement of the quality of noisy and low-resolution full 3D reconstructions of dynamic non-rigid objects acquired with a fully calibrated multi-view system. For this purpose, we propose a recursive dynamic multi-frame 3D super-resolution scheme for noise removal and resolution enhancement of 3D measurements, of non-rigidly deforming objects.The proposed scheme tracks the position and motion of each 3D point at every-time step by making use of the current acquisition and the result of the previous iteration. The affects of system blur due to per-point tracking are subsequently tackled by introducing a novel and efficient multi-level 3D bilateral total variation regularization. These characteristics enable the proposed scheme to handle large deformations and topology changes accurately. A thorough evaluation of the proposed scheme on both real and simulated data is carried out. The results show that the proposed scheme improves upon the performance of the state-of-art methods and is able to accurately enhance the quality of low-resolution and highly noisy 3D reconstructions while being robust to large local deformations.
State-of-the-art Software Defined Wide Area Networks (SD-WANs) provide the foundation for flexible and highly resilient networking. In this work we design, implement and evaluate a novel architecture (denoted SABR) that leverages the benefits of SDN to provide network assisted Adaptive Bitrate Streaming. With clients retaining full control of their streaming algorithms we clearly show that by this network assistance, both the clients and the content providers benefit significantly in terms of QoE and content origin offloading. SABR utilizes information on available bandwidths per link and network cache contents to guide video streaming clients with the goal of improving the viewers QoE. In addition, SABR uses SDN capabilities to dynamically program flows to optimize the utilization of CDN caches. Backed by our study of SDN assisted streaming we discuss the change in the requirements for network-to-player APIs that enables flexible video streaming. We illustrate the difficulty of the problem and the impact of SDN-assisted streaming on QoE metrics using various well established player algorithms. We evaluate SABR together with state-of-the-art DASH quality adaptation algorithms through a series of experiments performed on a real-world, SDN-enabled testbed network with minimal modifications to an existing DASH client. In addition, we compare the performance of different caching strategies in combination with SABR. Our trace-based measurements show the substantial improvement in cache hitrates and QoE metrics in conjunction with SABR indicating a rich design space for jointly optimized SDN-assisted caching architectures for adaptive bitrate video streaming applications.
It is a matter of fact that the Quality of Experience (QoE) has become one of the key factors that determine whether a new multimedia service will be successfully accepted by the final users. Accordingly, several QoE models have been developed with the aim of capturing the perception of the user by considering as many influencing factors as possible. However, when it comes to adopt these models in the management of the services and networks, it frequently happens that no one single provider has access to all the tools to either measure the influencing factor parameters or controlling the delivered quality. In particular, it often happens to the Over The Top (OTT) and Internet Service Provider (ISP), which act with complementary roles in the service delivery over the Internet. On the basis of this consideration, in this paper we first highlight the importance of a possible OTT-ISP collaboration for a joint service management in terms of technical and economic aspects. Then, we propose a general reference architecture for a possible collaboration and information exchange among them. Finally, we define three different approaches, namely: joint-venture, customer lifetime value-based, and QoE-fairness-based. The first aims to maximize the revenue by providing better QoE to customers paying more. The second aims to maximize the profit by providing better QoE to Most Profitable Customers (MPCs). The third aims to maximize QoE fairness among all customers. Finally, we conduct simulations to compare the three approaches in terms of QoE provided to the users, profit generated for the providers and QoE fairness.
Median filtering forensics in images has gained wide attention from researchers in recent years because of its inherent nature of preserving visual traces. Although many image forensic methods are developed for median filtering detection, but probability of detection reduces under JPEG compression at low quality factors and for low resolution images. The feature set reduction is also a challenging issue among existing detectors. In this paper, the $19$ dimensional feature set is analytically derived from skewness and kurtosis histograms. The new feature set is exposed for the purpose of global median filtering forensics supported with exhaustive experimental results to thoroughly assess the benefits and limitations of our propose method. The efficacy of the method is tested on five popular image databases (UCID, BOWS2, BOSSBase, NRCS and DID) and found that the new feature set uncover filtering traces for moderate, low JPEG post-compression and low resolution operation. Our propose method yields lowest probability of error and largest area under the ROC curve for most of the test cases in comparison with previous approaches. The obtained results ensure that the propose method would provide an important tool to the field of passive image forensics.
Digital multimedia steganalysis has attracted wide attention during the past decade. Up to now, there are many algorithms for detecting image steganography. However, just a few works have been reported for audio steganalysis. Since the statistical properties between image and audio are quite different, those effective features in image steganalysis may not be suitable for audio directly. In this paper, we design an improved audio steganalytic feature set derived from both the time and Mel-frequency domains for detecting some typical steganography in the time domain, including LSB matching, Hide4PGP, and Steghide. The experimental results evaluated on different audio sources, including various music and speech clips as well as their decompressed versions with different bit rates, have shown that the proposed features significantly outperform the existing ones, especially for never compressed audio clips. What is more, we use the proposed features to detect and further identify some typical audio operations that would be probably used in audio tampering. The extensive experimental results have shown that the proposed features also outperform the related forensic methods, especially when the length of the audio clip is small, such as audio clips with 800 samples, which is very important in real forensic situation.
Community Question Answering (CQA) websites have become valuable knowledge repositories. Millions of internet users resort to CQA websites to seek answers to their encountered questions. CQA websites provide information far beyond a search on a site such as Google due to (1) the plethora of high quality answers, and (2) the capabilities to post new questions towards the communities of domain experts. While most research efforts have been made to identify experts or to preliminary detect potential experts of CQA websites, there has been a remarkable shift towards investigating how to keep the engagement of experts. Experts are usually the major contributors of high-quality answers and questions of CQA websites. Consequently, keeping the expert communities active is vital to improving the lifespan of these websites. In this paper, we present an algorithm termed PALP to predict the activity level of users of CQA websites. To the best of our knowledge, PALP is the first to address a personalized activity level prediction model for CQA websites. Furthermore, it takes into consideration user behavior change over time and focuses specifically on expert users. Extensive experiments on the Stack Overflow website demonstrate the competitiveness of PALP over existing methods.