Guest Editorial: Special Section on Multimedia Understanding via Multimodal Analytics
Light field images, for example taken with plenoptic cameras, offer interesting post-processing opportunities, including depth-of-field management, depth estimation, viewpoint selection, and 3D image synthesis. Like most capture devices, however, plenoptic cameras have a limited dynamic range, so that over- and under-exposed areas in plenoptic images are commonplace. We therefore present a straightforward and robust plenoptic reconstruction technique based on the observation that vignetting causes peripheral views to receive less light than central views. Thus, corresponding pixels in different views can be used to reconstruct illumination, especially in areas where information missing in one view is present in another. Our algorithm accurately reconstructs under- and over-exposed regions (known as declipping), additionally affording an increase in peak luminance by up to 2 f-stops, and a comparable lowering of the noise floor. The key advantages of this approach are that no hardware modifications are necessary to improve the dynamic range, that no multiple exposure techniques are required, and therefore that no ghosting or other artefacts are introduced.
Hosting interactive video-based services, such as computer games, in the cloud poses particular challenges given the sensitivity to delay. A better understanding of the impact of delay on player-game interactions can help design cloud systems and games that accommodate delay inherent in cloud systems. Previous top-down studies of delay using full-featured games have helped understand the impact of delay, but often do not generalize nor lend themselves to analytic modeling. Bottom-up studies isolating user input and delay can better generalize and be used in models, but have yet to be applied to cloud-hosted computer games. In order to better understand delay impact in cloud-hosted computer games, we conduct a large bottom-up user study centered on a fundamental game interaction - selecting a moving target with user input subject to delay. Our work builds a custom game that controls both the target speed and input delay and has players select the target using an analog thumbstick controller. Analysis of data from over 50 users shows target selection time exponentially increases with delay and target speed and is well-fit by an exponential model that includes a delay & target speed interaction term. A comparison with two previous studies, both using a mouse instead of a thumbstick, suggests the model's relationship between delay and target speed holds more broadly, providing a foundation for a potential law explaining moving target selection with delay encountered in cloud-hosted games.
We present a novel fine-grained image recognition framework using user click data, which can bridge the semantic gap in distinguishing categories that are similar in visual. As the query set is usually large-scale and redundant, we firstly propose a click feature based query merging approach to merge semantically similar queries and construct a compact click feature. Afterwards, we utilize this compact click feature and Convolutional Neural Network (CNN) based deep visual feature to jointly represent an image. Finally, with the combined feature, we employ the metric learning based template matching scheme for efficient recognition. Considering the heavy noise in the training data, we introduce a reliability variable to characterize the image reliability, and propose a Weakly supervised Metric and Template Leaning with Deep feature and Click data (WMTLDC) method to jointly learn the distance metric, object templates, and image reliability. Extensive experiments are conducted on the public Clickture-Dog dataset. It is shown that, the click data based query merging helps generating a highly compact click feature for images (the dimension is reduced to 0.9%), which greatly improves the computational efficiency. Also, introducing this click feature can boost the recognition accuracy by more than 20% compared to that using CNN feature only. The proposed framework performs much better than previous state-of-the-arts in fine-grained recognition tasks.
The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids and clouds. Yet it remains a challenge to harness the available power and move towards gracefully handling web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this paper, we describe a prototype of a (near) web-scale multimedia service using the Spark framework running on the AWS cloud service. We present experimental results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
Inconsistency in contrast enhancement can be used to expose image forgeries. In this work, we describe a new method to estimate contrast enhancement from a single image. Our method takes advantage of the nature of contrast enhancement as a mapping between pixel values, and the distinct characteristics it introduces to the image pixel histogram. Our method recovers the original pixel histogram and the contrast enhancement simultaneously from a single image with an iterative algorithm. Unlike previous methods, our method is robust in the presence of additive noise perturbations that are used to hide the traces of contrast enhancement. Furthermore, we also develop an e effective method to to detect image regions undergone contrast enhancement transformations that are different from the rest of the image, and use this method to detect composite images. We perform extensive experimental evaluations to demonstrate the efficacy and efficiency of our method method.
Quality of Experience (QoE) has received much attention over the past years and has become a prominent issue for delivering services and applications. A significant amount of research has been devoted to understanding, measuring, and modelling QoE for a variety of media services. The next logical step is to actively exploit that accumulated knowledge to improve and manage the quality of multimedia services, while at the same time ensuring efficient and cost-effective network operations. Moreover, with many different players involved in the end-to-end service delivery chain, identifying the root causes of QoE impairments and finding effective solutions for meeting the end users' requirements and expectations in terms of service quality is a challenging and complex problem. In this paper we survey state-of-the-art findings and present emerging concepts and challenges related to managing QoE for networked multimedia services. Going beyond a number of previously published survey papers addressing the topic of QoE management, we address QoE management in the context of ongoing developments, such as the move to 5G and virtualized networks, the exploitation of big data analytics and machine learning, and the steady rise of new and immersive services (e.g., augmented and virtual reality). We address the implications of such paradigm shifts in terms of new approaches in QoE modeling, and the need for novel QoE monitoring and management infrastructures.
In this paper, we explore the possibility of enabling cloud-based virtual space applications, for better computational scalability and easy access from any end device, including future lightweight wireless head-mounted displays (HMDs). In particular, we investigate virtual classroom and virtual gallery applications, in which the scenes and activities are rendered in the cloud, with multiple views captured and streamed to each end device. A key challenge is the high bandwidth requirement to stream all the user views, leading to high operational cost and potential large delay in a bandwidth-restricted wireless network. We propose a novel hybrid-cast approach to save bandwidth in a multi-user streaming scenario. We identify and broadcast the common pixels shared by multiple users, while unicast the residual pixels for each user. We formulate the problem of minimizing the total bitrate needed to transmit the user views using hybrid-casting and describe our approach. A common view extraction approach and a smart grouping algorithm are proposed and developed to achieve our hybrid-cast approach. Simulation results show that the hybrid-cast approach can significantly reduce total bitrate by up to 55%, compared to traditional cloud-based approach of transmitting all the views as individual unicast streams, hence addressing the bandwidth challenges of cloud, with additional benefits in cost and delay.
The next generation of multimedia services will have to be optimized in a personalized way, therefore the user factors will play a crucial role in individual experience evaluation. So far, the influence of user factors is mainly investigated in the controlled laboratory environment which often includes limited number of users and fails to reflect real-life environment. Social media, especially Facebook, provides an interesting alternative for internet-based subjective experimentation. In this paper, we developed an open-sourced Facebook application, named YouQ, as an experimental platform for studying individual experience evaluations. Our results show that subjective experimentation based on YouQ can produce reliable results as compared to a controlled laboratory experiment. Additionally, YouQ is able to collect user information automatically from Facebook, and such user information has shown its potential for modelling individual experience.
The large share of traffic in the Internet generated by video streaming services puts high loads on access networks and produces high costs for the content delivery infrastructure. To reduce the bandwidth consumed, while maintaining a high playback quality, video players use policies that control and limit the buffer level. This allows shaping the bandwidth consumed by video streams and limiting the traffic wasted in case of playback abortion. Especially in mobile scenarios, where the bandwidth can be highly variant, the buffer policy can have a high impact on the probability of interruptions during video playback. To find the optimal setting for the buffer policy in each network condition, the relationship between the parameters of the buffer policy, the network dynamics and the corresponding video playback behavior need to be understood. To this end, we model the video buffer as GI/GI/1 queue with pq-policy using discrete-time analysis. This allows evaluating the impact of varying network conditions and video bitrate on the efficiency of the buffer policy. By studying the stochastic properties of the buffer level distribution, we are able to accurately evaluate the impact of network and video bitrate dynamics on the video playback quality based on the buffer policy. Further, we can optimize the trade-off between the traffic wasted in case of video abortion and video streaming quality experienced by the user.
Video streaming applications currently dominate Internet traffic. Particularly, HTTP Adaptive Streaming (HAS) has emerged as the de facto standard for streaming videos over the best-effort Internet, thanks to its capability of matching the video quality to the available network resources. In HAS, the video client is equipped with a heuristic that dynamically decides the most suitable quality to stream the content, based on information such as the perceived network bandwidth or the video player buffer status. The goal of this heuristic is to optimize the quality as perceived by the user, the so-called Quality of Experience (QoE). Despite the many advantages brought by the adaptive streaming principle, optimizing users' QoE is far from trivial. Current heuristics are still suboptimal when sudden bandwidth drops occur, especially in wireless environments, thus leading to freezes in the video playout, the main factor influencing users' QoE. This issue is aggravated in case of live events, where the player buffer has to be kept as small as possible in order to reduce the playout delay between the user and the live signal. In light of the above, in recent years, several works have been proposed with the aim of extending the classical purely client-based structure of adaptive video streaming, in order to fully optimize users' QoE. In this paper, a survey is presented of research works on this topic together with a classification based on where the optimization takes place. This classification goes beyond client-based heuristics to investigate the usage of server- and network-assisted architectures and of new application and transport layer protocols. In addition, we outline the major challenges currently arising in the field of multimedia delivery, which are going to be of extreme relevance in future years.
Detection of aesthetic highlights is a challenge for understanding the affective processes taking place during movie watching. In this paper we focus our study on spectators responses to movie aesthetic stimuli in a social context. Moreover, we look for uncovering the emotional component of aesthetic highlights in movies. Our assumption is that synchronized spectators physiological and behavioral reactions occur during these highlights because: (i) aesthetic choices of filmmakers are made to elicit specific emotional reactions (e.g. special effects, empathy and compassion toward a character, etc.) and (ii) watching a movie together causes spectators affective reactions to be synchronized through emotional contagion. We compare different approaches to estimation of synchronization among groups of spectators signals, such as pairwise, group and overall synchronization measures to detect aesthetic highlights in movies. The results show that the unsupervised architecture relying on synchronization measures is able to capture different properties of spectators synchronization and detect aesthetic highlights based on both spectators electrodermal and acceleration signals. Pairwise synchronization measures perform the most accurately independently of the type of the highlights and movie genres. Moreover, we observe that electrodermal signals have more discriminative power than acceleration signals for highlight detection.
Effective and efficient video retrieval has become a pressing need in the "big video'' era. The objective of this work is to provide a principled model for computing the ranking scores of a video in response to one or more concepts, where the concepts could be directly supplied by users or inferred by the system from the user queries. Indeed, how to deal with multi-concept queries has become a central component in modern video retrieval systems that accept text queries. However, it has been long overlooked and simply implemented by weighted averaging the corresponding concept detectors' scores. Our approach, which can be considered as a latent ranking SVM, integrates the advantages of various recent works in text and image retrieval, such as choosing ranking over structured prediction and modeling inter-dependencies between querying concepts and the others. Videos consist of shots and we use latent variables to account for the mutually complementary cues within and across shots. Concept labels of shots are scarce and noisy. We introduce a simple and effective technique to make our model robust to outliers. Our approach gives superior performance when it is tested on not only the queries seen at training but also novel queries, some of which consist of more concepts than the queries used for training.
In this paper we address the problem of recognizing an event from a single related picture. Given the large number of event classes and the limited information contained into a single shot, the problem is known to be particularly hard. In order to achieve a reliable detection, we propose a combination of multiple classifiers,and we compare three alternative strategies to fuse the results of each classifier, namely: (i) Induced OrderWeighted Averaging operators, (ii) Genetic Algorithms, and (iii) Particle Swarm Optimization. Each method is aimed at determining the optimal weights to be assigned to the decision scores yielded by different deep models, according to the relevant optimization strategy. Experimental tests have been performed on three event recognition datasets, evaluating the performance of various deep models, both alone and selectively combined. Experimental results demonstrate that the proposed approach outperforms traditional multiple classifier solutions based on uniform weighting, and outperforms recent state of art approaches.
We propose data musicalization, i.e., automated composition of music based on given data, as an approach to perceptualizing information. The aim of data musicalization is to evoke subjective experiences in the user rather than just convey unemotional information. We illustrate data musicalization by introducing several novel applications: one that perceptualizes physical sleep data as music, several ones that artistically composes music inspired by the sleep data, one that musicalizes on-line chat conversations to provide perceptualization of liveliness of a discussion, and one that uses musicalization in a game-like mobile application to allow its users to produce music. We also present a preliminary empirical evaluation of chat musicalization suggesting that some features of online conversations are naturally represented as music. We provide a number of electronic samples of music produced by the different musicalization applications so readers may judge the aesthetic pleasure and artistic quality themselves.
Guest Editorial: Special Issue on QoE Management for Multimedia Services
With the increasing accessibility of the mobile head-mounted displays (HMDs), mobile virtual reality (VR) systems are finding applications in various areas. However, mobile HMDs are highly constrained with limited graphics processing units (GPUs), low processing power and onboard memory. Hence, VR developers must be cognizant of the number of polygons contained within their virtual environments to avoid rendering at low frame rates and inducing simulator sickness. The most robust and rapid approach to keeping the overall number of polygons low is to use mesh simplification algorithms to create low-poly versions of pre-existing, high-poly models. Unfortunately, most existing mesh simplification algorithms cannot adequately handle meshes with lots of boundaries or non-manifold meshes, which are common attributes of many 3D models. In this paper, we present QEM4VR, a high-fidelity mesh simplification algorithm specifically designed for VR. This algorithm addresses the deficiencies of prior quadric error metric (QEM) approaches by leveraging the insight that the most relevant boundary edges lie along curvatures while linear boundary edges can be collapsed. Additionally, our algorithm preserves key surface properties, such as normals, texture coordinates, colors, and materials, as it pre-processes 3D models and generates their low-poly approximations offline. We evaluated the effectiveness of our QEM4VR algorithm by comparing its simplified-mesh results to those of prior QEM variations in terms of geometric approximation error, texture error, progressive approximation errors, frame rate impact, and perceptual quality measures. We found that QEM4VR consistently yielded simplified meshes with less geometric approximation error and texture error than the prior QEM variations. It afforded better frame rates than QEM variations with boundary preservation constraints that create unnecessary lower bounds on overall polygon count reduction. Our evaluation revealed that QEM4VR did not fair well in terms of existing perceptual distance measurements, but human-based inspections demonstrate that these algorithmic measurements are not suitable substitutes for actual human perception. In turn, we present a user-based methodology for evaluating the perceptual qualities of mesh simplification algorithms.
We show how to build the components of a privacy-aware, live video analytics ecosystem from the bottom up, starting with OpenFace, our new open-source face recognition system that approaches state-of-the-art accuracy. Integrating OpenFace with inter-frame tracking, we build RTFace, a mechanism for denaturing video streams that selectively blurs faces according to specified policies at full frame rates. This enables privacy management for live video analytics while providing a secure approach for handling retrospective policy exceptions. Finally, we present a scalable, privacy-aware architecture for large camera networks using RTFace, and show how it can be an enabler for a vibrant ecosystem and marketplace of privacy-aware video streams and analytics services.
Cloud gaming has gained significant popularity recently due to many important benefits such as removal of device constraints, instant-on and cross-platform, etc. The properties of intensive resource demands and dynamic workloads make cloud gaming appropriate to be supported by an elastic cloud platform. Facing a large user population, a fundamental problem is how to provide satisfactory cloud gaming service at modest cost. We observe that software maintenance cost could be substantial compared to server running cost in cloud gaming using elastic cloud resources. In this paper, we address the server provisioning problem for cloud gaming to optimize both server running cost and software maintenance cost. We find that the distribution of game softwares among servers and the selection of server types both trigger trade-offs between the software maintenance cost and server running cost in cloud gaming. We formulate the problem with a stochastic model and employ queueing theories to conduct solid theoretical analysis of the system behaviors under different request dispatching policies. We then propose several classes of algorithms to approximate the optimal solution. The proposed algorithms are evaluated by extensive experiments using real-world parameters. The results show that the proposed Ordered and Genetic algorithms are computationally efficient, nearly cost-optimal and highly robust to dynamic changes.
Features extracted by deep networks have been popular in many visual search tasks. This paper studies deep network structures and training schemes for mobile visual search. The goal is to learn an effective yet portable feature representation that is suitable for bridging the domain gap between mobile user pho- tos and (mostly) professionally taken product images, while keeping the computational cost acceptable for mobile based applications. The technical contributions are two-fold. First, we propose an alternative of the contrastive loss popularly used for training deep Siamese networks, namely robust contrastive loss, where we relax the penalty on some positive pairs to alleviate overfitting. Second, a simple multi-task fine-tuning scheme is leveraged to train the network, which not only utilizes knowledge from the provided training photo pairs, but also harnesses additional information from the large ImageNet dataset to regularize the fine-tuning process. Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multi-task fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.
360 degree video is a new generation of video streaming technology that promises greater immersiveness than standard video streams. This level of immersiveness is similar to that produced by virtual reality devices -- users can control the field of view using head movements rather than needing to manipulate external devices. Although 360 degree video could revolutionize streaming technology, large scale adoption is hindered by a number of factors. 360 degree video streams have larger bandwidth requirements, require faster responsiveness to user inputs, and users may be more sensitive to lower quality streams. In this paper, we review standard approaches toward 360 degree video encoding and compare these to families of approaches that distort the spherical surface to allow oriented concentrations of the 360 degree view. We refer to these distorted projections as offset projections. At best, we estimate via measurement studies that these offset projections can produce better or similar visual quality with less than 50\% pixels under reasonable assumptions about user behavior. Offset projections complicate adaptive 360 degree video streaming because they require a combination of bitrate and view orientation adaptations. We estimate that this combination of streaming adaptation in two dimensions can cause over 57\% extra segments to be downloaded compared to an ideal downloading strategy, wasting 20\% of the total downloading bandwidth.
Large scale image dataset and deep convolutional neural network (DCNN) are the two primary driving forces for the rapid progress in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarging existing datasets due to high labeling cost and unfair comparison issues. In this paper, we aim to achieve lower error rate by augmenting existing datasets in an automatic manner. Our method leverages both Web and DCNN, where Web provides massive images with rich contextual information, and DCNN replaces human to automatically label images under the guidance of Web contextual information. Experiments show that our method can automatically scale up existing datasets significantly from billions of web pages with high accuracy, and significantly improve the performance on object recognition tasks with the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the Web. Both the dataset and models trained on the dataset have been made publicly available.
Main stream approaches for 3D human action recognition usually combine the depth and the skeleton feature to improve the recognition accuracy. However, this strategy may result in high feature dimension and low discrimination due to the redundant of feature vector. In order to solve this drawback, a multi-feature selection approach for 3D human action recognition is proposed in this paper. First, three novel single-modal features are proposed to respectively describe depth appearance, depth motion, and skeleton motion. Second, a classification entropy of random forest is used to evaluate the discrimination of depth appearance based feature. Furthermore, one of the three features is selected to recognize the sample according to the discrimination evaluation. Experimental results show that the proposed multi-feature selection significantly outperforms single-modal feature and other feature fusion based approaches.
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.
For an efficient, high-quality, and low-cost solution to web online interactive rendering application, such as video gaming, virtual reality, simulation and so on, we propose a novel real-time cloud rendering system on the web. Different from the existing cloud rendering systems that render full-frame image sequences, we render the lightweight models in the web front-end with WebGL and put heavy Global illumination(GI) rendering calculations in the cloud back-end. Our system consists of three key stages, include cloud rendering, image data transmission, and final frames optimization. Compare to the traditional cloud rendering systems, our system overcomes the traditional cloud rendering system's transmission delay defect and shows satisfactory real-time rendering performance on web browsers.
Broadcasting live video directly from mobile devices is rapidly gaining popularity with applications like Periscope and Facebook Live. The Quality of Experience (QoE) provided by these services comprises many factors, such as quality of transmitted video, video playback stalling, end-to-end latency, and impact on battery life, and they are not yet well understood. In this paper, we examine mainly the Periscope service through a comprehensive measurement study and compare it in some aspects to Facebook Live. We shed light on the usage of Periscope through analysis of crawled data and then investigate the aforementioned QoE factors through statistical analyses as well as controlled small scale measurements using a couple of different smartphones and both versions, Android and iOS, of the two applications. We report a number of findings including the discrepancy in latency between the two most commonly used protocols RTMP and HLS, surprising surges in bandwidth demand caused by the Periscope apps chat feature, substantial variations in video quality, poor adaptation of video bitrate to available upstream bandwidth at the video broadcaster side, and significant power consumption caused by the applications.
Single-source HTTP Adaptive Streaming protocols (HAS), such as MPEG-DASH, have become the de-facto solutions to deliver live video over the Internet. By avoiding buffer stalling events that are mainly caused by the lack of throughput at client or at server side, HAS protocols increase end-users Quality of Experience (QoE). We propose to extend HAS capabilities to a pragmatic HAS-compliant multi-source protocol that simultaneously utilizes several servers: MS-Stream. MS-Stream aims at offering high QoE live content delivery by exploiting expanded bandwidth and link diversity in distributed heterogeneous infrastructures. By leveraging on end-users connectivity capacities, we further extend the QoE and scalability capabilities of our proposal and we expose a hybrid P2P/Multi-Source live-streaming solution: PMS. PMS is a distributed streaming solution trading-off the systems scalability and the end-users QoE. The bitrate adaptation algorithm relies on global and local indicators characterizing the capacity and efficiency of the entire system. This paper exposes our contribution in building a lightweight pragmatic and evolving solution utilizing both P2P and DASH to achieve low cost live-content delivery at high QoE.
State-of-the-art Software Defined Wide Area Networks (SD-WANs) provide the foundation for flexible and highly resilient networking. In this work we design, implement and evaluate a novel architecture (denoted SABR) that leverages the benefits of SDN to provide network assisted Adaptive Bitrate Streaming. With clients retaining full control of their streaming algorithms we clearly show that by this network assistance, both the clients and the content providers benefit significantly in terms of QoE and content origin offloading. SABR utilizes information on available bandwidths per link and network cache contents to guide video streaming clients with the goal of improving the viewers QoE. In addition, SABR uses SDN capabilities to dynamically program flows to optimize the utilization of CDN caches. Backed by our study of SDN assisted streaming we discuss the change in the requirements for network-to-player APIs that enables flexible video streaming. We illustrate the difficulty of the problem and the impact of SDN-assisted streaming on QoE metrics using various well established player algorithms. We evaluate SABR together with state-of-the-art DASH quality adaptation algorithms through a series of experiments performed on a real-world, SDN-enabled testbed network with minimal modifications to an existing DASH client. In addition, we compare the performance of different caching strategies in combination with SABR. Our trace-based measurements show the substantial improvement in cache hitrates and QoE metrics in conjunction with SABR indicating a rich design space for jointly optimized SDN-assisted caching architectures for adaptive bitrate video streaming applications.
It is a matter of fact that the Quality of Experience (QoE) has become one of the key factors that determine whether a new multimedia service will be successfully accepted by the final users. Accordingly, several QoE models have been developed with the aim of capturing the perception of the user by considering as many influencing factors as possible. However, when it comes to adopt these models in the management of the services and networks, it frequently happens that no one single provider has access to all the tools to either measure the influencing factor parameters or controlling the delivered quality. In particular, it often happens to the Over The Top (OTT) and Internet Service Provider (ISP), which act with complementary roles in the service delivery over the Internet. On the basis of this consideration, in this paper we first highlight the importance of a possible OTT-ISP collaboration for a joint service management in terms of technical and economic aspects. Then, we propose a general reference architecture for a possible collaboration and information exchange among them. Finally, we define three different approaches, namely: joint-venture, customer lifetime value-based, and QoE-fairness-based. The first aims to maximize the revenue by providing better QoE to customers paying more. The second aims to maximize the profit by providing better QoE to Most Profitable Customers (MPCs). The third aims to maximize QoE fairness among all customers. Finally, we conduct simulations to compare the three approaches in terms of QoE provided to the users, profit generated for the providers and QoE fairness.
Median filtering forensics in images has gained wide attention from researchers in recent years because of its inherent nature of preserving visual traces. Although many image forensic methods are developed for median filtering detection, but probability of detection reduces under JPEG compression at low quality factors and for low resolution images. The feature set reduction is also a challenging issue among existing detectors. In this paper, the $19$ dimensional feature set is analytically derived from skewness and kurtosis histograms. The new feature set is exposed for the purpose of global median filtering forensics supported with exhaustive experimental results to thoroughly assess the benefits and limitations of our propose method. The efficacy of the method is tested on five popular image databases (UCID, BOWS2, BOSSBase, NRCS and DID) and found that the new feature set uncover filtering traces for moderate, low JPEG post-compression and low resolution operation. Our propose method yields lowest probability of error and largest area under the ROC curve for most of the test cases in comparison with previous approaches. The obtained results ensure that the propose method would provide an important tool to the field of passive image forensics.
Digital multimedia steganalysis has attracted wide attention during the past decade. Up to now, there are many algorithms for detecting image steganography. However, just a few works have been reported for audio steganalysis. Since the statistical properties between image and audio are quite different, those effective features in image steganalysis may not be suitable for audio directly. In this paper, we design an improved audio steganalytic feature set derived from both the time and Mel-frequency domains for detecting some typical steganography in the time domain, including LSB matching, Hide4PGP, and Steghide. The experimental results evaluated on different audio sources, including various music and speech clips as well as their decompressed versions with different bit rates, have shown that the proposed features significantly outperform the existing ones, especially for never compressed audio clips. What is more, we use the proposed features to detect and further identify some typical audio operations that would be probably used in audio tampering. The extensive experimental results have shown that the proposed features also outperform the related forensic methods, especially when the length of the audio clip is small, such as audio clips with 800 samples, which is very important in real forensic situation.
Community Question Answering (CQA) websites have become valuable knowledge repositories. Millions of internet users resort to CQA websites to seek answers to their encountered questions. CQA websites provide information far beyond a search on a site such as Google due to (1) the plethora of high quality answers, and (2) the capabilities to post new questions towards the communities of domain experts. While most research efforts have been made to identify experts or to preliminary detect potential experts of CQA websites, there has been a remarkable shift towards investigating how to keep the engagement of experts. Experts are usually the major contributors of high-quality answers and questions of CQA websites. Consequently, keeping the expert communities active is vital to improving the lifespan of these websites. In this paper, we present an algorithm termed PALP to predict the activity level of users of CQA websites. To the best of our knowledge, PALP is the first to address a personalized activity level prediction model for CQA websites. Furthermore, it takes into consideration user behavior change over time and focuses specifically on expert users. Extensive experiments on the Stack Overflow website demonstrate the competitiveness of PALP over existing methods.