Facial images embed age, gender, and other rich information that are implicitly related to occupation. In this work, we advocate that occupation prediction from a single facial image is a doable computer vision problem. We extract multilevel hand-crafted features associated with locality-constrained linear coding, and convolutional neural network features, as image occupation descriptors. To avoid the curse of dimensionality and overfitting, a boost strategy called multi-channel SVM is used to integrate features from face and body. Intra-class and inter-class visual variations are jointly considered in the boosting framework to further improve performance. In the evaluation, we verify effectiveness of predicting occupation from face, and demonstrate promising performance obtained by combining face and body information. More importantly, our work as the first attempt to integrate deep features into the multi-channel SVM framework, and shows significantly better performance over the state of the art.
In recent years, deep networks have been successfully applied to model image concepts and achieved competitive performance on many tasks. In spite of impressive performance, the conventional deep networks can be subjected to the decayed performance if there are insufficient training data. This problem becomes extremely severe for deep networks with powerful representation structures, making them prone to over fitting by capturing nonessential or noisy information in a small data set. Towards this end, we propose a novel generalized deep transfer networks framework, capable of transferring labeling information across heterogeneous domains, such as from text domain to image domain. The proposed framework enables to adequately mitigate the problem of insufficient image training data by bringing in rich labels from the text domain. Specially, to share the labels between two domains, we will build weakly shared layers of features, i.e., parameter- and presentation-shared layers. It allows to represent both shared inter-domain features and domain-specific features, making this structure more flexible and powerful in capturing complex data of different domains jointly than the strongly shared layers. Experimental results on real world dataset show the competitive performance of the proposed method compared with some state-of-the-art methods.
Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multi-modal data analysis. Large deep learning models are developed for learning rich representations of complex data. There are two challenges to overcome before deep learning can be widely adopted in multimedia and other applications. One is usability, namely the implementation of different models and training algorithms must be done by non-experts without much effort especially when the model is large and complex. The other is scalability, that is the deep learning system must be able to provision for a huge demand of computing resources for training large models with massive datasets. To address these two challenges, in this paper, we design a distributed deep learning platform called SINGA which has an intuitive programming model based on the common layer abstraction of deep learning models. Good scalability is achieved through flexible distributed training architecture and specific optimization techniques. SINGA runs on GPUs as well as on CPUs, and we show that it outperforms many other state-of-the-art deep learning systems. Our experience with developing and training deep learning models for real-life multimedia applications in SINGA shows that the platform is both usable and scalable.
The rising demand for mobile video is an increasing challenge for cellular networks. Even though they can be ubiquitously accessed by mobile devices the achieved throughputs are rather limited and highly varying. Nowadays, a solution which is deployed to cope with varying network conditions are adaptive video streaming systems. Adaptive video streaming approaches choose the appropriate video quality representations according to the available network resources and can dynamically adjust the quality during the streaming session. Even if the necessary throughput is available, mobile users are interested in limiting the generated data traffic as most cellular network contracts have data caps. Usually, if the cap is reached, the bandwidth is throttled to a speed which does not allow video streaming or requires additional payments. Existing systems react to varying network conditions but often neglect content-specific adaptation needs. Content inspection can help to save bandwidth when an increased video quality representation would not increase the perceived quality. In this work, we present a support service for a content-aware video adaptation on mobile devices. Based on the video content the adaptation process is improved for both the available network resources and the perception of the user. By leveraging the content properties of a video stream, the system is able to keep a stable video quality and at the same time reduce the generated data traffic. The system is evaluated with different adaptation schemes and shows that content specific adaptation can both increase the perceived quality as well as reduce the data traffic. Additionally, we integrate the VAS system into the current Dynamic Adaptive Streaming over HTTP (DASH) standard and show how such a system can be deployed.
It has been observed in the recent literature that the drift error due to watermarking degrades the visual quality of the embedded video. The existing drift error handling strategies for recent video standards such as H.264 may not be directly applicable for upcoming HD video standards (such as HEVC) due to different compression architecture. In this paper, a compressed domain watermarking scheme is proposed for H.265/HEVC bit stream which can handle drift error propagation both for intra and inter prediction process. Additionally, the proposed scheme shows adequate robustness against re-compression attack as well as common image processing attacks while maintaining the decent visual quality. A comprehensive set of experiments have been carried out to justify the efficacy of the proposed scheme over the existing literature.
When running multi-player online games on IP networks with losses and delays, the order of actions may be changed when compared to the order run on an ideal network with no delays and losses. To maintain a proper ordering of events, traditional approaches either use rollbacks to undo certain actions, or use local lags to introduce additional delays. Both may be perceived by players because their changes are beyond the just-noticeable-difference (JND) threshold. In this paper we propose a novel method for ensuring a strongly consistent completion order of actions, where strong consistency refers to the same completion order as well as the same interval between any completion time and the corresponding ideal reference completion time under no network delay. We find that small adjustments within the JND on the duration of an action would not be perceivable, as long as the duration is comparable to the network round-trip time (RTT). We utilize this property to control the vector of durations of actions and formulate the search of the vector as a multi-dimensional optimization problem. By using the property that players are generally more sensitive to the most prominent delay effect (with the highest probability of noticeability Pnotice, or the probability of correctly noticing a change when compared to the reference), we prove that the optimal solution occurs when Pnotice of the individual adjustments are equal. As this search can be done efficiently in polynomial time (<5 ms) with a small amount of space (<160 KB), the search can be done at run time to determine the optimal control. Lastly, we evaluate our approach on a popular open-source online shooting game BZFlag.
Introduction to Special Issue MMSys/NOSSDAV 2015
Both voice conversion and hidden Markov model (HMM)-based speech synthesis can be used to produce artificial voices of a target speaker. They have shown great negative impacts on speaker verification (SV) systems. In order to enhance the security of SV systems, the techniques to detect converted/synthesized speech should be taken into consideration. During voice conversion and HMM-based synthesis, speech reconstruction is applied to transform a set of acoustic parameters to reconstructed speech. Hence, the identification of reconstructed speech can be used to distinguish the converted/synthesized speech from the human speech. Several related works on such identification have been reported. The equal error rates (EERs) lower than 5% of detecting reconstructed speech have been achieved. However, through the cross-database evaluations on different speech databases, we find that the EERs of several testing cases are higher than 10%. The robustness of detection algorithms to different speech databases needs to be improved. In this paper, we propose an algorithm to identify the reconstructed speech. Three different speech databases and two different reconstruction methods are considered in our work, which has not been addressed in the reported works. The high-dimensional data visualization approach is used to analyze the effect of speech reconstruction on Mel-frequency cepstral coefficients (MFCC) of speech signals. The Gaussianmixturemodel (GMM) supervectors of MFCC are used as acoustic features. Furthermore, a set of commonly-used classification algorithms are applied to identify reconstructed speech. According to the comparison among different classification methods, linear discriminant analysis (LDA)-ensemble classifiers are chosen in our algorithm. Extensive experimental results show that the EERs lower than 1% can be achieved by the proposed algorithm in most cases, outperforming the reported state-of-the-art identification techniques.
Backlight scaling is a technique proposed to reduce the display panel power consumption by strategically dimming the backlight. However, for mobile video applications, a computationally intensive luminance compensation step must be performed in combination with backlight scaling to maintain the perceived appearance of video frames. This step, if done by the CPU, could easily offset the power savings via backlight dimming. Furthermore, computing the backlight scaling values requires per-frame luminance information, which is typically too energy intensive to compute on mobile devices. In this paper, we propose Content-Adaptive Display (CAD) for two typical Internet mobile video applications: video streaming and real-time video communication. CAD uses the mobile device's GPU rather than the CPU to perform luminance compensation at reduced power consumption. For video streaming where video frames are available in advance, we compute the backlight scaling schedule using a more efficient dynamic programming algorithm than existing work. For real-time video communication where video frames are generated on-the-fly, we propose a greedy algorithm to determine the backlight scaling at runtime. We implement CAD in one video streaming app and one real-time video call app on the Android platform and use a Monsoon power meter to measure the real power consumption. Experiment results show that CAD can effectively produce power savings.
High Efficiency Video Coding (HEVC/H.265) is the latest and most efficient video compression standard, a successor to H.264/AVC (Advanced Video Coding) that delivers the perceptual quality equivalent to H.264/AVC with up to 50% bitrate savings. Video watermarking in compressed domain has gained much attention in recent years as a promising solution to copyright protection since fully decode and re-encode of the video stream is not required for both embedding and extraction of watermark bits. We propose a robust watermarking framework with a blind extraction process for HEVC encoded video. A readable watermark sequence is embedded invisibly in P-frame for better perceptual quality. Our watermarking framework imposes security and robustness by selecting appropriate blocks using a pseudo-random key and the spatio-temporal characteristics of the compressed video. We analyze the strengths of different compressed domain features for implementing our watermarking framework. We demonstrate the utility of the proposed work with experimental results. The results show that the proposed work effectively restricts the increase in video bit rate and degradation in perceptual quality. The proposed framework is robust against different image and video processing attacks.
Solfège is a general technique used in the music learning process, which involves the vocal performance of melodies, regarding the time and duration of musical sounds as specified in the music score, properly associated with the meter-mimicking performed by the hand movement. This paper presents an audiovisual approach for automatic assessment of this relevant musical study practice. The proposed system combines the gesture of meter-mimicking (video information) with the melodic transcription (audio information), where the hand movement works as a metronome, controlling the time flow (tempo) of the musical piece. Thus, the meter-mimicking is used to align the music score (ground truth) with the sung melody, allowing the assessment even in time dynamic scenarios. Audio analysis is applied to achieve the melodic transcription of the sung notes and the solfège performances are evaluated by a set of Bayesian classifiers that were generated from real evaluations done by experts listeners.