ACM Transactions on

Multimedia Computing, Communications, and Applications (TOMM)

Latest Articles

Efficient QoE-Aware Scheme for Video Quality Switching Operations in Dynamic Adaptive Streaming

Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution... (more)

HTTP/2-based Frame Discarding for Low-Latency Adaptive Video Streaming

In this article, we propose video delivery schemes insuring around 1s delivery latency with Dynamic Adaptive Streaming over HTTP (DASH), which is a... (more)

Symmetrical Residual Connections for Single Image Super-Resolution

Single-image super-resolution (SISR) methods based on convolutional neural networks (CNN) have shown great potential in the literature. However, most... (more)

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint... (more)

Expression Robust 3D Facial Landmarking via Progressive Coarse-to-Fine Tuning

Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on... (more)

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distributions and representations of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and have shown their strong ability to... (more)

Reconstructing 3D Face Models by Incremental Aggregation and Refinement of Depth Frames

Face recognition from two-dimensional (2D) still images and videos is quite successful even with “in the wild” conditions. Instead,... (more)

Orchestrating Caching, Transcoding and Request Routing for Adaptive Video Streaming Over ICN

Information-centric networking (ICN) has been touted as a revolutionary solution for the future of the Internet, which will be dominated by video... (more)

Discovering Latent Topics by Gaussian Latent Dirichlet Allocation and Spectral Clustering

Today, diversifying the retrieval results of a certain query will improve customers’ search efficiency. Showing the multiple aspects of... (more)

Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a... (more)

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information... (more)


[December 2018]


Special issue call: "Multimodal Machine Learning for Human Behavior Analysis"Call for papers Submission deadline April 15th, 2019

Special issue call: "Computational Intelligence for Biomedical Data and Imaging". Call for papers . Submission deadline May 30th, 2019

Special issue call: "Smart Communications and Networking for Future Video Surveillance". Call for papers Submission deadline June 30th, 2019

Special issue call: "Trusted Multimedia for Smart Digital Environments". Call for papers . Submission deadline September 20th, 2019


News archive
Forthcoming Articles

Introduction to the Special Section on the Cross-Media Analysis for Visual Question Answering

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate accurate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.

Advanced Stereo Seam Carving by Considering Occlusions on Both Sides

Stereo image retargeting plays a significant role in the field of image processing, which aims at making major objects prominent as possible when the resolution of an image is changed, including maintaining disparity and depth information at the same time. Many researchers in relevant fields have proposed seam carving methods that generally preserve geometric consistency of the images. However, they did not take into account the regions of occlusion on both sides. We propose a solution to this problem using a new strategy of seams finding by considering occluded and occluding regions on both of the input images, and leaving geometric consistency in both images intact. We also introduced line segment detection and superpixel segmentation to further improve the quality of the images. Imaging effects are optimized in the process and visual comfort, which is also influenced by other factors, can be boosted as well.

Eigenvector-Based Distance Metric Learning for Image Classification and Retrieval

Distance metric learning has been widely studied in multifarious research fields. The mainstream approaches learn a Mahalanobis metric or learn a linear transformation. Recent related works propose learning a linear combination of base vectors to approximate the metric. In this way, fewer variables need to be determined, which is efficient when facing high-dimensional data. Nevertheless, such works obtain base vectors using additional data from related domains or randomly generate base vectors. However, obtaining base vectors from related domains requires extra time and additional data, and random vectors introduce randomness into the learning process, which requires sufficient random vectors to ensure the stability of the algorithm. Moreover, the random vectors cannot capture the rich information of the training data, leading to a degradation in performance. Considering these drawbacks, we propose a novel distance metric learning approach by introducing base vectors explicitly learned from training data. Given a specific task, we can make a sparse approximation of its objective function using the top eigenvalues and corresponding eigenvectors of a predefined integral operator on the reproducing kernel Hilbert space. Because the process of generating eigenvectors simply refers to the training data of the considered task, our proposed method does not require additional data and can reflect the intrinsic information of the input features. Furthermore, the explicitly learned eigenvectors do not result in randomness, and we can extend our method to any kernel space without changing the objective function. We only need to learn the coefficients of these eigenvectors, and the only hyperparameter that we need to determine is the number of eigenvectors that we utilize. Additionally, an optimization algorithm is proposed to efficiently solve this problem. Extensive experiments conducted on several datasets demonstrate the effectiveness of our proposed method.

Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation

Recently, progress on semantic image segmentation is substantial, benefitting from the rapid development of Convolutional Neural Networks (CNNs). Semantic image segmentation approaches proposed lately have been mostly based on Fully convolutional Networks (FCNs). However, these FCN-based methods use large receptive fields and too many pooling layers to depict the discriminative semantic information of the images. These operations often cause low spatial resolution inside deep layers, which leads to spatially fragmented prediction. To address this problem, we exploit the inherent multi-scale and pyramidal hierarchy of deep convolutional networks to extract the feature maps with different resolutions, and take full advantages of these feature maps via a gradually stacked fusing way. Specifically, for two adjacent convolutional layers, we upsample the features from deeper layer with stride of 2, and then stack them on the features from shallower layer. Then, a convolutional layer with kernels of 1 × 1 is followed to fuse these stacked features. The fused feature remains the spatial structure information of the image, meanwhile it owns strong discriminative capability for pixel classification. Additionally, to further preserve the spatial structure information and regional connectivity of the predicted category label map, we propose a novel loss term for the network. In detail, two graph model based spatial affinity matrixes are proposed, which are used to depict the pixel-level relationships in the input image and predicted category label map respectively, then their cosine distance is backward propagated to the network. The proposed architecture, called spatial structure preserving feature pyramid network (SSPFPN), significantly improves the spatial resolution of the predicted category label map for semantic image segmentation.

Deep Scalable Supervised Quantization by Self-Organizing Map

Approximate Nearest Neighbor (ANN) search is an important research topic in multimedia and computer vision fields. In this paper, we propose a new deep supervised quantization methods by Self-Organizing Map to address this problem. Our method integrates the Convolutional Neural Networks (CNN) and Self-Organizing Map (SOM) into a unified deep architecture. The overall training objective optimizes supervised quantization loss as well as classification loss. With the supervised quantization objective, we minimize the differences on the maps between similar image pairs, and maximize the differences on the maps between dissimilar image pairs. By optimization, the deep architecture can simultaneously extract deep features and quantize the features into the suitable nodes in the self-organizing map. To make the proposed deep supervised quantization method scalable for large datasets, instead of constructing larger self-organizing map, we propose to divide the input space into several subspaces, and construct self-organizing map in each subspace. The self-organizing maps in all the subspaces implicitly construct a large self-organizing map, which costs less memory and training time than directly constructing a self-organizing map with equal size. The experiments on several public standard datasets prove the superiority of our approaches over the existing ANN search methods. Besides, as a byproduct, our deep architecture can be directly applied to classification task and visualization with little modification, and promising performances are demonstrated on these tasks in the experiments.

Learning Discriminative Sentiment Representation from Strongly- and Weakly-Supervised CNNs

Visual sentiment analysis is getting increasing attention with the rapidly growing amount of images uploaded to social websites. Learning rich visual representations often requires training deep Convolutional Neural Networks (CNNs) on massive manually labeled data, which is expensive or scarce especially for a subjective task like visual sentiment analysis. Meanwhile, a large quantity of social image is quite available yet noisy by querying social network using the sentiment categories as keywords, where a various type of images related to the specific sentiment can be easily collected. In this paper, we propose a multiple kernel network (MKN) for visual sentiment recognition, which learns representation from strongly- and weakly- supervised CNNs. Specifically, the weakly-supervised deep model is trained using the large-scale data from social images, while the strongly-supervised deep model is fine-tuned on the affective datasets that are manually labeled. We employ the multiple kernel scheme on the multiple layers of these CNNs, which can automatically select the discriminative representation by learning a linear combination of a set of predefined kernels. In addition, we introduce a large-scale dataset collected from popular comics of various countries, e.g., America, Japan, China and France, which consists of 11,821 images with various artistic styles. Experimental results show that MKCNN achieves consistent improvements over the state-of-the-art methods on the public affective datasets as well as the newly established comics dataset.

Image Captioning by Asking Questions

Image captioning and visual question answering are typical tasks which connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this paper, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image captioning method which transfers the knowledge learned from the VQA corpora to the image captioning task. VQA models are firstly pretrained on image-question-answer instances. To effectively extract semantic features by the VQA model, we build a large set of free-form open-ended questions. Then the pretrained VQA model is used to extract VQA-grounded semantic representations which interpret images from a different perspective. We incorporate the VQA model into the image captioning model by fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image captioning (VQA-IIC) models significantly outperform the conventional image captioning methods on large-scale public datasets.

Introduction to the Best Papers of the ACM Multimedia Systems (MMSys) Conference 2018 and the ACM Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV) 2018 and the International Workshop on Mixed and Virtual Environment Systems (MMVE) 2018

Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks

Due to the special gating schemes of Long Short-Term Memory (LSTM), LSTMs have shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatio-temporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain caused by the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatio-temporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS. To fully utilize the different orders of DoS, we further propose to stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS, and thus the proposed LSTM model is termed deep differential Recurrent Neural Network ($d^2$RNN). The effectiveness of the proposed model is demonstrated on two publicly available human activity datasets: NUS-HGA and Violent-Flows. The proposed model outperforms both LSTM and non-LSTM based state-of-the-art algorithms.

Chunk duration-aware SDN-assisted DASH

Although Dynamic Adaptive Streaming over HTTP (DASH) is the pillar of multimedia content delivery mechanisms, its purely client-based adaptive video bitrate mechanisms have quality of experience (QoE) fairness and stability problems in the existence of multiple DASH clients and highly fluctuating background traffic on the same shared bottleneck link. Varying chunk duration among different titles of multiple video providers exacerbates this problem. With the help of the global network view provided by Software-Defined Networking paradigm (SDN), we propose a centralized joint optimization module-assisted adaptive video bitrate mechanism which takes diversity of chunk sizes among different content into account. Our system collects possible video bitrate levels and chunk duration from DASH clients and simply calculates the optimal video bitrates per client based on the available capacity and chunk duration of each client's selected content while not invading users' privacy. By continuously following the background traffic flows, it asynchronously updates the target video bitrate levels to avoid both buffer stall events and network under-utilization issues rather than bandwidth slicing which brings about scalability problems in practice. It also guarantees fair start-up delays for video sessions with various chunk duration. Our experiments clearly show that our proposed approach considering diversity of chunk duration and background traffic fluctuations can significantly provide a better and fair QoE in terms of SSIM-based video quality and start-up delay compared to both purely client-based and state-of-the-art SDN-based adaptive bitrate mechanisms.

Internet of Things Based Trusted Hypertension Management App Using Mobile Technology

App for hypertension management is developed. The web-roadmap technology is used to develop the app; this technology has five steps namely are planning, analysis, design, implementation and evaluation. The hypertension management app is tested with patients, with hypertension (N=56). Their medication possession ratio is calculated before and after using of hypertension management app for the period of five weeks. The total number of participants participated is 56, in 56 participants 45 participants have taken the medication adherence. The medical possession ratio is calculated using morisky scale, and there is an improvement in the patients? health after the usage of Hyperion management app with (p=.001). The calculated score of usefulness is 3.9 in 5. The satisfaction of user is calculated after using the hypertension app for the different process like recording of blood pressure is 4.5, recording of medication ratio is 4.0, score for sending the data is 3.4, for alerting process is 4.3 and for process of alerting about medication process is 5. This paper shows that mobile app for hypertension using clinical practice guidelines is effective in improving the patients help.

Paillier Cryptosystem based Mean Value Computation for Encrypted Domain Image Processing Operations

Due to its large storage facility and high-end computing capability, cloud computing has received great attention as a huge amount of personal multimedia data and computationally expensive tasks can be outsourced to the cloud. However, the cloud being third-party semi-trusted, are prone to privacy risks. Signal processing in the encrypted domain (SPED) has aroused as a new research paradigm on privacy-preserving processing over outsourced data by semi-trusted cloud. In this paper, we propose a solution for non-integer mean value computation in the homomorphic encrypted domain without any interactive protocol between the client and the service provider. Using the proposed solution, various image processing operations such as local smoothing filter, un-sharp masking and histogram equalization can be performed in the encrypted domain at the cloud server without any privacy concerns. Our experimental results from standard test images reveal that these image processing operations can be performed without pre-processing, without client-server interactive protocol and without any error between the encrypted domain and the plain domain.

Harvesting Visual Objects from Internet Images via Deep Learning Based Objectness Assessment

The collection of internet images has been growing in an astonishing speed. It is undoubted that these images contain rich visual information that can be useful in many applications, such as visual media creation and data-driven image synthesis. In this paper, we focus on the methodologies for building a visual object database from a collection of internet images. Such database is built to contain a large number of high-quality visual objects that can help with various data-driven image applications. Our method is based on dense proposal generation and objectness-oriented re-ranking. A novel deep convolutional neural network is designed for the inference of proposal objectness, the probability of a proposal containing optimally-located foreground object. In our work, the objectness is quantitatively measured in regard of completeness and fullness, reflecting two complementary features of an optimal proposal: a complete foreground and relatively small background. Our experiments indicate that object proposals re-ranked according to the output of our network generally achieve higher performance than those produced by other state-of-the-art methods. As a concrete example, a database of over 1.2 million visual objects has been built using the proposed method, and has been successfully used in various data-driven image applications.

Artificial Intelligence, Artists, and Art: Attitudes Toward Artwork Produced by Humans vs. Artificial Intelligence

This study examines how people perceive artwork created by artificial intelligence (AI) and how knowledge of the artist?s identity (Human vs AI) affects individuals? evaluation of art. Drawing on Schema theory and theory of Computers Are Social Actors (CASA), this study used a survey-experiment that controlled for the identity of the artist (AI vs. Human) and presented participants with two types of artworks (AI-created vs. Human-created). After seeing images of six artworks created by either AI or human artists, participants (n=288) were asked to evaluate the artistic value using a validated scale commonly employed among art professionals. The study found that human-created artworks and AI-created artworks were not judged to be equivalent in their artistic value. Additionally, knowing that a piece of was created by AI did not in general influence participants? evaluation of art pieces? artistic value. However, having a schema that AI cannot make art significantly influenced evaluation. Implications of the findings for application and theory are discussed.

Visual Arts Search on Mobile Devices

Visual arts, especial paintings appear everywhere in our daily lives. They are not only liked by art lovers, but also ordinary people, that are usually curious about the stories behind these art pieces and also interested in exploring related art pieces. Among various methods, mobile visual search has its merit in providing an alternative solution, where text and voice searches are not always applicable. Mobile visual search for visual arts is far more challenging than general image visual search. Conventionally, visual search, such as searching products and plant, focuses on locating images with simialr objects. Hence, approaches are designed to locate the objects and extract scale-invariant features from distorted images that captured by mobile camera. However, the objects is only part of a visual art, the background and the painting style are all important factors that are not considered in the conventional approaches. An empirical study is conducted to study issues of photos taken by mobile camera, such as, orientation variance, motion blur, and how they influence the results of visual arts search. A photo rectification pipeline is designed to rectify the photos into perfect one for feature extraction. A new method is proposed to learn high discriminative features for visual arts, which consider both the content information and style information in visual arts. Apart from conducting solid experiments, a real-world system is built to prove the effectiveness of proposed methods. To the best of our knowledge, this is the first paper to solve problems for visual arts search on mobile devices.

A Framework for Adaptive Residual Streaming for Single Player Cloud Gaming

Applying cloud technology to 3D interactive multimedia applications is a promising way to provide flexible and cost efficient online high bandwidth immersive services to a large population of end users. One main reason cloud systems are popular among users is the fact that it relaxes the hardware requirements for high-end interactive visual applications. As most of the computational tasks are done on cloud servers, users no longer need to upgrade their hardware as frequently to keep up with the ever increasing high end computing requirements of the latest applications. Moreover, cloud systems make it easier for a user to enjoy applications on different platforms, including mobile devices that are usually not powerful enough to run high-end, memory intensive services. In short, applying cloud technology to high end immersive applications has advantages in cost efficiency and flexibility both for the end users and the service providers. However, there are two main drawbacks to applying cloud technology to 3D interactive multimedia services: 1) high bandwidth utilization and 2) latency. In this paper, we propose a flexible framework that addresses the two problems by using a combination of collaborative rendering, progressive meshes, and 3D image warping techniques. The experimental results show that the proposed system can reduce the bandwidth usage and improve the visual quality by utilizing local computing power on the client. The results also show that it is possible to reduce the interactive latency by sacrificing the visual quality in our system.

A Pseudo-likelihood Approach For Geo-localization of Events From Crowd-sourced Sensor-Metadata

Events such as live concert, protest march, an exhibition are often video recorded by many people at the same time, typically using smartphone devices. In this work, we address the problem of geo-localizing such events from crowd generated data. Traditional approaches for solving such a problem using multiple video sequences of the event would require highly complex co

6K and 8K Effective Resolution with 4K HEVC Decoding Capability for OMAF-compliant 360° Video Streaming

The recent Omnidirectional MediA Format (OMAF) standard, which specifies delivery of 360° video content, supports only equirectangular (ERP) and cubemap projections and their region-wise packing with a limitation on video decoding capability to the maximum resolution of 4K (e.g., 4096×2048). Streaming of 4K ERP content allows only a limited viewport resolution, which is lower than the resolution of many current head-mounted displays (HMDs). Therefore, in order to take the full advantage of high-resolution HMDs, delivery of 360° video content beyond 4K resolution needs to be enabled. In this regard, this work proposes two specific mixed-resolution packing schemes of 6K (e.g. 6144×3072) and 8K (e.g. 8192×4096) ERP content and their realization in tile-based streaming, while complying with the 4K decoding constraint and the High Efficiency Video Coding (HEVC) standard. The proposed packing schemes offer 6K and 8K effective resolution at the viewport (i.e., part of the video viewed by the user). Overall, experimental results indicate that, the proposed layouts significantly decrease the streaming bitrate when compared to mixed-quality viewport-adaptive streaming of 4K ERP as an alternative solution. The results further indicate that the 8K-effective packing outperforms the 6K-effective packing especially in high-quality ranges.

Game of Streaming Players: Is Consensus Viable or an Illusion?

The dramatic growth of HTTP adaptive streaming (HAS) traffic represents a practical challenge for service providers in satisfying the demand from their customers while maintaining a good profit. Achieving this in a network where multiple players share the network capacity, has so far proved hard because of the bandwidth competition among the HAS players. This competition is caused by the bandwidth overestimation that is introduced due to the isolated and selfish behavior of the HAS players. Each player strives individually to select the maximum bitrate without considering the co-existing players or network resource dynamics. As a result, the HAS players suffer from video quality instability, quality unfairness, and network underutilization or oversubscription, and the players experience a poor quality of experience (QoE). To address this, we propose a fully distributed game theory and consensus-based collaborative adaptive bitrate solution for shared network environments, termed GTAC. Our solution consists of two-stage games that run in parallel during a streaming session. We extensively evaluate GTAC on a broad set of trace-driven and real-world experiments. Results show that GTAC enhances the viewer QoE by 22%, presentation quality stability by up to 24%, fairness by at least 31%, and network utilization by 28% compared to the well-known schemes.

Soul Dancer: Emotion-based Human Action Generation

Body language is one of the most common ways of expressing human emotion. In this paper, we make the first attempt to generate action video with a specific emotion from a single person image. The task of emotion based action generation (EBAG) can be defined as: provided with a type of emotion and a human image with full body, action video in which the person of the source image expressing the given type of emotion can be generated. We divide the task into two parts and propose a two-stage framework to generate action video with emotion expressing. In the first stage, we propose an RNN based LS-GAN for translating the emotion to a pose sequence. In the second stage, we generate the target video frames according to the three inputs including the source pose and the target pose as the motion information and source image as the appearance reference by using conditional GAN model with online training strategy. Our framework produces the pose sequence and transforms the action independently, which underlines the fundamental role that the high-level pose feature plays in generating action video with a specific emotion. The proposed method has been evaluated on the "Soul Dancer" dataset which is built for action emotion analysis and generation. The experimental results demonstrate that our framework can effectively solve the emotion-based action generation task. However, the gap in the details of the appearance between the generated action video and the real-world video still exists, which indicates that the emotion-based action generation task has great research potential.

Color Theme-based Aesthetic Enhancement Algorithm to Emulate the Human Perception of Beauty in Photos

From Theory to Practice: Improving Bitrate Adaptation in the DASH Reference Player

Modern video streaming uses adaptive bitrate (ABR) algorithms than run inside video players and continually adjust the quality (i.e., bitrate) of the video segments that are downloaded and rendered to the user. To maximize the quality-of-experience of the user, ABR algorithms must stream at a high bitrate with low rebuffering and low bitrate oscillations. Further, a good ABR algorithm is responsive to user and network events and can be used in demanding scenarios such as low-latency live streaming. Recent research papers provide an abundance of ABR algorithms, but fall short on many of the above real-world requirements. We develop \texttt{Sabre}, an open-source publicly-available simulation tool that enables fast and accurate simulation of adaptive streaming environments. We empirically validated \texttt{Sabre} to show that it accurately simulates real-world environments. We used \texttt{Sabre} to design and evaluate \texttt{BOLA-E} and \texttt{DYNAMIC}, two novel ABR algorithms. We also developed a \texttt{FAST SWITCHING} algorithm that can replace segments that have already been downloaded with higher-bitrate (thus higher-quality) segments. The new algorithms provide higher QoE to the user in terms of higher bitrate, fewer rebuffers, and lesser bitrate oscillations. In addition, these algorithms react faster to user events such as startup and seek, and respond more quickly to network events such as improvements in throughput. Further, they perform very well for live streams that require low latency, a challenging scenario for ABR algorithms. Overall, our algorithms offer superior video QoE and responsiveness for real-life adaptive video streaming, in comparison to the state-of-the-art. \emph{Importantly all three algorithms presented in this paper are now part of the official DASH reference player \texttt{dash.js} and are being used by video providers in production environments.} While our evaluation and implementation are focused on the DASH environment, our algorithms are equally applicable to other adaptive streaming formats such as Apple HLS.

Introduction to the Special Section on Big Data, Machine Learning and AI Technologies for Art and Design

Beauty is in the Eye of the Beholder: Demographically Oriented Analysis of Aesthetics in Photographs

Subtitle Region Selection of S3D Images in Consideration of Visual Discomfort and Viewing Habit

Emotion Recognition with Multi-hypergraph Neural Networks Combining Multimodal Physiological Signals

Emotion recognition by physiological signals is an effective way to discern the inner state of human beings and therefore has been widely adopted in user-centered work, such as driver status monitoring, telemedicine and other tasks. The majority of present studies regarding emotion recognition are devoted to exploration of the relationship among emotion and physiological signals with subjects seen as a whole. However, given some features of the natural process of emotional expression, it is an urgent task to characterize latent correlations among multimodal physiological signals and pay attention to the influence of individual differences to exploit associations among individual subjects. To tackle the problem, it is proposed in the paper to establish multi-hypergraph neural networks (MHGNN) to recognize emotions by physiological signals. The method constructs multi-hypergraph structure, in which one hypergraph is established by one type of physiological signals to formulate correlations among different subjects. Each one of the vertices in a hypergraph stands for one subject with a description of its related stimuli, and the hyperedges serve as representation of the connections among the vertices. With the multi-hypergraph structure of the subjects, emotion recognition is transformed into classification of vertices in the multi-hypergraph structure. Experimental results with the DEAP dataset and ASCERTAIN dataset demonstrate that the proposed method outperforms the state-of-the-art methods at present. The contrast experiments prove that MHGNN is capable of describing real process of biological response with much higher precision.

Synthesizing facial photometries and corresponding geometries using generative adversarial networks

Artificial data synthesis is currently a well studied topic with useful applications in data science, computer vision, graphics and many other fields. Generating realistic data is especially challenging since human perception is highly sensitive non-realistic appearance. Recent advances in GAN architecture and training procedures have driven the capabilities of synthetic data generation to new heights of realism. These successful models however, are tuned mostly for use with regularly sampled data such as images, audio and video. Despite the wide success on these types of media, applying the same tools to geometric data poses a far greater challenge which is still a hot topic of debate within the academic community. The lack of intrinsic parametrization inherent to geometric objects prohibits the direct use of convolutional filters, a main building block of today's machine learning systems. In this paper we propose a new method for generating realistic human facial geometries coupled with overlayed textures. We circumvent the parametrization issue by imposing a global mapping from our data to the unit rectangle. This mapping enables the representation of our geometric data as regularly sampled 2D images. We further discuss how to design such a mapping in order to control the mapping distortion and conserve area within the mapped image. By representing geometric textures and geometries as images, we are able to use advanced GAN methodologies in order to generate new geometries. We address the often neglected topic of relation between texture and geometry and propose to use this correlation in order to match between generated textures and their corresponding geometries. In addition we widen the scope of our discussion and offer a new method for training GAN models on partially corrupted data. Finally, we provide empirical evidence to support our claim that our generative model is able to produce examples of new people which do not exist within the training data while maintaining high realism and texture detail, two traits that are often at odds.

Efficient Face Alignment with Fast Normalization and Contour Fitting Loss

Face Alignment is a key component of numerous face analysis tasks. In recent years, most existing methods have focused on designing high-performance face alignment systems and paid less attention to efficiency. However more and more face alignment systems are applied on low-cost devices, such as mobile phones. In this paper, we design an efficient light-weight CNN-based regression framework with a novel contour fitting loss, achieving competitive performance with other state-of-art methods. We discover that the maximum error exists in the face contour, where landmarks do not have distinct semantic positions, and thus are randomly labeled along the face contours in training data. To address this problem, we reshape the common L2 loss, to dynamically adjust the regression targets during training network, so that the network can learn more accurate semantic meanings of the contour landmarks and achieve better localization performance. Meanwhile, we systematically analyze the effects of pose variations in face alignment task and design an efficient framework with a Fast Normalization Module (FNM) and a lightweight alignment module(LAM), which fast normalizes the in-plane rotation and efficiently localize the landmarks. Our method achieves competitive performance with state of the arts on 300W benchmark and the speed is significant faster than other CNN-based approaches.

A Unified Tensor-based Active Appearance Model

Appearance variations result in many difficulties in face image analysis. To deal with this challenge, we present a Unified Tensor-based Active Appearance Model (UT-AAM) for jointly modelling the geometry and texture information of 2D faces. For each type of face information, namely shape and texture, we construct a unified tensor model capturing all relevant appearance variations. This contrasts with the variation-specific models of the classical tensor AAM. To achieve the unification across pose variations, a strategy for dealing with self-occluded faces is proposed to obtain consistent shape and texture representations of pose-varied faces. In addition, our UT-AAM is capable of constructing the model from an incomplete training dataset, using tensor completion methods. Last, we use an effective cascaded-regression-based method for UT-AAM fitting. With these advancements, the utility of UT-AAM in practice is considerably enhanced. As an example, we demonstrate the improvements in training facial landmark detectors through the use of UT-AAM to synthesise a large number of virtual samples. Experimental results obtained using the Multi-PIE and 300-W face datasets demonstrate the merits of the proposed approach.

Statistical Early Termination and Early Skip Models for Fast Mode Decision in HEVC INTRA Coding

In this paper, statistical Early Termination (ET) and Early Skip (ES) models are proposed for fast Coding Unit (CU) and prediction mode decision in HEVC INTRA coding , in which three categories of ET and ES sub-algorithms are included. Firstly, the CU ranges of the current CU are recursively predicted based on the texture and CU depth of the spatial neighboring CUs. Secondly, the statistical model based ET and ES schemes are proposed and applied to joint CU and intra prediction mode decision, in which the coding complexities over different decision layers are jointly minimized subject to acceptable rate-distortion degradation. Thirdly, the mode correlations among the intra prediction modes are exploited to early terminate the full rate-distortion optimization in each CU decision layer. Extensive experiments are performed to evaluate the coding performance of each sub-algorithm and the overall algorithm. Experimental results reveal that the overall proposed algorithm can achieve 45.47% to 74.77%, and 58.09% on average complexity reduction, while the overall Bjøntegaard delta bit rate increase and Bjøntegaard delta peak signal-to-noise rate degradation are 2.29% and -0.11 dB, respectively, which are negligible.

Smart Diagnosis: A Multiple-Source Transfer TSK Fuzzy System for EEG Seizure Identification

In order to effectively identify Electroencephalogram (EEG) signals in multiple source domains, a transductive multiple source transfer learning method called as MS-TL-TSK is proposed, which combines together multiple source transfer learning and manifold regularization (MR) learning mechanisms into Takagi-Sugeno-Kang (TSK) fuzzy system. Specifically, the advantages of MS-TL-TSK include: (1) By evaluating the significant of each source domain, a flexible domain weighting index is presented; (2) Using the theory of sample transfer learning, a re-weighting strategy is presented to weigh the prediction of unknown samples in target domain and the output of source prediction functions; (3) By taking into account the MR term, the manifold structure of the target domain is effectively maintained in the proposed system; (4) By inheriting the interpretability of TSK fuzzy system (TSK-FS), MS-TL-TSK has good interpretability that would be understandable by human beings(domain experts) for identifying EEG signals. The effectiveness of the proposed fuzzy system is demonstrated on several EEG multiple source transfer learning problems.

Learning Click-based Deep Structure-Preserving Embeddings with Visual Attention

One fundamental problem in image search is to learn the ranking functions, i.e., the similarity between query and image. Recent progress on this topic has evolved through two paradigms: text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this paper that the above two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: a deep convolutional neural networks (CNN) followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of CNN to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraint. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves to-date the best reported NDCG@25 of 52.21%.

Characterizing Subtle Facial Movements via Riemannian Manifold

Facial movements play a crucial role for human beings to communicate and express emotions since they not only transmit communication contents but also contribute to ongoing processes of emotion-relevant information. Characterizing subtle facial movements from videos is one of the most intensive topics in computer vision research. It is, however, challenging because that 1) the intensity of subtle facial muscle movement is usually low; 2) the duration may be transient, and 3) datasets containing spontaneous subtle movements with reliable annotations are painful to obtain and often of small sizes. This paper is targeted at addressing these problems for characterizing subtle facial movements from both the aspects of motion elucidation and description. Firstly, we propose an efficient method for elucidating hidden and repressed movements to make them easier to get noticed. We explore the feasibility of linearizing motion magnification and temporal interpolation, which has been obscured by the implementation of existing methods. We then propose a consolidated framework, termed MOTEL, to expand temporal duration and amplify subtle facial movements simultaneously. Secondly, we make our contribution to dynamic description. One major challenge is how to capture the intrinsic temporal variations caused by movements and omit extrinsic ones caused by different individuals and various environments. To diminish the influences of such diversity, we propose to characterize the dynamics of short-term movements via the differences between points on the tangent spaces to the manifolds, rather than the points themselves. We then significantly relax the trajectory-smooth assumption of the conventional manifold based trajectory modeling method and model longer-term dynamics using statistical observation model within the sequential inference approaches. Finally, we incorporate the tangent delta descriptor with the sequential inference approaches and present a hierarchical representation architecture to cover the period of the facial movements occurrence. The proposed motion elucidation and description approach is validated by a series of experiments on publicly available datasets in the example tasks of micro-expression recognition and visual speech recognition.

Video Question Answering via Knowledge-Based Progressive Spatial-Temporal Attention Network

Visual Question Answering (VQA) is a challenging task which has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though it is quite hot recently, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. And because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this paper, we not only take the spatial and temporal dimension of video content into account, but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network (K-PSTANet) to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object which is extracted from DBpedia. Then, we develop the question-knowledge guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments validate the effectiveness of our method.

DenseNet-201 based deep neural network with composite learning factor and precomputation for multiple sclerosis classification

(Aim) Multiple sclerosis is a neurological condition that may cause neurologic disability among. To identify multiple sclerosis more accurately, this paper proposed a new transfer-learning based approach. (Method) DenseNet-121, DensetNet-169, and DenseNet-201 neural networks were compared. Besides, we proposed to use a composite learning factor (CLF) that assigns different learning factor to three types of layers: early frozen layers, middle layers, and late newly-replaced layers. How to allocate layers into those three layers remain a problem. Hence, four transfer learning setting (viz., Setting A, B, C, and D) were tested and compared. Precomputation method was utilized to reduce storage burden and accelerate the program. (Results) We observed that DenseNet-201-D can achieve the best performance. The sensitivity, specificity, and accuracy of DenseNet-201-D was 98.27± 0.58, 98.35± 0.69, and 98.31± 0.53, respectively. (Conclusion) Our method gives better performances than state-of-the-art approaches. Furthermore, this composite learning rate gives superior results to traditional simple learning factor (SLF) strategy.

Interpretable Partitioned Embedding for Intelligent Multi-item Fashion Outfit Composition

Intelligent fashion outfit composition becomes more and more popular in these years. Some deep learning based approaches reveal competitive composition recently. However, the uninterpretable characteristic makes such deep learning based approach cannot meet the designers, businesses, and consumers? urge to comprehend the importance of different attributes in an outfit composition. To realize interpretable and customized multi-item fashion outfit compositions, we propose a partitioned embedding network to learn interpretable embeddings from clothing items. The network contains two vital components: attribute partition module and partition adversarial module. In the attribute partition module, multiple attribute labels are adopted to ensure that different parts of the overall embedding correspond to different attributes. In the partition adversarial module, adversarial operations are adopted to achieve the independence of different parts. With the interpretable and partitioned embedding, we then construct an outfit composition graph and an attribute matching map. Extensive experiments demonstrate that 1) the partitioned embedding have unmingled parts which corresponding to different attributes and 2) outfits recommended by our model are more desirable in comparison with the existing methods.

Affective Content-aware Adaptation Scheme on QoE Optimization of Adaptive Streaming over HTTP

The paper presents a novel affective content-aware adaptation scheme (ACAA) to optimize QoE for adaptive video streaming over HTTP. Most of existing HTTP-based adaptive streaming schemes conduct video bit-rate adaptation based on an estimation of available network resources, which ignored user preference on affective content (AC) embedded in video data streaming over the network. Since the personal demands to AC is very different among all viewers, to satisfy individual affective demand is critical to improve the QoE in the commercial video services. However, the results of video affective analysis can?t be applied into a current adaptive streaming scheme directly. Considering the AC distributions in user?s viewing history and all streaming segments, the AC relevancy can be inferred as an affective metric for the AC related segments. Further, we have proposed an ACAA scheme to optimize QoE for user desired affective content while taking into account both network status and affective relevancy. We have implemented the ACAA scheme over a realistic traces based evaluation and compared its performance in terms of network performance, Quality of Experience (QoE) with that of Probe and Adaptation (PANDA), buffer-based adaptation (BBA) and Model Predictive Control (MPC). Experimental results show that ACAA can preserve available buffer time for future being delivered affective content preferred by viewer?s individual preference, so as to achieve better QoE in affective contents than those normal contents while remain the overall QoE to be satisfactory.

Image/Video Restoration via Multiplanar Autoregressive Model and Low-Rank Optimization

In this paper, we introduce an image/video restoration approach by utilizing the high-dimensional similarity in images/videos. After grouping similar patches from neighboring frames, we propose to build a Multiplanar autoregressive (AR) model to exploit the correlation in cross-dimensional planes of the patch group, which has long been neglected by previous AR models. To further utilize the nonlocal self-similarity in images/videos, a joint multiplanar AR and low-rank based approach is proposed (MARLow) to reconstruct patch groups more effectively. Moreover, for video restoration, the temporal smoothness of the restored video is constrained by the Markov random field (MRF), where MRF encodes a priori knowledge about consistency of patches from neighboring frames. Specifically, we treat different restoration results (from different patch groups) of a certain patch as labels of an MRF, and temporal consistency among these restored patches is imposed. Besides image and video restoration, the proposed method is also suitable for other restoration applications such as interpolation and text removal. Extensive experimental results demonstrate that the proposed approach obtains encouraging performance comparing with state-of-the-art methods.

Visual Attention Analysis and Prediction on Human Faces for Children with Autism Spectrum Disorder

The focus of this article is to analyze and predict the visual attention of children with Autism Spectrum Disorder (ASD) when looking at human faces. Social difficulties are the hallmark features of ASD and will lead to atypical visual attention toward various stimuli more or less, especially on human faces. Learning the visual attention of children with ASD could contribute to related research in the field of medical science, psychology, and education. We first construct a Visual Attention on Faces for Autism Spectrum Disorder (VAFA) database, which consists of 300 natural scene images with human faces and corresponding eye movement data collected from 13 children with ASD. Compared with matched typically developing (TD) controls, we quantify atypical visual attention on human faces in ASD. Statistics show that some high-level factors such as face size, facial features, face poses, and face emotions have different impacts on the visual attention of children with ASD. Combining the feature maps extracted from the state-of-the-art saliency models, we get the visual attention model on human faces for the autistic. The proposed model shows the best performance among all competitors. With the help of our proposed model, researchers in related fields could design specialized education contents containing human faces for the children with ASD or produce the specific model for rapidly screening ASD using their eye movement data.

Pseudo-3D Attention Transfer Network with Content-Aware Strategy for Image Captioning

In this paper, we propose a novel pseudo-3D attention memory network with content-aware strategy (P3DAM-CAS) for the image captioning task. Our model consists of three parts: pseudo-3D attention network (P3DA), the P3DA based memory network (P3DAM) and content-aware strategy (CAS). Firstly, we propose P3DA to make full use of 3D information in convolutional feature maps to capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes to generate more informative captions. Secondly, we design P3DAM to maintain the attention contexts, predicting each word by previous and current attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas P3DAM is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is effective to convey meaningful information about the visual contents accurately. P3DAM-CAS is evaluated on Flickr30k and MSCOCO, which demonstrates that the proposed model outperforms the state-of-the-art approaches. To our best knowledge, P3DAM-CAS achieves one of the best performance among all using the cross-entropy training method.

Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition

Recently, many deep learning approaches have shown remarkable progress on human action recognition. However, few efforts have been made to improve the performance of action recognition by applying the visual attention mechanism in deep learning model. In this paper, we propose a novel deep framework called Moving Foreground Attention (MFA) which enhances the performance of action recognition by guiding the model to focus on the discriminative foreground targets. In our work, MFA detects the moving foreground through a proposed variance-based algorithm. Meanwhile, an unsupervised proposal is utilized to mine the action-related key volumes and generate corresponding correlation scores. Based on these scores, a new stochastic-out scheme is adopted to effectively train the MFA. In addition, we integrate an independent Temporal Net into the proposed framework for temporal dynamic modeling. Experiments on two standard benchmarks UCF101 and HMDB51 show that the proposed MFA is effective and reaches state-of-the-art performance.

Art by Computing Machinery: Is Machine-Art Acceptable in the Artworld?

When does a machine-created work become art? What is art, and can machine artworks fit in to the historical and present discourse? Are machine artworks a mere type of new media which artists extend their creativity with? Will solely machine-created artworks be accepted by our artworlds? This article probes these questions by first identifying the frameworks for defining and explaining art and evaluating its suitability for explaining machine artworks. It then explores how artworks have a necessary relationship with their human artists and the wider context of history, institutions, of styles and approaches, and of audiences and artworlds. The article then questions if machines have such a relational context and if machines will ever live up to our standard of what constitutes an artwork as defined by us, or are machines good only for assisting creativity. The question of IP, rights and ownership are also discussed for human-machine artworks and purely machine-produced works of art. The article views the viability of machines as artists as the central question in the historical discourse, extended through the art and the artworld. The article evaluates machine-produced work from such a basis.

Multi-source Multi-level Attention Networks for Visual Question Answering

In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with the reference to a given image. VQA is challenging because the reasoning process on visual domain needs full understanding of spatial relationship, semantic concepts as well as common sense for real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to lack of the modeling of the rich spatial context of regions, high-level semantics of images and knowledges across multiple sources. To solve the challenges, we propose a multi-source multi-level attention network for visual question answering that can benefit both spatial inference by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level from image source as well as sentence level from external knowledge base. First, we encode region-based middle-level outputs from convolutional neural networks (CNN) into spatially-embedded representation by a multi-directional 2D recurrent neural network, and further locate the answer-related regions by multiple layer perceptron (MLP) as visual attention. Second, we generate semantic concepts from high-level semantics in CNN and select those question-related concepts as concept attention. Third, we query semantic knowledges from general knowledge base by concepts and selected those question-related knowledges as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.

Cross-Modality Retrieval by Joint Correlation Learning

As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this paper, in order to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. Firstly, an auto-encoder is used to embed the visual features by minimizing {the error of feature reconstruction} and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter- correlations among the image-sentence pairs, \emph{i.e.}, the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its {performance}. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, \emph{i.e.,} Flickr8k, Flickr30k, and MS-COCO.

Stochastic Optimization for Green Multimedia Services in Dense 5G Networks

The many fold capacity magnification promised by dense 5G networks will make possible the provisioning of broadband multimedia services, including virtual reality, augmented reality, mobile immersive video, to name a few. These new applications will coexist with classic ones and contribute to the exponential growth of multimedia services in mobile networks. At the same time, the different requirements of past and old services pose new challenges to the effective usage of 5G resources. In response to these challenges, a novel Stochastic Optimization framework for Green Multimedia Services (SOGMS) is proposed hereby that targets the maximization of system throughput and the minimization of energy consumption in data delivery. In particular, Lyapunov optimization is leveraged to face this optimization objective, which is formulated and decomposed into three tractable subproblems. For each subproblem, a distinct algorithm is conceived, namely Quality of Experience (QoE) based admission control, cooperative resource allocation, and multimedia services scheduling. Finally, extensive simulations are carried out to evaluate the proposed method against state-of-art solutions in dense 5G networks.

A Simplistic Global Median Filtering Forensics Based on Frequency Domain Analysis of Image Residuals

Sophisticated image forgeries introduce digital image forensics as an active area of research. In this area, many researchers have addressed the problem of median filtering forensics. Existing median filtering detectors are adequate to classify median filtered images in uncompressed mode and in compressed mode at high quality factors. Despite that, the field is lacking a robust method to detect median filtering in low resolution images compressed with low quality factors. In this article, a novel feature set (four feature dimensions), based on first order statistics of frequency contents of median filtered residuals (MFRs) of original and median filtered images, has been proposed. The proposed feature set outperforms handcrafted features based state-of-the- art detectors, in terms of feature set dimensions, robustness for low resolution images at all quality factors and robustness against existing anti-forensic method. Also, results reveal the efficacy of proposed method over convolutional neural network (CNN) based median filtering detector. Comprehensive results expose the efficacy of the proposed detector to detect median filtering against other similar manipulations. Additionally, generalization ability test on cross-database images support the cross-validation results on four different databases. Thus, our proposed detector meets the current challenges in the field, to a great extent.

All ACM Journals | See Full Journal Index

Search TOMM
enter search term and/or author name