ACM Transactions on

Multimedia Computing, Communications, and Applications (TOMM)

Latest Articles

Advanced Stereo Seam Carving by Considering Occlusions on Both Sides

Stereo image retargeting plays a significant role in the field of image processing, which aims at making major objects as prominent as possible when... (more)

Statistical Early Termination and Early Skip Models for Fast Mode Decision in HEVC INTRA Coding

In this article, statistical Early Termination (ET) and Early Skip (ES) models are proposed for fast... (more)

A Simplistic Global Median Filtering Forensics Based on Frequency Domain Analysis of Image Residuals

Sophisticated image forgeries introduce digital image forensics as an active area of research. In... (more)

Harvesting Visual Objects from Internet Images via Deep-Learning-Based Objectness Assessment

The collection of internet images has been growing in an astonishing speed. It is undoubted that these images contain rich visual information that can... (more)

Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation

Recently, progress on semantic image segmentation is substantial, benefiting from the rapid development of Convolutional Neural Networks. Semantic... (more)

Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition

Recently, many deep learning approaches have shown remarkable progress on human action recognition. However, it remains unclear how to extract the... (more)

A Pseudo-likelihood Approach for Geo-localization of Events from Crowd-sourced Sensor-Metadata

Events such as live concerts, protest marches, and exhibitions are often video recorded by many... (more)

Paillier Cryptosystem based Mean Value Computation for Encrypted Domain Image Processing Operations

Due to its large storage facility and high-end computing capability, cloud computing has received... (more)

Subtitle Region Selection of S3D Images in Consideration of Visual Discomfort and Viewing Habit

Subtitles, serving as a linguistic approximation of the visual content, are an essential element in... (more)

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this... (more)

Stochastic Optimization for Green Multimedia Services in Dense 5G Networks

The manyfold capacity magnification promised by dense 5G networks will make possible the provisioning of broadband multimedia services, including... (more)


[December 2018]


Special issue call: "Multimodal Machine Learning for Human Behavior Analysis"Call for papers Submission deadline April 15th, 2019

Special issue call: "Computational Intelligence for Biomedical Data and Imaging". Call for papers . Submission deadline May 30th, 2019

Special issue call: "Smart Communications and Networking for Future Video Surveillance". Call for papers Submission deadline June 30th, 2019

Special issue call: "Trusted Multimedia for Smart Digital Environments". Call for papers . Submission deadline September 20th, 2019


News archive
Forthcoming Articles

Cross-domain brain CT image smart segmentation via shared hidden space transfer FCM clustering

Active Balancing Mechanism for Imbalanced Medical Data in Deep Learning based Classification Models Xx

CovLets: a Second Order Descriptor for Modeling Multiple Features

Introduction to the Special Issue on Affective Computing for Large-Scale Heterogeneous Multimedia Data

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

In this paper, we propose a novel method for unsupervised learning of human action categories in still images. Different from previous methods, the proposed method tries to explore distinctive information of actions directly from unlabeled image databases, and learn discriminative deep representations in an unsupervised manner to categorize actions. In the proposed method, large action image collections can be utilized without manual annotations. Specifically, (i) to deal with the problem that unsupervised discriminative deep representations are difficult to learn, the proposed method builds a training dataset with surrogate labels from the unlabeled dataset, then learn discriminative representations by alternately updating CNN parameters and the surrogate training dataset in an iterative manner; (ii) to explore the discriminatory information among different action categories, training batches for updating the CNN parameters are built with triplet groups, and the triplet loss function is introduced to update the CNN parameters; (iii) to learn more discriminative deep representations, a Random Forest classifier is adopted to update the surrogate training dataset, then more beneficial triplet groups can be built with the updated surrogate training dataset. Extensive experiments on two benchmark action image datasets demonstrate the effectiveness of the proposed method.

HGAN: Holistic Generative Adversarial Networks for 2D Image-based 3D Object Retrieval

In this paper, we propose a novel method to handle the 2D image-based 3D object retrieval problem. First, we extract a set of virtual views to represent each 3D object. And, the soft-attention model is utilized to find the weight of each view to select one characteristic view for each 3D object. Second, we propose a novel Holistic Generative Adversarial Networks (HGAN) to solve the cross-domain feature representation problem and make the feature space of virtual characteristic view more inclined to the feature space of real picture. This will effectively mitigate the distribution discrepancy across the 2D image domain and 3D object domain. Finally, we utilize the generative model of HGAN to obtain the ``virtual real image'' of each 3D object and make the characteristic view of the 3D object and real picture the same feature space for retrieval. To demonstrate the performance of our approach, We set up a new dataset that includes pairs of 2D images and 3D objects, where the 3d objects is based on the ModelNet40 dataset. The experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods.

Learning Discriminative Sentiment Representation from Strongly- and Weakly-Supervised CNNs

Visual sentiment analysis is getting increasing attention with the rapidly growing amount of images uploaded to social websites. Learning rich visual representations often requires training deep Convolutional Neural Networks (CNNs) on massive manually labeled data, which is expensive or scarce especially for a subjective task like visual sentiment analysis. Meanwhile, a large quantity of social image is quite available yet noisy by querying social network using the sentiment categories as keywords, where a various type of images related to the specific sentiment can be easily collected. In this paper, we propose a multiple kernel network (MKN) for visual sentiment recognition, which learns representation from strongly- and weakly- supervised CNNs. Specifically, the weakly-supervised deep model is trained using the large-scale data from social images, while the strongly-supervised deep model is fine-tuned on the affective datasets that are manually labeled. We employ the multiple kernel scheme on the multiple layers of these CNNs, which can automatically select the discriminative representation by learning a linear combination of a set of predefined kernels. In addition, we introduce a large-scale dataset collected from popular comics of various countries, e.g., America, Japan, China and France, which consists of 11,821 images with various artistic styles. Experimental results show that MKCNN achieves consistent improvements over the state-of-the-art methods on the public affective datasets as well as the newly established comics dataset.

AB-LSTM: Attention-Based Bidirectional LSTM Model for Scene Text Detection

Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as the curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. In order to tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First of all, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To validate the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method achieves competitive results, and well handles scene text with arbitrary shapes (horizontal, oriented, and curved form).

Introduction to the Special Issue on Face Analysis Applications

Multi-scale Supervised Attentive Encoder-Decoder Network for Crowd Counting

Crowd counting is a popular topic with widespread applications. Currently, the biggest challenge to crowd counting is large-scale variation in objects. In this paper, we focus on overcoming this challenge by proposing a novel Attentive Encoder-Decoder Network (AEDN), which is supervised on multiple feature scales to conduct crowd counting via density estimation. This work has three main contributions. First, we augment the traditional encoder-decoder architecture with our proposed residual attention blocks, which, beyond skip-connected encoded features, further extend the decoded features with attentive features. AEDN is better at establishing long-range dependencies between the encoder and decoder, therefore promoting more effective fusion of multi-scale features for handling scale-variations. Second, we design a new KL-divergence based distribution loss to supervise the scale-aware structural differences between two density maps, which complements the pixel-isolated MSE loss and better optimizes AEDN to generate high-quality density maps. Third, we adopt a multi-scale supervision scheme, such that multiple KL divergences and MSE losses are deployed at all decoding stages, providing more thorough supervisions for different feature scales. Extensive experimental results on four public datasets, including ShanghaiTech Part A, ShanghaiTech Part B, UCF- CC-50 and UCF-QNRF, reveal the superiority and efficacy of the proposed method, which outperforms most state-of-the-art competitors.

U-Net Conditional GANs for Photo-Realistic and Identity-Preserving Facial Expression Synthesis

Facial expression synthesis is a challenging task since the expression changes are highly non-linear, and depend on the facial appearance. Person identity should also be well preserved in the synthesized face. In this paper, we present a novel U-Net Conditional Generative Adversarial Network (UC-GAN) for facial expression generation. U-Net helps retain the property of the input face, including the identity information and facial details. Category condition is added to the U-Net model so that one-to-many expression synthesis can be reached simultaneously. We also design constraints for identity preserving during facial expression synthesis to further guarantee that the identity of the input face can be well preserved in the generated facial image. Specifically, we pair the generated output with condition image of other identities for the discriminator, so as to encourage it to learn the distinctions both between the synthesized and natural images and between input and other identities, which can help improve its discriminating ability. Additionally, we utilize the triplet loss to maintain the generated facial images closer to the same identity person by imposing a margin between the positive pairs and negative pairs in feature space, in which face feature vector are extracted from discriminator. Both qualitative and quantitative evaluations are conducted on the Oulu-CASIA, RaFD and KDEF datasets, and the results show that our method can generate faces with natural and realistic expressions while preserving the identity information.

Spatial Preserved Graph Convolution Networks for Person Re-identification

Person Re-identification is a very challenging task due to inter-class ambiguity caused by similar appearances, and large intra-class diversity caused by viewpoints, illuminations and poses. To address these challenges, in this paper, a graph convolution network based model for person re-identification is proposed to learn more discriminative feature embeddings, where graph-structured relationship between person images and person parts are together integrated. Graph convolution networks extract common characteristics of same person, while pyramid feature embedding exploits parts relations and learns stable representation with each person images. We achieve very competitive performance respectively on three widely used datasets, indicating that the proposed approach significantly outperforms the baseline methods and achieve the state-of-the-art performance.

Dissecting the Performance of VR Video Streaming Through the VR-EXP Experimentation Platform

To cope with the massive bandwidth demands of Virtual Reality (VR) video streaming, both the scientific community and the industry have been proposing optimization techniques such as viewport-aware streaming and tile-based adaptive bitrate heuristics. As most of the VR video traffic is expected to be delivered through mobile networks, a major problem arises: both the network performance and VR video optimization techniques have the potential to influence the video playout performance and the Quality of Experience (QoE). However, the interplay between them is neither trivial nor has it been properly investigated. To bridge this gap, in this paper we introduce VR-EXP, an open-source platform for carrying out VR video streaming performance evaluation. Furthermore, we consolidate a set of relevant VR video streaming techniques and evaluate them under variable network conditions, contributing to an in-depth understanding of what to expect when different combinations are employed. To the best of our knowledge, this is the first work to propose a systematic approach, accompanied by a software toolkit, which allows one to compare different optimization techniques under the same circumstances. Extensive evaluations carried out using realistic datasets demonstrate that VR-EXP is instrumental in providing valuable insights regarding the interplay between network performance and VR video streaming optimization techniques.

Internet of Things Based Trusted Hypertension Management App Using Mobile Technology

App for hypertension management is developed. The web-roadmap technology is used to develop the app; this technology has five steps namely are planning, analysis, design, implementation and evaluation. The hypertension management app is tested with patients, with hypertension (N=56). Their medication possession ratio is calculated before and after using of hypertension management app for the period of five weeks. The total number of participants participated is 56, in 56 participants 45 participants have taken the medication adherence. The medical possession ratio is calculated using morisky scale, and there is an improvement in the patients? health after the usage of Hyperion management app with (p=.001). The calculated score of usefulness is 3.9 in 5. The satisfaction of user is calculated after using the hypertension app for the different process like recording of blood pressure is 4.5, recording of medication ratio is 4.0, score for sending the data is 3.4, for alerting process is 4.3 and for process of alerting about medication process is 5. This paper shows that mobile app for hypertension using clinical practice guidelines is effective in improving the patients help.

A Decision Support System with Intelligent Recommendation for Multi-Disciplinary Medical Treatment

Recent years have witnessed an emerging trend for improving disease treatment by forming multi-disciplinary medical teams. The collaboration among specialists from multiple medical domains has been shown to be significantly helpful for designing comprehensive and reliable regimens, especially for incurable diseases. Although this kind of multi-disciplinary treatment has been increasingly adopted by healthcare providers, a new challenge has been introduced to the decision-making process ? how to efficiently and effectively develop final regimens by searching for candidate treatments and considering inputs from every expert. In this paper, we present a sophisticated decision support system called MdtDSS (a decision support system (DSS) for multi-disciplinary treatment (Mdt)), which is particularly developed to guide the collaborative decision-making in multi-disciplinary treatment scenarios. The system integrates a recommender system which aims to search for personalized candidates from a large-scale high-quality regimen pool, and a voting system which helps collect feedback from multiple specialists without potential bias. Our decision support system optimally combines machine intelligence and human experience and helps medical practitioners make informed and accountable regimen decisions. We deployed the proposed system in a large hospital in Shanghai, China, and collected real-world data on large-scale patient cases. The evaluation shows that the proposed system achieves outstanding results in terms of high-quality multi-disciplinary treatment.

Textual Entailment based Figure Summarization for Biomedical Articles

The current paper proposes a novel approach (FigSum++) for automatic figure summarization in biomedical scientific articles using a multi-objective evolutionary algorithm. The problem is treated as a binary optimization problem where relevant sentences in the summary for a given figure are selected based on various sentence scoring features: the textual entailment score between sentences in the summary and figure{\rq}s caption, the number of sentences referring to figure, semantic similarity between sentences and {figure\rq s} caption, the number of overlapping words between sentences and figure{\rq}s caption etc. These features are optimized simultaneously using multi-objective binary differential evolution (MBDE). MBDE consists of a set of solutions and each solution represents a subset of sentences to be selected in the summary. MBDE generally uses single DE variant, but, here, ensemble of two different DE variants measuring diversity among solutions and convergence towards global optimal solution, respectively, is employed for efficient search. Usually, in any summarization system, diversity amongst sentences (called as anti-redundancy) in the summary is a very critical feature and it is calculated in terms of similarity (like cosine similarity) among sentences. In this paper, a new way of measuring diversity in terms of textual entailment is proposed. To represent the sentences of the article in the form of numeric vectors, recently proposed, BioBERT, a pre-trained language model in biomedical text mining is utilized. An ablation study has also been presented to determine the important of different objective functions. For evaluation of the proposed technique, two benchmark biomedical datasets containing 91 and 84 figures, respectively, are considered. Our proposed system obtains 5% and 11% improvements in terms of F-measure metric over two datasets, respectively, in comparison to the state-of-the-art.

Random Playlists Smoothly Commuting Between Styles

Someone enjoys listening to playlists while commuting. He wants a different playlist of n songs each day, but always starting from Locked Out of Heaven, a Bruno Mars song. The list should progress in smooth transitions between successive and randomly selected songs until it ends up at Stairway to Heaven, a Led Zepellin song. The challenge of automatically generating random and heterogeneous playlists is to find the appropriate balance among several conflicting goals. We propose two methods for solving this problem, ROPE and STRAW. When compared with the state of the art algorithms, our algorithms are the only ones that satisfy the following quality constraints: heterogeneity, smooth transitions, novelty, scalability, and usability. We demonstrate the usefulness of our proposed algorithms by applying them to a large collection of songs and make available a prototype.

Adaptive Chunklets and AQM for Higher Performance Content Streaming

Commercial streaming services such as Netflix and YouTube use proprietary HTTP-based adaptive streaming (HAS) techniques to deliver content to consumers worldwide. MPEG recently developed Dynamic Adaptive Streaming over HTTP (DASH) as a unifying standard for HAS-based streaming. In DASH systems, streaming clients employ adaptive bitrate (ABR) algorithms to maximise user Quality of Experience (QoE) under variable network conditions. In a typical Internet-enabled home, video streams have to compete with diverse application flows for the last-mile Internet Service Provider (ISP) bottleneck capacity. Under such circumstances, ABRs will only act upon the fraction of the network capacity that is available, leading to possible QoE degradation. We have previously proposed chunklets as an approach orthogonal to ABR which uses parallel connections for intra-video chunk retrieval. Chunklets effectively make more bandwidth available for ABRs in the presence of cross-traffic, especially in environments where Active Queue Management (AQM) schemes such as PIE and FQ-CoDel are deployed. However, chunklets consume valuable server/middlebox resources which typically handle hundreds of thousands requests/connections per-second. In this paper, we propose `adaptive chunklets' -- a novel chunklet enhancement that dynamically tunes the number of concurrent connections. We demonstrate that the combination of adaptive chunkleting and FQ-CoDel is the most effective strategy. Our experiments show that adaptive chunklets can reduce the number of connections by almost 35% and consume almost 11% less bandwidth than fixed chunklets while providing the same QoE.

Tile-Based Adaptive Streaming for Virtual Reality Video

The increasing popularity of head-mounted devices and {360\degree} video cameras allows content providers to provide virtual reality (VR) video streaming over the Internet, using a two-dimensional representation of the immersive content combined with traditional HTTP adaptive streaming (HAS) techniques. However, since only a limited part of the video (i.e., the viewport) is watched by the user, the available bandwidth is not optimally used. Recent studies have shown the benefits of adaptive tile-based video streaming; rather than sending the whole {360\degree} video at once, the video is cut into temporal segments and spatial tiles, each of which can be requested at a different quality level. This allows prioritization of viewable video content, and thus results in an increased bandwidth utilization. Given the early stages of research, there are still a number of open challenges to unlock the full potential of adaptive tile-based VR streaming. The aim of this work is to provide an answer to several of these open research questions. Among others, we propose two tile-based rate adaptation heuristics for equirectangular VR video, which use the great-circle distance between the viewport center and the center of each of the tiles to decide upon the most appropriate quality representation. We also introduce a feedback loop in the quality decision process, which allows the client to revise prior decisions based on more recent information on the viewport location. Furthermore, we investigate the benefits of parallel TCP connections and the use of HTTP/2 as an application layer optimization. Through an extensive evaluation, we show that the proposed optimizations result in a significant improvement in terms of video quality (more than twice the time spent on the highest quality layer), compared to non-tiled HAS solutions.

Soul Dancer: Emotion-based Human Action Generation

Body language is one of the most common ways of expressing human emotion. In this paper, we make the first attempt to generate action video with a specific emotion from a single person image. The task of emotion based action generation (EBAG) can be defined as: provided with a type of emotion and a human image with full body, action video in which the person of the source image expressing the given type of emotion can be generated. We divide the task into two parts and propose a two-stage framework to generate action video with emotion expressing. In the first stage, we propose an RNN based LS-GAN for translating the emotion to a pose sequence. In the second stage, we generate the target video frames according to the three inputs including the source pose and the target pose as the motion information and source image as the appearance reference by using conditional GAN model with online training strategy. Our framework produces the pose sequence and transforms the action independently, which underlines the fundamental role that the high-level pose feature plays in generating action video with a specific emotion. The proposed method has been evaluated on the "Soul Dancer" dataset which is built for action emotion analysis and generation. The experimental results demonstrate that our framework can effectively solve the emotion-based action generation task. However, the gap in the details of the appearance between the generated action video and the real-world video still exists, which indicates that the emotion-based action generation task has great research potential.

Cell Nuclei Classification In Histopathological Images using Hybrid OLConvNet

Computer-aided histopathological image analysis for cancer detection is a major research challenge in the medical domain. Automatic detection and classification of nuclei for cancer diagnosis impose a lot of challenges in developing state of the art algorithms due to the heterogeneity of cell nuclei and data set variability. Recently, a multitude of classification algorithms has used complex deep learning models for their dataset. However, most of these methods are rigid and their architectural arrangement suffers from inflexibility and non-interpretability. In this research article, we have proposed a hybrid and flexible deep learning architecture OLConvNet that integrates the interpretability of traditional object-level features and generalization of deep learning features by using a shallower Convolutional Neural Network (CNN) named as CNN3L. CNN3L reduces the training time by training fewer parameters and hence eliminating space constraints imposed by deeper algorithms. We used F1-score and multiclass Area Under the Curve (AUC) performance parameters to compare the results. To further strengthen the viability of our architectural approach, we tested our proposed methodology with state of the art deep learning architectures AlexNet, VGG16 and VGG19 as backbone networks. After a comprehensive analysis of classification results from all four architectures, we observed that our proposed model works well and perform better than contemporary complex algorithms.

Emotion Recognition with Multi-hypergraph Neural Networks Combining Multimodal Physiological Signals

Emotion recognition by physiological signals is an effective way to discern the inner state of human beings and therefore has been widely adopted in user-centered work, such as driver status monitoring, telemedicine and other tasks. The majority of present studies regarding emotion recognition are devoted to exploration of the relationship among emotion and physiological signals with subjects seen as a whole. However, given some features of the natural process of emotional expression, it is an urgent task to characterize latent correlations among multimodal physiological signals and pay attention to the influence of individual differences to exploit associations among individual subjects. To tackle the problem, it is proposed in the paper to establish multi-hypergraph neural networks (MHGNN) to recognize emotions by physiological signals. The method constructs multi-hypergraph structure, in which one hypergraph is established by one type of physiological signals to formulate correlations among different subjects. Each one of the vertices in a hypergraph stands for one subject with a description of its related stimuli, and the hyperedges serve as representation of the connections among the vertices. With the multi-hypergraph structure of the subjects, emotion recognition is transformed into classification of vertices in the multi-hypergraph structure. Experimental results with the DEAP dataset and ASCERTAIN dataset demonstrate that the proposed method outperforms the state-of-the-art methods at present. The contrast experiments prove that MHGNN is capable of describing real process of biological response with much higher precision.

Synthesizing facial photometries and corresponding geometries using generative adversarial networks

Artificial data synthesis is currently a well studied topic with useful applications in data science, computer vision, graphics and many other fields. Generating realistic data is especially challenging since human perception is highly sensitive non-realistic appearance. Recent advances in GAN architecture and training procedures have driven the capabilities of synthetic data generation to new heights of realism. These successful models however, are tuned mostly for use with regularly sampled data such as images, audio and video. Despite the wide success on these types of media, applying the same tools to geometric data poses a far greater challenge which is still a hot topic of debate within the academic community. The lack of intrinsic parametrization inherent to geometric objects prohibits the direct use of convolutional filters, a main building block of today's machine learning systems. In this paper we propose a new method for generating realistic human facial geometries coupled with overlayed textures. We circumvent the parametrization issue by imposing a global mapping from our data to the unit rectangle. This mapping enables the representation of our geometric data as regularly sampled 2D images. We further discuss how to design such a mapping in order to control the mapping distortion and conserve area within the mapped image. By representing geometric textures and geometries as images, we are able to use advanced GAN methodologies in order to generate new geometries. We address the often neglected topic of relation between texture and geometry and propose to use this correlation in order to match between generated textures and their corresponding geometries. In addition we widen the scope of our discussion and offer a new method for training GAN models on partially corrupted data. Finally, we provide empirical evidence to support our claim that our generative model is able to produce examples of new people which do not exist within the training data while maintaining high realism and texture detail, two traits that are often at odds.

Efficient Face Alignment with Fast Normalization and Contour Fitting Loss

Face Alignment is a key component of numerous face analysis tasks. In recent years, most existing methods have focused on designing high-performance face alignment systems and paid less attention to efficiency. However more and more face alignment systems are applied on low-cost devices, such as mobile phones. In this paper, we design an efficient light-weight CNN-based regression framework with a novel contour fitting loss, achieving competitive performance with other state-of-art methods. We discover that the maximum error exists in the face contour, where landmarks do not have distinct semantic positions, and thus are randomly labeled along the face contours in training data. To address this problem, we reshape the common L2 loss, to dynamically adjust the regression targets during training network, so that the network can learn more accurate semantic meanings of the contour landmarks and achieve better localization performance. Meanwhile, we systematically analyze the effects of pose variations in face alignment task and design an efficient framework with a Fast Normalization Module (FNM) and a lightweight alignment module(LAM), which fast normalizes the in-plane rotation and efficiently localize the landmarks. Our method achieves competitive performance with state of the arts on 300W benchmark and the speed is significant faster than other CNN-based approaches.

A Unified Tensor-based Active Appearance Model

Appearance variations result in many difficulties in face image analysis. To deal with this challenge, we present a Unified Tensor-based Active Appearance Model (UT-AAM) for jointly modelling the geometry and texture information of 2D faces. For each type of face information, namely shape and texture, we construct a unified tensor model capturing all relevant appearance variations. This contrasts with the variation-specific models of the classical tensor AAM. To achieve the unification across pose variations, a strategy for dealing with self-occluded faces is proposed to obtain consistent shape and texture representations of pose-varied faces. In addition, our UT-AAM is capable of constructing the model from an incomplete training dataset, using tensor completion methods. Last, we use an effective cascaded-regression-based method for UT-AAM fitting. With these advancements, the utility of UT-AAM in practice is considerably enhanced. As an example, we demonstrate the improvements in training facial landmark detectors through the use of UT-AAM to synthesise a large number of virtual samples. Experimental results obtained using the Multi-PIE and 300-W face datasets demonstrate the merits of the proposed approach.

Robust Visual Tracking using Kernel Sparse Coding on Multiple Covariance Descriptors

In this paper, we aim to improve the performance of visual tracking by combing different features of multiple modalities. The core idea is to use covariance matrices as feature descriptors and then use sparse coding to encode different features. The notion of sparsity has been successfully used in visual tracking. In this context, sparsity is used along appearance models often obtained from intensity/color information. In this work, we step outside this trend and propose to model the target appearance by local Covariance Descriptors (CovD) in a pyramid structure. The proposed pyramid structure not only enables us to encode local and spatial information of the target appearance but also inherits useful properties of CovDs such as invariance to affine transforms. Since CovDs lie on a Riemannian manifold, we further propose to perform tracking through sparse coding by embedding the Riemannian manifold into an infinite-dimensional Hilbert space. Embedding the manifold into a Hilbert space allows us to perform sparse coding efficiently using kernel trick. Our empirical study shows that the proposed tracking framework outperforms the existing state-of-the-art methods in challenging scenarios.

Random Forest with Self-paced Bootstrap Learning in Lung Cancer Prognosis

Training gene expression data with supervised learning approaches, it has the potential to decrease cancer death rates by developing prediction strategies for lung cancer treatment, but the samples of gene features still involved lots of noises. In this study, we presented a random forest with self-paced learning bootstrap for improvement of lung cancer prognosis and classification based on gene expression data. To be specific, we proposed an ensemble learning with random forest approach to improving the model classification performance by selecting multi-classifiers. We also investigated the sampling strategy by gradually embedding from high- to low-quality samples by the self-paced learning. The results based on five public lung cancer datasets showed that our proposed method could select significant genes and improve classification performance compared to existing approaches. We believe that our proposed method has the potential to assist doctors for gene selections and lung cancer prognosis.

Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining

Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space through hash function, and achieves fast and flexible cross-modal retrieval. Most existing cross-modal hashing methods learn hash function by mining the correlation among multimedia data, but ignore the important property of multimedia data: Each modality of multimedia data has features of different scales, such as texture, object and scene features in the image, which can provide complementary information for boosting retrieval task. The correlations among the multi-scale features are more abundant than the correlations between single features of multimedia data, which reveal finer underlying structure of the multimedia data and can be used for effective hashing function learning. Therefore we propose Multi-scale Correlation Sequential Cross-modal Hashing (MCSCH) approach, and its main contributions can be summarized as follows: 1) Multi-scale feature guided sequential hashing learning method is proposed to share the information from features of different scales through a RNN based network and generate the hash codes sequentially. The features of different scales are used to guide the hash codes generation, which can enhance the diversity of the hash codes and weaken the influence of errors in specific features, such as false object features caused by occlusion. 2) Multi-scale correlation mining strategy is proposed to align the features of different scales in different modalities and mine the correlations among aligned features. These correlations reveal finer underlying structure of multimedia data and can help to boost the hash function learning. 3) Correlation evaluation network evaluates the importance of the correlations to select the worthwhile correlations, and increases the impact of these correlations for hash function learning. Experiments on two widely-used 2-media datasets and a 5-media dataset demonstrate the effectiveness of our proposed MCSCH approach.

RCE-HIL: Recognizing Cross-media Entailment with Heterogeneous Interactive Learning

Entailment recognition is an important paradigm of reasoning, which judges if a hypothesis can be inferred from given premises. However, previous efforts mainly concentrate on text-based reasoning as recognizing textual entailment (RTE), where the hypotheses and premises are both textual. In fact, human?s reasoning process has the characteristic of cross-media reasoning. It is naturally based on the joint inference with different sensory organs, which represent complementary reasoning cues from unique perspectives as language, vision and audition. How to realize cross-media reasoning has been a significant challenge to achieve the breakthrough for width and depth of entailment recognition. Therefore, this paper extends RTE to a novel reasoning paradigm: recognizing cross-media entailment (RCE), and proposes heterogeneous interactive learning (HIL) approach. Specifically, HIL recognizes entailment relationships via cross-media joint inference, from image-text premises to text hypotheses. It is an end-to-end architecture with 2 parts: 1) Cross-media hybrid embedding is proposed to perform cross embedding of premises and hypotheses, for generating their fine-grained representations. It aims to achieve the alignment of cross-media inference cues, via image-text and text-text interactive attention. 2) Heterogeneous joint inference is proposed to construct a heterogeneous interaction tensor space, and extract semantic features for entailment recognition. It aims to simultaneously capture the interaction between cross-media premises and hypotheses, and distinguish their entailment relationships. Experimental results on widely-used SNLI dataset with image premises from Flickr30K dataset, verify the effectiveness of HIL, and the intrinsic inter-media complementarity in reasoning.

Smart Diagnosis: A Multiple-Source Transfer TSK Fuzzy System for EEG Seizure Identification

In order to effectively identify Electroencephalogram (EEG) signals in multiple source domains, a transductive multiple source transfer learning method called as MS-TL-TSK is proposed, which combines together multiple source transfer learning and manifold regularization (MR) learning mechanisms into Takagi-Sugeno-Kang (TSK) fuzzy system. Specifically, the advantages of MS-TL-TSK include: (1) By evaluating the significant of each source domain, a flexible domain weighting index is presented; (2) Using the theory of sample transfer learning, a re-weighting strategy is presented to weigh the prediction of unknown samples in target domain and the output of source prediction functions; (3) By taking into account the MR term, the manifold structure of the target domain is effectively maintained in the proposed system; (4) By inheriting the interpretability of TSK fuzzy system (TSK-FS), MS-TL-TSK has good interpretability that would be understandable by human beings(domain experts) for identifying EEG signals. The effectiveness of the proposed fuzzy system is demonstrated on several EEG multiple source transfer learning problems.

Action Recognition using form and motion modalities

Action recognition has attracted increasing interest in computer vision due to its potential applications in many vision systems. One of the main challenges in action recognition is to extract powerful features from videos. Most existing approaches exploit either hand-crafted techniques or learning based methods to extract features from videos. However, these methods mainly focus on extracting the dynamic motion features, which ignore the static form features. Therefore, these methods cannot fully capture the underlying information in videos accurately. In this paper, we propose a novel feature representation method for action recognition, which exploits hierarchal sparse coding to learn the underlying features from videos. The learned features characterise the form and motion simultaneously and therefore provide more accurate and complete feature representation. The learned form and motion features are considered as two modalities, which are used to represent both the static and motion features. These modalities are further encoded into a global representation via a pair-wise dictionary learning and then fed to a SVM classifier for action classification. Experimental results on several challenging datasets validate the proposed method is superior to several state-of-the-art methods.

Characterizing Subtle Facial Movements via Riemannian Manifold

Facial movements play a crucial role for human beings to communicate and express emotions since they not only transmit communication contents but also contribute to ongoing processes of emotion-relevant information. Characterizing subtle facial movements from videos is one of the most intensive topics in computer vision research. It is, however, challenging because that 1) the intensity of subtle facial muscle movement is usually low; 2) the duration may be transient, and 3) datasets containing spontaneous subtle movements with reliable annotations are painful to obtain and often of small sizes. This paper is targeted at addressing these problems for characterizing subtle facial movements from both the aspects of motion elucidation and description. Firstly, we propose an efficient method for elucidating hidden and repressed movements to make them easier to get noticed. We explore the feasibility of linearizing motion magnification and temporal interpolation, which has been obscured by the implementation of existing methods. We then propose a consolidated framework, termed MOTEL, to expand temporal duration and amplify subtle facial movements simultaneously. Secondly, we make our contribution to dynamic description. One major challenge is how to capture the intrinsic temporal variations caused by movements and omit extrinsic ones caused by different individuals and various environments. To diminish the influences of such diversity, we propose to characterize the dynamics of short-term movements via the differences between points on the tangent spaces to the manifolds, rather than the points themselves. We then significantly relax the trajectory-smooth assumption of the conventional manifold based trajectory modeling method and model longer-term dynamics using statistical observation model within the sequential inference approaches. Finally, we incorporate the tangent delta descriptor with the sequential inference approaches and present a hierarchical representation architecture to cover the period of the facial movements occurrence. The proposed motion elucidation and description approach is validated by a series of experiments on publicly available datasets in the example tasks of micro-expression recognition and visual speech recognition.

Steganographer Detection via Multi-Scale Embedding Probability Estimation

Steganographer detection aims to identify the guilty user, who utilizes steganographic methods to hide secret information in the spread multimedia data, especially image data, from a large amount of innocent users on the social networks. True embedding probability map illustrates the probability distribution of embedding secret information in the corresponding images by specific steganographic methods and settings, which has been successfully used as the guidance for content-adaptive steganographic and steganalytic methods. Unfortunately, in real-world situation, the detailed steganographic settings adopted by the guilty user cannot be known in advance. It thus becomes necessary to propose an automatic embedding probability estimation method. In this paper, we propose a novel content-adaptive steganographer detection method via embedding probability estimation. The embedding probability estimation is firstly formulated as a learning-based saliency detection problem and the multi-scale estimated map is then integrated into the CNN to extract steganalytic features. Finally, the guilty user is detected via an efficient Gaussian vote method with the extracted steganalytic features. The experimental results prove that the proposed method is superior to the state-of-the-art methods in both spatial and frequency domains.

Proposal Complementary Action Detection

Temporal action detection not only requires correct classification, but also needs to detect the start and end times of each action accurately. However, the tradition approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by the means of end-to-end. In this paper, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present a novel proposal complementary action detector (PCAD) to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional (Conv3D) network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also designed a boundary proposal network (BPN) to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on THUMOS?14 detection benchmark, PCAD achieves the state-of-the-art performance in high-speed models.

Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey

The rapid development of digital photography and social networks has generated a rapidly growing volume of multimedia data (i.e., image, music, and video), resulting in a great demand for managing, retrieving, and understanding these data. Affective computing (AC) of these data can help to understand human behaviors and enable wide applications. In this article, we review the state-of-the-art AC technologies comprehensively for large-scale heterogeneous multimedia data. We begin with an introduction to the key emotion representation models that have been widely employed in AC. Secondly, we briefly describe the available existing datasets for performing AC evaluation. And then we summarize and compare the representative approaches on AC of different multimedia types, i.e. images, music, videos, and multimodal data, with the focus on both handcrafted features-based methods and deep learning methods. Finally, we discuss some challenges and potential directions for future research.

DenseNet-201 based deep neural network with composite learning factor and precomputation for multiple sclerosis classification

(Aim) Multiple sclerosis is a neurological condition that may cause neurologic disability among. To identify multiple sclerosis more accurately, this paper proposed a new transfer-learning based approach. (Method) DenseNet-121, DensetNet-169, and DenseNet-201 neural networks were compared. Besides, we proposed to use a composite learning factor (CLF) that assigns different learning factor to three types of layers: early frozen layers, middle layers, and late newly-replaced layers. How to allocate layers into those three layers remain a problem. Hence, four transfer learning setting (viz., Setting A, B, C, and D) were tested and compared. Precomputation method was utilized to reduce storage burden and accelerate the program. (Results) We observed that DenseNet-201-D can achieve the best performance. The sensitivity, specificity, and accuracy of DenseNet-201-D was 98.27± 0.58, 98.35± 0.69, and 98.31± 0.53, respectively. (Conclusion) Our method gives better performances than state-of-the-art approaches. Furthermore, this composite learning rate gives superior results to traditional simple learning factor (SLF) strategy.

Efficient Image Hashing with Invariant Vector Distance for Copy Detection

Image hashing is an efficient technique of multimedia security for image content protection. It maps an image into a content-based compact code for denoting the image itself. While most existing algorithms focus on improving the classification between robustness and discrimination, little attention has been paid to geometric invariance under normal digital operations, and therefore results in quite fragile to geometric distortion when applied in image copy detection. In this paper, a novel effective image hashing method is proposed based on invariant vector distance in both spatial domain and frequency domain. First, the image is preprocessed by some joint operations to extract robust features. Then, the preprocessed image is randomly divided into several overlapping blocks under a secret key, and two different feature matrices are separately obtained in the spatial domain and frequency domain through invariant moment and low frequency discrete cosine transform coefficients. Furthermore, the invariant distances between vectors in feature matrices are calculated and quantified to form a compact hash code. We conduct various experiments to demonstrate that the proposed hashing not only reaches good classification between robustness and discrimination, but also resists most geometric distortion in image copy detection. In addition, both receiver operating characteristics curve comparisons and mean average precision in copy detection clearly illustrate that the proposed hashing method outperforms state-of-the-art algorithms.

AMIL: Adversarial Multi-Instance Learning for Human Pose Estimation

Human pose estimation has an important impact on a wide range of applications from human-computer interface to surveillance and content-based video retrieval. For human pose estimation, joint obstructions and overlapping upon human bodies result in departed pose estimation. To address these problems, by integrating priors of the structure of human bodies, we present an innovative structure-aware network to discreetly consider such priors during the training of the network. Typically, learning such constraints is a challenging task. Instead, we propose generative adversarial networks as our learning model in which we design two residual multiple instance learning (MIL) models with the identical architecture, one used as the generator and the other one used as the discriminator. The discriminator task is to distinguish the actual poses from the fake ones. If the pose generator generates the results that the discriminator is not able to distinguish from the real ones, the model successfully learns the priors. In the proposed model, the discriminator differentiates the ground-truth heatmaps from the generated ones, and later the adversarial loss back-propagates to the generator. Such procedure assists the generator to learn reasonable body configurations and is proved to be advantageous to improve the prediction accuracy. Meanwhile, we propose a novel function for MIL. It is an adjustable structure for both instance selection and modeling to appropriately pass the information between instances in a single bag. In the proposed residual MIL neural network, the pooling action frequently updates the instance contribution to its bag. The proposed adversarial residual multi-instance neural network that is based on pooling has been validated on two datasets for the human pose estimation task and successfully outperforms the other state-of-arts models.

LFGAN: 4D Light Field Synthesis from a Single RGB Image

We present a deep neural network called Light Field GAN (LFGAN) that synthesizes a 4D light field from a single 2D RGB image. We generate light fields using single image super-resolution (SISR) technique based on two important observations. First, the small baselines give rise to the high similarity between the full light field image and each sub-aperture view. Second, the occlusion edge at any spatial coordinate of a sub-aperture view has the same orientation as the occlusion edge at the corresponding angular patch, implying the occlusion information in the angular domain can be inferred from the sub-aperture local information. We employ the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) to learn the color and depth information from the light field datasets. The network can generate a plausible 4D light field comprising 8 × 8 angular views from a single sub-aperture 2D image. We propose new loss terms, namely, epipolar plane image (EPI) and brightness regularization losses, as well as a novel multi-stage training framework to feed the loss terms at different time to generate superior light fields. The EPI loss can reinforce the network to learn the geometric features of the light fields, and the brightness regularization loss can preserve the brightness consistency across different sub-aperture views. Two datasets have been used to evaluate our method: in additional to an existing light field dataset capturing scenes of flowers and plants, we have built a large dataset of toy animals consisting 2100 light fields captured with a plenoptic camera. We have performed comprehensive ablation studies to evaluate the effects of individual loss terms and the multi-stage training strategy, and compared LFGAN with other state-of-the-art techniques. Qualitative and quantitative evaluation demonstrates that LFGAN can effectively estimate complex occlusions and geometry in challenging scenes, and outperform other existing techniques.

Features-Enhanced Multi-attribute Estimation with Convolutional Tensor Correlation Fusion Network

To achieve robust facial attribute estimation, a hierarchical prediction system referred to as tensor correlation fusion network (TCFN) is proposed for attribute estimation. The system includes feature extraction, correlation excavation among facial attribute features, score fusion, and multi-attribute prediction. Subnetworks (Age-Net, Gender-Net, Race-Net, and Smile-Net) are used to extract corresponding features while Main-Net extracts features not only from input image but also from corresponding pooling layers of subnetworks. Dynamic tensor canonical correlation analysis (DTCCA) is proposed to explore the correlation of different targets' features in the F7 layers. Then, for binary classifications of gender, race, smile, corresponding robust decisions are achieved by fusing results of subnetworks with these of TCFN while for age prediction, facial image into one of age groups, and then ELM regressor performs the final age estimation. Experimental results on benchmarks with multiple face attributes (MORPH-II, Adience Benchmark datasets, LAP-2016, and CelebA) show that the proposed approach has superior performance compared to state of the art.

Affective Content-aware Adaptation Scheme on QoE Optimization of Adaptive Streaming over HTTP

The paper presents a novel affective content-aware adaptation scheme (ACAA) to optimize QoE for adaptive video streaming over HTTP. Most of existing HTTP-based adaptive streaming schemes conduct video bit-rate adaptation based on an estimation of available network resources, which ignored user preference on affective content (AC) embedded in video data streaming over the network. Since the personal demands to AC is very different among all viewers, to satisfy individual affective demand is critical to improve the QoE in the commercial video services. However, the results of video affective analysis can?t be applied into a current adaptive streaming scheme directly. Considering the AC distributions in user?s viewing history and all streaming segments, the AC relevancy can be inferred as an affective metric for the AC related segments. Further, we have proposed an ACAA scheme to optimize QoE for user desired affective content while taking into account both network status and affective relevancy. We have implemented the ACAA scheme over a realistic traces based evaluation and compared its performance in terms of network performance, Quality of Experience (QoE) with that of Probe and Adaptation (PANDA), buffer-based adaptation (BBA) and Model Predictive Control (MPC). Experimental results show that ACAA can preserve available buffer time for future being delivered affective content preferred by viewer?s individual preference, so as to achieve better QoE in affective contents than those normal contents while remain the overall QoE to be satisfactory.

Machine learning techniques for the diagnosis of Alzheimer's disease: A review

Alzheimer's disease is an incurable neurodegenerative disease primarily affecting the elderly population. Efficient automated techniques are needed for early diagnosis of Alzheimers. Many novel approaches are proposed by researchers for classification of Alzheimer's disease. However, to develop more efficient learning techniques, better understanding of the work done on Alzheimers is needed. Here, we provide a review on 165 papers from 2001-2019 using various feature extraction and machine learning techniques. The machine learning techniques are surveyed under three main categories: support vector machine (SVM), artificial neural network (ANN), and deep learning (DL) and ensemble methods. We present a detailed review on these three approaches for Alzheimers with possible future directions.

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

This paper aims to develop an efficient Content-based Video Retrieval (CBVR) system by hashing videos into short binary codes. It is an appealing research topic with increasing demands in such an Internet era when massive videos are uploaded to the website every day. The main challenge of this task is how to discriminatively map video sequences to compact hash codes by preserving original similarity. Existing video hashing methods are usually built on two isolated steps: frame pooling-based video features extraction and hash codes generation, which have not fully explored the spatial-temporal properties in videos and also inevitably result in severe information loss. To address these issues, in this paper we present an end-to-end video retrieval framework called Similarity-Preserving Deep Temporal Hashing (SPDTH) network. Specifically, we design the hashing module as an encoder Recurrent Neural Network (RNN) which is equipped with the stacked Gated Recurrent Units (GRUs). The benefit of our network is that it explicitly extracts the spatial-temporal properties of videos and yields compact hash codes in an end-to-end manner. Besides, we also introduce a structured ranking loss for deep network training by preserving intra-class similarity and inter-class separability, and the quantization loss between the real-valued output and the binary codes is minimized. Extensive experiments on several challenging datasets have demonstrated that SPDTH can consistently outperform state-of-the-art video hashing methods.

Autonomous Semantic Community Detection via Adaptively Weighted Low-rank Approximation

Identification of semantic community structures is important for understanding the interactions and sentiments of different groups of people. A robust community detection method needs to autonomously determine the number of communities and community structure for a given network. Nonnegative matrix factorization (NMF), a component decomposition approach, has been extensively used for community detection. However, the existing NMF-based methods require the number of communities to be determined \emph{a priori}, limiting their applicability in practice. Here, we develop a novel NMF-based method to autonomously determine the number of communities and community structure simultaneously. In our method, we use an initial number of communities, larger than the actual number, in the NMF formulation, and then suppress some of the communities by introducing an adaptively weighted group-sparse low-rank regularization to derive the target number of communities and at the same time the corresponding community structure. Our method is efficient without increasing the complexity of the original NMF method. We thoroughly examine the new method, showing its superior performance over several competing methods on synthetic and large real-world networks.

A Hierarchical CNN-RNN Approach for Visual Emotion Classification

Visual emotion classification is to predict emotional reactions of people for the given visual content. There are different visual stimuli that can affect human emotion from low-level to high-level, including contrast, color, texture, scene, object, association, etc. However, most existing methods regard different levels of stimuli as independent components without effectively fusing different stimuli. This paper proposes a hierarchical CNN-RNN approach to predict the emotion based on the fused stimuli by exploiting the dependency among different level features. First, we introduce a dual CNN network to extract different levels of features, where two related loss functions are designed to learn the feature representation within a multi-task learning framework. Further, to model the dependencies within the low-level and high-level features, a stacked bidirectional RNN is proposed to fuse the above learned features from the dual CNN network. Extensive experiments on one large scale and three small scale datasets show that our approach outperforms the state-of-the-art methods, and ablation experiments demonstrate the effectiveness of different modules from our model.

Image/Video Restoration via Multiplanar Autoregressive Model and Low-Rank Optimization

In this paper, we introduce an image/video restoration approach by utilizing the high-dimensional similarity in images/videos. After grouping similar patches from neighboring frames, we propose to build a Multiplanar autoregressive (AR) model to exploit the correlation in cross-dimensional planes of the patch group, which has long been neglected by previous AR models. To further utilize the nonlocal self-similarity in images/videos, a joint multiplanar AR and low-rank based approach is proposed (MARLow) to reconstruct patch groups more effectively. Moreover, for video restoration, the temporal smoothness of the restored video is constrained by the Markov random field (MRF), where MRF encodes a priori knowledge about consistency of patches from neighboring frames. Specifically, we treat different restoration results (from different patch groups) of a certain patch as labels of an MRF, and temporal consistency among these restored patches is imposed. Besides image and video restoration, the proposed method is also suitable for other restoration applications such as interpolation and text removal. Extensive experimental results demonstrate that the proposed approach obtains encouraging performance comparing with state-of-the-art methods.

A Benchmark Dataset and Comparison Study for Multi-Modal Human Action Analytics

Large-scale benchmarks provide a solid foundation for the development of action analytics. Most of the previous activity benchmarks focus on analyzing actions in RGB videos. There is a lack of large-scale and high-quality benchmarks for multi-modal action analytics. In this paper, we introduce PKU Multi-Modal Dataset (PKU-MMD), a new large-scale benchmark for multi-modal human action analytics. It consists of about 28,000 action instances and 5.4 million frames in total, and provides high-quality multi-modal data sources, including RGB, depth, infrared radiation (IR) and skeletons. To make PKU-MMD more practical, our dataset comprises two subsets under different settings for action understanding, namely Part I and Part II. Part I contains 1,076 untrimmed video sequences with 51 action classes performed by 66 subjects, while Part II contains 1,009 untrimmed video sequences with 41 action classes performed by 13 subjects. Compared to Part I, Part II is more challenging due to short action intervals, concurrent actions and heavy occlusion. PKU-MMD can be leveraged in two scenarios: action recognition with trimmed video clips and action detection with untrimmed video sequences. For each scenario, we provide benchmark performance on both subsets, by conducting different methods with different modalities under two evaluation protocols, respectively. Experimental results show that PKU-MMD is a significant challenge to many state-of-the-art methods. We further illustrate that the features learned on PKU-MMD can be well transferred to other datasets. We believe this large-scale dataset will boost the research in the field of action analytics for the community.

Visual Attention Analysis and Prediction on Human Faces for Children with Autism Spectrum Disorder

The focus of this article is to analyze and predict the visual attention of children with Autism Spectrum Disorder (ASD) when looking at human faces. Social difficulties are the hallmark features of ASD and will lead to atypical visual attention toward various stimuli more or less, especially on human faces. Learning the visual attention of children with ASD could contribute to related research in the field of medical science, psychology, and education. We first construct a Visual Attention on Faces for Autism Spectrum Disorder (VAFA) database, which consists of 300 natural scene images with human faces and corresponding eye movement data collected from 13 children with ASD. Compared with matched typically developing (TD) controls, we quantify atypical visual attention on human faces in ASD. Statistics show that some high-level factors such as face size, facial features, face poses, and face emotions have different impacts on the visual attention of children with ASD. Combining the feature maps extracted from the state-of-the-art saliency models, we get the visual attention model on human faces for the autistic. The proposed model shows the best performance among all competitors. With the help of our proposed model, researchers in related fields could design specialized education contents containing human faces for the children with ASD or produce the specific model for rapidly screening ASD using their eye movement data.

ACMNet: Adaptive Confidence Matching Network for Human Behavior Analysis via Cross-Modal Retrieval

Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this paper, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this paper, we propose an \textbf{Adaptive Confidence Matching Network} (\textbf{ACMNet}), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, \ie{} Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Intelligent Classification and Analysis of Essential Genes Species Using Quantitative Methods

Significance of Essential word needs no further clarifications. Essential genes are considered in the perspective of evolution of different organisms; however, it is quite complicated because we have to recognize the difference between essential cellular processes, essential protein functions and essential genes. There is also a need to identify whether one set of growth conditions may be replaces under another set. It is also contended that most genes are essential in natural selection process. In this article, we applied intelligent method for classification of essential genes of four different species, namely, Human, Arabidopsis Thaliana, Drosophila Melanogaster and Danio Rerio. The primary aim of the current article is to understand the distributions of purines and pyrimidines over the essential genes of four different species Human, Arabidopsis Thaliana, Drosophila Melanogaster and Danio Rerio are considered. Based on quantitative parameters (Shannon Entropy, Fractal Dimension, Hurst Exponent, Distribution of purines- pyrimidines) ten different clusters have be generated for the four species. Some proximity results have been observed among the clusters of all the four species.

Multichannel Attention Refinement for Video Question Answering

Video Question Answering (VideoQA) is the extension of image question answering (ImageQA) in the video domain. Methods are required to give the correct answer after analyzing the provided video and question in this task. Comparing to ImageQA, the most distinctive part is the media type. Both tasks require the understanding of visual media, but VideoQA is much more challenging mainly because of the complexity and diversity of videos. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this paper, we propose to tackle the task from a multichannel perspective. Appearance, motion and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer. We also incorporate the relevant text information acquired from Wikipedia as an attempt to extend the capability of the method. Experiments on TGIF-QA and ActivityNet-QA datasets show the advantages of our method compared to existing methods. We also demonstrate the effectiveness and interpretability of our method by analyzing the refined attention weights during the question answering procedure.

Embedding distortion analysis in wavelet-domain watermarking

Imperceptibility and robustness are two complementary fundamental requirements of any watermarking algorithm. Low strength watermarking yields high imperceptibility, but exhibits poor robustness. High strength watermarking schemes achieve good robustness but often infuse distortions resulting in poor visual quality in host image. In this paper we analyse the embedding distortion for wavelet based watermarking schemes. We derive the relationship between the distortion, measured in mean square error (MSE), and the watermark embedding modification and propose the linear proportionality between MSE and the sum of energy of the selected wavelet coefficients for watermark embedding modification. The initial proposition assumes the orthonormality of the discrete wavelet transform. It is further extended for non-orthonormal wavelet kernels using a weighting parameter, that follows the energy conservation theorems in wavelet frames. The proposed analysis is verified by experimental results for non-blind as well as blind watermarking schemes. Such a model is useful to find the optimum input parameters, including, the wavelet kernel, coefficient selection and subband choices for wavelet domain image watermarking.

Hybrid Wolf-Bat algorithm for optimisation of connection weights in multi-layer perceptron

In any neural network, the weights act as parameters for determining the output(s) from a set of inputs. They are used for finding the activation values of nodes of a layer from the values of the previous layer. Finding the ideal set of these weights for training a Multilayer Perceptron neural network such that it minimizes the classification error is a widely known optimization problem. This paper proposes a HybridWolf-Bat algorithm, a novel optimization algorithm, as a solution for solving this problem. The proposed algorithm is a hybrid of two already existing nature-inspired algorithms, which are the Grey Wolf Optimization algorithm and Bat algorithm. This novel approach is tested on ten different datasets of the medical field, obtained from the UCI machine learning repository. These results of the proposed algorithm are compared with those of four recently developed nature-inspired algorithms: Grey Wolf Optimization algorithm (GWO), Cuckoo Search (CS), Bat Algorithm (BA) and Whale Optimization Algorithm (WOA) along with the standard Back-propagation training method. As observed from the results, the proposed method is better in terms of both speed of convergence and accuracy and outperforms the other bio-inspired algorithms.

Modeling Long-term dependencies from Videos using Deep Multiplicative Neural Networks

Understanding temporal dependencies of videos is fundamental for vision problems, but neural networks based models are still insufficient in this field. In this paper, we propose a novel Deep Multiplicative Neural Networks (DMNNs) for learning hierarchical long-term representations from video. The DMNNs is built upon the multiplicative block which remembers the pairwise transformations happened between frames by using multiplicative interactions instead of regular weighted-sum ones. The block is slided over the time steps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact rather than approximate modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. To address the difficulty of training DMNNs, we also derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on UCF101 dataset and the e?ectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

Pulmonary Nodule based on ISODATA-Improved Faster RCNN and 3D-CNN with Focal Loss

The early diagnosis of pulmonary cancer can significantly improve the survival rate of patients, where pulmonary nodules detection in computed tomography images plays an important role. In this paper, we propose a novel pulmonary nodule detection system based on convolutional neural networks (CNN). Our system consists of two stages, pulmonary nodule candidate detection and false positive reduction. For candidate detection, we introduce Focal Loss and Iterative Self-Organizing Data Analysis Techniques Algorithm (ISODATA) to Faster Region-based Convolutional Neural Network (Faster R-CNN) model. For false positive reduction, a three-dimensional convolutional neural network (3D-CNN) is employed to completely utilize the three-dimensional nature of CT images. Experiments were conducted on The Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset and the results indicate the proposed system achieves preferable performance on pulmonary nodule detection.

Delving Deeper in Drone-Based Person Re-Id by Employing Deep Decision Forest and Attributes Fusion

Deep learning has revolutionized the field of computer vision and image processing. Its ability to extract the compact image representation has taken the person re-identification problem to the new level. However, in most cases, the researchers are focused on developing new approaches to extract more fruitful image representation and use it in the re-id task. The extra information about images is rarely taken into account because the traditional person re-identification datasets usually do not have it. Nevertheless, the research in multimodal machine learning has demonstrated that utilization of the information from the different sources leads to better performance. In this work, we demonstrate how a person re-identification problem can benefit from utilization of multimodal data. We have used the UAV drone to collect and label the new person re-identification dataset, which composed of pedestrians images and its attributes. We have manually annotated this dataset with attributes and in contrast to the recent research, we do not use the deep network to classify them. Instead, we employ the CBOW model to extract the word embeddings from text descriptions and fuse it with features extracted from images. Then the deep neural decision forest is used for pedestrians classification. The extensive experiments on collected dataset demonstrated the effectiveness of the proposed model.

All ACM Journals | See Full Journal Index

Search TOMM
enter search term and/or author name