ACM Transactions on

Multimedia Computing, Communications, and Applications (TOMM)

Latest Articles

Introduction to the Special Issue on Face Analysis Applications

A Unified Tensor-based Active Appearance Model

Appearance variations result in many difficulties in face image analysis. To deal with this challenge, we present a Unified Tensor-based Active Appearance Model (UT-AAM) for jointly modelling the geometry and texture information of 2D faces. For each type of face information, namely shape and texture, we construct a unified tensor model capturing... (more)

Synthesizing Facial Photometries and Corresponding Geometries Using Generative Adversarial Networks

Artificial data synthesis is currently a well-studied topic with useful applications in data... (more)

U-Net Conditional GANs for Photo-Realistic and Identity-Preserving Facial Expression Synthesis

Facial expression synthesis (FES) is a challenging task since the expression changes are highly... (more)

Efficient Face Alignment with Fast Normalization and Contour Fitting Loss

Face alignment is a key component of numerous face analysis tasks. In recent years, most existing methods have focused on designing high-performance... (more)

Visual Attention Analysis and Prediction on Human Faces for Children with Autism Spectrum Disorder

The focus of this article is to analyze and predict the visual attention of children with Autism... (more)

Features-Enhanced Multi-Attribute Estimation with Convolutional Tensor Correlation Fusion Network

To achieve robust facial attribute estimation, a hierarchical prediction system referred to as... (more)




The ACM TOMM Nicholas D. Georganas Best Paper Award 2019 has been given to the paper: "Deep Bi-directional Cross-triplet Embedding for Online Clothing Shopping" (ACM TOMM vol.14 Issue 1, January 2018) by Shuhui Jiang, Yue Wu, Yun Fu. Details can be found in the press release.

Special issue call: "Recent Trends in Medical Data Security for e-health Applications". Call for papersSubmission deadline February 15th, 2020

Special issue call: "Advanced Approaches for Multiple Instance Learning on Multimedia Applications". Call for papers . Submission deadline March 30th, 2020

Special issue call: "Privacy and Security in Evolving Internet of Multimedia Things". Call for papers. Submission deadline March 31st 2020

Special section call: "Fine-grained Visual Computing". Call for papers. Submission deadline March 31st 2020

Special issue call: "Big Multi-modal Multimedia Data with Deep Analytics". Call for papers. Submission deadline April 15th 2020

News archive
Forthcoming Articles

Exploring Disorder-aware Attention for Clinical Event Extraction

Cross-domain brain CT image smart segmentation via shared hidden space transfer FCM clustering

Active Balancing Mechanism for Imbalanced Medical Data in Deep Learning based Classification Models

CovLets: a Second Order Descriptor for Modeling Multiple Features

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

In this paper, we propose a novel method for unsupervised learning of human action categories in still images. Different from previous methods, the proposed method tries to explore distinctive information of actions directly from unlabeled image databases, and learn discriminative deep representations in an unsupervised manner to categorize actions. In the proposed method, large action image collections can be utilized without manual annotations. Specifically, (i) to deal with the problem that unsupervised discriminative deep representations are difficult to learn, the proposed method builds a training dataset with surrogate labels from the unlabeled dataset, then learn discriminative representations by alternately updating CNN parameters and the surrogate training dataset in an iterative manner; (ii) to explore the discriminatory information among different action categories, training batches for updating the CNN parameters are built with triplet groups, and the triplet loss function is introduced to update the CNN parameters; (iii) to learn more discriminative deep representations, a Random Forest classifier is adopted to update the surrogate training dataset, then more beneficial triplet groups can be built with the updated surrogate training dataset. Extensive experiments on two benchmark action image datasets demonstrate the effectiveness of the proposed method.

HGAN: Holistic Generative Adversarial Networks for 2D Image-based 3D Object Retrieval

In this paper, we propose a novel method to handle the 2D image-based 3D object retrieval problem. First, we extract a set of virtual views to represent each 3D object. And, the soft-attention model is utilized to find the weight of each view to select one characteristic view for each 3D object. Second, we propose a novel Holistic Generative Adversarial Networks (HGAN) to solve the cross-domain feature representation problem and make the feature space of virtual characteristic view more inclined to the feature space of real picture. This will effectively mitigate the distribution discrepancy across the 2D image domain and 3D object domain. Finally, we utilize the generative model of HGAN to obtain the ``virtual real image'' of each 3D object and make the characteristic view of the 3D object and real picture the same feature space for retrieval. To demonstrate the performance of our approach, We set up a new dataset that includes pairs of 2D images and 3D objects, where the 3d objects is based on the ModelNet40 dataset. The experimental results demonstrate the superiority of our proposed method over the state-of-the-art methods.

AB-LSTM: Attention-Based Bidirectional LSTM Model for Scene Text Detection

Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as the curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. In order to tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First of all, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To validate the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method achieves competitive results, and well handles scene text with arbitrary shapes (horizontal, oriented, and curved form).

Multi-scale Supervised Attentive Encoder-Decoder Network for Crowd Counting

Crowd counting is a popular topic with widespread applications. Currently, the biggest challenge to crowd counting is large-scale variation in objects. In this paper, we focus on overcoming this challenge by proposing a novel Attentive Encoder-Decoder Network (AEDN), which is supervised on multiple feature scales to conduct crowd counting via density estimation. This work has three main contributions. First, we augment the traditional encoder-decoder architecture with our proposed residual attention blocks, which, beyond skip-connected encoded features, further extend the decoded features with attentive features. AEDN is better at establishing long-range dependencies between the encoder and decoder, therefore promoting more effective fusion of multi-scale features for handling scale-variations. Second, we design a new KL-divergence based distribution loss to supervise the scale-aware structural differences between two density maps, which complements the pixel-isolated MSE loss and better optimizes AEDN to generate high-quality density maps. Third, we adopt a multi-scale supervision scheme, such that multiple KL divergences and MSE losses are deployed at all decoding stages, providing more thorough supervisions for different feature scales. Extensive experimental results on four public datasets, including ShanghaiTech Part A, ShanghaiTech Part B, UCF- CC-50 and UCF-QNRF, reveal the superiority and efficacy of the proposed method, which outperforms most state-of-the-art competitors.

Spatial Preserved Graph Convolution Networks for Person Re-identification

Person Re-identification is a very challenging task due to inter-class ambiguity caused by similar appearances, and large intra-class diversity caused by viewpoints, illuminations and poses. To address these challenges, in this paper, a graph convolution network based model for person re-identification is proposed to learn more discriminative feature embeddings, where graph-structured relationship between person images and person parts are together integrated. Graph convolution networks extract common characteristics of same person, while pyramid feature embedding exploits parts relations and learns stable representation with each person images. We achieve very competitive performance respectively on three widely used datasets, indicating that the proposed approach significantly outperforms the baseline methods and achieve the state-of-the-art performance.

Dissecting the Performance of VR Video Streaming Through the VR-EXP Experimentation Platform

To cope with the massive bandwidth demands of Virtual Reality (VR) video streaming, both the scientific community and the industry have been proposing optimization techniques such as viewport-aware streaming and tile-based adaptive bitrate heuristics. As most of the VR video traffic is expected to be delivered through mobile networks, a major problem arises: both the network performance and VR video optimization techniques have the potential to influence the video playout performance and the Quality of Experience (QoE). However, the interplay between them is neither trivial nor has it been properly investigated. To bridge this gap, in this paper we introduce VR-EXP, an open-source platform for carrying out VR video streaming performance evaluation. Furthermore, we consolidate a set of relevant VR video streaming techniques and evaluate them under variable network conditions, contributing to an in-depth understanding of what to expect when different combinations are employed. To the best of our knowledge, this is the first work to propose a systematic approach, accompanied by a software toolkit, which allows one to compare different optimization techniques under the same circumstances. Extensive evaluations carried out using realistic datasets demonstrate that VR-EXP is instrumental in providing valuable insights regarding the interplay between network performance and VR video streaming optimization techniques.

A Decision Support System with Intelligent Recommendation for Multi-Disciplinary Medical Treatment

Recent years have witnessed an emerging trend for improving disease treatment by forming multi-disciplinary medical teams. The collaboration among specialists from multiple medical domains has been shown to be significantly helpful for designing comprehensive and reliable regimens, especially for incurable diseases. Although this kind of multi-disciplinary treatment has been increasingly adopted by healthcare providers, a new challenge has been introduced to the decision-making process ? how to efficiently and effectively develop final regimens by searching for candidate treatments and considering inputs from every expert. In this paper, we present a sophisticated decision support system called MdtDSS (a decision support system (DSS) for multi-disciplinary treatment (Mdt)), which is particularly developed to guide the collaborative decision-making in multi-disciplinary treatment scenarios. The system integrates a recommender system which aims to search for personalized candidates from a large-scale high-quality regimen pool, and a voting system which helps collect feedback from multiple specialists without potential bias. Our decision support system optimally combines machine intelligence and human experience and helps medical practitioners make informed and accountable regimen decisions. We deployed the proposed system in a large hospital in Shanghai, China, and collected real-world data on large-scale patient cases. The evaluation shows that the proposed system achieves outstanding results in terms of high-quality multi-disciplinary treatment.

Textual Entailment based Figure Summarization for Biomedical Articles

The current paper proposes a novel approach (FigSum++) for automatic figure summarization in biomedical scientific articles using a multi-objective evolutionary algorithm. The problem is treated as a binary optimization problem where relevant sentences in the summary for a given figure are selected based on various sentence scoring features: the textual entailment score between sentences in the summary and figure{\rq}s caption, the number of sentences referring to figure, semantic similarity between sentences and {figure\rq s} caption, the number of overlapping words between sentences and figure{\rq}s caption etc. These features are optimized simultaneously using multi-objective binary differential evolution (MBDE). MBDE consists of a set of solutions and each solution represents a subset of sentences to be selected in the summary. MBDE generally uses single DE variant, but, here, ensemble of two different DE variants measuring diversity among solutions and convergence towards global optimal solution, respectively, is employed for efficient search. Usually, in any summarization system, diversity amongst sentences (called as anti-redundancy) in the summary is a very critical feature and it is calculated in terms of similarity (like cosine similarity) among sentences. In this paper, a new way of measuring diversity in terms of textual entailment is proposed. To represent the sentences of the article in the form of numeric vectors, recently proposed, BioBERT, a pre-trained language model in biomedical text mining is utilized. An ablation study has also been presented to determine the important of different objective functions. For evaluation of the proposed technique, two benchmark biomedical datasets containing 91 and 84 figures, respectively, are considered. Our proposed system obtains 5% and 11% improvements in terms of F-measure metric over two datasets, respectively, in comparison to the state-of-the-art.

Random Playlists Smoothly Commuting Between Styles

Someone enjoys listening to playlists while commuting. He wants a different playlist of n songs each day, but always starting from Locked Out of Heaven, a Bruno Mars song. The list should progress in smooth transitions between successive and randomly selected songs until it ends up at Stairway to Heaven, a Led Zepellin song. The challenge of automatically generating random and heterogeneous playlists is to find the appropriate balance among several conflicting goals. We propose two methods for solving this problem, ROPE and STRAW. When compared with the state of the art algorithms, our algorithms are the only ones that satisfy the following quality constraints: heterogeneity, smooth transitions, novelty, scalability, and usability. We demonstrate the usefulness of our proposed algorithms by applying them to a large collection of songs and make available a prototype.

Adaptive Chunklets and AQM for Higher Performance Content Streaming

Commercial streaming services such as Netflix and YouTube use proprietary HTTP-based adaptive streaming (HAS) techniques to deliver content to consumers worldwide. MPEG recently developed Dynamic Adaptive Streaming over HTTP (DASH) as a unifying standard for HAS-based streaming. In DASH systems, streaming clients employ adaptive bitrate (ABR) algorithms to maximise user Quality of Experience (QoE) under variable network conditions. In a typical Internet-enabled home, video streams have to compete with diverse application flows for the last-mile Internet Service Provider (ISP) bottleneck capacity. Under such circumstances, ABRs will only act upon the fraction of the network capacity that is available, leading to possible QoE degradation. We have previously proposed chunklets as an approach orthogonal to ABR which uses parallel connections for intra-video chunk retrieval. Chunklets effectively make more bandwidth available for ABRs in the presence of cross-traffic, especially in environments where Active Queue Management (AQM) schemes such as PIE and FQ-CoDel are deployed. However, chunklets consume valuable server/middlebox resources which typically handle hundreds of thousands requests/connections per-second. In this paper, we propose `adaptive chunklets' -- a novel chunklet enhancement that dynamically tunes the number of concurrent connections. We demonstrate that the combination of adaptive chunkleting and FQ-CoDel is the most effective strategy. Our experiments show that adaptive chunklets can reduce the number of connections by almost 35% and consume almost 11% less bandwidth than fixed chunklets while providing the same QoE.

Tile-Based Adaptive Streaming for Virtual Reality Video

The increasing popularity of head-mounted devices and {360\degree} video cameras allows content providers to provide virtual reality (VR) video streaming over the Internet, using a two-dimensional representation of the immersive content combined with traditional HTTP adaptive streaming (HAS) techniques. However, since only a limited part of the video (i.e., the viewport) is watched by the user, the available bandwidth is not optimally used. Recent studies have shown the benefits of adaptive tile-based video streaming; rather than sending the whole {360\degree} video at once, the video is cut into temporal segments and spatial tiles, each of which can be requested at a different quality level. This allows prioritization of viewable video content, and thus results in an increased bandwidth utilization. Given the early stages of research, there are still a number of open challenges to unlock the full potential of adaptive tile-based VR streaming. The aim of this work is to provide an answer to several of these open research questions. Among others, we propose two tile-based rate adaptation heuristics for equirectangular VR video, which use the great-circle distance between the viewport center and the center of each of the tiles to decide upon the most appropriate quality representation. We also introduce a feedback loop in the quality decision process, which allows the client to revise prior decisions based on more recent information on the viewport location. Furthermore, we investigate the benefits of parallel TCP connections and the use of HTTP/2 as an application layer optimization. Through an extensive evaluation, we show that the proposed optimizations result in a significant improvement in terms of video quality (more than twice the time spent on the highest quality layer), compared to non-tiled HAS solutions.

Adaptive Exploration for Unsupervised Person Re-Identification

Due to domain bias, directly deploying a deep person re-identification (re-ID) model trained on one dataset often achieves considerably poor accuracy on another dataset. In this paper, we propose an Adaptive Exploration (AE) method to address the domain-shift problem for re-ID in an unsupervised manner. Specifically, in the target domain, the re-ID model is inducted to 1) maximize distances between all person images and 2) minimize distances between similar person images. In the first case, by treating each person image as an individual class, a non-parametric classifier with a feature memory is exploited to encourage person images to move away from each other. In the second case, according to a similarity threshold, our method adaptively selects neighborhoods in the feature space for each person image. By treating these similar person images as the same class, the non-parametric classifier forces them to stay closer. However, a problem of this strategy is that, when an image has too many neighborhoods, it is more likely to attract other images as its neighborhoods. As a result, a minority of images may select a large number of neighborhoods while a majority of images has only a few neighborhoods. To address this issue, we additionally integrate a balance strategy into the adaptive selection. We evaluate our methods with two protocols. The first one is called ``target-only re-ID', in which only the unlabeled target data is used for training. The second one is called ``domain adaptive re-ID', in which both the source data and the target data are used. Experimental results on large-scale re-ID datasets demonstrate the effectiveness of our method. Our code has been released at

An Evaluation of Tile Selection Methods for Viewport Adaptive Streaming of 360-degree Video

360-degree video has become increasingly popular nowadays. For e?ective transmission of bandwidth-intensive 360-degree video over networks, viewport-adaptive streaming has been introduced. In this paper, we evaluate, for the first time, ten existing methods to understand the e?ectiveness of tile-based viewport adaptive streaming of 360-degree video. Experimental results show that tile-based methods can improve the average V-PSNR by up to 4.3 dB compared to a non-tiled method under low delay settings. Here, the V-PSNR is computed as the peak signal to noise ratio of the adapted viewport compared to the corresponding origin viewport. Also, di?erent methods show di?erent tradeo?s between average viewport quality and viewport quality variations. Especially, the performances of most tile-based methods decrease quickly as the segment duration and/or bu?er size increase for the content with no main focus. Even, under long delay settings like HTTP Adaptive Streaming, it is found that the simple non-tiled method outperforms most tile-based methods. For the content with a strong viewing focus, it is found that the tile-based methods are less in?uenced by the segment duration and the bu?er size. In addition, a comparison of the performances of the tile selection methods using two popular viewport estimation methods is conducted. It is interesting that there is only little di?erence found in performances of tile selection methods. The findings of this study are useful for service providers to make decisions on deployment of streaming solutions.

DQ-DASH: A Queuing Theory Approach to Distributed Adaptive Video Streaming

The significant popularity of HTTP adaptive video streaming (HAS), such as MPEG-DASH, over the Internet has led to a stark increase in user expectations in terms of video quality and delivery robustness. This situation creates new challenges for content providers who must satisfy the Quality-of-Experience (QoE) requirements and demands of their customers over a best-effort network infrastructure. Unlike traditional single server Dynamic Adaptive Streaming over HTTP (DASH), our new design leverages the availability of multiple servers by downloading segments in parallel (denoted DQ-DASH). DQ-DASH streaming facilitates the aggregation of bandwidth from different servers and increases fault-tolerance and robustness through path diversity. The resulting resilience prevents clients from suffering QoE degradations when some of the servers become congested. The proposed solution uses an extended Mx /D/1/K queuing theory based rate adaptation algorithm in conjunction with the request scheduler to download subsequent segments of the same quality through parallel requests to reduce quality fluctuations. DQ-DASH also helps to fully utilize the aggregate bandwidth from the servers and download the imminently required segment from the server with the highest throughput. DQ-DASH is robust in case of server bottleneck by maintaining a high user satisfaction. We have also analyzed the effect of buffer capacity and segment duration for multi-source video streaming.

Soul Dancer: Emotion-based Human Action Generation

Body language is one of the most common ways of expressing human emotion. In this paper, we make the first attempt to generate action video with a specific emotion from a single person image. The task of emotion based action generation (EBAG) can be defined as: provided with a type of emotion and a human image with full body, action video in which the person of the source image expressing the given type of emotion can be generated. We divide the task into two parts and propose a two-stage framework to generate action video with emotion expressing. In the first stage, we propose an RNN based LS-GAN for translating the emotion to a pose sequence. In the second stage, we generate the target video frames according to the three inputs including the source pose and the target pose as the motion information and source image as the appearance reference by using conditional GAN model with online training strategy. Our framework produces the pose sequence and transforms the action independently, which underlines the fundamental role that the high-level pose feature plays in generating action video with a specific emotion. The proposed method has been evaluated on the "Soul Dancer" dataset which is built for action emotion analysis and generation. The experimental results demonstrate that our framework can effectively solve the emotion-based action generation task. However, the gap in the details of the appearance between the generated action video and the real-world video still exists, which indicates that the emotion-based action generation task has great research potential.

Cell Nuclei Classification In Histopathological Images using Hybrid OLConvNet

Computer-aided histopathological image analysis for cancer detection is a major research challenge in the medical domain. Automatic detection and classification of nuclei for cancer diagnosis impose a lot of challenges in developing state of the art algorithms due to the heterogeneity of cell nuclei and data set variability. Recently, a multitude of classification algorithms has used complex deep learning models for their dataset. However, most of these methods are rigid and their architectural arrangement suffers from inflexibility and non-interpretability. In this research article, we have proposed a hybrid and flexible deep learning architecture OLConvNet that integrates the interpretability of traditional object-level features and generalization of deep learning features by using a shallower Convolutional Neural Network (CNN) named as CNN3L. CNN3L reduces the training time by training fewer parameters and hence eliminating space constraints imposed by deeper algorithms. We used F1-score and multiclass Area Under the Curve (AUC) performance parameters to compare the results. To further strengthen the viability of our architectural approach, we tested our proposed methodology with state of the art deep learning architectures AlexNet, VGG16 and VGG19 as backbone networks. After a comprehensive analysis of classification results from all four architectures, we observed that our proposed model works well and perform better than contemporary complex algorithms.

Cross Refinement Techniques for Markerless Human Motion Capture

This paper presents a global 3D human pose estimation method for markerless motion capture. Given two calibrated images of a person, it first obtains the 2D joint locations in the images using a pre-trained 2D Pose CNN, then constructs the 3D pose based on stereo triangulation. To improve the accuracy and the stability of the system, we propose two efficient optimization techniques for the joints. The first one, called cross-view refinement, optimizes the joints based on epipolar geometry. The second one, called cross-joint refinement, optimizes the joints using bone length constraints. Our method automatically detects and corrects the unreliable joint, and consequently is robust against heavy occlusion, symmetry ambiguity, motion blur, and highly distorted poses. We evaluate our method on a number of benchmark datasets covering indoors and outdoors, which showed that our method is better than or on par with the state-of-the-art methods. As an application, we create a 3D human pose dataset using the proposed motion capture system, that contains about 480,000 images of both indoors and outdoors scenes, and demonstrate the usefulness of the dataset for human pose estimation.

An image cues coding approach for 3D pose estimation

Although the Deep Convolutional Neural Networks (DCNN) facilitates the evolution of 3D human pose estimation, the ambiguity in 3D pose recovery remains the most challenging problem in such tasks. Inspired by the Human Perception Mechanism (HPM), we propose an image-to-pose coding method to fill the gap between image cues and 3D poses, thereby alleviating the one-to-many problem in the 2D pose-to-3D pose. Our method utilizes the mapping relation between image cues and 3D poses to explicitly encode all 3D pose subspaces so that the ambiguous in entire 3D pose space can be mitigated. In 3D pose space, we explicitly divide the whole 3D pose space into multiple sub-regions named pose codes and turns disambiguation problem into a classification problem. With the vast amounts of 3D pose, the proposed coding mechanism fully covers a majority of view changes, thus providing a complete description for 3D pose space. The articulated structure of the human body lies on a sophisticated product manifold and the error accumulation in the chain structure will undoubtedly affect the coding performance. Therefore, in image space, we extract the image cues from independent local image patches rather than the whole image. Then, the mapping relationship between image cues and 3D pose codes are constructed by a set of DCNN. Combining the proposed image-to-pose coding method with the matching mechanism, we propose a 3D pose estimation method which could effectively alleviate the ambiguity in 3D pose recovery. The image-to-pose coding method transforms the implicit image cues into explicit constraints in the matching stage. We conduct extensive experiments on widely used public benchmarks. The experimental results show that our method can effectively alleviate the ambiguity in 3D pose recovery and robust to view changes.

Cloud Gaming With Foveated Video Encoding

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlation alignment (${S}^{3}CA$) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully-connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully-connected neural network to extract semantic features for texts, including news articles from news media. In particular, nonlinear correlations of layer activations in the two neural networks are aligned with correlation alignment during the joint training of the networks. Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly. In addition, we contribute a Wiki-Flickr Event dataset, where the multimodal data samples are not describing each other in pairs like the existing paired datasets, but all of them are describing semantic events. Extensive experiments conducted on both paired and unpaired datasets manifest the effectiveness of ${S}^{3}CA$, outperforming the state-of-the-art methods.

Robust Visual Tracking using Kernel Sparse Coding on Multiple Covariance Descriptors

In this paper, we aim to improve the performance of visual tracking by combing different features of multiple modalities. The core idea is to use covariance matrices as feature descriptors and then use sparse coding to encode different features. The notion of sparsity has been successfully used in visual tracking. In this context, sparsity is used along appearance models often obtained from intensity/color information. In this work, we step outside this trend and propose to model the target appearance by local Covariance Descriptors (CovD) in a pyramid structure. The proposed pyramid structure not only enables us to encode local and spatial information of the target appearance but also inherits useful properties of CovDs such as invariance to affine transforms. Since CovDs lie on a Riemannian manifold, we further propose to perform tracking through sparse coding by embedding the Riemannian manifold into an infinite-dimensional Hilbert space. Embedding the manifold into a Hilbert space allows us to perform sparse coding efficiently using kernel trick. Our empirical study shows that the proposed tracking framework outperforms the existing state-of-the-art methods in challenging scenarios.

Random Forest with Self-paced Bootstrap Learning in Lung Cancer Prognosis

Training gene expression data with supervised learning approaches, it has the potential to decrease cancer death rates by developing prediction strategies for lung cancer treatment, but the samples of gene features still involved lots of noises. In this study, we presented a random forest with self-paced learning bootstrap for improvement of lung cancer prognosis and classification based on gene expression data. To be specific, we proposed an ensemble learning with random forest approach to improving the model classification performance by selecting multi-classifiers. We also investigated the sampling strategy by gradually embedding from high- to low-quality samples by the self-paced learning. The results based on five public lung cancer datasets showed that our proposed method could select significant genes and improve classification performance compared to existing approaches. We believe that our proposed method has the potential to assist doctors for gene selections and lung cancer prognosis.

How avatar gender may influence users intention to use Second Life environment: An empirical study

The use of virtual environments to transform individuals? use of modern technology has gained a special attention lately. The role of gender in the virtual space has also been viewed as a contributor to individuals? use of modern technology. This study investigated the impact of avatar gender on users? intention to use the Second Life (SL) environment in a university context. Two avatars of male and female characteristics were designed and used in the SL environment. A total of 74 SL users were involved in two learning sessions (with male female avatars). A questionnaire was used to capture users? perceptions of ease of use, usefulness, attitude, and behavioral intention to use the SL space. SL users had positive intentions to use the SL environment for various learning purposes when they are provided with the preferred gender appearance. Offering opposite gender characteristics can help stimulate users? interaction with the avatar, thus facilitating the learning process and building the sense of technology effectiveness.

Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining

Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space through hash function, and achieves fast and flexible cross-modal retrieval. Most existing cross-modal hashing methods learn hash function by mining the correlation among multimedia data, but ignore the important property of multimedia data: Each modality of multimedia data has features of different scales, such as texture, object and scene features in the image, which can provide complementary information for boosting retrieval task. The correlations among the multi-scale features are more abundant than the correlations between single features of multimedia data, which reveal finer underlying structure of the multimedia data and can be used for effective hashing function learning. Therefore we propose Multi-scale Correlation Sequential Cross-modal Hashing (MCSCH) approach, and its main contributions can be summarized as follows: 1) Multi-scale feature guided sequential hashing learning method is proposed to share the information from features of different scales through a RNN based network and generate the hash codes sequentially. The features of different scales are used to guide the hash codes generation, which can enhance the diversity of the hash codes and weaken the influence of errors in specific features, such as false object features caused by occlusion. 2) Multi-scale correlation mining strategy is proposed to align the features of different scales in different modalities and mine the correlations among aligned features. These correlations reveal finer underlying structure of multimedia data and can help to boost the hash function learning. 3) Correlation evaluation network evaluates the importance of the correlations to select the worthwhile correlations, and increases the impact of these correlations for hash function learning. Experiments on two widely-used 2-media datasets and a 5-media dataset demonstrate the effectiveness of our proposed MCSCH approach.

RCE-HIL: Recognizing Cross-media Entailment with Heterogeneous Interactive Learning

Entailment recognition is an important paradigm of reasoning, which judges if a hypothesis can be inferred from given premises. However, previous efforts mainly concentrate on text-based reasoning as recognizing textual entailment (RTE), where the hypotheses and premises are both textual. In fact, human?s reasoning process has the characteristic of cross-media reasoning. It is naturally based on the joint inference with different sensory organs, which represent complementary reasoning cues from unique perspectives as language, vision and audition. How to realize cross-media reasoning has been a significant challenge to achieve the breakthrough for width and depth of entailment recognition. Therefore, this paper extends RTE to a novel reasoning paradigm: recognizing cross-media entailment (RCE), and proposes heterogeneous interactive learning (HIL) approach. Specifically, HIL recognizes entailment relationships via cross-media joint inference, from image-text premises to text hypotheses. It is an end-to-end architecture with 2 parts: 1) Cross-media hybrid embedding is proposed to perform cross embedding of premises and hypotheses, for generating their fine-grained representations. It aims to achieve the alignment of cross-media inference cues, via image-text and text-text interactive attention. 2) Heterogeneous joint inference is proposed to construct a heterogeneous interaction tensor space, and extract semantic features for entailment recognition. It aims to simultaneously capture the interaction between cross-media premises and hypotheses, and distinguish their entailment relationships. Experimental results on widely-used SNLI dataset with image premises from Flickr30K dataset, verify the effectiveness of HIL, and the intrinsic inter-media complementarity in reasoning.

A Novel Security Scheme for WBAN Systems in Smart Healthcare Environments

Wireless Body Area Networks (WBAN) is expected to play an important role in smart healthcare environments and have caused a wide interest in academic and industrial areas in recent years. Due to the privacy of biometric data, the healthcare systems require quite high security assurance. However, the security protocol is only briefly introduced in the existing IEEE 802.15.6 WBAN system standard. In this paper, a novel security scheme for WBAN system in smart healthcare environments is proposed, which consists of authentication and encryption phases. The authentication scheme is based on a challenge-response architecture to guarantee the validity of the access node. While the elliptic curve crypto (ECC) based encryption scheme is used to protect the transmission data. Besides, the corresponding hardware architecture is also proposed and implemented with 40 nm CMOS technology, demonstrating low hardware cost and power consumption with acceptable latency.

Action Recognition using form and motion modalities

Action recognition has attracted increasing interest in computer vision due to its potential applications in many vision systems. One of the main challenges in action recognition is to extract powerful features from videos. Most existing approaches exploit either hand-crafted techniques or learning based methods to extract features from videos. However, these methods mainly focus on extracting the dynamic motion features, which ignore the static form features. Therefore, these methods cannot fully capture the underlying information in videos accurately. In this paper, we propose a novel feature representation method for action recognition, which exploits hierarchal sparse coding to learn the underlying features from videos. The learned features characterise the form and motion simultaneously and therefore provide more accurate and complete feature representation. The learned form and motion features are considered as two modalities, which are used to represent both the static and motion features. These modalities are further encoded into a global representation via a pair-wise dictionary learning and then fed to a SVM classifier for action classification. Experimental results on several challenging datasets validate the proposed method is superior to several state-of-the-art methods.

Steganographer Detection via Multi-Scale Embedding Probability Estimation

Steganographer detection aims to identify the guilty user, who utilizes steganographic methods to hide secret information in the spread multimedia data, especially image data, from a large amount of innocent users on the social networks. True embedding probability map illustrates the probability distribution of embedding secret information in the corresponding images by specific steganographic methods and settings, which has been successfully used as the guidance for content-adaptive steganographic and steganalytic methods. Unfortunately, in real-world situation, the detailed steganographic settings adopted by the guilty user cannot be known in advance. It thus becomes necessary to propose an automatic embedding probability estimation method. In this paper, we propose a novel content-adaptive steganographer detection method via embedding probability estimation. The embedding probability estimation is firstly formulated as a learning-based saliency detection problem and the multi-scale estimated map is then integrated into the CNN to extract steganalytic features. Finally, the guilty user is detected via an efficient Gaussian vote method with the extracted steganalytic features. The experimental results prove that the proposed method is superior to the state-of-the-art methods in both spatial and frequency domains.

Proposal Complementary Action Detection

Temporal action detection not only requires correct classification, but also needs to detect the start and end times of each action accurately. However, the tradition approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by the means of end-to-end. In this paper, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present a novel proposal complementary action detector (PCAD) to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional (Conv3D) network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also designed a boundary proposal network (BPN) to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on THUMOS?14 detection benchmark, PCAD achieves the state-of-the-art performance in high-speed models.

DenseNet-201 based deep neural network with composite learning factor and precomputation for multiple sclerosis classification

(Aim) Multiple sclerosis is a neurological condition that may cause neurologic disability among. To identify multiple sclerosis more accurately, this paper proposed a new transfer-learning based approach. (Method) DenseNet-121, DensetNet-169, and DenseNet-201 neural networks were compared. Besides, we proposed to use a composite learning factor (CLF) that assigns different learning factor to three types of layers: early frozen layers, middle layers, and late newly-replaced layers. How to allocate layers into those three layers remain a problem. Hence, four transfer learning setting (viz., Setting A, B, C, and D) were tested and compared. Precomputation method was utilized to reduce storage burden and accelerate the program. (Results) We observed that DenseNet-201-D can achieve the best performance. The sensitivity, specificity, and accuracy of DenseNet-201-D was 98.27± 0.58, 98.35± 0.69, and 98.31± 0.53, respectively. (Conclusion) Our method gives better performances than state-of-the-art approaches. Furthermore, this composite learning rate gives superior results to traditional simple learning factor (SLF) strategy.

Efficient Image Hashing with Invariant Vector Distance for Copy Detection

Image hashing is an efficient technique of multimedia security for image content protection. It maps an image into a content-based compact code for denoting the image itself. While most existing algorithms focus on improving the classification between robustness and discrimination, little attention has been paid to geometric invariance under normal digital operations, and therefore results in quite fragile to geometric distortion when applied in image copy detection. In this paper, a novel effective image hashing method is proposed based on invariant vector distance in both spatial domain and frequency domain. First, the image is preprocessed by some joint operations to extract robust features. Then, the preprocessed image is randomly divided into several overlapping blocks under a secret key, and two different feature matrices are separately obtained in the spatial domain and frequency domain through invariant moment and low frequency discrete cosine transform coefficients. Furthermore, the invariant distances between vectors in feature matrices are calculated and quantified to form a compact hash code. We conduct various experiments to demonstrate that the proposed hashing not only reaches good classification between robustness and discrimination, but also resists most geometric distortion in image copy detection. In addition, both receiver operating characteristics curve comparisons and mean average precision in copy detection clearly illustrate that the proposed hashing method outperforms state-of-the-art algorithms.

AMIL: Adversarial Multi-Instance Learning for Human Pose Estimation

Human pose estimation has an important impact on a wide range of applications from human-computer interface to surveillance and content-based video retrieval. For human pose estimation, joint obstructions and overlapping upon human bodies result in departed pose estimation. To address these problems, by integrating priors of the structure of human bodies, we present an innovative structure-aware network to discreetly consider such priors during the training of the network. Typically, learning such constraints is a challenging task. Instead, we propose generative adversarial networks as our learning model in which we design two residual multiple instance learning (MIL) models with the identical architecture, one used as the generator and the other one used as the discriminator. The discriminator task is to distinguish the actual poses from the fake ones. If the pose generator generates the results that the discriminator is not able to distinguish from the real ones, the model successfully learns the priors. In the proposed model, the discriminator differentiates the ground-truth heatmaps from the generated ones, and later the adversarial loss back-propagates to the generator. Such procedure assists the generator to learn reasonable body configurations and is proved to be advantageous to improve the prediction accuracy. Meanwhile, we propose a novel function for MIL. It is an adjustable structure for both instance selection and modeling to appropriately pass the information between instances in a single bag. In the proposed residual MIL neural network, the pooling action frequently updates the instance contribution to its bag. The proposed adversarial residual multi-instance neural network that is based on pooling has been validated on two datasets for the human pose estimation task and successfully outperforms the other state-of-arts models.

LFGAN: 4D Light Field Synthesis from a Single RGB Image

We present a deep neural network called Light Field GAN (LFGAN) that synthesizes a 4D light field from a single 2D RGB image. We generate light fields using single image super-resolution (SISR) technique based on two important observations. First, the small baselines give rise to the high similarity between the full light field image and each sub-aperture view. Second, the occlusion edge at any spatial coordinate of a sub-aperture view has the same orientation as the occlusion edge at the corresponding angular patch, implying the occlusion information in the angular domain can be inferred from the sub-aperture local information. We employ the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) to learn the color and depth information from the light field datasets. The network can generate a plausible 4D light field comprising 8 × 8 angular views from a single sub-aperture 2D image. We propose new loss terms, namely, epipolar plane image (EPI) and brightness regularization losses, as well as a novel multi-stage training framework to feed the loss terms at different time to generate superior light fields. The EPI loss can reinforce the network to learn the geometric features of the light fields, and the brightness regularization loss can preserve the brightness consistency across different sub-aperture views. Two datasets have been used to evaluate our method: in additional to an existing light field dataset capturing scenes of flowers and plants, we have built a large dataset of toy animals consisting 2100 light fields captured with a plenoptic camera. We have performed comprehensive ablation studies to evaluate the effects of individual loss terms and the multi-stage training strategy, and compared LFGAN with other state-of-the-art techniques. Qualitative and quantitative evaluation demonstrates that LFGAN can effectively estimate complex occlusions and geometry in challenging scenes, and outperform other existing techniques.

Machine learning techniques for the diagnosis of Alzheimer's disease: A review

Alzheimer's disease is an incurable neurodegenerative disease primarily affecting the elderly population. Efficient automated techniques are needed for early diagnosis of Alzheimers. Many novel approaches are proposed by researchers for classification of Alzheimer's disease. However, to develop more efficient learning techniques, better understanding of the work done on Alzheimers is needed. Here, we provide a review on 165 papers from 2001-2019 using various feature extraction and machine learning techniques. The machine learning techniques are surveyed under three main categories: support vector machine (SVM), artificial neural network (ANN), and deep learning (DL) and ensemble methods. We present a detailed review on these three approaches for Alzheimers with possible future directions.

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

This paper aims to develop an efficient Content-based Video Retrieval (CBVR) system by hashing videos into short binary codes. It is an appealing research topic with increasing demands in such an Internet era when massive videos are uploaded to the website every day. The main challenge of this task is how to discriminatively map video sequences to compact hash codes by preserving original similarity. Existing video hashing methods are usually built on two isolated steps: frame pooling-based video features extraction and hash codes generation, which have not fully explored the spatial-temporal properties in videos and also inevitably result in severe information loss. To address these issues, in this paper we present an end-to-end video retrieval framework called Similarity-Preserving Deep Temporal Hashing (SPDTH) network. Specifically, we design the hashing module as an encoder Recurrent Neural Network (RNN) which is equipped with the stacked Gated Recurrent Units (GRUs). The benefit of our network is that it explicitly extracts the spatial-temporal properties of videos and yields compact hash codes in an end-to-end manner. Besides, we also introduce a structured ranking loss for deep network training by preserving intra-class similarity and inter-class separability, and the quantization loss between the real-valued output and the binary codes is minimized. Extensive experiments on several challenging datasets have demonstrated that SPDTH can consistently outperform state-of-the-art video hashing methods.

An End-to-end Attention-based Neural Model for The Complementary Clothing Matching

In modern society, people tend to prefer fashionable and decent outfits than the ones can only meet the basic physiological need. In fact, a proper outfit usually relies on the good matching among the complementary fashion items (e.g., the top, bottom, and shoes) that compose it, which thus propels us to investigate the automatic complementary clothing matching scheme. However, this is non-trivial due to the following challenges. (1) The main challenge lies in how to accurately model the compatibility between complementary fashion items (e.g., the top and bottom) that come from the heterogeneous spaces with multi-modalities(e.g., the visual modality and textual modality). (2) Since different features (e.g., the color, style, and pattern)of fashion items may contribute differently to the compatibility modeling, how to encode the confidence of different pair-wise features poses a tough challenge. (3) How to jointly learn the latent representation of multi-modal data and the compatibility between complementary fashion items contributes to the last challenge. Towards this end, in this work, we present an end-to-end attention-based neural framework for the compatibility modeling, where we introduce a feature-level attention model to adaptively learn the confidence for different pair-wise features. Extensive experiments on a public available real-world dataset show the superiority of our model over the state-of-the-art methods.

Image/Video Restoration via Multiplanar Autoregressive Model and Low-Rank Optimization

In this paper, we introduce an image/video restoration approach by utilizing the high-dimensional similarity in images/videos. After grouping similar patches from neighboring frames, we propose to build a Multiplanar autoregressive (AR) model to exploit the correlation in cross-dimensional planes of the patch group, which has long been neglected by previous AR models. To further utilize the nonlocal self-similarity in images/videos, a joint multiplanar AR and low-rank based approach is proposed (MARLow) to reconstruct patch groups more effectively. Moreover, for video restoration, the temporal smoothness of the restored video is constrained by the Markov random field (MRF), where MRF encodes a priori knowledge about consistency of patches from neighboring frames. Specifically, we treat different restoration results (from different patch groups) of a certain patch as labels of an MRF, and temporal consistency among these restored patches is imposed. Besides image and video restoration, the proposed method is also suitable for other restoration applications such as interpolation and text removal. Extensive experimental results demonstrate that the proposed approach obtains encouraging performance comparing with state-of-the-art methods.

A Benchmark Dataset and Comparison Study for Multi-Modal Human Action Analytics

Large-scale benchmarks provide a solid foundation for the development of action analytics. Most of the previous activity benchmarks focus on analyzing actions in RGB videos. There is a lack of large-scale and high-quality benchmarks for multi-modal action analytics. In this paper, we introduce PKU Multi-Modal Dataset (PKU-MMD), a new large-scale benchmark for multi-modal human action analytics. It consists of about 28,000 action instances and 5.4 million frames in total, and provides high-quality multi-modal data sources, including RGB, depth, infrared radiation (IR) and skeletons. To make PKU-MMD more practical, our dataset comprises two subsets under different settings for action understanding, namely Part I and Part II. Part I contains 1,076 untrimmed video sequences with 51 action classes performed by 66 subjects, while Part II contains 1,009 untrimmed video sequences with 41 action classes performed by 13 subjects. Compared to Part I, Part II is more challenging due to short action intervals, concurrent actions and heavy occlusion. PKU-MMD can be leveraged in two scenarios: action recognition with trimmed video clips and action detection with untrimmed video sequences. For each scenario, we provide benchmark performance on both subsets, by conducting different methods with different modalities under two evaluation protocols, respectively. Experimental results show that PKU-MMD is a significant challenge to many state-of-the-art methods. We further illustrate that the features learned on PKU-MMD can be well transferred to other datasets. We believe this large-scale dataset will boost the research in the field of action analytics for the community.

ACMNet: Adaptive Confidence Matching Network for Human Behavior Analysis via Cross-Modal Retrieval

Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this paper, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this paper, we propose an \textbf{Adaptive Confidence Matching Network} (\textbf{ACMNet}), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, \ie{} Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

Intelligent Classification and Analysis of Essential Genes Species Using Quantitative Methods

Significance of Essential word needs no further clarifications. Essential genes are considered in the perspective of evolution of different organisms; however, it is quite complicated because we have to recognize the difference between essential cellular processes, essential protein functions and essential genes. There is also a need to identify whether one set of growth conditions may be replaces under another set. It is also contended that most genes are essential in natural selection process. In this article, we applied intelligent method for classification of essential genes of four different species, namely, Human, Arabidopsis Thaliana, Drosophila Melanogaster and Danio Rerio. The primary aim of the current article is to understand the distributions of purines and pyrimidines over the essential genes of four different species Human, Arabidopsis Thaliana, Drosophila Melanogaster and Danio Rerio are considered. Based on quantitative parameters (Shannon Entropy, Fractal Dimension, Hurst Exponent, Distribution of purines- pyrimidines) ten different clusters have be generated for the four species. Some proximity results have been observed among the clusters of all the four species.

Joint Stacked Hourglass Network and Salient Region Attention Refinement for Robust Face Alignment

Facial landmark detection aims to locate keypoints for facial images, which typically suffer from variations caused by arbitrary pose, diverse facial expressions, and partial occlusion. In this paper, we propose a coarse-to-fine framework which joints stacked hourglass network and salient region attention refinement for robust face alignment. In order to achieve this goal, we firstly present a Multi-Scale Region Learning (MSR) module to analyze the structure information at different facial region and extract strong discriminative deep feature. Then we employ Stacked Hourglass Network (SHN) for heatmap regression and initial facial landmarks prediction. Specifically, SHN introduces an improved Inception-ResNet unit as basic building block, which can effectively improve the receptive field and learn contextual feature representations. Meanwhile, a novel loss function takes into account global weights and local weights to make the heatmap regression more accurate and faster. Different from existing heatmap regression models, we present a Salient Region Attention Refinement (SRA) module to extract precise feature based on the heatmap regression, and utilize the filtered feature for landmarks refinement to achieve accurate prediction. Extensive experimental results of several challenging datasets (including 300W, COFW and AFLW) confirm that our approach can achieve more competitive performance than the most advanced algorithms.

Multichannel Attention Refinement for Video Question Answering

Video Question Answering (VideoQA) is the extension of image question answering (ImageQA) in the video domain. Methods are required to give the correct answer after analyzing the provided video and question in this task. Comparing to ImageQA, the most distinctive part is the media type. Both tasks require the understanding of visual media, but VideoQA is much more challenging mainly because of the complexity and diversity of videos. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this paper, we propose to tackle the task from a multichannel perspective. Appearance, motion and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer. We also incorporate the relevant text information acquired from Wikipedia as an attempt to extend the capability of the method. Experiments on TGIF-QA and ActivityNet-QA datasets show the advantages of our method compared to existing methods. We also demonstrate the effectiveness and interpretability of our method by analyzing the refined attention weights during the question answering procedure.

Embedding distortion analysis in wavelet-domain watermarking

Imperceptibility and robustness are two complementary fundamental requirements of any watermarking algorithm. Low strength watermarking yields high imperceptibility, but exhibits poor robustness. High strength watermarking schemes achieve good robustness but often infuse distortions resulting in poor visual quality in host image. In this paper we analyse the embedding distortion for wavelet based watermarking schemes. We derive the relationship between the distortion, measured in mean square error (MSE), and the watermark embedding modification and propose the linear proportionality between MSE and the sum of energy of the selected wavelet coefficients for watermark embedding modification. The initial proposition assumes the orthonormality of the discrete wavelet transform. It is further extended for non-orthonormal wavelet kernels using a weighting parameter, that follows the energy conservation theorems in wavelet frames. The proposed analysis is verified by experimental results for non-blind as well as blind watermarking schemes. Such a model is useful to find the optimum input parameters, including, the wavelet kernel, coefficient selection and subband choices for wavelet domain image watermarking.

Hybrid Wolf-Bat algorithm for optimisation of connection weights in multi-layer perceptron

In any neural network, the weights act as parameters for determining the output(s) from a set of inputs. They are used for finding the activation values of nodes of a layer from the values of the previous layer. Finding the ideal set of these weights for training a Multilayer Perceptron neural network such that it minimizes the classification error is a widely known optimization problem. This paper proposes a HybridWolf-Bat algorithm, a novel optimization algorithm, as a solution for solving this problem. The proposed algorithm is a hybrid of two already existing nature-inspired algorithms, which are the Grey Wolf Optimization algorithm and Bat algorithm. This novel approach is tested on ten different datasets of the medical field, obtained from the UCI machine learning repository. These results of the proposed algorithm are compared with those of four recently developed nature-inspired algorithms: Grey Wolf Optimization algorithm (GWO), Cuckoo Search (CS), Bat Algorithm (BA) and Whale Optimization Algorithm (WOA) along with the standard Back-propagation training method. As observed from the results, the proposed method is better in terms of both speed of convergence and accuracy and outperforms the other bio-inspired algorithms.

Modeling Long-term dependencies from Videos using Deep Multiplicative Neural Networks

Understanding temporal dependencies of videos is fundamental for vision problems, but neural networks based models are still insufficient in this field. In this paper, we propose a novel Deep Multiplicative Neural Networks (DMNNs) for learning hierarchical long-term representations from video. The DMNNs is built upon the multiplicative block which remembers the pairwise transformations happened between frames by using multiplicative interactions instead of regular weighted-sum ones. The block is slided over the time steps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact rather than approximate modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. To address the difficulty of training DMNNs, we also derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on UCF101 dataset and the e?ectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

A new transfer function for volume visualization of aortic stent and virtual endoscopy application

Aortic stent has been widely used in restoring vascular stenosis and assisting patients with cardiovascular disease. The effective visualization of aortic stent is considered to be critical to ensure the effectiveness and functions of the aortic stent in clinic practice. Volume rendering with ray casting has been used as an effective approach to enable the effective visualization of aortic stent. The volume rendering relies on the transfer function that converts the medical images into optical attributes including color and transparency. This paper proposes a new transfer function, namely the multi-dimensional transfer function, to provide additional transparency value of a voxel. The proposed approach using the additional transparency value effectively assists the distinguishing of tissues that have the same CT value. The transparency values are simultaneously determined by gray threshold and gray change threshold, which can recognize the unnecessary structures such as bones transparent. A series of experimental results demonstrate that the situation of aorta stent of a patient can be directly observed, and the angle of view can be switched arbitrarily. The proposed method provides a new way for the operation of a virtual endoscopy to reach the place of blood vessels that a traditional endoscopy fails to reach.

Pulmonary Nodule based on ISODATA-Improved Faster RCNN and 3D-CNN with Focal Loss

The early diagnosis of pulmonary cancer can significantly improve the survival rate of patients, where pulmonary nodules detection in computed tomography images plays an important role. In this paper, we propose a novel pulmonary nodule detection system based on convolutional neural networks (CNN). Our system consists of two stages, pulmonary nodule candidate detection and false positive reduction. For candidate detection, we introduce Focal Loss and Iterative Self-Organizing Data Analysis Techniques Algorithm (ISODATA) to Faster Region-based Convolutional Neural Network (Faster R-CNN) model. For false positive reduction, a three-dimensional convolutional neural network (3D-CNN) is employed to completely utilize the three-dimensional nature of CT images. Experiments were conducted on The Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset and the results indicate the proposed system achieves preferable performance on pulmonary nodule detection.

Delving Deeper in Drone-Based Person Re-Id by Employing Deep Decision Forest and Attributes Fusion

Deep learning has revolutionized the field of computer vision and image processing. Its ability to extract the compact image representation has taken the person re-identification problem to the new level. However, in most cases, the researchers are focused on developing new approaches to extract more fruitful image representation and use it in the re-id task. The extra information about images is rarely taken into account because the traditional person re-identification datasets usually do not have it. Nevertheless, the research in multimodal machine learning has demonstrated that utilization of the information from the different sources leads to better performance. In this work, we demonstrate how a person re-identification problem can benefit from utilization of multimodal data. We have used the UAV drone to collect and label the new person re-identification dataset, which composed of pedestrians images and its attributes. We have manually annotated this dataset with attributes and in contrast to the recent research, we do not use the deep network to classify them. Instead, we employ the CBOW model to extract the word embeddings from text descriptions and fuse it with features extracted from images. Then the deep neural decision forest is used for pedestrians classification. The extensive experiments on collected dataset demonstrated the effectiveness of the proposed model.

All ACM Journals | See Full Journal Index

Search TOMM
enter search term and/or author name