Special Section on Multimodal Understanding of Social, Affective and Subjective Attributes
Editorial to Special Issue on Deep Learning for Intelligent Multimedia Analytics
Automatic product image classification is a task of crucial importance towards better understanding and management of online retailers. Motivated by recent advancements of deep convolutional neural network (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to further improve the classification accuracy. With these structural-aware clues, we argue that more advanced CNN models could be developed beyond the one-versus-all classification as did by conventional CNNs. To this end, novel efforts of this work include: developing a salient-sensitive CNN that could focus more on the product foreground by inserting a spatial attention layer to a proper location, proposing a multi-class regression based refinement method that is expected to generate more accurate predictions by utilizing prediction scores from preceding multiple CNNs, each corresponding to a distinctive classifier on a categorical layer in the hierarchy, and devising a multi-task deep learning architecture that effectively explore correlations among the categories and attributes for better categorical label prediction. Experimental results on nearly one million real-world product images basically validate the effectiveness of the proposed efforts jointly and individually, from which performance gains are observed.
Person re-identification aims at identifying a certain pedestrian across non-overlapping multi-camera networks in different time and places. Existing person re-identification approaches mainly focus on matching pedestrians on still images, however little attention is paid to person re-identification in videos. Compared to images, video clips contain motion of pedestrians which is crucial to re-identification. Moreover, consecutive video frames present pedestrian appearance with different poses and from different viewpoints, providing valuable information towards addressing the challenge of pose variation, occlusion, and viewpoint change etc. In this paper, we propose a Dense 3D-Convolutional Network (D3DNet) to jointly learn spatio-temporal and appearance features for person re-identification in videos. The D3DNet consists of multiple 3D dense blocks and transition layers. The 3D dense blocks enlarge the receptive fields of visual neurons in spatial and temporal dimensions, leading to discriminative appearance representation as well as short-term and long-term motion information of pedestrians without the requirement of an additional motion estimation module. Moreover, we propose an improved loss function consisting of identification loss and center loss to minimize intra-class variance and maximize inter-class variance simultaneously, towards addressing the challenge of large intra-class variance and small inter-class variance, which is common phenomenon in person re-identification task. Extensive experiments on two widely-used surveillance video datasets, i.e., MARS and iLIDS-VID, have shown the effectiveness of the proposed approach.
Nowadays, diversifying the retrieval results of a certain query will improve search efficiency for customers. Showing the multiple aspects information provides users an overview of the object, which helps them fast target their demands. To discover aspects, researches focus on generating image clusters from initially retrieved results. As an effective approach, Latent Dirichlet Allocation (LDA) has proved to have a good performance on discovering high level topics. However, traditional LDA is designed to process textual words, and it needs the input in the form of discrete data. When we apply this algorithm to process continuous visual images, a common solution is to quantize the continuous features into discrete representation by Bag-of-Visual-Word (BoVW) algorithm. During this process, quantization error will lead to information loss inevitably. To construct topic model with complete visual information, this work applies Gaussian Latent Dirichlet Allocation (GLDA) on diversity issue of image retrieval. In this model, traditional multinomial distribution is substituted by Gaussian distribution to model continuous visual features. Besides, we propose a two-phase spectral clustering strategy, called as dual spectral clustering, to generate clusters from segment level to image level. The experiments on challenging landmarks of DIV400 database show that our proposal improves relevance and diversity by about 10% comparing with traditional topic models.
Scene classification is a challenging problem. Compared with object images, scene images are more abstract, which are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representation including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this paper, the idea of share codebook is introduced by organically integrating deep learning, concept feature and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1000 and Places365 concepts (Mixed1365), using convolutional neural networks. As the mixed 1365 features covers all the semantic information including both object and scene concepts, we can extract shared codebook from the mixed 1365 features which only contains a subset of the whole 1365 concepts with the same codebooks size. The shared codebook can not only provide complementary representations without additional codebook training, but also it can be adaptively extracted towards different scene classification tasks. A method of combining both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public dataset validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.
Dynamic Adaptive Streaming over HTTP (DASH) is a popular over-the-top video content distribution technique that adapts the streaming session according to the users network condition typically in terms of downlink bandwidth. This video quality adaptation can be achieved by scaling the frame quality, spatial resolution or frame rate. Despite the flexibility on the video quality scaling methods, each of these quality scaling dimensions has varying effects on the Quality of Experience (QoE) for end users. Furthermore, in video streaming, the changes in motion over time along with the scaling method employed have an influence on QoE, hence the need to carefully tailor scaling methods to suit streaming applications and content type. In this work, we investigate an intelligent DASH approach for the latest video coding standard H.265 and propose a heuristic QoE-aware cost-efficient adaptation scheme that does not switch unnecessarily to the highest quality level but rather stays temporarily at an intermediate quality level in certain streaming scenarios. Such an approach achieves a comparable and consistent level of quality under impaired network conditions as commonly found in Internet and mobile networks whilst reducing bandwidth requirements and quality switching overhead. The rationale is based on our empirical experiments, which show that an increase in bitrate does not necessarily mean noticeable improvement in QoE. Furthermore, our work demonstrates that the Signal-to-Noise Ratio (SNR) and the spatial resolution scalability types are the best fit for our proposed algorithm. Finally, we demonstrate an innovative interaction between quality scaling methods and the polarity of switching operations. The proposed QoE-aware scheme is implemented and empirical results show that it is able to reduce bandwidth requirements by up to 41% whilst achieving equivalent QoE compared with a representative DASH reference implementation.
Smart photography carries the promise of quality improvement and functionality extension in making aesthetically appealing pictures. In this paper, we focus on self-portrait photographs and introduce new methods that guide a user in how to best pose while taking a selfie. While most of the current solutions use a post processing procedure to beautify a picture, the developed tool enables a novel function of recommending a good look before the photo is captured. Given an input face image, the tool automatically estimates the pose-based aesthetic score, finds the most attractive angle of the face and suggests how the pose should be adjusted. The recommendation results are determined adaptively to the appearance and initial pose of the input face. We apply a data mining approach to find distinctive, frequent itemsets and association rules from online profile pictures, upon which the aesthetic estimation and pose recommendation methods are developed. A simulated and a real image set are used for experimental evaluation. The results show the proposed aesthetic estimation method can effectively select user-favorable photos. Moreover, the recommendation performance for the vertical adjustment is moderately related to the degree of conformity among the professional photographers' recommendations. This study echoes the trend of instant photo sharing, in which a user takes a picture and then immediately shares it on a social network without engaging in tedious editing.
Despite the promising progress made in visual captioning and paragraphing, visual storytelling is still largely unexplored. This task is more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. To deal with these challenges, we propose an Attribute-based Hierarchical Generative model with Reinforcement Learning and adversarial training (AHGRL). First, to model the ordered photo sequence and the complex story structure, we propose an attribute-based hierarchical generator. The generator incorporates semantic attributes to create more accurate and relevant descriptions. The hierarchical framework enables the generator to learn from the complex paragraph structure. Second, to generate story-style paragraphs, we design a language-style discriminator, which provides word-level rewards to optimize the generator by policy gradient. Third, we further consider the story generator and the reward critic as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critic aims at distinguishing them and further improving the generator. Extensive experiments on the widely-used dataset well demonstrate the advantages of the proposed method over state-of-the-art methods.
Recent studies have shown that spatial relationships among objects are very important for visual recognition since they provide rich clues on object contexts within images. In this paper, we introduce a novel method to learn Semantic Feature Map (SFM) with attention based deep neural networks for image and video classification in an end to end manner, with an aim to explicitly model spatial object contexts within the images. In particular, for every object proposals obtained from the input image, we extract high-level semantic object features with convolutional neural networks. Then, we explicitly apply gate units to these extracted features for important objects selection and noise removal. These selected object features are organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as classifiers on top of the SFM for content recognition, which are expected to exploit the spatial relationships among objects. We also introduce a novel multi-task learning framework to help learn the model parameters in the training phase. It consists of a basic image classification loss in cross entropy form, an object localization loss to guide important object selection, as well as a grid labeling loss to predict object labels at SFM grids. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach and very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the SFMs learned on the image domain are transferred to video classification on CCV and FCVID benchmarks and the results successfully demonstrate its robustness and generalization capability.
Face recognition from 2D still images and videos is quite successful even ``in the wild'' conditions. Instead, less consolidated results are available for the cases where face data come from non-conventional cameras, like infrared or depth. In this paper, we investigate this latter scenario assuming a low-resolution depth camera is used to perform face recognition in an uncooperative context. To this end, we propose, first, to automatically select a set of frames from the depth sequence of the camera according to the fact they provide a good view of the face in terms of pose and distance. Then, we design a progressive refinement approach to reconstruct a higher-resolution model from the selected low-resolution frames. This process accounts for the anisotropic error of the existing points in the current 3D model and the points in a newly acquired frame so that the refinement step can progressively adjust the point positions in the model using a Kalman-like estimation. The quality of the reconstructed model is evaluated by considering the error between the reconstructed models and their corresponding high-resolution scans used as ground truth. In addition, we performed face recognition using the reconstructed models as probes against either a gallery of reconstructed models and a gallery with high-resolution scans. The obtained results confirm the possibility to effectively use the reconstructed models for the face recognition task.
Single-image super-resolution (SISR) methods based on convolutional neural network (CNN) have shown great success in the literature. However, most deep CNN models dont have direct access to the subsequent layers, this seriously hinders the information flow. Whats more, they also dont make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we present a special SISR CNN with symmetrical nested residual connections for super-resolution reconstruction to further improve the quality of reconstruction image. Compared with previous SISR CNNs, our learning architecture shows significant improvements in accuracy and execution time. It has larger image region for contextual spreading. Its symmetrical combinations provide multiple short paths for the forward propagation to improve the reconstruction accuracy and for the backward propagation of gradient flow to accelerate the convergence speed. Extensive experiments on the open challenge datasets confirm the effectiveness of symmetrical residual connections. Our method can reconstruct high-quality high-resolution (HR) images at a relatively fast speed and outperform other methods by a large margin.
In this paper, we propose video delivery schemes insuring around one-second delivery latency. To this purpose, we use Dynamic Adaptive Streaming over HTTP (DASH), which is a standard version of HTTP Live Streaming (HLS), as to benefit from the video representation switching between successive video segments. We also propose HTTP/2-based algorithms to apply video frame discarding policies inside a video segment. When a selected DASH representation does not match with the available network resources, current solutions suffer from rebuffering events. Rebuffering does not only impact the Quality of Experience (QoE) but it also increases the delivery delay between the displayed and the original video streams. We observe that rebuffering-based solutions may increase the delivery delay with 1.5 s to 2 s inside a six-second video segment. In this work, we develop optimal and practical algorithms in order to respect the one-second targeted latency. In all algorithms, we selectively drop the least meaningful video frames thanks to HTTP/2 stream resetting feature. An important number of missing video frames results in a temporal fluidity break known as video jitters. The displayed video seems as a series of snapshots. Our simulations show that we respect the one-second targeted latency while insuring an acceptable video quality with at least a Peak Signal to Noise Ratio (PSNR) of 30 dB. We also quantify and qualify the resulting jitters for each algorithm. We show that both, the optimal and the practical algorithms we propose, decrease the jitters impact on the displayed videos. For example, 97 % of the optimal algorithm outputs and 87 % of the practical algorithms outputs are considered as acceptable comparing to only 57 % of the First In First Out (FIFO) basic algorithm outputs.
It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation. Inspired by this, we aim to effectively correlate existing large-scale heterogeneous data of different modalities by utilizing the power of GANs to model the cross-modal joint distribution, and its idea for adversarial learning can fully be exploited to learn discriminative common representation for bridging the heterogeneity gap. Thus, in this paper we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions: (1) Cross-modal GANs architecture is proposed to model the joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They can not only exploit the cross-modal correlation for learning the common representation, but also preserve the reconstruction information for capturing the semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representation more discriminative by adversarial training process. In summary, our proposed CM-GANs approach can utilize GANs to perform cross-modal common representation learning, by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval, compared with 11 state-of-the-art methods on 3 cross-modal datasets.
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.
Emotion recognition methodologies from physiological signals are increasingly becoming personalized, due to the subjective responses of different subjects to physical stimuli. Existing works mainly focused on modelling the involved physiological corpus of each subject, without considering the psychological factors, such as interest and personality. The latent correlation among different subjects has also been rarely examined. In this paper, we propose to investigate the influence of personality on emotional behavior in a hypergraph learning framework. Assuming that each vertex is a compound tuple (subject, stimuli), multi-modal hypergraphs can be constructed based on the personality correlation among different subjects and on the physiological correlation among corresponding stimuli. To model the different importance within vertices, hyperedges and modalities, we assign each of them with weight. Doing so allows the learning procedure to be conducted on the vertex-weighted multi-modal multi-task hypergraphs, thus simultaneously modelling the emotions of multiple subjects. The estimated emotion relevance is employed for emotion recognition. We carry out extensive experiments on the ASCERTAIN dataset and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.
This paper proposes a novel feature-extraction framework for inferring impressed personality traits, emergent leadership skills, communicative competence and hiring decisions. The proposed framework extracts multimodal features, describing each participant's nonverbal activities. It captures inter-modal and inter-person relationships in interaction and captures how the target interactor generates nonverbal behavior when the other interactors also generate the nonverbal behavior. The inter-modal and inter-personal patterns are identified as frequent co-occurring events based on graph clustering from multimodal sequences. The framework can be applied to any type of interaction task. The proposed framework is applied to the SONVB corpus, which is an audio-visual dataset collected from dyadic job interviews, and the ELEA audio-visual data corpus, which is a dataset collected from group meetings. We evaluate the framework on a binary classification task of 15 impression variables in two data corpora. The experimental results show that the model trained with co-occurrence features is more accurate than previous models for 14 out of 15 traits.
Multimedia community has witnessed the rise of deep learning based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep learning and multimedia analytics has boosted the performance of several traditional tasks such as classification, detection, regression, and also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning and content generation. This paper aims to review the development path of major tasks in multimedia analytics, and then take a peep for future directions. We start by summarizing the fundamental deep-techniques related to multimedia analytics, especially in visual domain, and then review representative high-level tasks powered by recent advancement. Moreover, the performance review on popular benchmarks gives a pathway of technology advancement, and helps identify both the milestone works and future directions.
Learning robust and representative feature across multiple modalities has been a fundamental problem in machine learning and multimedia fields. In this paper, we propose a novel MUltimodal Convolutional AutoEncoder (MUCAE) approach to learn representative features from visual and textual modalities. For each modality, we integrate the convolutional operation into an autoencoder framework to learn a joint representation from original image and text content. We optimize the convolutional autoencoders of different modalities jointly by exploiting the correlation between the hidden representations from the convolutional autoencoders, in particular by minimizing both the reconstructing error of each modality and the correlation divergence between the hidden feature of different modalities. Compared to the conventional solutions relying on hand-crafted features, the proposed MUCAE approach encodes features from image pixels and text characters directly and produces more representative and robust features. We evaluate MUCAE on cross media retrieval as well as unimodal classification tasks over real-world large-scale multimedia databases. Experimental results have shown that MUCAE performs better than the state-of-the-arts methods.
Deep convolution neural network (CNN) has achieved remarkable results in computer vision tasks for end-to-end learning. We evaluate here the power of a deep CNN to learn robust features from raw EEG data to detect seizures. Seizure are hard to detect as they vary both inter- and intra-patient. In this paper, we use a deep CNN model for seizure detection task on an open access EEG epilepsy dataset collected at the Childrens Hospital Boston. Our deep learning model is able to extract spectral, temporal features from EEG epilepsy data and use them to learn general structure of a seizure that is less sensitive to variations. Our method produced an overall sensitivity of 90.00 %, specificity of 91.65% and accuracy of 98.05% for whole dataset of 23 patients. Hence, it can be used as an excellent cross-patient classifier. The results show that our model performs better than previous state of the art models for cross-patient seizure detection task. The proposed model can also visualize special orientation of band power features. We use correlation maps to relate spectral amplitude features to the output in the form of images. By using the results from our deep learning model, this visualization method can be used as an effective multimedia tool for producing quick and relevant brain mapping images that can be used by medical experts for further investigation.
Transfer learning, which focuses on finding a favorable representation for instances of different domains based on auxiliary data, can mitigate the divergence between domains through knowledge transfer. Recently, increasing efforts on transfer learning have employed deep neural network (DNN) to learn more robust and higher level feature representations to better tackle cross-media disparity. However, only a few papers consider the correction and semantic matching between multi-layer heterogeneous domain networks. In this paper, we propose a deep semantic mapping model for heterogeneous multimedia transfer learning (DHTL) using co-occurrence data. More specifically, we integrate the DNN with canonical correlation analysis (CCA) to derive a deep correlation subspace as the joint semantic representation for associating data across different domains. In the proposed DHTL, a multi-layer correlation matching network across domains is constructed, in which the CCA is combined to bridge each pair of domain-specific hidden layers. To train the network, a joint objective function is defined and the optimization processes are presented. When the deep semantic representation is achieved, the shared features of the source domain are transferred for task learning in the target domain. Extensive experiments for three multimedia recognition applications demonstrate that the proposed DHTL can effectively find deep semantic representations for heterogeneous domains, and is superior to the several existing state-of-the-art methods for deep transfer learning.
Bilinear models are very powerful in multimodal fusion tasks such as Visual Question Answering. The predominant bilinear methods can be all seen as a kind of tensor-based decomposition operation which contains a key kernel called core tensor. Current approaches usually focus on reduce the computation complexity by giving low-rank constraint onto the core tensor. In this paper, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP) which can not only maintains the advantages of previous bilinear methods, but also conduct sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such block-diagonal core tensor is equivalent to conducting many ?tiny? bilinear operations in different feature spaces. Thus introducing sparsity into bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What?s more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
In this paper, we present convolutional attention networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on convolutional neural networks (CNN) and recurrent neural networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods, which is completely built on CNN and combines attention mechanism. The distinctive characteristics of our method include: (1) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN. (2) The attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling. (3) Position embeddings are equipped in both spatial encoder and sequence decoder to give our networks a sense of locations. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results validate the effectiveness of different components, and show our convolutional-based method achieves state-of-the-art or competitive performance than prior works, even without the use of RNN.
Information-centric networking (ICN) has been touted as a revolutionary solution for future Internet, which will be dominated by video traffic. This work investigates the challenge of distributing video contents of adaptive bit rate (ABR) over ICN. In particular, we utilize the in-network caching capability of ICN routers to serve users; in addition, with the help of named function, we enable ICN routers to transcode videos to lower-bitrate versions to improve the cache hit ratio. Mathematically, we formulate this design challenge into a constrained optimization problem, which aims to maximize the cache hit ratio for service providers and minimize the service delay for end users. We design a two-step iterative algorithm to find the optimum. First, given a content management scheme, we minimize the service delay via optimally configuring the routing scheme. Second, we maximize the cache hits for a giving routing policy. Finally, we rigorously prove its convergence. Through extensive simulations, we verify the convergence and the performance gains over other algorithms. We also find that more resources should be allocated to ICN routers with heavier request rate, and the routing scheme favors the shortest path to schedule more traffic.
In this paper, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention model (SEA) is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks resulting in abundant irrelevant semantic features. In contract, at each time step, our model selects the most relevant word which aligns with current context. That is, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Since visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on MS COCO dataset and the results show that VSDA outperforms other methods and achieves promising performance.
Facial landmarking is a fundamental task in automatic machine-based face analysis. The majority of existing techniques for such a problem are based on 2D images; however, they suffer from illumination and pose variations that may largely degrade landmarking performance. The emergence of 3D data theoretically provides an alternative to overcome these weaknesses in the 2D domain. This paper proposes a novel approach to 3D facial landmarking, which combines both the advantages of feature based methods as well as model based ones in a progressive coarse-to-fine manner (initial, intermediate and fine stages). For the initial stage, a few fiducial landmarks (i.e. the nose tip and two inner eye corners) are robustly detected through curvature analysis, and these points are further exploited to initialize the subsequent stage. For the intermediate stage, a statistical model is learned in the feature space of three normal components of the facial point-cloud rather than the smooth original coordinates, namely Active Normal Model (ANM). For the fine stage, cascade regression is employed to locally refine the landmarks according to their geometry attributes. The proposed approach can accurately localize dozens of fiducial points on each 3D face scan, greatly surpassing feature based ones, and it also improves the state of the art of the model based ones in two aspects, i.e., sensitivity to initialization and deficiency in discrimination. The proposed method is evaluated on the BU-3DFE and Bosphorus databases, and competitive results are achieved in comparison with the ones in literature, clearly demonstrating its effectiveness.