Program at a glance


Conference Brochure


Conference Mobile App

We created an event for the conference on EventBase with all details about the program. You can browse it by downloading the EventBase app from the app store and searching for “ECML PKDD 2017”.

Download Eventbase for :


Journal Special Issues

Pierre-Philippe Mathieu

ESA/ESRIN, EO Science, Applications and New Technologies

Supported by : Sponsors

Enabling a smarter planet with Earth Observation

Frank Hutter

University of Freiburg

Supported by : Sponsors

Towards end-to-end learning & optimization

John Quackenbush

Dana-Farber Cancer Institute
and Harvard TH Chan School of Public Health

Using Networks to Link Genotype to Phenotype

Alex Graves

Google DeepMind

Frontiers in Recurrent Neural Network Research

Cordelia Schmid

INRIA

Automatic Understanding of the Visual World

Inderjit Dhillon

University of Texas at Austin

Multi-Target Prediction via Low-Rank Embeddings

Journal Track

Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.

View article

A novel online ensemble strategy, ensemble BPegasos(EBPegasos), is proposed to solve the problems simultaneously caused by concept drifting and the curse of dimensionality in classifying high-dimensional evolving data streams, which has not been addressed in the literature. First, EBPegasos uses BPegasos, an online kernelized SVM-based algorithm, as the component classifier to address the scalability and sparsity of high-dimensional data. Second, EBPegasos takes full advantage of the characteristics of BPegasos to cope with various types of concept drifts. Specifically, EBPegasos constructs diverse component classifiers by controlling the budget size of BPegasos; it also equips each component with a drift detector to monitor and evaluate its performance, and modifies the ensemble structure only when large performance degradation occurs. Such conditional structural modification strategy makes EBPegasos strike a good balance between exploiting and forgetting old knowledge. Lastly, we first prove experimentally that EBPegasos is more effective and resource-efficient than the tree ensembles on high-dimensional data. Then comprehensive experiments on synthetic and real-life datasets also show that EBPegasos can cope with various types of concept drifts significantly better than the state-of-the-art ensemble frameworks when all ensembles use BPegasos as the base learner.

View article

Open set recognition is a classification-like task.It is accomplished not only by the identification of observations which belong to targeted classes (i.e., the classes among those represented in the training sample which should be later recognized) but also by the rejection of inputs from other classes in the problem domain. The need for proper handling of elements of classes beyond those of interest is frequently ignored, even in works found in the literature. This leads to the improper development of learning systems, which may obtain misleading results when evaluated in their test beds, consequently failing to keep the performance level while facing some real challenge. The adaptation of a classifier for open set recognition is not always possible: the probabilistic premises most of them are built upon are not valid in a open-set setting. Still, this paper details how this was realized for WiSARD a weightless artificial neural network model. Such achievement was based on an elaborate distance-like computation this model provides and the definition of rejection thresholds during training. The proposed methodology was tested through a collection of experiments, with distinct backgrounds and goals. The results obtained confirm the usefulness of this tool for open set recognition.

View article

The problem of community detection in a multilayer network can effectively be addressed by aggregating the community structures separately generated for each network layer, in order to infer a consensus solution for the input network. To this purpose, clustering ensemble methods developed in the data clustering field are naturally of great support. Bringing these methods into a community detection framework would in principle represent a powerful and versatile approach to reach more stable and reliable community structures. Surprisingly, research on consensus community detection is still in its infancy. In this paper, we propose a novel modularity-driven ensemble-based approach to multilayer community detection. A key aspect is that it finds consensus community structures that not only capture prototypical community memberships of nodes, but also preserve the multilayer topology information and optimize the edge connectivity in the consensus via modularity analysis. Empirical evidence obtained on seven real-world multilayer networks sheds light on the effectiveness and efficiency of our proposed modularity-driven ensemble-based approach, which has shown to outperform state-of-the-art multilayer methods in terms of modularity, silhouette of community memberships, and redundancy assessment criteria, and also in terms of execution times.

View article

Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution—P(y, x), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution—and, are more effective for classification, since they fit P(y|x) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional log-likelihood precludes the derivation of closed-form estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speed-up the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other state-of-the-art parameterizations.

View article

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to (1) flexibility in terms of quality measures and constraints that can be used, and/or (2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: (1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and (2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.

View article

The problem of local community detection in graphs refers to the identification of a community that is specific to a query node and relies on limited information about the network structure. Existing approaches for this problem are defined to work in dynamic network scenarios, however they are not designed to deal with complex real-world networks, in which multiple types of connectivity might be considered. In this work, we fill this gap in the literature by introducing the first framework for local community detection in multilayer networks (ML-LCD). We formalize the ML-LCD optimization problem and provide three definitions of the associated objective function, which correspond to different ways to incorporate within-layer and across-layer topological features. We also exploit our framework to generate multilayer global community structures. We conduct an extensive experimentation using seven real-world multilayer networks, which also includes comparison with state-of-the-art methods for single-layer local community detection and for multilayer global community detection. Results show the significance of our proposed methods in discovering local communities over multiple layers, and also highlight their ability in producing global community structures that are better in modularity than those produced by native global community detection approaches.

View article

Sequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails, a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.

View article

For many real-world applications, structured regression is commonly used for predicting output variables that have some internal structure. Gaussian conditional random fields (GCRF) are a widely used type of structured regression model that incorporates the outputs of unstructured predictors and the correlation between objects in order to achieve higher accuracy. However, applications of this model are limited to objects that are symmetrically correlated, while interaction between objects is asymmetric in many cases. In this work we propose a new model, called Directed Gaussian conditional random fields (DirGCRF), which extends GCRF to allow modeling asymmetric relationships (e.g. friendship, influence, love, solidarity, etc.). The DirGCRF models the response variable as a function of both the outputs of unstructured predictors and the asymmetric structure. The effectiveness of the proposed model is characterized on six types of synthetic datasets and four real-world applications where DirGCRF was consistently more accurate than the standard GCRF model and baseline unstructured models.

View article

To learn control policies in unknown environments, learning agents need to explore by trying actions deemed suboptimal. In prior work, such exploration is performed by either perturbing the actions at each time-step independently, or by perturbing policy parameters over an entire episode. Since both of these strategies have certain advantages, a more balanced trade-off could be beneficial. We introduce a unifying view on step-based and episode-based exploration that allows for such balanced trade-offs. This trade-off strategy can be used with various reinforcement learning algorithms. In this paper, we study this generalized exploration strategy in a policy gradient method and in relative entropy policy search. We evaluate the exploration strategy on four dynamical systems and compare the results to the established step-based and episode-based exploration strategies. Our results show that a more balanced trade-off can yield faster learning and better final policies, and illustrate some of the effects that cause these performance differences.

View article

Instance-based learning, and the k-nearest neighbors algorithm (k-NN) in particular, provide simple yet effective classification algorithms for data mining. Classifiers are often executed on sensitive information such as medical or personal data. Differential privacy has recently emerged as the accepted standard for privacy protection in sensitive data. However, straightforward applications of differential privacy to k-NN classification yield rather inaccurate results. Motivated by this, we develop algorithms to increase the accuracy of private instance-based classification. We first describe the radius neighbors classifier (r-N) and show that its accuracy under differential privacy can be greatly improved by a non-trivial sensitivity analysis. Then, for k-NN classification, we build algorithms that convert k-NN classifiers to r-N classifiers. We experimentally evaluate the accuracy of both classifiers using various datasets. Experiments show that our proposed classifiers significantly outperform baseline private classifiers (i.e., straightforward applications of differential privacy) and executing the classifiers on a dataset published using differential privacy. In addition, the accuracy of our proposed k-NN classifiers are at least comparable to, and in many cases better than, the other differentially private machine learning techniques.

View article

In this work, we build upon the observation that offline reinforcement learning (RL) is synergistic with task hierarchies that decompose large Markov decision processes (MDPs). Task hierarchies can allow more efficient sample collection from large MDPs, while offline algorithms can learn better policies than the so-called “recursively optimal” or even hierarchically optimal policies learned by standard hierarchical RL algorithms. To enable this synergy, we study sample collection strategies for offline RL that are consistent with a provided task hierarchy while still providing good exploration of the state-action space. We show that naïve extensions of uniform random sampling do not work well in this case and design a strategy that has provably good convergence properties. We also augment the initial set of samples using additional information from the task hierarchy, such as state abstraction. We use the augmented set of samples to learn a policy offline. Given a capable offline RL algorithm, this policy is then guaranteed to have a value greater than or equal to the value of the hierarchically optimal policy. We evaluate our approach on several domains and show that samples generated using a task hierarchy with a suitable strategy allow significantly more sample-efficient convergence than standard offline RL. Further, our approach also shows more sample-efficient convergence to policies with value greater than or equal to hierarchically optimal policies found through an online hierarchical RL approach.

View article

The remarkable development of deep learning in medicine and healthcare domain presents obvious privacy issues, when deep neural networks are built on users’ personal and highly sensitive data, e.g., clinical records, user profiles, biomedical images, etc. However, only a few scientific studies on preserving privacy in deep learning have been conducted. In this paper, we focus on developing a private convolutional deep belief network (pCDBN), which essentially is a convolutional deep belief network (CDBN) under differential privacy. Our main idea of enforcing ε-differential privacy is to leverage the functional mechanism to perturb the energy-based objective functions of traditional CDBNs, rather than their results. One key contribution of this work is that we propose the use of Chebyshev expansion to derive the approximate polynomial representation of objective functions. Our theoretical analysis shows that we can further derive the sensitivity and error bounds of the approximate polynomial representation. As a result, preserving differential privacy in CDBNs is feasible. We applied our model in a health social network, i.e., YesiWell data, and in a handwriting digit dataset, i.e., MNIST data, for human behavior prediction, human behavior classification, and handwriting digit recognition tasks. Theoretical analysis and rigorous experimental evaluations show that the pCDBN is highly effective. It significantly outperforms existing solutions.

View article

For the regression task in a non-parametric setting, designing the objective function to be minimized by the learner is a critical task. In this paper we propose a principled method for constructing and minimizing robust losses, which are resilient to errant observations even under small samples. Existing proposals typically utilize very strong estimates of the true risk, but in doing so require a priori information that is not available in practice. As we abandon direct approximation of the risk, this lets us enjoy substantial gains in stability at a tolerable price in terms of bias, all while circumventing the computational issues of existing procedures. We analyze existence and convergence conditions, provide practical computational routines, and also show empirically that the proposed method realizes superior robustness over wide data classes with no prior knowledge assumptions.

View article

User privacy on the internet is an important and unsolved problem. So far,no sufficient and comprehensive solution has been proposed that helps a user to protect his or her privacy while using the internet. Data are collected and assembled by numerous service providers. Solutions so far focused on the side of the service providers to store encrypted or transformed data that can be still used for analysis. This has a major flaw, as it relies on the service providers to do this. The user has no chance of actively protecting his or her privacy. In this work, we suggest a new approach, empowering the user to take advantage of the same tool the other side has, namely data mining to produce data which obfuscates the user’s profile. We apply this approach to search engine queries and use feedback of the search engines in terms of personalized advertisements in an algorithm similar to reinforcement learning to generate new queries potentially confusing the search engine. We evaluated the approach using a real-world data set. While evaluation is hard, we achieve results that indicate that it is possible to influence the user’s profile that the search engine generates. This shows that it is feasible to defend a user’s privacy from a new and more practical perspective.

View article

Clustering requires the user to define a distance metric, select a clustering algorithm, and set the hyperparameters of that algorithm. Getting these right, so that a clustering is obtained that meets the users subjective criteria, can be difficult and tedious. Semi-supervised clustering methods make this easier by letting the user provide must-link or cannot-link constraints. These are then used to automatically tune the similarity measure and/or the optimization criterion. In this paper, we investigate a complementary way of using the constraints: they are used to select an unsupervised clustering method and tune its hyperparameters. It turns out that this very simple approach outperforms all existing semi-supervised methods. This implies that choosing the right algorithm and hyperparameter values is more important than modifying an individual algorithm to take constraints into account. In addition, the proposed approach allows for active constraint selection in a more effective manner than other methods.

View article

Label embedding (LE) is an important family of multi-label classification algorithms that digest the label information jointly for better performance. Different real-world applications evaluate performance by different cost functions of interest. Current LE algorithms often aim to optimize one specific cost function, but they can suffer from bad performance with respect to other cost functions. In this paper, we resolve the performance issue by proposing a novel cost-sensitive LE algorithm that takes the cost function of interest into account. The proposed algorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using the classic multidimensional scaling approach for manifold learning. CLEMS is able to deal with both symmetric and asymmetric cost functions, and effectively makes cost-sensitive decisions by nearest-neighbor decoding within the embedded vectors. We derive theoretical results that justify how CLEMS achieves the desired cost-sensitivity. Furthermore, extensive experimental results demonstrate that CLEMS is significantly better than a wide spectrum of existing LE algorithms and state-of-the-art cost-sensitive algorithms across different cost functions.

View article

Sharing information among multiple learning agents can accelerate learning. It could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents. Here we address the problem in the context of online adaptive learning. We formally define the learning settings of Group Online Adaptive Learning and derive an algorithm named Shared Online Adaptive Learning (SOAL) to address it. SOAL avoids explicitly modeling changes or their dynamics, and instead shares information continuously. The key idea is that learners share a common small pool of experts, which they can use in a weighted adaptive way. We define group adaptive regret and prove that SOAL maintains known bounds on the adaptive regret obtained for single adaptive learners. Furthermore, it quickly adapts when learning tasks are related to each other. We demonstrate the benefits of the approach for two domains: vision and text. First, in the visual domain, we study a visual navigation task where a robot learns to navigate based on outdoor video scenes. We show how navigation can improve when knowledge from other robots in related scenes is available. Second, in the text domain, we create a new dataset for the task of assigning submitted papers to relevant editors. This is, inherently, an adaptive learning task due to the dynamic nature of research fields evolving in time. We show how learning to assign editors improves when knowledge from other editors is available. Together, these results demonstrate the benefits for sharing information across learners in concurrently changing environments.

View article

Spreadsheets, comma separated value files and other tabular data representations are in wide use today. However, writing, maintaining and identifying good formulas for tabular data and spreadsheets can be time-consuming and error-prone. We investigate the automatic learning of constraints (formulas and relations) in raw tabular data in an unsupervised way. We represent common spreadsheet formulas and relations through predicates and expressions whose arguments must satisfy the inherent properties of the constraint. The challenge is to automatically infer the set of constraints present in the data, without labeled examples or user feedback. We propose a two-stage generate and test method where the first stage uses constraint solving techniques to efficiently reduce the number of candidates, based on the predicate signatures. Our approach takes inspiration from inductive logic programming, constraint learning and constraint satisfaction. We show that we are able to accurately discover constraints in spreadsheets from various sources.

View article

Cluster analysis over Moving Object Databases (MODs) is a challenging research topic that has attracted the attention of the mobility data mining community. In this paper, we study the temporal-constrained sub-trajectory cluster analysis problem, where the aim is to discover clusters of sub-trajectories given an ad-hoc, user-specified temporal constraint within the dataset’s lifetime. The problem is challenging because: (a) the time window is not known in advance, instead it is specified at query time, and (b) the MOD is continuously updated with new trajectories. Existing solutions first filter the trajectory database according to the temporal constraint, and then apply a clustering algorithm from scratch on the filtered data. However, this approach is extremely inefficient, when considering explorative data analysis where multiple clustering tasks need to be performed over different temporal subsets of the database, while the database is updated with new trajectories. To address this problem, we propose an incremental and scalable solution to the problem, which is built upon a novel indexing structure, called Representative Trajectory Tree (ReTraTree). ReTraTree acts as an effective spatio-temporal partitioning technique; partitions in ReTraTree correspond to groupings of sub-trajectories, which are incrementally maintained and assigned to representative (sub-)trajectories. Due to the proposed organization of sub-trajectories, the problem under study can be efficiently solved as simply as executing a query operator on ReTraTree, while insertion of new trajectories is supported. Our extensive experimental study performed on real and synthetic datasets shows that our approach outperforms a state-of-the-art in-DBMS solution supported by PostgreSQL by orders of magnitude.

View article

Recent literature has shown the merits of having deep representations in the context of neural networks. An emerging challenge in kernel learning is the definition of similar deep representations. In this paper, we propose a general methodology to define a hierarchy of base kernels with increasing expressiveness and combine them via multiple kernel learning (MKL) with the aim to generate overall deeper kernels. As a leading example, this methodology is applied to learning the kernel in the space of Dot-Product Polynomials (DPPs), that is a positive combination of homogeneous polynomial kernels (HPKs). We show theoretical properties about the expressiveness of HPKs that make their combination empirically very effective. This can also be seen as learning the coefficients of the Maclaurin expansion of any definite positive dot product kernel thus making our proposed method generally applicable. We empirically show the merits of our approach comparing the effectiveness of the kernel generated by our method against baseline kernels (including homogeneous and non homogeneous polynomials, RBF, etc...) and against another hierarchical approach on several benchmark datasets.

View article

Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging applications. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse undirected graphical models (UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian graphical models (sGGMs) and can’t identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained l1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the l1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate O(log(Kp)/ntot). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines (SIMULE implementation and the used datasets @https://github.com/QData/SIMULE).

View article

Linear mixed models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the sparse probit linear mixed model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.

View article

Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective functions: we show how tight estimators can be computed efficiently for all functions that are determined by subgroup size (non-decreasing dependence), the subgroup median value, and a dispersion measure around the median (non-increasing dependence). In the important special case when dispersion is measured using the mean absolute deviation from the median, this novel approach yields a linear time algorithm. Empirical evaluation on a wide range of datasets shows that, when used within branch-and-bound search, this approach is highly efficient and indeed discovers subgroups with much smaller errors.

View article

We propose graph-based predictable feature analysis (GPFA), a new method for unsupervised learning of predictable features from high-dimensional time series, where high predictability is understood very generically as low variance in the distribution of the next data point given the previous ones. We show how this measure of predictability can be understood in terms of graph embedding as well as how it relates to the information-theoretic measure of predictive information in special cases. We confirm the effectiveness of GPFA on different datasets, comparing it to three existing algorithms with similar objectives—namely slow feature analysis, forecastable component analysis, and predictable feature analysis—to which GPFA shows very competitive results.

View article

Clustering is an underspecified task: there are no universal criteria for what makes a good clustering. This is especially true for relational data, where similarity can be based on the features of individuals, the relationships between them, or a mix of both. Existing methods for relational clustering have strong and often implicit biases in this respect. In this paper, we introduce a novel dissimilarity measure for relational data. It is the first approach to incorporate a wide variety of types of similarity, including similarity of attributes, similarity of relational context, and proximity in a hypergraph. We experimentally evaluate the proposed dissimilarity measure on both clustering and classification tasks using data sets of very different types. Considering the quality of the obtained clustering, the experiments demonstrate that (a) using this dissimilarity in standard clustering methods consistently gives good results, whereas other measures work well only on data sets that match their bias; and (b) on most data sets, the novel dissimilarity outperforms even the best among the existing ones. On the classification tasks, the proposed method outperforms the competitors on the majority of data sets, often by a large margin. Moreover, we show that learning the appropriate bias in an unsupervised way is a very challenging task, and that the existing methods offer a marginal gain compared to the proposed similarity method, and can even hurt performance. Finally, we show that the asymptotic complexity of the proposed dissimilarity measure is similar to the existing state-of-the-art approaches. The results confirm that the proposed dissimilarity measure is indeed versatile enough to capture relevant information, regardless of whether that comes from the attributes of vertices, their proximity, or connectedness of vertices, even without parameter tuning.

View article

We propose a principled approach for the problem of aligning multiple partially overlapping networks. The objective is to map multiple graphs into a single graph while preserving vertex and edge similarities. The problem is inspired by the task of integrating partial views of a family tree (genealogical network) into one unified network, but it also has applications, for example, in social and biological networks. Our approach, called Flan, introduces the idea of generalizing the facility location problem by adding a non-linear term to capture edge similarities and to infer the underlying entity network. The problem is solved using an alternating optimization procedure with a Lagrangian relaxation. Flan has the advantage of being able to leverage prior information on the number of entities, so that when this information is available, Flan is shown to work robustly without the need to use any ground truth data for fine-tuning method parameters. Additionally, we present three multiple-network extensions to an existing state-of-the-art pairwise alignment method called Natalie. Extensive experiments on synthetic, as well as real-world datasets on social networks and genealogical networks, attest to the effectiveness of the proposed approaches which clearly outperform a popular multiple network alignment method called IsoRankN.

View article

The polarization of society over controversial social issues has been the subject of study in social sciences for decades (Isenberg in J Personal Soc Psychol 50(6):1141–1151, 1986, Sunstein in J Polit Philos 10(2):175–195, 2002). The widespread usage of online social networks and social media, and the tendency of people to connect and interact with like-minded individuals has only intensified the phenomenon of polarization (Bakshy et al. in Science 348(6239):1130–1132, 2015). In this paper, we consider the problem of measuring and reducing polarization of opinions in a social network. Using a standard opinion formation model (Friedkin and Johnsen in J Math Soc 15(3–4):193–206, 1990), we define the polarization index, which, given a network and the opinions of the individuals in the network, it quantifies the polarization observed in the network. Our measure captures the tendency of opinions to concentrate in network communities, creating echo-chambers. Given this numeric measure of polarization, we then consider the problem of reducing polarization in the network by convincing individuals (e.g., through education, exposure to diverse viewpoints, or incentives) to adopt a more neutral stand towards controversial issues. We formally define the ModerateInternal and ModerateExpressed problems, and we prove that both our problems are NP-hard. By exploiting the linear- algebraic characteristics of the opinion formation model we design polynomial-time algorithms for both problems. Our experiments with real-world datasets demonstrate the validity of our metric, and the efficiency and the effectiveness of our algorithms in practice.

View article

Location-based social networks (LBSNs), exemplified by Foursquare, are fast gaining popularity. One important feature of LBSNs is micro-review. Upon check-in at a particular venue, a user may leave a short review (up to 200 characters long), also known as a tip. These tips are an important source of information for others to know more about various aspects of an entity (e.g., restaurant), such as food, waiting time, or service. However, a user is often interested not in one particular entity, but rather in several entities collectively, for instance within a neighborhood or a category. In this paper, we address the problem of summarizing the tips of multiple entities in a collection, by way of synthesizing new micro-reviews that pertain to the collection, rather than to the individual entities per se. We formulate this problem in terms of first finding a representation of the collection, by identifying a number of “aspects” that link common threads across two or more entities within the collection. We express these aspects as dense subgraphs in a graph of sentences derived from the multi-entity corpora. This leads to a formulation of maximal multi-entity quasi-cliques, as well as a heuristic algorithm to find K such quasi-cliques maximizing the coverage over the multi-entity corpora. To synthesize a summary tip for each aspect, we select a small number of sentences from the corresponding quasi-clique, balancing conciseness and representativeness in terms of a facility location problem. Our approach performs well on collections of Foursquare entities based on localities and categories, producing more representative and diverse summaries than the baselines.

View article

Collaborative filtering (CF) is a widely used technique to guide the users of web applications towards items that might interest them. CF approaches are severely challenged by the characteristics of user-item preference matrices, which are often high dimensional and extremely sparse. Recently, several works have shown that incorporating information from social networks—such as friendship and trust relationships—into traditional CF alleviates the sparsity related issues and yields a better recommendation quality, in most cases. More interestingly, even with comparable performances, social-based CF is more beneficial than traditional CF; the former makes it possible to provide recommendations for cold start users. In this paper, we propose a novel model that leverages information from social networks to improve recommendations. While existing social CF models are based on popular modelling assumptions such as Gaussian or Multinomial, our model builds on the von Mises–Fisher assumption which turns out to be more adequate, than the aforementioned assumptions, for high dimensional sparse data. Setting the estimate of the model parameters under the maximum likelihood approach, we derive a scalable learning algorithm for analyzing data with our model. Empirical results on several real-world datasets provide strong support for the advantages of the proposed model.

View article

We study prediction problems in which the conditional distribution of the output given the input varies as a function of task variables which, in our applications, represent space and time. In varying-coefficient models, the coefficients of this conditional are allowed to change smoothly in space and time; the strength of the correlations between neighboring points is determined by the data. This is achieved by placing a Gaussian process (GP) prior on the coefficients. Bayesian inference in varying-coefficient models is generally intractable. We show that with an isotropic GP prior, inference in varying-coefficient models resolves to standard inference for a GP that can be solved efficiently. MAP inference in this model resolves to multitask learning using task and instance kernels. We clarify the relationship between varying-coefficient models and the hierarchical Bayesian multitask model and show that inference for hierarchical Bayesian multitask models can be carried out efficiently using graph-Laplacian kernels. We explore the model empirically for the problems of predicting rent and real-estate prices, and predicting the ground motion during seismic events. We find that varying-coefficient models with GP priors excel at predicting rents and real-estate prices. The ground-motion model predicts seismic hazards in the State of California more accurately than the previous state of the art.

View article

Consider a group of people who are visiting a major touristic city, such as NY, Paris, or Rome. It is reasonable to assume that each member of the group has his or her own interests or preferences about places to visit, which in general may differ from those of other members. Still, people almost always want to hang out together and so the following question naturally arises: What is the best tour that the group could perform together in the city? This problem underpins several challenges, ranging from understanding people’s expected attitudes towards potential points of interest, to modeling and providing good and viable solutions. Formulating this problem is challenging because of multiple competing objectives. For example, making the entire group as happy as possible in general conflicts with the objective that no member becomes disappointed. In this paper, we address the algorithmic implications of the above problem, by providing various formulations that take into account the overall group as well as the individual satisfaction and the length of the tour. We then study the computational complexity of these formulations, we provide effective and efficient practical algorithms, and, finally, we evaluate them on datasets constructed from real city data.

View article

Prediction in a small-sized sample with a large number of covariates, the “small n, large p” problem, is challenging. This setting is encountered in multiple applications, such as in precision medicine, where obtaining additional data can be extremely costly or even impossible, and extensive research effort has recently been dedicated to finding principled solutions for accurate prediction. However, a valuable source of additional information, domain experts, has not yet been efficiently exploited. We formulate knowledge elicitation generally as a probabilistic inference process, where expert knowledge is sequentially queried to improve predictions. In the specific case of sparse linear regression, where we assume the expert has knowledge about the relevance of the covariates, or of values of the regression coefficients, we propose an algorithm and computational approximation for fast and efficient interaction, which sequentially identifies the most informative features on which to query expert knowledge. Evaluations of the proposed method in experiments with simulated and real users show improved prediction accuracy already with a small effort from the expert.

View article

Copulas enable flexible parameterization of multivariate distributions in terms of constituent marginals and dependence families. Vine copulas, hierarchical collections of bivariate copulas, can model a wide variety of dependencies in multivariate data including asymmetric and tail dependencies which the more widely used Gaussian copulas, used in Meta-Gaussian distributions, cannot. However, current inference algorithms for vines cannot fit data with mixed—a combination of continuous, binary and ordinal—features that are common in many domains. We design a new inference algorithm to fit vines on mixed data thereby extending their use to several applications. We illustrate our algorithm by developing a dependency-seeking multi-view clustering model based on Dirichlet Process mixture of vines that generalizes previous models to arbitrary dependencies as well as to mixed marginals. Empirical results on synthetic and real datasets demonstrate the performance on clustering single-view and multi-view data with asymmetric and tail dependencies and with mixed marginals.

View article

Conference Track

This paper proposes an ensemble method for time series forecasting tasks. Combining different forecasting models is a common approach to tackle these tasks. State-of-the-art methods track the loss of the available models and adapt their weights accordingly. Metalearning strategies such as stacking are also used in these tasks. We propose a metalearning approach for adaptively combining forecasting models that specializes them across the time series. Our assumption is that different forecasting models have different areas of expertise and a varying relative performance. Moreover, many time series show recurring structures due to factors such as seasonality. Therefore, the ability of a method to deal with changes in relative performance of models as well as recurrent changes in the data distribution can be very useful in dynamic environments. Our approach is based on an ensemble of heterogeneous forecasters, arbitrated by a metalearning model. This strategy is designed to cope with the different dynamics of time series and quickly adapt the ensemble to regime changes. We validate our proposal using time series from several real world domains. Empirical results show the competitiveness of the method in comparison to state-of-the-art approaches for combining forecasters.

Download article

The Fourier domain is used in computer vision and machine learning as image analysis tasks in the Fourier domain are analogous to spatial domain methods but are achieved using different operations. Convolutional Neural Networks (CNNs) use machine learning to achieve state-of-the-art results with respect to many computer vision tasks. One of the main limiting aspects of CNNs is the computational cost of updating a large number of convolution parameters. Further, in the spatial domain, larger images take exponentially longer than smaller image to train on CNNs due to the operations involved in convolution methods. Consequently, CNNs are often not a viable solution for large image computer vision tasks. In this paper a Fourier Convolution Neural Network (FCNN) is proposed whereby training is conducted entirely within the Fourier domain. The advantage offered is that there is a significant speed up in training time without loss of effectiveness. Using the proposed approach larger images can therefore be processed within viable computation time. The FCNN is fully described and evaluated. The evaluation was conducted using the benchmark Cifar10 and MNIST datasets, and a bespoke fundus retina image dataset. The results demonstrate that convolution in the Fourier domain gives a significant speed up without adversely affecting accuracy. For simplicity the proposed FCNN concept is presented in the context of a basic CNN architecture, however, the FCNN concept has the potential to improve the speed of any neural network system involving convolution.

Download article

Mining streaming and drifting data is among the most popular contemporary applications of machine learning methods. Due to the potentially unbounded number of instances arriving rapidly, evolving concepts and limitations imposed on utilized computational resources, there is a need to develop efficient and adaptive algorithms that can handle such problems. These learning difficulties can be further augmented by appearance of skewed distributions during the stream progress. Class imbalance in non-stationary scenarios is highly challenging, as not only imbalance ratio may change over time, but also the class relationships. In this paper we propose an efficient and fast cost-sensitive decision tree learning scheme for handling online class imbalance. In each leaf of the tree we train a perceptron with output adaptation to compensate for skewed class distributions, while McDiarmid's bound is used for controlling the splitting attribute selection. The cost matrix automatically adapts itself to the current imbalance ratio in the stream, allowing for a smooth compensation of evolving class relationships. Furthermore, we analyze the characteristics of minority class instances and incorporate this information during the training process. It allows our classifier to focus on most difficult instances, while a sliding window keeps track of the changes in class structures. Experimental analysis carried out on a number of binary and multi-class imbalanced data streams indicate the usefulness of the proposed approach.

Download article

Classification of social media data is an important approach in understanding user behavior on the Web. Although information on social media can be of different modalities such as texts, images, audio or videos, traditional approaches in classification usually leverage only one prominent modality. Techniques that are able to leverage multiple modalities are often complex and susceptible to the absence of some modalities. In this paper, we present simple models that combine information from different modalities to classify social media content and are able to handle the above problems with existing techniques. Our models combine information from different modalities using a pooling layer and an auxiliary learning task is used to learn a common feature space. We demonstrate the performance of our models and their robustness to the missing of some modalities in the emotion classification domain. Our approaches, although being simple, can not only achieve significantly higher accuracies than traditional fusion approaches but also have comparable results when only one modality is available.

Download article

Learning from data streams has received increasing attention in recent years, not only in the machine learning community but also in other research fields, such as computational intelligence and fuzzy systems. In particular, several rule-based methods for the incremental induction of regression models have been proposed. In this paper, we develop a method that combines the strengths of two existing approaches rooted in different learning paradigms. Our method induces a set of fuzzy rules, which, compared to conventional rules with Boolean antecedents, has the advantage of producing smooth regression functions. To do so, it makes use of an induction technique inspired by AMRules, a very efficient and effective learning algorithm that can be seen as the state of the art in machine learning. We conduct a comprehensive experimental study showing that a combination of the expressiveness of fuzzy rules with the algorithmic concepts of AMRules yields a learning system with superb performance.

Download article

Source-target attention mechanism (briefly, source attention) has become one of the key components in a wide range of sequence generation tasks, such as neural machine translation, image caption, and open-domain dialogue generation. In these tasks, the attention mechanism, typically in control of information flow from the encoder to the decoder, enables to generate every component in the target sequence relying on different source components. While source attention mechanism has attracted many research interests, few of them turn eyes to if the generation of target sequence can additionally benefit from attending back to itself, which however is intuitively motivated by the nature of attention. To investigate the question, in this paper, we propose a new target-target attention mechanism (briefly, target attention). Along the progress of generating target sequence, target attention mechanism takes into account the relationship between the component to generate and its preceding context within the target sequence, such that it can better keep the coherent consistency and improve the readability of the generated sequence. Furthermore, it complements the information from source attention so as to further enhance semantic adequacy. After designing an effective approach to incorporate target attention in encoder-decoder framework, we conduct extensive experiments on both neural machine translation and image caption. Experimental results clearly demonstrate the effectiveness of our design of integrating both source and target attention for sequence generation tasks.

Download article

We consider the problems of online and one-pass maximization of the area under the ROC curve (AUC). AUC maximization is hard even in the offline setting and thus solutions often make some compromises. Existing results for the online problem typically optimize for some proxy defined via surrogate losses instead of maximizing the real AUC. This approach is confirmed by results showing that the optimum of these proxies, over the set of all (measurable) functions, maximize the AUC. The problem is that---in order to meet the strong requirements for per round run time complexity---online methods typically work with restricted hypothesis classes and this, as we show, corrupts the above compatibility and causes the methods to converge to suboptimal solutions even in some simple stochastic cases. To remedy this, we propose a different approach and show that it leads to asymptotic optimality. Our theoretical claims and considerations are tested by experiments on real datasets, which provide empirical justification to them.

Download article

Dynamic ensemble selection (DES) is the problem of finding, given an input x, a subset of models among the ensemble that achieves the best possible prediction accuracy. Recent studies have reformulated the DES problem as a multi-label classification problem and promising performance gains have been reported. However, their approaches may converge to an incorrect, and hence suboptimal, solution as they don't optimize the true - but non standard - loss function directly. In this paper, we show that the label dependencies have to be captured explicitly and propose a DES method based on Probabilistic Classifier Chains. Experimental results on 20 benchmark data sets show the effectiveness of the proposed method against competitive alternatives, including the aforementioned multi-label approaches. Keywords: Dynamic ensemble selection, Multi-label learning, Probabilistic Classifier Chains

Download article

Parallelization framework has become a necessity to speed up the training of deep neural networks (DNN) recently. Such framework typically employs the Model Average approach, denoted as MA-DNN, in which parallel workers conduct respective training based on their own local data while the parameters of local models are periodically communicated and averaged to obtain a global model which serves as the new start of local models. However, since DNN is a highly non-convex model, averaging parameters cannot ensure that such global model can perform better than those local models. To tackle this problem, we introduce a new parallel training framework called Ensemble-Compression, denoted as EC-DNN. In this framework, we propose to aggregate the local models by ensemble, i.e., averaging the outputs of local models instead of the parameters. As most of prevalent loss functions are convex to the output of DNN, the performance of ensemble-based global model is guaranteed to be at least as good as the average performance of local models. However, a big challenge lies in the explosion of model size since each round of ensemble can give rise to multiple times size increment. Thus, we carry out model compression after each ensemble, specialized by a distillation based method in this paper, to reduce the size of the global model to be the same as the local ones. Our experimental results demonstrate the prominent advantage of EC-DNN over MA-DNN in terms of both accuracy and speedup.

Download article

Dynamic Time Warping is a well-known measure of dissimilarity between time series. Due to its flexibility to deal with non-linear distortions along the time axis, this measure has been widely utilized in machine learning models for this particular kind of data. Nowadays, the proliferation of streaming data sources has ignited the interest and attention of the scientific community around on-line learning models. In this work, we naturally adapt Dynamic Time Warping to the on-line learning setting. Specifically, we propose a novel on-line measure of dissimilarity for streaming time series which combines a warp constraint and a weighted memory mechanism to simplify the time series alignment and adapt to non-stationary data intervals along time. Computer simulations are analyzed and discussed so as to shed light on the performance and complexity of the proposed measure.

Download article

What will be the power consumption of our institution at 8am for the upcoming days? What will happen to the power consumption of a small factory, if it wants to double (or half) its production? Technologies associated with the smart electrical grid are needed. Central to this process are algorithms that accurately model electrical load behavior, and forecast future electric power demand. However, existing power load models fail to accurately represent electrical load behavior in the grid. In this paper, we propose PowerCast, a novel domain-aware approach for forecasting the electrical power demand, by carefully incorporating domain knowledge. Our contributions are as follows: 1. Infusion of domain expert knowledge: We represent the time sequences using an equivalent circuit model, the ``BIG" model, which allows for an intuitive interpretation of the power load, as the BIG model is derived from physics-based first principles. 2. Forecasting of the power load: Our PowerCast uses the BIG model, and provides (a) accurate prediction in multi-step-ahead forecasting, and (b) extrapolations, under what-if scenarios, such as variation in the demand (say, due to increase in the count of people on campus, or a decision to half the production in our factory etc.) 3. Anomaly detection: PowerCast can spot and, even explain, anomalies in the given time sequences. The experimental results based on two real world datasets of up to three weeks duration, demonstrate that PowerCast is able to forecast several steps ahead, with 59% error reduction, compared to the competitors. Moreover, it is fast, and scales linearly with the duration of the sequences.

Download article

We study an extension of the multi-instance learning problem where examples are organized as nested bags of instances (e.g., a document could be represented as a bag of sentences, which in turn are bags of words). This framework can be useful in various scenarios, such as graph classification, image classification and translation-invariant pooling in convolutional neural network. In order to learn multi-multi instance data, we introduce a special neural network layer, called bag-layer, whose units aggregate sets of inputs of arbitrary size. We prove that the associated class of functions contains all Boolean functions over sets of sets of instances. We present empirical results on semi-synthetic data showing that such class of functions can be actually learned from data. We also present experiments on citation graphs datasets where our model obtains competitive results.

Download article

Cutset Networks (CNets) are recently introduced density estimators leveraging context-specific independences to provide exact inference in polynomial time. Learning a CNet is done first by building a weighted probabilistic OR tree and then estimating tractable distributions as its leaves. In practice, selecting an optimal OR split node requires cubic time in the number of the data features, and even approximate heuristics still scale in quadratic time. We introduce Extremely Randomized Cutset Networks (XCNets), CNets whose OR tree is learned by performing random conditioning. This simple yet surprisingly effective approach reduces the complexity of OR node selection to constant time. While the likelihood of an XCNet is slightly worse than an optimally learned CNet, ensembles of XCNets outperform state-of-the-art density estimators on a series of standard benchmark datasets, yet employing only a fraction of the time needed to learn the competitors.

Download article

Graph vertices are often associated with attributes. For example, in addition to their connection relations, people in friendship networks have personal attributes, such as interests, age, and residence. Such graphs (networks) are called attributed graphs. The detection of clusters in attributed graphs is of great practical relevance, e.g., targeting ads. Attributes and edges often provide complementary information. The effective use of both types of information promises meaningful results. In this work, we propose a method called UNCut (for Unimodal Normalized Cut) to detect cohesive clusters in attributed graphs. A cohesive cluster is a subgraph that has densely connected edges and has as many homogeneous (unimodal) attributes as possible. We adopt the normalized cut to assess the density of edges in a graph cluster. To evaluate the unimodality of attributes, we propose a measure called unimodality compactness which exploits Hartigans' dip test. Our method UNCut integrates the normalized cut and unimodality compactness in one framework such that the detected clusters have low normalized cut and unimodality compactness values. Extensive experiments on various synthetic and real-world data verify the effectiveness and efficiency of our method UNCut compared with state-of-the-art approaches.

Download article

Given time-series data such as electrocardiogram (ECG) readings, or motion capture data, how can we succintly summarize the data in a way that robustly identifies patterns that appear repeatedly? How can we then use such a summary to identify anomalies such as abnormal heartbeats, and also forecast future values of the time series? Our main idea is a vocabulary-based approach, which automatically learns a set of common patterns, or `beat patterns,' which are used as building blocks to describe the time series in an intuitive and interpretable way. Our summarization algorithm, BeatLex (Beat Lexicons for Summarization) is: 1) fast and online, requiring linear time in the data size and bounded memory; 2) effective, outperforming competing algorithms in labelling accuracy by 5.3 times, and forecasting accuracy by 1.8 times; 3) principled and parameter-free, as it is based on the Minimum Description Length principle of summarizing the data by compressing it using as few bits as possible, and automatically tunes all its parameters; 4) general: it applies to any domain of time series data, and can make use of multidimensional (i.e. coevolving) time series.

Download article

Unstructured data from diverse sources, such as social media and aerial imagery, can provide valuable up-to-date information for intelligent situation assessment. Mining these different information sources could bring major benefits to applications such as situation awareness in disaster zones and mapping the spread of diseases. Such applications depend on classifying the situation across a region of interest, which can be depicted as a spatial “heatmap". Annotating unstructured data using crowdsourcing or automated classifiers produces individual classifications at sparse locations that typically contain many errors. In this paper, we propose a novel Bayesian approach that models the relevance, error rates and bias of each information source, enabling us to learn a spatial Gaussian Process classifier by aggregating data from multiple sources with varying reliability and relevance. Our method does not require gold-labelled data and can make predictions at any location in an area of interest given only sparse observations. We show empirically that our approach can handle noisy and biased data sources, and that simultaneously inferring reliability and transferring information between neighbouring reports leads to more accurate predictions. We demonstrate our method on two real-world problems from disaster response, showing how our approach reduces the amount of crowdsourced data required and can be used to generate valuable heatmap visualisations from SMS messages and satellite images.

Download article

This paper proposes a fully Bayesian approach for Least-Squares Temporal Differences (LSTD), resulting in fully probabilistic inference of value functions that avoids the overfitting commonly experienced with classical LSTD when the number of features is larger than the number of samples. Sparse Bayesian learning provides an elegant solution through the introduction of a prior over value function parameters. This gives us the advantages of probabilistic predictions, a sparse model, and good generalisation capabilities, as irrelevant parameters are marginalised out. The algorithm efficiently approximates the posterior distribution through variational inference. We demonstrate the ability of the algorithm in avoiding overfitting experimentally.

Download article

In this paper we present the interesting Behavioral Constraint Miner (iBCM), a new approach towards classifying sequences. The prevalence of sequential data, i.e., a collection of ordered items such as text, website navigation patterns, traffic management, and so on, has incited a surge in research interest towards sequence classification. Existing approaches mainly focus on retrieving sequences of itemsets and checking their presence in labeled data streams to obtain a classifier. The proposed iBCM approach, rather than focusing on plain sequences, is template-based and draws its inspiration from behavioral patterns used for software verification. These patterns have a broad range of characteristics and go beyond the typical sequence mining representation, allowing for a more precise and concise way of capturing sequential information in a database. Furthermore, it is possible to also mine for negative information, i.e., sequences that do not occur. The technique is benchmarked against other state-of-the-art approaches and exhibits a strong potential towards sequence classification.

Download article

Finding dense subgraphs in a graph is a fundamental graph mining task, with applications in several fields. Algorithms for identifying dense subgraphs are used in biology, in finance, in spam detection, etc. Standard formulations of this problem such as the problem of finding the maximum clique of a graph are hard to solve. However, some tractable formulations of the problem have also been proposed, focusing mainly on optimizing some density function, such as the degree density and the triangle density. However, maximization of degree density usually leads to large subgraphs with small density, while maximization of triangle density does not necessarily lead to subgraphs that are close to being cliques. In this paper, we introduce the k-clique-graph densest subgraph problem, k <= 3, a novel formulation for the discovery of dense subgraphs. Given an input graph, its k-clique-graph is a new graph created from the input graph where each vertex of the new graph corresponds to a k-clique of the input graph and two vertices are connected with an edge if they share a common (k - 1)-clique. We define a simple density function, the k-clique-graph density, which gives compact and at the same time dense subgraphs, and we project its resulting subgraphs back to the input graph. In this paper we focus on the triangle-graph densest subgraph problem obtained for k = 3. To optimize the proposed function, we provide an exact algorithm. Furthermore, we present an efficient greedy approximation algorithm that scales well to larger graphs. We evaluate the proposed algorithms on real datasets and compare them with other algorithms in terms of the size and the density of the extracted subgraphs. The results verify the ability of the proposed algorithms in finding high-quality subgraphs in terms of size and density. Finally, we apply the proposed method to the important problem of keyword extraction from textual documents.

Download article

We present a new approach for learning a sequence regression function, i.e., a mapping from sequential observations to a numeric score. Our learning algorithm employs coordinate gradient descent with Gauss-Southwell optimization in the feature space of all subsequences. We give a tight upper bound for the coordinate wise gradients of squared error loss which enables efficient Gauss-Southwell selection. The proposed bound is built by separating the positive and the negative gradients of the loss function and exploits the structure of the feature space. Extensive experiments on simulated as well as real-world sequence regression benchmarks show that the bound is effective and our proposed learning algorithm is efficient and accurate. The resulting linear regression model provides the user with a list of the most predictive features selected during the learning stage, adding to the interpretability of the method.

Download article

Discovering causal structure from observational data in the presence of latent variables remains an active research area. Constraint-based causal discovery algorithms are relatively efficient at discovering such causal models from data using independence tests. Typically, however, they derive and output only one such model. In contrast, Bayesian methods can generate and probabilistically score multiple models, outputting the most probable one; however, they are often computationally infeasible to apply when modeling latent variables. We introduce a hybrid method that derives a Bayesian probability that the set of independence tests associated with a given causal model are jointly correct. Using this constraint-based scoring method, we are able to score multiple causal models, which possibly contain latent variables, and output the most probable one. The structure-discovery performance of the proposed method is compared to an existing constraint-based method (RFCI) using data generated from several previously published Bayesian networks. The structural Hamming distances of the output models improved when using the proposed method compared to RFCI, especially for small sample sizes.

Download article

We propose a novel approach called the Local Lanczos Spectral Approximation (LLSA) for identifying all latent members of a local community from very few seed members. To reduce the computation complexity, we first apply a fast heat kernel diffusing to sample a comparatively small subgraph covering almost all possible community members around the seeds. Then starting from a normalized indicator vector of the seeds and by a few steps of Lanczos iteration on the sampled subgraph, a local eigenvector is gained for approximating the eigenvector of the transition matrix with the largest eigenvalue. Elements of this local eigenvector is a relaxed indicator for the affiliation probability of the corresponding nodes to the target community. We conduct extensive experiments on real-world datasets in various domains as well as synthetic datasets. Results show that the proposed method outperforms state-of-the-art local community detection algorithms. To the best of our knowledge, this is the first work to adapt the Lanczos method for local community detection, which is natural and potentially effective. Also, we did the first attempt of using heat kernel as a sampling method instead of detecting communities directly, which is proved empirically to be very efficient and effective.

Download article

Consider a large network, and a user-provided set of query nodes between which the user wishes to explore relations. For example, a researcher may want to connect research papers in a citation network, an analyst may wish to connect organized crime suspects in a communication network, or an internet user may want to organize their bookmarks given their location in the world wide web. A natural way to show how query nodes are related is in the form of a tree in the network that connects them. However, in sufficiently dense networks, most such trees will be large or somehow trivial (e.g. involving high degree nodes) and thus not insightful. In this paper, we define and investigate the new problem of mining subjectively interesting trees connecting a set of query nodes in a network, i.e., trees that are highly surprising to the specific user at hand. Using information theoretic principles, we formalize the notion of interestingness of such trees mathematically, taking in account any prior beliefs the user has specified about the network. We then propose heuristic algorithms to find the best trees efficiently, given a specified prior belief model. Modeling the user’s prior belief state is however not necessarily computationally tractable. Yet, we show how a highly generic class of prior beliefs, namely about individual node degrees in combination with the density of particular sub-networks, can be dealt with in a tractable manner. Such types of beliefs can be used to model knowledge of a partial or total order of the network nodes, e.g. where the nodes represent events in time (such as papers in a citation network). An empirical validation of our methods on a large real network evaluates the different heuristics and validates the interestingness of the given trees.

Download article

A broad range of high impact applications involve learning a predictive model in a temporal network environment. In weather forecasting, predicting effectiveness of treatments, outcomes in healthcare and in many other domains, networks are often large, while intervals between consecutive time moments are brief. Therefore, models are required to forecast in a more scalable and efficient way, without compromising accuracy. The Gaussian Conditional Random Field (GCRF) is a widely used graphical model for performing structured regression on networks. However, GCRF is not applicable to large networks and it cannot capture different network substructures (communities) since it considers the entire network while learning. In this study, we present a novel model, Adaptive Skip-Train Structured Ensemble (AST-SE), which is a sampling-based structured regression ensemble for prediction on top of temporal networks. AST-SE takes advantage of the scheme of ensemble methods to allow multiple GCRFs to learn from several subnetworks. The proposed model is able to automatically skip the entire training or some phases of the training process. The prediction accuracy and efficiency of AST-SE were assessed and compared against alternatives on synthetic temporal networks and the H3N2 Virus Influenza network. The obtained results provide evidence that (1) AST-SE is ~140 times faster than GCRF as it skips retraining quite frequently; (2) It still captures the original network structure more accurately than GCRF while operating solely on partial views of the network; (3) It outperforms both unweighted and weighted GCRF ensembles which also operate on subnetworks but require retraining at each timestep.

Download article

In this paper, we consider a generalized variant of inverse reinforcement learning (IRL) that estimates both a cost (negative reward) function and a transition probability from observed optimal behavior. In theoretical studies of standard IRL, which estimates only the cost function, it is well known that IRL involves a non-identifiable problem, i.e., the cost function cannot be determined uniquely. This problem has been solved by using a new class of Markov decision process (MDP) called a linearly solvable MDP (LMDP). In this paper, we investigate whether a non-identifiable problem occurs in the generalized variant of IRL (gIRL) using the framework of LMDP and construct a new gIRL method. The contributions of this study are summarized as follows: (i) We point out that gIRL with LMDP suffers from a non-identifiable problem. (ii) We propose a Bayesian method to escape the non-identifiable problem. (iii) We validate the proposed method by performing an experiment on synthetic data and real car probe data.

Download article

We study the problem of detecting malware on client computers based on the analysis of HTTPS traffic. Here, malware has to be detected based on the host address, timestamps, and data volume information of the computer's network traffic. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural embedding of domain names and a long short-term memory network that processes network flows. We study the method's ability to detect new malware in a large-scale empirical study.

Download article

Due to its pharmaceutical applications, one of the most prominent machine learning challenges in bioinformatics is the prediction of drug--target interactions. State-of-the-art approaches are based on various techniques, such as matrix factorization, restricted Boltzmann machines, network-based inference and bipartite local models (BLM). In this paper, we extend BLM by the incorporation of a hubness-aware regression technique coupled with an enhanced representation of drugs and targets in a multi-modal similarity space. Additionally, we propose to build a projection-based ensemble. Our Advanced Local Drug-Target Interaction Prediction technique (ALADIN) is evaluated on publicly available real-world drug-target interaction datasets. The results show that our approach statistically significantly outperforms BLM-NII, a recent version of BLM, as well as NetLapRLS and WNN-GIP.

Download article

Learning over multi-view data is a challenging problem with strong practical applications. Most related studies focus on the classification point of view and assume that all the views are available at any time. We consider an extension of this framework in two directions. First, based on the BiGAN model, the Multi-view BiGAN (MV-BiGAN) is able to perform density estimation from multi-view inputs. Second, it can deal with missing views and is able to update its prediction when additional views are provided. We illustrate these properties on a set of experiments over different datasets.

Download article

We consider a semi-supervised learning scenario for regression, where only few labelled examples, many unlabelled instances and different data representations (multiple views) are available. For this setting, we extend support vector regression with a co-regularisation term and obtain co-regularised support vector regression (CoSVR). In addition to labelled data, co-regularisation includes information from unlabelled examples by ensuring that models trained on different views make similar predictions. Ligand affinity prediction is an important real-world problem that fits into this scenario. The characterisation of the strength of protein-ligand bonds is a crucial step in the process of drug discovery and design. We introduce variants of the base CoSVR algorithm and discuss their theoretical and computational properties. For the CoSVR function class we provide a theoretical bound on the Rademacher complexity. Finally, we demonstrate the usefulness of CoSVR for the affinity prediction task and evaluate its performance empirically on different protein-ligand datasets. We show that CoSVR outperforms co-regularised least squares regression as well as existing state-of-the-art approaches for affinity prediction.

Download article

This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduced to a classical version of the bandit problem to a certain extent. Beyond the formal analysis, these two approaches are compared through numerical experiments.

Download article

We study a two-level multiview learning with more than two views under the PAC-Bayesian framework. This approach, sometimes referred as late fusion, consists in learning sequentially multiple view-specific classifiers at the first level, and then combining these view-specific classifiers at the second level. Our main theoretical result is a generalization bound on the risk of the majority vote which exhibits a term of diversity in the predictions of the view-specific classifiers. From this result it comes out that controlling the trade-off between diversity and accuracy is a key element for multiview learning, which complements other results in multiview learning. Finally, we experiment our principle on multiview datasets extracted from the Reuters RCV1/RCV2 collection.

Download article

Privacy has become a serious concern in data mining. Achieving adequate privacy is especially challenging when the scale of the problem is large. Fundamentally, designing a practical privacy-preserving data mining system involves tradeoffs among several factors such as the privacy guarantee, the accuracy or utility of the mining result, the computation efficiency and the generality of the approach. In this paper, we present PEM, a practical system that tries to strike the right balance among these factors. We use a combination of noise-based and noise-free techniques to achieve provable differential privacy at a low computational overhead while obtaining more accurate result than previous approaches. PEM provides an efficient private gradient descent that can be the basis for many practical data mining and machine learning algorithms, like logistic regression, k-means, and Apriori. Evaluate them on three real- world open datasets in a cloud computing environment. The results show that PEM achieves good accuracy, high scalability, low computation cost while maintaining differential privacy.

Download article

Multi-view text data consist of texts from different sources. For in- stance, multilingual Wikipedia corpora contain articles in different languages which are created by different group of users. Because multi-view text data are often created in distributed fashion, information from different sources may not be consistent. Such inconsistency introduce noise to analysis of such kind of data. In this paper, we propose a probabilistic topic model for multi-view data, which is robust against noise. The proposed model can also be used for detecting anoma- lies. In our experiments on Wikipedia data sets, the proposed model is more ro- bust than existing multi-view topic models in terms of held-out perplexity.

Download article

Many online regression (and adaptive filtering) algorithms are linear, use additive update and designed for the noise-free setting. We consider the practical setting where the algorithm's feedback is noisy, rather than a clean label. We propose a new family of algorithms which modifies the learning rate based on the noise-variance of the feedback (labels), by shrinking both inputs and feedbacks, based on the amount of noise per input instance. We consider both settings, where the noise is either given or estimated. Empirical study with both synthetic and real-world speech data shows that our algorithms improve the overall performance of the regressor, even when there is no additional explicit information (i.e. amount of noise). We also consider a more general setting where an algorithm can sample more than single (noisy) label, yet there is a total (or average) budget for the feedback. We propose a few strategies how to effectively spend the given budget, which are based on noise-variance estimation and our shrinkage rule. We show empirically that our approach outperforms other naive approaches.

Download article

In this paper, we introduce a novel non-stationnary bandit setting, called relational recurrent bandit, where reward expectations at successive time steps are interdependent. The aim is to discover temporal and structural dependencies between arms in order to maximize the cumulative collected reward. Two algorithms are proposed: while a first one directly models temporal dependencies between arms, a second one assumes the existence of hidden states of the system to explain rewards of arms. For both approaches, we develop a Variational Thompson Sampling method, which approximates distributions on every hidden variable via variational inference, and uses the estimated distributions to sample reward expectations at each iteration of the process. Experiments conducted on both synthetic and real data demonstrate the effectiveness of our approaches.

Download article

In emotion recognition, it is difficult to recognize human's emotional states using just a single modality. Besides, the annotation of physiological emotional data is particularly expensive. These two aspects make the building of effective emotion recognition model challenging. In this paper, we first build a multi-view deep generative model to simulate the generative process of multi-modality emotional data. By imposing a mixture of Gaussians assumption on the posterior approximation of the latent variables, our model can learn the shared deep representation from multiple modalities. To solve the labeled-data-scarcity problem, we further extend our multi-view model to semi-supervised learning scenario by casting the semi-supervised classification problem as a specialized missing data imputation task. Our semi-supervised multi-view deep generative framework can leverage both labeled and unlabeled data from multiple modalities, where the weight factor for each modality can be learned automatically. Compared with previous emotion recognition methods, our method is more robust and flexible. The experiments conducted on two real multi-modal emotion datasets have demonstrated the superiority of our framework over a number of competitors.

Download article

Hashing methods have been widely used for applications of large-scale image retrieval and classification. Non-deep hashing methods using handcrafted features have been significantly outperformed by deep hashing methods due to their better feature representation and end-to-end learning framework. However, the most striking successes in deep hashing have mostly involved discriminative models, which require labels. In this paper, we propose a novel unsupervised deep hashing method, named Deep Discrete Hashing (DDH), for large-scale image retrieval and classification. In the proposed framework, we address two main problems: 1) how to directly learn discrete binary codes? 2) how to equip the binary representation with the ability of accurate image retrieval and classification in an unsupervised way? We resolve these problems by introducing an intermediate variable and a loss function steering the learning process, which is based on the neighborhood structure in the original space. Experiments on real datasets show that our method can significantly outperform other unsupervised methods to achieve the state-of-the-art performance for image retrieval and object recognition.

Download article

We propose k^2-means, a new clustering method which efficiently copes with large numbers of clusters and achieves low energy solutions. k^2-means builds upon the standard k-means (Lloyd's algorithm) and combines a new strategy to accelerate the convergence with a new low time complexity divisive initialization. The accelerated convergence is achieved through only looking at k_n nearest clusters and using triangle inequality bounds in the assignment step while the divisive initialization employs an optimal 2-clustering along a direction. The worst-case time complexity per iteration of our k^2-means is O(nk_nd+k^2d), where d is the dimension of the n data points and k is the number of clusters and usually n >> k >> k_n. Compared to k-means' O(nkd) complexity, our k^2-means complexity is significantly lower, at the expense of slightly increasing the memory complexity by O(nk_n+k^2). In our extensive experiments k^2-means is an order of magnitude faster than standard methods in computing accurate clusterings on several standard datasets and settings with hundreds of clusters and high dimensional data. Moreover, the proposed divisive initialization generally leads to clustering energies comparable to those achieved with the standard k-means++ initialization, while being significantly faster.

Download article

Feature ranking is beneficial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by itself might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundance on mixed datasets. In this work, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundance. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state-of-the-art feature selection techniques.

Download article

Representations are fundamental to artificial intelligence. The performance of a learning system depends on the type of representation used for representing the data. Typically, these representations are hand-engineered using domain knowledge. More recently, the trend is to learn these representations through stochastic gradient descent in multi-layer neural networks, usually called backprop. Learning the representations directly from the incoming data stream reduces the human labour involved in designing a learning system. More importantly, this allows in scaling of a learning system for difficult tasks. In this paper, we introduce a new incremental learning algorithm called crossprop, which learns incoming weights of hidden units based on the meta-gradient descent approach, that was previously introduced by Sutton (1992) and Schraudolph (1999) for learning step-sizes. The final update equation introduces an additional memory parameter for each of these weights and generalizes the backprop update equation. From our experiments, we show that crossprop learns and reuses its feature representation while tackling new and unseen tasks whereas backprop relearns a new feature representation.

Download article

Spectral dimensionality reduction algorithms are widely used in numerous domains, including for recognition, segmentation, tracking and visualization. However, despite their popularity, these algorithms suffer from a major limitation known as the ``repeated Eigen-directions'' phenomenon. That is, many of the embedding coordinates they produce typically capture the same direction along the data manifold. This leads to redundant and inefficient representations that do not reveal the true intrinsic dimensionality of the data. In this paper, we propose a general method for avoiding redundancy in spectral algorithms. Our approach relies on replacing the orthogonality constraints underlying those methods by unpredictability constraints. Specifically, we require that each embedding coordinate be unpredictable (in the statistical sense) from all previous ones. We prove that these constraints necessarily prevent redundancy, and provide a simple technique to incorporate them into existing methods. As we illustrate on challenging high-dimensional scenarios, our approach produces significantly more informative and compact representations, which improve visualization and classification tasks.

Download article

Recent work has shown that the usage of extrapolation of learning curves to determine when to terminate a training build has been shown to be effective in reducing the number of epochs of training required for finding a good performing hyper-parameter configuration. However, the current technique uses the information only from the current build to make the prediction. We propose the usage of a simple regression based extrapolation model that uses the trajectories from previous builds to make predictions of new builds. This can be used to terminate poorly performing builds and thus, speed up hyper-parameter search with performance comparable to non-augmented hyper-parameter optimization techniques. We compare the predictions made by our model against that of the existing extrapolation technique in different tasks. We incorporate our approach into a pre-existing termination criterion. We incorporate this termination criterion into an existing hyper-parameter optimization toolkit. We analyze the performance of our approach and contrast it against a baseline in terms of quality of prediction in three different tasks. We show that our approach yields builds with performance comparable to the non-augmented version with fewer epochs, and outperforms an existing parametric extrapolation technique for two out of three tasks in terms of number of required epochs.

Download article

High-dimensional data are prevalent in various machine learning applications. Feature selection is a useful technique for alleviating the curse of dimensionality. Unsupervised feature selection problem tends to be more challenging than its supervised counterpart due to the lack of class labels. State-of-the-art approaches usually use the concept of pseudo labels to select discriminative features by their regression coefficients and the pseudo-labels derived from clustering is usually inaccurate. In this paper, we propose a new perspective for unsupervised feature selection by Discriminatively Exploiting Similarity (DES). Through forming similar and dissimilar data pairs, implicit discriminative information can be exploited. The similar/dissimilar relationship of data pairs can be used as guidance for feature selection. Based on this idea, we propose hypothesis testing based and classification based methods as instantiations of the DES framework. We evaluate the proposed approaches extensively using six real-world datasets. Experimental results demonstrate that our approaches achieve significantly outperforms the state-of-the-art unsupervised methods. More surprisingly, our unsupervised method even achieves performance comparable to a supervised feature selection method.

Download article

Stochastic local search (SLS), like many other stochastic optimization algorithms, has several parameters that need to be optimized in order for the algorithm to find high quality solutions within a short amount of time. In this paper, we formulate a stochastic local search bandit (SLSB), which is a novel learning variant of SLS based on multi-armed bandits. SLSB optimizes SLS over a sequence of stochastic optimization problems and achieves high average cumulative reward. In SLSB, we study how SLS can be optimized via low degree polynomials in its noise and restart parameters. To determine the coefficients of the polynomials, we present polynomial Thompson Sampling (PolyTS). We derive a regret bound for PolyTS and validate its performance on synthetic problems of varying difficulty as well as on feature selection problems. Compared to bandits with no assumptions of the reward function and other parameter optimization approaches, our PolyTS assuming polynomial structure can provide substantially better parameter optimization for SLS. In addition, due to its simple model update, PolyTS has low computational cost compared to other SLS parameter optimization methods.

Download article

Corpus-based set expansion (i.e., finding the "complete" set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search. To discover new entities in an expanded set, previous approaches either make one-time entity ranking based on distributional similarity, or resort to iterative pattern-based bootstrapping. The core challenge for these methods is how to deal with noisy context features derived from free-text corpora, which may lead to entity intrusion and semantic drifting. In this study, we propose a novel framework, SetExpan, which tackles this problem, with two techniques: (1) a context feature selection method that selects clean context features for calculating entity-entity distributional similarity, and (2) a ranking-based unsupervised ensemble method for expanding entity set based on denoised context features. Experiments on three datasets show that SetExpan is robust and outperforms previous state-of-the-art methods in terms of mean average precision.

Download article

k-nearest-neighbor (kNN) search is a fundamental data mining task critical to many data analytics methods. Yet no effective techniques to date scale kNN search to large datasets. In this work we present PkNN, an exact distributed method that by leveraging modern distributed architectures for the first time scales kNN search to billion point datasets. The key to the PkNN strategy is a multi-round kNN search that exploits pivot-based data partitioning at each stage. This includes an outlier-driven partition adjustment mechanism that effectively minimizes data duplication and achieves a balanced workload across the compute cluster. Aggressive data-driven bounds along with a tiered support assignment strategy ensure correctness while limiting computation costs. Our experimental study on multi-dimensional real-world data demonstrates that PkNN achieves significant speedup over the state-of-the-art and scales effectively in data cardinality and dimension.

Download article

New social and economic activities massively exploit big data and machine learning algorithms to do inference on people's lives. Applications include automatic curricula evaluation, wage determination, and risk assessment for credits and loans. Recently, many governments and institutions have raised concerns about the lack of fairness, equity and ethics in machine learning to treat these problems. It has been shown that not including sensitive features that bias fairness, such as gender or race, is not enough to mitigate the discrimination when other related features are included. Instead, including fairness in the objective function has been shown to be more efficient. We present novel fair regression and dimensionality reduction methods built on a previously proposed fair classification framework. Both methods rely on using the Hilbert Schmidt independence criterion as the fairness term. Unlike previous approaches, this allows us to simplify the problem and to use multiple sensitive variables simultaneously. Replacing the linear formulation by kernel functions allows the methods to deal with nonlinear problems. For both linear and nonlinear formulations the solution reduces to solving simple matrix inversions or generalized eigenvalue problems. This simplifies the evaluation of the solutions for different trade-off values between the predictive error and fairness terms. We illustrate the usefulness of the proposed methods in toy examples, and evaluate their performance on real world datasets to predict income using gender and/or race discrimination as sensitive variables, and contraceptive method prediction under demographic and socio-economic sensitive descriptors.

Download article

We present a novel approach to learning distributed representation of sentences from unlabeled data by modeling content and context of a sentence. The content model learns sentence representation by predicting its words. The context model comprises a neighbor prediction component and a regularizer to model distributional and proximity hypotheses, respectively. We propose an online algorithm to train the model components jointly. For the first time, we evaluate the models in a setup, where contextual information is available to infer the sentence vectors. The experimental results on tasks involving classifying, clustering, and ranking sentences show that our model outperforms best existing models by a wide margin across multiple datasets.

Download article

Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings.

Download article

String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size or allow more number of mismatches. This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to dictionary size and mismatches. We propose a fast algorithm for calculating Gapped k-mer Kernel using counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger dictionary and mismatches, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the dictionary term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets).

Download article

The quality of user modeling is crucial for personalized recommender systems. Traditional within-site recommender systems aim at modeling user preferences using only actions within target site, thus suffer from cold-start problem. To alleviate such problem, researchers propose cross-domain models to leverage user actions from other domains within same site. Joint user modeling is later proposed to further integrate user actions from aligned sites for data enrichment. However, there are still limitations in existing works regarding the modeling of heterogeneous actions, the requirement of full alignment and the design of preferences coupling. To tackle these, we propose JUN: a joint user modeling framework using neural network. We take advantage of neural network's capability of capturing different data types and its ability for mining high-level non-linear correlations. Specifically, in additional to site-specific preferences models, we further introduce an auxiliary neural network to transfer knowledge between sites by fine-tuning the user embeddings using alignment information. We adopt JUN for item-based and text-based site to demonstrate its performance. Experimental results indicate that JUN outperforms both within-site and cross-site models. Specifically, JUN achieves relative improvement of 2.96% and 2.37% for item-based and text-based sites (5.77% and 13.54% for cold-start scenarios). Besides performance gain, JUN also achieves great generality and significantly extends the use scenarios.

Download article

Memory networks model information and knowledge as memories that can be manipulated for prediction and reasoning about questions of interest. In many cases, there exists complicated relational structure in the data, by which the memories can be linked together into graphs to propagate information. Typical examples include tree structure of a sentence and knowledge graph in a dialogue system. In this paper, we present a novel graph enhanced memory network GEMN to integrate relational information between memories for prediction and reasoning. Our approach introduces graph attentions to model the relations, and couples them with content-based attentions via an additional neural network layer. It thus can better identify and manipulate the memories related to a given question, and provides more accurate prediction about the final response. We demonstrate the effectiveness of the proposed approach with aspect based sentiment classification. The empirical analysis on real data shows the advantages of incorporating relational dependencies into the memory networks.

Download article

We propose kernel sequential Monte Carlo (KSMC), a framework for sampling from static target densities. KSMC is a family of sequential Monte Carlo (SMC) algorithms that are based on building emulator models of the current particle system in a reproducing kernel Hilbert space. We here focus on modelling nonlinear covariance structure and gradients of the target. The emulator's geometry is adaptively updated and subsequently used to inform local proposals. Unlike in adaptive Markov chain Monte Carlo (MCMC), continuous adaptation does not compromise convergence of the sampler. KSMC combines the strengths of SMC and kernel methods: superior performance for multimodal targets and the ability to estimate model evidence as compared to MCMC, and the emulator's ability to represent targets that exhibit high degrees of nonlinearity. As KSMC does not require access to target gradients, it is particularly applicable on targets whose gradients are unknown or prohibitively expensive. We describe necessary tuning details and demonstrate the benefits of the the proposed methodology on a series of challenging synthetic and real-world examples.

Download article

In this paper we provide a framework to embed logical constraints into the classical learning scheme of kernel machines, that gives rise to a learning algorithm based on a quadratic programming problem. In particular, we show that, once the constraints are expressed using a specific fragment from the Lukasiewicz logic, the learning objective turns out to be convex. We formulate the primal and dual forms of a general multi-task learning problem, where the functions to be determined are predicates (of any arity) defined on the feature space. The learning set contains both supervised examples for each predicate and unsupervised examples exploited to enforce the constraints. We give some properties of the solutions constructed by the framework along with promising experimental results.

Download article

Wikipedia is the largest online encyclopedia that allows anyone to edit articles. In this paper, we propose the use of deep learning to detect vandals based on their edit history. In particular, we develop a multi-source long-short term memory network (M-LSTM) to model user behaviors by using a variety of user edit aspects as inputs, including the history of edit reversion information, edit page titles and categories. With M-LSTM, we can encode each user into a low dimensional real vector, called user embedding. Meanwhile, as a sequential model, M-LSTM updates the user embedding each time after the user commits a new edit. Thus, we can predict whether a user is benign or vandal dynamically based on the up-to-date user embedding. Furthermore, those user embeddings are crucial to discover collaborative vandals.

Download article

Learning embeddings of entities and relations using neural architectures is an effective method of performing statistical learning on large-scale relational data, such as knowledge graphs. In this paper, we consider the problem of regularizing the training of neural knowledge graph embeddings by leveraging external background knowledge. We propose a principled and scalable method for leveraging equivalence and inversion axioms during the learning process, by imposing a set of model-dependent soft constraints on the predicate embeddings. The method has several advantages: i) the number of introduced constraints does not depend on the number of entities in the knowledge base; ii) regularities in the embedding space effectively reflect available background knowledge; iii) it yields more accurate results in link prediction tasks over non-regularized methods; and iv) it can be adapted to a variety of models, without affecting their scalability properties. We demonstrate the effectiveness of the proposed method on several large knowledge graphs.Our evaluation shows that it consistently improves the predictive accuracy of several neural knowledge graph embedding models (for instance,the MRR of TransE on WordNet increases by 11%) without compromising their scalability properties.

Download article

We propose a novel approach to finding explanations of deviating subsets, often called subgroups. Existing approaches for subgroup discovery rely on various quality measures that nonetheless often fail to find subgroup sets that are diverse, of high quality, and most importantly, provide good explanations of the deviations that occur in the data. To tackle this issue we introduce explanation networks, which provide a holistic view on all candidate subgroups and how they relate to each other, offering elegant ways to select high-quality yet diverse subgroup sets. Explanation networks are constructed by representing subgroups by nodes and having weighted edges represent the extent to which one subgroup explains another. Explanatory strength is defined by extending ideas from database causality, in which interventions are used to quantify the effect of one query on another. Given an explanatory network, existing network analysis techniques can be used for subgroup discovery. In particular, we study the use of PageRank for pattern ranking and seed selection (from influence maximization) for pattern set selection. Experiments on synthetic and real data show that the proposed approach finds subgroup sets that are more likely to capture the generative processes of the data than other methods.

Download article

We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.

Download article

The scalable calculation of matrix determinants has been a bottleneck to the widespread application of many machine learning methods such as determinantal point processes, Gaussian processes, generalised Markov random fields, graph models and many others. In this work, we estimate log determinants under the framework of maximum entropy, given information in the form of moment constraints from stochastic trace estimation. The estimates demonstrate a significant improvement on state-of-the-art alternative methods, as shown on a wide variety of matrices from the SparseSuite Matrix Collection. By taking the example of a general Markov random field, we also demonstrate how this approach can significantly accelerate inference in large-scale learning methods involving the log determinant.

Download article

We address the problem of discovering contexts that lead well-distinguished collections of individuals to change their pairwise agreement w.r.t. their usual one. For instance, in the European parliament, while in overall, a strong disagreement is witnessed between deputies of the far-right French party Front National and deputies of the left party Front de Gauche, a strong agreement is observed between these deputies in votes related to the thematic: External relations with the union. We devise the method DSC (Discovering Similarities Changes) which relies on exceptional model mining to uncover three-set patterns that identify contexts and two collections of individuals where an unex- pected strengthening or weakening of pairwise agreement is observed. To efficiently explore the search space, we define some closure operators and pruning techniques using upper bounds on the quality measure. In addition of handling usual attributes (e.g. numerical, nominal), we propose a novel pattern domain which involves hierarchical multi-tag attributes that are present in many datasets. A thorough empirical study on two real-world datasets (i.e., European parliament votes and collaborative movie reviews) demonstrates the efficiency and the effectiveness of our approach as well as the interest and the actionability of the patterns.

Download article

In this paper we study a problem of determining when entities are active based on their interactions with each other. More formally, we consider a set of entities V and a sequence of time-stamped edges E among the entities. Each edge (u,v,t) in E denotes an interaction between entities u and v that takes place at time t. We view this input as a temporal network. We then assume a simple activity model in which each entity is active during a short time interval. An interaction (u,v,t) can be explained if at least one of u or v are active at time t. Our goal is to reconstruct the activity intervals, for all entities in the network, so as to explain the observed interactions. This problem, which we refer to as the network-untangling problem, can be applied to discover timelines of events from complex interactions among entities. We provide two formulations for the network-untangling problem: (i) minimizing the total interval length over all entities, and (ii) minimizing the maximum interval length. We show that the sum problem is NP-hard, while, surprisingly, the max problem can be solved optimally in linear time, using a mapping to 2-SAT. For the sum problem we provide efficient and effective algorithms based on realistic assumptions. Furthermore, we complement our study with an extensive evaluation on synthetic and real-world datasets, which demonstrates the validity of our concepts and the good performance of our algorithms.

Download article

Many machine learning algorithms minimize a regularized risk, and stochastic optimization is widely used for this task. When working with massive data, it is desirable to perform stochastic optimization in parallel. Unfortunately, many existing stochastic optimization algorithms cannot be parallelized efficiently. In this paper we show that one can rewrite the regularized risk minimization problem as an equivalent saddle-point problem, and propose an efficient distributed stochastic optimization (DSO) algorithm. We prove the algorithm's rate of convergence; remarkably, our analysis shows that the algorithm scales almost linearly with the number of processors. We also verify with empirical evaluations that the proposed algorithm is competitive with other parallel, general purpose stochastic and batch optimization algorithms for regularized risk minimization.

Download article

Knowledge graph completion with representation learning predicts new entity-relation triples from the existing knowledge graphs by embedding entities and relations into a vector space. Most existing methods focus on the structured information of triples and maximize the likelihood of them. However, they neglect semantic information contained in most knowledge graphs and the prior knowledge indicated by the semantic information. To overcome this drawback, we propose an approach that integrates the structured information and entity types which describe the categories of entities. Our approach constructs relation types from entity types and utilizes type-based semantic similarity of the related entities and relations to capture prior distributions of entities and relations. With the type-based prior distributions, our approach generates multiple embedding representations of each entity in different contexts and estimates the posterior probability of entity and relation prediction. Extensive experiments show that our approach outperforms previous semantics-based methods.

Download article

Despite prolific success, kernel methods become difficult to use in many large-scale unsupervised problems because of the evaluation and storage of the full Gram matrix. Here we overcome this difficulty by proposing a novel approach: compute the optimal small, out-of-sample Nystrom sketch which allows for fast approximation of the Gram matrix via the Nystrom method. We demonstrate and compare several methods for computing the optimal Nystrom sketch and show how this approach outperforms previous state-of-the-art Nystrom subset-based methods of similar size.

Download article

Reliable evaluation of network mining tools implies significance and scalability testing. This is usually achieved by picking several graphs of various size from different domains. However, graph properties and thus evaluation results could be dramatically different from one domain to another. Hence the necessity of aggregating results over a multitude of graphs within each domain. The paper introduces an approach to automatically learn features of a directed graph from any domain and generate similar graphs while scaling input graph size with a real-valued factor. Generating multiple graphs with similar size allows significance testing, while scaling graph size makes scalability evaluation possible. The proposed method relies on embedding an input graph into low-dimensional space, thus encoding graph features in a set of node vectors. Edge weights and node communities could be imitated as well in optional steps. We demonstrate that embedding-based approach ensures variability of synthetic graphs while keeping degree and subgraphs distributions close to the original graphs. Therefore, the method could make significance and scalability testing of network algorithms more reliable without the need to collect additional data. We also show that embedding-based approach preserves various features in generated graphs which can't be achieved by other generators imitating a given graph.

Download article

We present a novel notion of outlier, called Concentration Free Outlier Factor (CFOF), having the peculiarity to resist concentration phenomena that affect other scores when the dimensionality of the feature space increases. Indeed we formally prove that CFOF does not concentrate in intrinsically high-dimensional spaces. Moreover, CFOF is adaptive to different local density levels and it does not require the computation of exact neighbors in order to be reliably computed. We present a very efficient technique, named fast-CFOF, for detecting outliers in very large high-dimensional datasets. The technique is efficiently parallelizable, and we provide a MIMD-SIMD implementation. Experimental results witness for scalability and effectiveness of the technique and highlight that CFOF exhibits state of the art detection performances.

Download article

We present a simple generative framework for learning to predict previously unseen classes, based on estimating class-attribute-gated class-conditional distributions. We model each class-conditional distribution as an exponential family distribution and the parameters of the distribution of each seen/unseen class are defined as functions of the respective observed class attributes. These functions can be learned using only the seen class data and can be used to predict the parameters of the class-conditional distribution of each unseen class. Unlike most existing methods for zero-shot learning that represent classes as fixed embeddings in some vector space, our generative model naturally represents each class as a probability distribution. It is simple to implement and also allows leveraging additional unlabeled data from unseen classes to improve the estimates of their class-conditional distributions using transductive/semi-supervised learning. Moreover, it extends seamlessly to few-shot learning by easily updating these distributions when provided with a small number of additional labelled examples from unseen classes. Through a comprehensive set of experiments on several benchmark data sets, we demonstrate the efficacy of our framework.

Download article

In this paper we address the anomaly detection problem in a supervised setting where positive examples might be very sparse. We tackle this task with a learning to rank strategy by optimizing a differentiable smoothed surrogate of the so-called Average Precision (AP). Despite its non-convexity, we show how to use it efficiently in a stochastic gradient boosting framework. We show that using AP is much better to optimize the top rank alerts than the state of the art measures. We demonstrate on anomaly detection tasks that the interest of our method is even reinforced in highly unbalanced scenarios.

Download article

PCA is a classical statistical technique whose simplicity and maturity has seen it find widespread use as an anomaly detection technique. However, it is limited in this regard by being sensitive to gross perturbations of the input, and by seeking a linear subspace that captures normal behaviour. The first issue has been dealt with by robust PCA, a variant of PCA that explicitly allows for some data points to be arbitrarily corrupted; however, this does not resolve the second issue, and indeed introduces the new issue that one can no longer inductively find anomalies on a test set. This paper addresses both issues in a single model, the robust autoencoder. This method learns a nonlinear subspace that captures the majority of data points, while allowing for some data to have arbitrary corruption. The model is simple to train and leverages recent advances in the optimisation of deep neural networks. Experiments on a range of real-world datasets highlight the model's effectiveness.

Download article

In this paper, we propose a general framework DeepCluster to integrate traditional clustering methods into deep learning (DL) models and adopt Alternating Direction of Multiplier Method (ADMM) to optimize it. While most existing DL based clustering techniques have separate feature learning (via DL) and clustering (with traditional clustering methods), DeepCluster simultaneously learns feature representation and does cluster assignment under the same framework. Furthermore, it is a general and flexible framework that can employ different networks and clustering methods. We demonstrate the effectiveness of DeepCluster by integrating two popular clustering methods: K-means and Gaussian Mixture Model (GMM) into deep networks. The experimental results shown that our method can achieve state-of-the-art performance on learning representation for clustering analysis.

Download article

Cyberbullying is a phenomenon which negatively affects the individuals, the victims suffer from various mental issues, ranging from depression, loneliness, anxiety to low self-esteem. In parallel with the pervasive use of social media, cyberbullying is becoming more and more prevalent. Traditional mechanisms to fight against cyberbullying include the use of standards and guidelines, human moderators, and blacklists based on the profane words. However, these mechanisms fall short in social media and cannot scale well. Therefore, it is necessary to develop a principled learning framework to automatically detect cyberbullying behaviors. However, it is a challenging task due to short, noisy and unstructured content information and intentional obfuscation of the abusive words or phrases by social media users. Motivated by sociological and psychological findings on bullying behaviors and the correlation with emotions, we propose to leverage sentiment information to detect cyberbullying behaviors in social media by proposing a sentiment informed cyberbullying detection framework. Experimental results on two real-world, publicly available social media datasets show the superiority of the proposed framework. Further studies validate the effectiveness of leveraging sentiment information for cyberbullying detection.

Download article

Clustering of customer transaction data is very important in retail and e-commerce companies. In this paper, we propose a local PurTree Subspace Spectral (LPSS) clustering algorithm for customer transaction data. In the new method, a recently proposed ``Purchase Tree'' is used to represent the customer transaction data. We propose a PurTree subspace distance to compute the distance between two trees, in which a set of sparse node weights are introduced to distinguish the importance of different nodes in a purchase tree. The new method learns a data similarity matrix from the local distances and a set of sparse node weights in the PurTree subspace distance simultaneously. An iterative optimization algorithm is proposed to optimize the proposed model. We also present an effective method to compute a regularization parameter in LPSS. We compare LPSS with five clustering algorithms on 10 benchmark data sets and the experimental results show the superiority of the new method.

Download article

In a growing number of application domains, multiple feature representations or views are available to describe objects. Multi-view clustering tries to find similar groups of objects across these views. This task is complicated when the corresponding clusterings in each view show poor agreement (conflicting views). In such cases, traditional multi-view clustering methods will not benefit from using multi-view data. Here, we propose to overcome this problem by combining the ideas of multi-view spectral clustering with alternative clustering through kernel-based dimensionality reduction. Our method automatically determines feature transformations in each view that lead to an optimal clustering w.r.t to a new proposed objective function for conflicting views. In our experiments, our approach outperforms state-of-the-art multi-view clustering methods by more accurately detecting the ground truth clustering supported by all views.

Download article

Most user-based websites such as social networks(Twitter,Facebook) and e-commerce websites (Amazon) have been targets of group fraud (multiple users working together for malicious purposes). How can we better rank malicious entities in such cases of group-fraud? Most of the existing work in group anomaly detection detects lock-step behavior by detecting dense blocks in matrices, and recently, in tensors. However, there is no principled way of scoring the users based on their participation in these dense blocks. In addition, existing methods do not take into account temporal features while detecting dense blocks, which are crucial to uncover bot-like behaviors. In this paper (a) we propose a systematic way of handling temporal information; (b) we give a list of axioms that any individual suspiciousness metric satisfies; (c) we propose ZOORANK, an algorithm that finds and ranks suspicious entities (users, targeted products, days, etc.) effectively in real-world datasets. Experimental results on multiple real-world datasets show that ZOORANK detected and ranked the suspicious entities with a nearly accurate precision, while outperforming the baseline approach.

Download article

In this paper we propose a survival factorization framework that models information cascades by tying together social influence patterns, topical structure and temporal dynamics. This is achieved through the introduction of a latent space which encodes: (a) the relevance of a information cascade on a topic; (b) the topical authoritativeness and the susceptibility of each individual involved in the information cascade, and (c) temporal topical patterns. By exploiting the cumulative properties of the survival function and of the likelihood of the model on a given adoption log, which records the observed activation times of users and side-information for each cascade, we show that the inference phase is linear in the number of users and in the number of adoptions. The evaluation on both synthetic and real-world data shows the effectiveness of the model in detecting the interplay between topics and social influence patterns, which ultimately provides high accuracy in predicting users activation times.

Download article

In this study we investigate the recommendation problem with trust and distrust relationships to overcome the sparsity of users' preferences, accounting for the fact that users trust the recommendations of their friends, and they do not accept the recommendations of their foes. In addition, not only users' preferences are sparse, but also users' social relationships. So, we first propose an inference step with multiple random walks to predict the implicit-missing trust relationships that users might have in recommender systems, while considering users' explicit trust and distrust relationships during the inference. We introduce a regularization method and design an objective function with a social regularization term to weigh the influence of friends' trust and foes' distrust degrees on users' preferences. We formulate the objective function of our regularization method as a minimization problem with respect to the users' and items' latent features and then we solve our recommendation problem via gradient descent. Our experiments confirm that our approach preserves relatively high recommendation accuracy in the presence of sparsity in both the users' preferences and social relationships, significantly outperforming several state-of-the-art methods.

Download article

Common time series similarity measures that operate on the full series (like Euclidean distance or Dynamic Time Warping DTW) do not correspond well to the visual similarity as perceived by a human. Based on the interval tree of scale, we propose a multiscale Bezier representation of time series, that supports the definition of elastic similarity measures that overcome this problem. With this representation the matching can be performed efficiently as similarity is measured segment-wise rather than element-wise (as with DTW). We effectively restrict the set of warping paths considered by DTW and the results do not only correspond better to the analysts intuition but improve the accuracy in the standard 1NN time series classification.

Download article

Recent developments in lifelong machine learning have demonstrated that it is possible to learn multiple tasks consecutively, transferring knowledge between those tasks to accelerate learning and improve performance. However, these methods are limited to using linear parametric base learners, substantially restricting the predictive power of the resulting models. We present a lifelong learning algorithm that can support non-parametric models, focusing on Gaussian processes. To enable efficient online transfer between Gaussian process models, our approach assumes a factorized formulation of the covariance functions, and incrementally learns a shared sparse basis for the models' parameterizations. We show that this lifelong learning approach is highly computationally efficient, and outperforms existing methods on a variety of data sets.

Download article

Image tag recommendation in social media systems provides the users with personalized tag suggestions which facilitate the users' tagging task and enable automatic organization and many image retrieval tasks. Factorization models are a widely used approach for personalized tag recommendation and achieve good results. These methods rely on the user's tagging preferences only and ignore the contents of the image. However, it is obvious that especially the contents of the image, such as the objects appearing in the image, colors, shapes or other visual aspects, strongly influence the user's tagging decisions. We present a personalized content-aware image tag recommendation approach that combines both historical tagging information and image-based features in a factorization model. Employing transfer learning, we apply state of the art deep learning image classification and object detection techniques to extract powerful features from the images. Both, image information and tagging history, are fed to an adaptive factorization model to recommend tags. Empirically, we can demonstrate that the visual and object-based features can improve the performance up to 1.5 percent over the state of the art.

Download article

We present a unified contextual bandit framework for recommendation problems that is able to capture long- and short-term interests of users. The model is devised in dual space and the derivation is consequentially carried out using Fenchel-Legrende conjugates and thus lever- ages to a wide range of tasks and settings. We detail two instantiations for regression and classification scenarios and obtain well-known algorithms for these special cases. The resulting general and unified framework allows for quickly adapting contextual bandits to different applications at-hand. The empirical study demonstrates that the proposed long- and short-term framework outperforms both, short-term and long-term models on data. Moreover, a tweak of the combined model proves beneficial in cold start problems.

Download article

This paper investigates the problem of highly imbalanced time-series classification using shapelets. The current state-of-the-art approach learns generalized shapelets along with weights of the classification hyperplane via a classical cost-insensitive loss function. Cost-insensitive loss functions tend to treat different misclassification errors equally and thus, models are usually biased towards examples of majority class. The rare class (which will be referred to as positive class) is usually the important class and a false negative is always costlier than a false positive. Traditional cost-insensitive loss functions fail to differentiate between these two types of misclassification errors. In this paper, the generalized shapelets learning framework is extended and a cost-sensitive learning model is proposed. Instead of incorporating the misclassification cost as a prior knowledge, as was done by other published methods, we formulate a constrained optimization problem to learn the unknown misclassification costs along with the shapelets and their weights. First, we demonstrated the effectiveness of the proposed method on two case studies, with the objective to detect true alarms from life threatening cardiac arrhythmia dataset from Physionets MIMIC II repository. The results show improved true alarm detection rates over the current state-of-the-art method. Next, we compared to the state-of-the-art learning shapelet method on 16 balanced dataset from UCR time-series repository. The results show evidence that the proposed method outperforms the state-of-the-art method. Finally, we performed extensive experiments across additional 18 imbalanced time-series datasets. The results provide evidence that the proposed method achieves comparable results with state-of-the-art sampling/non-sampling based approaches for highly imbalanced time-series datasets. However, our method is highly interpretable which is advantage over many other methods.

Download article

In the time-series classification context, the majority of the most accurate core methods are based on the Bag-of-Words framework, in which sets of local features are first extracted from time series. A dictionary of words is then learned and each time series is finally represented by a histogram of word occurrences. This representation induces a loss of information due to the quantization of features into words as all the time series are represented using the same fixed dictionary. In order to overcome this issue, we introduce in this paper a kernel operating directly on sets of features. Then, we extend it to a time-compliant kernel that allows one to take into account the temporal information. We apply this kernel in the time series classification context. Proposed kernel has a quadratic complexity with the size of input feature sets, which is problematic when dealing with long time series. However, we show that kernel approximation techniques can be used to define a good trade-off between accuracy and complexity. We experimentally demonstrate that the proposed kernel can significantly improve the performance of time series classification algorithms based on Bag-of-Words.

Download article

Domain adaptation (DA) is an important and emerging field of machine learning that tackles the problem occurring when the distributions of training (source domain) and test (target domain) data are similar but different. Current theoretical results show that the efficiency of DA algorithms depends on their capacity of minimizing the divergence between source and target probability distributions. In this paper, we provide a theoretical study on the advantages that concepts borrowed from optimal transportation theory can bring to DA. In particular, we show that the Wasserstein metric can be used as a divergence measure between distributions to obtain generalization guarantees for three different learning settings: (i) classic DA with unsupervised target data (ii) DA combining source and target labeled data, (iii) multiple source DA. Based on the obtained results, we provide some insights showing when this analysis can be tighter than other existing frameworks. We think that these results open the door to novel ideas and directions for DA.

Download article

To predict customers next choice in the context of what he/she has bought in a session is interesting and critical in the transaction domain especially for online shopping. Precise prediction leads to high quality recommendations and thus high benefit. Such kind of recommendation is usually formalized as transaction-based recommender systems (TBRS). Existing TBRS either tend to recommend popular items while ignore infrequent and newly-released ones (e.g., pattern-based RS) or assume a rigid order between items within a transaction(e.g., Markov Chain-based RS) which does not satisfy real-world cases in most time. In this paper, we propose a neural network based comprehensive transaction embedding model (NTEM) which can effectively perceive the next choice in a transaction context. Specifically, we learn these comprehensive embeddings of both items and their features from relaxed ordered transactions. The relevance between items revealed by the transactions is encoded into such embeddings. With rich information embedded, such embeddings are powerful to predict the next choices given those already bought items. NTEM is a shallow wide-in-wide-out network, which is more efficient than deep networks considering large numbers of items and transactions. Experimental results on real-world datasets show that NTEM outperforms three typical TBRS models FPMC, PRME and GRU4Rec in terms of recommendation accuracy and novelty. Our implementation is available at https://github.com/shoujin88/NTEM-model

Download article

Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the predictive function but also the matrix-valued kernel underlying the function search space directly from the data. Our approach is based on learning multiple matrix-valued kernels, each of those composed of a set of input kernels and a set of output kernels learned in the cone of positive semi-definite matrices. In addition to superior predictive performance in the presence of strong non-linearities, our method also recovers the hidden dynamic relationships between the series and thus is a new alternative to existing graphical Granger techniques.

Download article

Unsupervised Domain Adaptation (UDA) considers the problem of adapting a classifier trained using labelled training instances from a source domain to a different target domain, without having access to any labelled training instances from the target domain. Projection-based methods, where the source and target domain instances are first projected onto a common feature space on which a classifier can be trained and applied have produced state-of-the-art results for UDA. However, a critical pre-processing step required by these methods is the selection of a set of common features (aka. pivots), this is typically done using heuristic approaches,applied prior to performing domain adaptation. In contrast to the one of heuristics, we propose a method for learning Task-Specific Pivots (TSPs) in a systematic manner by considering both the labelled and unlabelled data available from both domains. We evaluate TSPs against pivots selected using alternatives in two cross-domain sentiment classification applications. Our experimental results show that the proposed TSPs significantly outperform previously proposed selection strategies in both tasks. Moreover, when applied in a cross-domain sentiment classification task, TSP captures many sentiment-bearing pivots.

Download article

Urban anomalies such as noise complaints and potholes in streets negatively affect our everyday life and need to be addressed in a timely manner. While significant effort has been made in detecting whether there exist anomalies in current urban data, the prediction of future urban anomalies is much less well explored and understood. In this work, we formalize the anomaly prediction problem in urban environments, such that those can be addressed in a more timely and efficient manner. We develop the Urban Anomaly PreDiction (UAPD) framework, which addresses a number of challenges, including the dynamics of different categories of anomalies, as well as varied spatial-temporal distributions. Given the urban anomaly data to date, UAPD first detects its change point in the temporal dimension, then a tensor decomposition based model decouples the interrelations among the spatial, temporal, and categorical dimensions of the urban data to detect the potential anomalies. Subsequently, based on the latent tensor of each dimension, we apply the autoregression method to predict which category of anomalies will happen at each region in the future. We perform extensive experiments to evaluate the proposed UAPD framework in two urban environments, namely New York City and Pittsburgh. Experimental results demonstrate that UAPD outperforms alternative baselines across various settings, including different region and time-frame scales, as well as diverse categories of anomalies.

Download article

Cross-DomainCollaborativeFiltering(CDCF)hasattracted various research works in recent years. However, an important problem setting, i.e., “users and items in source and target domains are totally different�, has not received much attention yet. We coin this problem as Non-Overlapping Cross-Domain Collaborative Filtering (NOCDCF). In order to solve this challenging CDCF task, we propose a novel 3-step rating pattern transfer model, i.e. low-rank knowledge transfer via fac- torization machines (LKT-FM). Our solution is able to mine high quality knowledge from large and sparse source matrices, and to integrate the knowledge without losing much information contained in the target ma- trix via exploiting Factorization Machine (FM). Extensive experiments on real world datasets show that the proposed LKT-FM model outper- forms the state-of-the-art CDCF solutions.

Download article

Given labeled data represented by a binary matrix, we consider the task to derive a Boolean matrix factorization which identifies commonalities and specifications among the classes. While existing works focus on rank-one factorizations which are either specific or common to the classes, we derive class-specific alterations from common factorizations as well. Therewith, we broaden the applicability of our new method to datasets whose class-dependencies have a more complex structure. On the basis of synthetic and real-world datasets, we show on the one hand that our method is able to filter structure which corresponds to our model assumption, and on the other hand that our model assumption is justified in real-world application. Our method is parameter-free.

Download article

A proper semantic representation for encoding side information is key to the success of zero-shot learning. In this paper, we explore two alternative semantic representations especially for zero-shot human action recognition: textual descriptions of human actions and deep features extracted from still images relevant to human actions. Such side information are accessible on Web with little cost, which paves a new way in gaining side information for large-scale zero-shot human action recognition. We investigate different encoding methods to generate semantic representations for human actions from such side information. Based on our zero-shot visual recognition method, we conducted experiments on UCF101 and HMDB51 to evaluate two proposed semantic representations . The results suggest that our proposed text- and image-based semantic representations outperform traditional attributes and word vectors considerably for zero-shot human action recognition. In particular, the image-based semantic representations yield the favourable performance even though the representation is extracted from a small number of images per class.

Download article

Learning interactions between dynamical processes is a widespread but difficult problem in ecological or human sciences. Unlike in other domains (bioinformatics, for example), data is often scarce, but expert knowledge is available. We consider the case where knowledge is about a limited number of interactions that drive the processes dynamics, and on a community structure in the interaction network. We propose an original framework, based on Dynamic Bayesian Networks with labeled-edge structure and parsimonious parameterization, and a Stochastic Block Model prior, to integrate this knowledge. Then we propose a restoration-estimation algorithm, based on 0-1 Linear Programing, that improves network learning when these two types of expert knowledge are available. The approach is illustrated on a problem of ecological interaction network learning.

Download article

In this paper, we study the trade-offs of different inference approaches for Bayesian matrix factorisation methods, which are commonly used for predicting missing values, and for finding patterns in the data. In particular, we consider Bayesian nonnegative variants of matrix factorisation and tri-factorisation, and compare non-probabilistic inference, Gibbs sampling, variational Bayesian inference, and a maximum-a-posteriori approach. The variational approach is new for the Bayesian nonnegative models. We compare their convergence, and robustness to noise and sparsity of the data, on both synthetic and real-world datasets. Furthermore, we extend the models with the Bayesian automatic relevance determination prior, allowing the models to perform automatic model selection, and demonstrate its efficiency.

Download article

Research on person re-identification (re-id) has attached much attention in the machine learning field in recent years. With sufficient labeled training data, supervised re-id algorithm can obtain promising performance. However, producing labeled data for training supervised re-id models is an extremely challenging and time-consuming task because it requires every pair of images across no-overlapping camera views to be labeled. Moreover, in the early stage of experiments, when labor resources are limited, only a small number of data can be labeled. Thus, it is essential to design an effective algorithm to select the most representative samples. This is referred as early active learning or early stage experimental design problem. The pairwise relationship plays a vital role in the re-id problem, but most of the existing early active learning algorithms fail to consider this relationship. To overcome this limitation, we propose a novel and efficient early active learning algorithm with a pairwise constraint for person re-identification in this paper. By introducing the pairwise constraint, the closeness of similar representations of instances is enforced in active learning. This benefits the performance of active learning for re-id. Extensive experimental results on four benchmark datasets confirm the superiority of the proposed algorithm.

Download article

A sensor in a sensor network is expected to be able to make prediction or decision utilizing the models learned from the data observed on this sensor. However, in the early stage of using a sensor, there may be not a lot of data available to train the model for this sensor. A solution is to leverage the observation data from other sensors which have similar conditions and models with the given sensor. We thus propose a novel distributed multi-task learning approach which incorporates neighborhood relations among sensors to learn multiple models simultaneously in which each sensor corresponds to one task. It may be not cheap for each sensor to transfer the observation data from other sensors; broadcasting the observation data of a sensor in the entire network is not satisfied for the reason of privacy protection; each sensor is expected to make real-time prediction independently from neighbor sensors. Therefore, this approach shares the model parameters as regularization terms in the objective function by assuming that neighbor sensors have similar model parameters. We conduct the experiments on two real datasets by predicting the temperature with the regression. They verify that our approach is effective, especially when the bias of an independent model which does not utilize the data from other sensors is high such as when there is not plenty of training data available.

Download article

Topic models for text analysis are most commonly trained using either Gibbs sampling or variational Bayes. Recently, hybrid variational-Gibbs algorithms have been found to combine the best of both worlds. Variational algorithms are fast to converge and more efficient for inference on new documents. Gibbs sampling enables sparse updates since each token is only associated with one topic instead of a distribution over all topics. Additionally, Gibbs sampling is unbiased. Although Gibbs sampling takes longer to converge, it is guaranteed to arrive at the true posterior after infinitely many iterations. By combining the two methods it is possible to reduce the bias of variational methods while simultaneously speeding up variational updates. This idea has previously been applied to standard latent Dirichlet allocation (LDA). We propose a new sampling method that enables the application of the idea to the nonparametric version of LDA, hierarchical Dirichlet process topic models. Our fast sampling method leads to a significant speedup of variational updates as compared to other sampling methods. Experiments show that training of our topic model converges to a better loglikelihood than previously existing variational methods and converges faster than Gibbs sampling in the batch setting.

Download article

We introduce Poisson Matrix Factorization with Content and Social trust information (PoissonMF-CS), a latent variable probabilistic model for recommender systems with the objective of jointly modeling social trust, item content and user's preference using Poisson matrix factorization framework. This probabilistic model is equivalent to collectively factorizing a non-negative user--item interaction matrix and a non-negative item--content matrix. The user--item matrix consists of sparse implicit (or explicit) interactions counts between user and item, and the item--content matrix consists of words or tags counts per item. The model imposes additional constraints given by the social ties between users, and the homophily effect on social networks -- the tendency of people with similar preferences to be socially connected. Using this model we can account for and fine-tune the weight of content-based and social-based factors in the user preference. We develop approximate variational inference algorithm and perform experiments comparing PoissonMF-CS with competing models. The experimental evaluation indicates that PoissonMF-CS achieves superior predictive performance on held-out data for the top-$M$ recommendations task. Also, we observe that PoissonMF-CS generates compact latent representations when compared with alternative models while maintaining superior predictive performance.

Download article

Sparse mapping has been a key methodology in many high-dimensional scientific problems. When multiple tasks share the set of relevant features, learning them jointly in a group drastically improves the quality of relevant feature selection. However, in practice this technique is used limitedly since such grouping information is usually hidden. In this paper, our goal is to recover the group structure on the sparsity patterns and leverage that information in the sparse learning. Toward this, we formulate a joint optimization problem in the task parameter and the group membership, by constructing an appropriate regularizer to encourage sparse learning as well as correct recovery of task groups. We further demonstrate that our proposed method recovers groups and the sparsity patterns in the task parameters accurately by extensive experiments.

Download article

Extracting features from incomplete tensors is a challenging task which is not well explored. Due to the data with missing entries, existing feature extraction methods are not applicable. Although tensor completion techniques can estimate the missing entries well, they focus on data recovery and do not consider the relationships among tensor samples for effective feature extraction. To solve this problem of feature extraction for incomplete data, we propose an unsupervised method, TDVM, which incorporates low-rank Tucker Decomposition with feature Variance Maximization in a unified framework. Based on Tucker decomposition, we impose nuclear norm regularization on the core tensors while minimizing reconstruction errors, and meanwhile maximize the variance of core tensors (i.e., extracted features). Here, the relationships among tensor samples are explored via variance maximization while estimating the missing entries. We thus can simultaneously obtain lower-dimensional core tensors and informative features directly from observed entries. The alternating direction method of multipliers approach is utilized to solve the optimization objective. We evaluate the features extracted from two real data with different missing entries for face recognition tasks. Experimental results illustrate the superior performance of our method with a significant improvement over the state-of-the-art methods.

Download article

In this paper we propose a new semi-supervised GAN architecture (ss-InfoGAN) for image synthesis that leverages information from few labels (as little as 0.22%, max. 10% of the dataset) to learn semantically meaningful and controllable data representations where latent variables correspond to label categories. The architecture builds on Information Maximizing Generative Adversarial Networks (InfoGAN) and is shown to learn both continuous and categorical codes and achieves higher quality of synthetic samples compared to fully unsupervised settings. Furthermore, we show that using small amounts of labeled data speeds-up training convergence. The architecture maintains the ability to disentangle latent variables for which no labels are available. Finally, we contribute an information-theoretic reasoning on how introducing semi-supervision increases mutual information between synthetic and real data.

Download article

In computing, remote devices may be identified by means of device fingerprinting, which works by collecting a myriad of client-side attributes such as the device's browser and operating system version, installed plugins, screen resolution, hardware artifacts, Wi-Fi settings, and anything else available to the server, and then merging these attributes into uniquely identifying fingerprints. This technique is used in practice to present personalized content to repeat website visitors, detect fraudulent users, and stop masquerading attacks on local networks. However, device fingerprints are seldom uniquely identifying. They are better viewed as partial device fingerprints, which do have some discriminatory power but not enough to uniquely identify users. How can we infer from partial fingerprints whether different observations belong to the same device? We present a mathematical formulation of this problem that enables probabilistic inference of the correspondence of observations. We set out to estimate a correspondence probability for every pair of observations that reflects the plausibility that they are made by the same user. By extending probabilistic data association techniques previously used in object tracking, traffic surveillance and citation matching, we develop a general-purpose probabilistic method for estimating correspondence probabilities with partial fingerprints. Our approach exploits the natural variation in fingerprints and allows for use of situation-specific knowledge through the specification of a generative probability model. Experiments with a real-world dataset show that our approach gives calibrated correspondence probabilities. Moreover, we demonstrate that improved results can be obtained by combining device fingerprints with behavioral models.

Download article

Scoring functions are an important tool for quantifying properties of interest in many domains; for example, in healthcare, a disease severity scores are used to diagnose the patient's condition and to decide its further treatment. Scoring functions might be obtained based on the domain knowledge or learned from data by using classification, regression or ranking techniques - depending on the type of supervised information. Although learning scoring functions from collected data is beneficial, it can be challenging when limited data are available. Therefore, learning multiple distinct, but related, scoring functions together can increase their quality as shared regularities may be easier to identify. We propose a multitask formulation for ranking-based learning of scoring functions, where the model is trained from pairwise comparisons. The approach uses mixed-norm regularization to impose structural regularities among the tasks. The proposed regularized objective function is convex; therefore, we developed an optimization approach based on alternating minimization and proximal gradient algorithms to solve the problem. The increased predictive accuracy of the presented approach, in comparison to several baselines, is demonstrated on synthetic data and two different real-world applications; predicting exam scores and predicting tolerance to infections score.

Download article

Understanding spatio-temporal activities in a city is a typical problem of spatio-temporal data analysis. For this analysis, tensor factorization methods have been widely applied for extracting a few essential patterns into latent factors. Non-negative Tensor Factorization (NTF) is popular because of its capability of learning interpretable factors from non-negative data, simple computation procedures, and dealing with missing observation. However, since existing NTF methods are not fully aware of spatial and temporal dependencies, they often fall short of learning latent factors where a large portion of missing observation exist in data. In this paper, we present a novel NTF method for extracting smooth and flat latent factors by leveraging various kinds of spatial and temporal structures. Our method incorporates a unified structured regularizer into NTF that can represent various kinds of auxiliary information, such as an order of timestamps, a daily and weekly periodicity, distances between sensor locations, and areas of locations. For the estimation of the factors for our model, we present a simple and efficient optimization procedure based on the alternating direction method of multipliers. In missing value interpolation experiments of traffic flow data and bike-sharing system data, we demonstrate that our proposed method improved interpolation performances from existing NTF, especially when a large portion of missing values exists.

Download article

Charts are an excellent way to convey patterns and trends in data, but they do not facilitate further modeling of the data or close inspection of individual data points. We present a fully automated system for extracting the numerical values of data points from images of scatter plots. We use deep learning techniques to identify the key components of the chart, and optical character recognition together with robust regression to map from pixels to the coordinate system of the chart. We focus on scatter plots with linear scales, which already have several interesting challenges. Previous work has done fully automatic extraction for other types of charts, but to our knowledge this is the first approach that is fully automatic for scatter plots. Our method performs well, achieving successful data extraction on 89% of the plots in our test set.

Download article

Colorization of grayscale images is a hot topic in computer vision. Previous research mainly focuses on producing a color image to recover the original one in a supervised learning fashion. However, since many colors share the same gray value, an input grayscale image could be diversely colorized while maintaining its reality. In this paper, we design a novel solution for unsupervised diverse colorization. Specifically, we leverage conditional generative adversarial networks to model the distribution of real-world item colors, in which we develop a fully convolutional generator with multi-layer noise to enhance diversity, with multi-layer condition concatenation to maintain reality, and with stride 1 to keep spatial information. With such a novel network architecture, the model yields highly competitive performance on the open LSUN bedroom dataset. The Turing test on 80 humans further indicates our generated color schemes are highly convincible.

Download article

Applied Data Science track

Twitter provides an open and rich source of data for studying human behaviour at scale and is widely used in social and network sciences. However, a major criticism of Twitter data is that demographic information is largely absent. Enhancing Twitter data with user ages would advance our ability to study social network structures, information flows and the spread of contagions. Approaches toward age detection of Twitter users typically focus on specific properties of tweets, e.g., linguistic features, which are language dependent. In this paper, we devise a language-independent methodology for determining the age of Twitter users from data that is native to the Twitter ecosystem. The key idea is to use a Bayesian framework to generalise ground-truth age information from a few Twitter users to the entire network based on what/whom they follow. Our approach scales to inferring the age of 700 million Twitter accounts with high accuracy.

Download article

Signature extraction is a key part of forensic log analysis. It involves recognizing patterns in log lines such that log lines that originated from the same line of code are grouped together. A log signature consists of immutable parts and mutable parts. The immutable parts define the signature, and the mutable parts are typically variable parameter values. In practice, the number of log lines and signatures can be quite large, and the task of detecting and aligning immutable parts of the logs to extract the signatures becomes a significant challenge. We propose a novel method based on a neural language model that outperforms the current state-of-the-art on signature extraction. We use an RNN auto-encoder to create an embedding of the log lines. Log lines embedded in such a way can be clustered to extract the signatures in an unsupervised manner.

Download article

Identifying events in real-time data streams such as Twitter is crucial for many occupations to make timely, actionable decisions. It is however extremely challenging because of the subtle difference between ``events'' and trending topics, the definitive rarity of these events, and the complexity of modern Internet's text data. Existing approaches often utilize topic modeling technique and keywords frequency to detect events on Twitter, which have three main limitations: 1) supervised and semi-supervised methods run the risk of missing important, breaking news events; 2) existing topic/event detection models are base on words, while the correlations among phrases are ignored; 3) many previous methods identify trending topics as events. To address these limitations, we propose the model, PhraseNet, an algorithm to detect and summarize events from tweets. To begin, all topics are defined as a clustering of high-frequency phrases extracted from text. All trending topics are then identified based on temporal spikes of the phrase cluster frequencies. PhraseNet thus filters out high-confidence events from other trending topics using number of peaks and variance of peak intensity. We evaluate PhraseNet on a three month duration of Twitter data and show the both the efficiency and the effectiveness of our approach.

Download article

In this research, we focus on automatic supervised stance classification of tweets. Given test datasets of tweets from five various topics, we try to classify the stance of the tweet authors as either in FAVOR of the target, AGAINST it, or NONE. We apply eight variants of seven supervised machine learning methods and three filtering methods using the WEKA platform. The macro-average results obtained by our algorithm are significantly better than the state-of-art results re-ported by the best macro-average results achieved in the SemEval 2016 Task 6-A for all the five released datasets. In contrast to the competitors of the SemEval 2016 Task 6-A, who did not use any char skip ngrams but rather used thousands of ngrams and hundreds of word embedding features, our algorithm uses a few tens of features mainly character-based features where most of them are skip char ngram features.

Download article

The process of liquidity provision in financial markets can result in prolonged exposure to illiquid instruments for market makers. In this case, where a proprietary position is not desired, pro-actively targeting the right client who is likely to be interested can be an effective means to offset this position, rather than relying on commensurate interest arising through natural demand. In this paper, we consider the inference of a client profile for the purpose of corporate bond recommendation, based on typical recorded information available to the market maker. Given a historical record of corporate bond transactions and bond meta-data, we use a topic-modelling analogy to develop a probabilistic technique for compiling a curated list of client recommendations for a particular bond that needs to be traded, ranked by probability of interest. We show that a model based on Latent Dirichlet Allocation offers promising performance to deliver relevant recommendations for sales traders.

Download article

Visual animal biometrics is rapidly gaining popularity as it enables a non-invasive and cost-effective approach for wildlife monitoring applications. Widespread usage of camera traps has led to large volumes of collected images, making manual processing of visual content hard to manage. In this work, we develop a framework for automatic detection and recognition of individuals in different patterned species like tigers, zebras and jaguars. Most existing systems primarily rely on manual input for localizing the animal, which does not scale well to large datasets. In order to automate the detection process while retaining robustness to blur, partial occlusion, illumination and pose variations, we use the recently proposed Faster-RCNN object detection framework to efficiently detect animals in images. We further extract features from AlexNet of the animal's flank and train a logistic regression (or Linear SVM) classifier to recognize the individuals. We primarily test and evaluate our framework on a camera trap tiger image dataset that contains images that vary in overall image quality, animal pose, scale and lighting. We also evaluate our recognition system on zebra and jaguar images to show generalization to other patterned species. Our framework gives perfect detection results in camera trapped tiger images and a similar or better individual recognition performance when compared with state-of-the-art recognition techniques.

Download article

With the rapid growth in smartphone usage, more organizations begin to focus on providing better services for mobile users. User identification can help these organizations to identify their customers and then cater services that have been customized for them. Currently, the use of cookies is the most common form to identify users. However, cookies are not easily transportable (e.g., when a user uses a different login account, cookies do not follow the user). This limitation motivates the need to use behavior biometric for user identification. In this paper, we propose DEEPSERVICE, a new technique that can identify mobile users based on user’s keystroke information captured by a special keyboard or web browser. Our evaluation results indicate that DEEPSERVICE is highly accurate in identifying mobile users (over 93% accuracy). The technique is also efficient and only takes less than 1 ms to perform identification.

Download article

Mobile phone metadata is increasingly used for humanitarian purposes in developing countries as traditional data is scarce. Basic demographic information is however often absent from mobile phone datasets, limiting the operational impact of the datasets. For these reasons, there has been a growing interest in predicting demographic information from mobile phone metadata. Previous work focused on creating increasingly advanced features to be modeled with standard machine learning algorithms. We here instead model the raw mobile phone metadata directly using deep learning, exploiting the temporal nature of the patterns in the data. From high-level assumptions we design a data representation and convolutional network architecture for modeling patterns within a week. We then examine three strategies for aggregating patterns across weeks and show that our method reaches state-of-the-art accuracy on both age and gender prediction using only the temporal modality in mobile metadata. We finally validate our method on low activity users and evaluate the modeling assumptions.

Download article

This paper studies supervised learning algorithms for the problem of uncooperative direction finding of a radio emitter using the received signal strength indicator (RSSI) from a rotating and uncharacterized antenna. Radio Direction Finding (RDF) is the task of finding the direction of a radio frequency emitter from which the received signal was transmitted, using a single receiver. We study the accuracy of radio direction finding for the 2.4 GHz WiFi band, and restrict ourselves to applying supervised learning algorithms for RSSI information analysis. We designed and built a hardware prototype for data acquisition using off-the-shelf hardware. During the course of our experiments, we collected more than three million RSSI values. We show that we can reliably predict the bearing of the transmitter with an error bounded by 11 degrees, in both indoor and outdoor environments. We do not explicitly model the multi-path, that inevitably arises in such situations and hence one of the major challenges that we faced in this work is that of automatically compensating for the multi-path and hence the associated noise in the acquired data.

Download article

The future planning, management and prediction of water demand and usage should be preceded by long-term variation analysis for related parameters in order to enhance the process of developing new scenarios whether for surface-water or ground-water resources. This paper aims to provide an appropriate methodology for long-term prediction for the water flow and water level parameters of the Shannon river in Ireland over a 30-year period from 1983 − 2013 through a framework that is composed of three phases: city wide scale analytics, data fusion, and domain knowledge data analytics phase which is the main focus of the paper that employs a machine learning model based on deep convolutional neural networks (DeepCNNs). We test our proposed deep learning model on three different water stations across the Shannon river and show it out-performs four well-known time-series forecasting models. We finally show how the proposed model simulate the predicted water flow and water level from 2013 − 2080. Our proposed solution can be very useful for the water authorities for better planning the future allocation of water resources among competing users such as agriculture, demotic and power stations. In addition, it can be used for capturing abnormalities by setting and comparing thresholds to the predicted water flow and water level.

Download article

Understanding the complex dynamics in the real-world such as in multi-agent behaviors is a challenge in numerous engineering and scientific fields. Spectral analysis using Koopman operators has been attracting attention as a way of obtaining a global modal description of a nonlinear dynamical system, without requiring explicit prior knowledge. However, when applying this to the comparison or classification of complex dynamics, it is necessary to incorporate the Koopman spectra of the dynamics into an appropriate metric. One way of implementing this is to design a kernel that reflects the dynamics via the spectra. In this paper, we introduced Koopman spectral kernels to compare the complex dynamics by generalizing the Binet–Cauchy kernel to nonlinear dynamical systems without specifying an underlying model. We applied this to strategic multiagent sport plays wherein the dynamics can be classified, e.g., by the success or failure of the shot. We mapped the latent dynamic characteristics of multiple attacker–defender distances to the feature space using our kernels and then evaluated the scorability of the play by using the features in different classification models.

Download article

Accurate electricity load forecasting is of crucial importance for power system operation and smart grid energy management. Different factors, such as weather conditions, lagged values, and day types may affect electricity load consumption. We propose to use multiple kernel learning (MKL) for electricity load forecasting, as it provides more flexibilities than traditional kernel methods. Computation time is an important issue for short-term load forecasting, especially for energy scheduling demand. However, conventional MKL methods usually lead to complicated optimization problems. Another practical aspect of this application is that there may be very few data available to train a reliable forecasting model for a new building, while at the same time we may have prior knowledge learned from other buildings. In this paper, we propose a boosting based framework for MKL regression to deal with the aforementioned issues for short-term load forecasting. In particular, we first adopt boosting to learn an ensemble of multiple kernel regressors, and then extend this framework to the context of transfer learning. Experimental results on residential data sets show the effectiveness of the proposed algorithms.

Download article

When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework—DC-Prophet—based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F3-score.

Download article

The World Bank provides billions of dollars in development finance to countries across the world every year. As many projects are related to the environment, we want to understand the World Bank projects impact to forest cover. However, the global extent of these projects results in substantial heterogeneity in impacts due to geographic, cultural, and other factors. Recent research by Athey and Imbens has illustrated the potential for hybrid machine learning and causal inferential techniques which may be able to capture such heterogeneity. We apply their approach using a geolocated dataset of World Bank projects, and augment this data with satellite-retrieved characteristics of their geo- graphic context (including temperature, precipitation, slope, distance to urban areas, and many others). We use this information in conjunction with causal tree (CT) and causal forest (CF) approaches to contrast ‘control’ and ‘treatment’ geographic locations to estimate the impact of World Bank projects on vegetative cover.

Download article

Worldwide, conservation agencies employ rangers to protect conservation areas from poachers. However, agencies lack the manpower to have rangers effectively patrol these vast areas frequently. While past work has modeled poachers’ behavior so as to aid rangers in planning future patrols, those models’ predictions were not validated by extensive field tests. In this paper, we present a hybrid spatio-temporal model that predicts poaching threat levels and results from a five-month field test of our model in Uganda’s Queen Elizabeth Protected Area (QEPA). To our knowledge, this is the first time that a predictive model has been evaluated through such an extensive field test in this domain. We present two major contributions. First, our hybrid model consists of two components: (i) an ensemble model which can work with the limited data common to this domain and (ii) a spatio-temporal model to boost the ensemble’s predictions when sufficient data are available. When evaluated on real-world historical data from QEPA, our hybrid model achieves significantly better performance than previous approaches with either temporally-aware dynamic Bayesian networks or an ensemble of spatially-aware models. Second, in collaboration with the Wildlife Conservation Society and Uganda Wildlife Authority, we present results from a five-month controlled experiment, where rangers patrolled over 450 sq km across QEPA. We demonstrate that our model successfully predicted (1) where snaring activity would occur and (2) where it would not occur; in areas where we predicted a high rate of snaring activity, rangers found more snares than in areas of lower predicted activity. These findings demonstrate that (1) our model’s predictions are selective, (2) our model’s superior laboratory performance extends to the real world, and (3) these predictive models can aid rangers in focusing their efforts to prevent wildlife poaching and save animals.

Download article

Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested.

Download article

Clostridium difficile infection (CDI) is a common hospital acquired infection with a $1B annual price tag that resulted in ~30,000 deaths in 2011. Studies have shown that early detection of CDI significantly improves the prognosis for the individual patient and reduces the overall mortality rates and associated medical costs. In this paper, we present CREST: CDI Risk Estimation, a data-driven framework for early and continuous detection of CDI in hospitalized patients. CREST uses a three-pronged approach for high accuracy risk prediction. First, CREST builds a rich set of highly predictive features from Electronic Health Records. These features include clinical and non-clinical phenotypes, key biomarkers from the patient's laboratory tests, synopsis features processed from time series vital signs, and medical history mined from clinical notes. Given the inherent multimodality of clinical data, CREST bins these features into three sets: time-invariant, time-variant, and temporal synopsis features. CREST then learns classifiers for each set of features, evaluating their relative effectiveness. Lastly, CREST employs a second-order meta learning process to ensemble these classifiers for optimized estimation of the risk scores. We evaluate the CREST framework using publicly available critical care data collected for over 12 years from Beth Israel Deaconess Medical Center, Boston. Our results demonstrate that CREST predicts the probability of a patient acquiring CDI with an AUC of 0.76 five days prior to diagnosis. This value increases to 0.80 and even 0.82 for prediction two days and one day prior to diagnosis, respectively.

Download article

With the rapid growth of e-commerce, a large number of on- line transactions are processed every day. In this paper, we take the initiative to conduct a systematic study of the challenging prediction problems of sales bursts. Here, we propose a novel model to detect bursts, find the bursty features, namely the start time of the burst, the peak value of the burst and the o -burst value, and predict the entire burst shape. Our model analyzes the features of similar sales bursts in the same category, and applies them to generate the prediction. We argue that the framework is capable of capturing the seasonal and categorical features of sales burst. Based on the real data from JD.com, we conduct extensive experiments and discover that the proposed model makes a relative MSE improvement of 71% and 30% over LSTM and ARMA.

Download article

E-commerce websites such as Amazon, Alibaba, Flipkart, and Walmart sell billions of products. Machine learning (ML) algorithms involving products are often used to improve the customer experience and increase revenue, e.g., product similarity, recommendation, and price estimation. The products are required to be represented as features before training an ML algorithm. In this paper, we propose an approach called MRNet-Product2Vec for creating generic embeddings of products within an e-commerce ecosystem. We learn a dense and low-dimensional embedding where a diverse set of signals related to a product are explicitly injected into its representation. We train a Discriminative Multi-task Bidirectional Recurrent Neural Network (RNN), where the input is a product title fed through a Bidirectional RNN and at the output, product labels corresponding to fifteen different tasks are predicted. The task set includes several intrinsic characteristics about a product such as price, weight, size, color, popularity, and material. We evaluate the proposed embedding quantitatively and qualitatively. We demonstrate that they are almost as good as sparse and extremely high-dimensional TF-IDF representation in spite of having less than 3% of the TF-IDF dimension. We also use a multimodal autoencoder for comparing products from different language-regions and show preliminary yet promising qualitative results.

Download article

Successful inventory management in retail entails accurate demand forecasts for many weeks/months ahead. Forecasting models use seasonality: recurring pattern of sales every year, to make this forecast. In e-commerce setting, where the catalog of items is much larger than brick and mortar stores and hence includes a lot of items with short history, it is infeasible to compute seasonality for items individually. It is customary in these cases to use ideas from factor analysis and express seasonality by a few factors/basis vectors computed together for an entire assortment of related items. In this paper, we demonstrate the effectiveness of choosing vectors with disjoint support as basis for seasonality when dealing with a large number of short time-series. We give theoretical results on computation of disjoint support factors that extend the state of the art, and also discuss temporal regularization necessary to make it work on walmart e-commerce dataset. Our experiments demonstrate a marked improvement in forecast accuracy for items with short history.

Download article

Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest parameters for commercial applications. We propose a novel metric that captures the stability of random forest predictions, which we argue is key for scenarios that require successive predictions. We motivate the need for multi-criteria optimization by showing that in practical applications, simply choosing the parameters that lead to the lowest error can introduce unnecessary costs and produce predictions that are not stable across independent runs. To optimize this multi-criteria trade-off, we present a new framework that efficiently finds a principled balance between these three considerations using Bayesian optimisation. The pitfalls of optimising forest parameters purely for error reduction are demonstrated using two publicly available real world datasets. We show that our framework leads to parameter settings that are markedly different from the values discovered by error reduction metrics alone.

Download article

Transaction frauds impose serious threats onto e-commerce. We present CLUE, a novel deep-learning-based transaction fraud detection system we design and deploy at JD.com, one of the largest e-commerce platforms in China with over 220 million active users. CLUE captures detailed information on users' click actions using neural-network based embedding, and models sequences of such clicks using the recurrent neural network. Furthermore, CLUE provides application-specific design optimizations including imbalanced learning, real-time detection, and incremental model update. Using real production data for over eight months, we show that CLUE achieves over 3x improvement over the existing fraud detection approaches.

Download article

Timely identification of dissatisfied customers would provide corporations and other customer serving enterprises the opportunity to take meaningful interventions. This work describes a fully operational system we have developed at a large US insurance company for predicting customer satisfaction following all incoming phone calls at our call center. To capture call relevant information, we integrate signals from multiple heterogeneous data sources including: speech-to-text transcriptions of calls, call metadata (duration, waiting time, etc.), customer profiles and insurance policy information. Because of its ordinal, subjective, and often highly-skewed nature, self-reported survey scores presents several modeling challenges. To deal with these issues we introduce a novel modeling workflow: First, a ranking model is trained on the customer call data fusion. Then, a convolutional fitting function is optimized to map the ranking scores to actual survey satisfaction scores. This approach produces more accurate predictions than standard regression and classification approaches that directly fit the survey scores with call data, and can be easily generalized to other customer satisfaction prediction problems. Source code and data are available at https://github.com/cyberyu/ecml2017.

Download article

In traditional A/B testing, we have two variants of the same product, a pool of test subjects, and a measure of success. In a randomized experiment, each test subject is presented with one of the two variants, and the measure of success is aggregated per variant. The variant of the product associated with the most success is retained, while the other variant is discarded. This, however, presumes that the company producing the products only has enough capacity to maintain one of the two product variants. If more capacity is available, then advanced data science techniques can extract more profit for the company from the A/B testing results. Exceptional Model Mining is one such advanced data science technique, which specializes in identifying subgroups that behave differently from the overall population. Using the association model class for EMM, we can find subpopulations that prefer variant A where the general population prefers variant B, and vice versa. This data science technique is applied on data from StudyPortals, a global study choice platform that ran an A/B test on the design of aspects of their website.

Download article

The growing availability of data from cities (e.g., traffic flow, human mobility and geographical data) open new opportunities for predicting and thus optimizing human activities. For example, the automatic analysis of land use enables the possibility of better administrating a city in terms of resources and provided services. However, such analysis requires specific information, which is often not available for privacy concerns. In this paper, we propose a novel machine learning representation based on the available public information to classify the most predominant land use of an urban area, which is a very common task in urban computing. In particular, in addition to standard feature vectors, we encode geo-social data from Location-Based Social Networks (LBSNs) into a conceptual tree structure that we call Geo-Tree. Then, we use such representation in kernel machines, which can thus perform accurate classification exploiting hierarchical substructure of concepts as features. Our extensive comparative study on the areas of New York and its boroughs shows that Tree Kernels applied to Geo-Trees are very effective improving the state of the art up to 18% in Macro-F1.

Download article

The rapid growth of Web usage for advertising job positions provides a great opportunity for real-time labour market monitoring. This is the aim of Labour Market Intelligence (LMI), a field that is becoming increasingly relevant to EU Labour Market policies design and evaluation. The analysis of Web job vacancies, indeed, represents a competitive advantage to labour market stakeholders with respect to classical survey-based analyses, as it allows for reducing the time-to-market of the analysis by moving towards a fact-based decision making model. In this paper, we present our approach for automatically classifying million Web job vacancies on a standard taxonomy of occupations. We show how this problem has been expressed in terms of text classification via machine learning. Then, we provide details about the classification pipelines we evaluated and implemented, along with the outcomes of the validation activities. Finally, we discuss how machine learning contributed to the LMI needs of the European Organisation that supported the project.

Download article

Suspect investigation as a critical function of policing determines the truth about how a crime occurred, as far as it can be found. Understanding of the environmental elements in the causes of a crime incidence inevitably improves the suspect investigation process. Crime pattern theory concludes that offenders, rather than venture into unknown territories, frequently commit opportunistic and serial violent crimes by taking advantage of opportunities they encounter in places they are most familiar with as part of their activity space. In this paper, we present a suspect investigation method, called SINAS, which learns the activity space of offenders using an extended version of the random walk method based on crime pattern theory, and then recommends the top-K potential suspects for a committed crime. Our experiments on a large real-world crime dataset show that SINAS outperforms the baseline suspect investigation methods we used for the experimental evaluation.

Download article

Demo Track

Each machine learning task comes equipped with its own set of performance measures. For example, there is a plethora of classification measures that assess predictive performance, a myriad of clustering indices, and equally many rule interestingness measures. Choosing the right measure requires careful thought, as it can influence model selection and thus the performance of the final machine learning system. However, analyzing and understanding measure properties is a difficult task. Here, we present Tetrahedron, a web-based visualization tool that aids the analysis of complete ranges of performance measures based on a two-by-two contingency matrix. The tool operates in a barycentric coordinate system using a 3D tetrahedron, which can be rotated, zoomed, cut, parameterized, and animated. The application is capable of visualizing predefined measures (86 currently), as well as helping prototype new measures by visualizing user-defined formulas.

Download article

Academic search engines (e.g., Google scholar or Microsoft academic) provide a medium for retrieving various information on scholarly documents. However, most of these popular scholarly search engines overlook the area of data set retrieval, which should provide information on relevant data sets used for academic research. Due to the increasing volume of publications, it has become a challenging task to locate suitable data sets on a particular research area for benchmarking or evaluations. We propose Delve, a web-based system for data set retrieval and document analysis. This system is different from other scholarly search engines as it provides a medium for both data set retrieval and real time visual exploration and analysis of data sets and documents.

Download article

We present WHODID: a turnkey intuitive web-based interface for fault detection, identification and diagnosis in production units. Fault detection and identification is an extremely useful feature and is becoming a necessity in modern production units. Moreover, the large deployment of sensors within the stations of a production line has enabled the close monitoring of products being manufactured. In this context, there is a high demand for computer intelligence able to detect and isolate faults inside production lines, and to additionally provide a diagnosis for maintenance on the identified faulty production device, with the purpose of preventing subsequent faults caused by the diagnosed faulty device behavior. We thus introduce a system which has fault detection, isolation, and identification features, for retrospective and on-the-fly monitoring and maintenance of complex dynamical production processes. It provides real-time answers to the questions: "is there a fault?", "where did it happen?", "for what reason?". The method is based on a posteriori analysis of decision sequences in XGBoost tree models, using recurrent neural networks sequential models of tree paths. The particularity of the presented system is that it is robust to missing or faulty sensor measurements, it does not require any modeling of the underlying, possibly exogenous manufacturing process, and provides fault diagnosis along with confidence level in plain English formulations. The latter can be used as maintenance directions by a human operator in charge of production monitoring and control.

Download article

This paper presents a system for tracking and analyzing the evolution and transformation of topics in an information network. The system consists of four main modules for pre-processing, adaptive topic modeling, network creation and temporal network analysis. The core module is built upon an adaptive topic modeling algorithm adopting a sliding time window technique that enables the discovery of groundbreaking ideas as those topics that evolve rapidly in the network.

Download article

Feature selection is an essential step to identify relevant and non-redundant features for target class prediction. In this context, the number of feature combinations grows exponentially with the dimension of the feature space. This hinders the user's understanding of the feature-target relevance and feature-feature redundancy. We propose an interactive Framework for Exploring and Understanding Multivariate Correlations (FEXUM), which embeds these correlations using a force-directed graph. In contrast to existing work, our framework allows the user to explore the correlated feature space and guides in understanding multivariate correlations through interactive visualizations.

Download article

The paper presents a smartphone application for monitoring physical activity and mental stress. The application utilizes sensor data from a wristband and/or a smartphone, which can be worn in various pockets or in a bag in any orientation. The presence and location of the devices are used as contexts for the selection of appropriate machine-learning models for activity recognition and the estimation of human energy expenditure. The stress-monitoring method uses two machine-learning models, the first one relying solely on physiological sensor data and the second one incorporating the output of the activity monitoring and other context information. The evaluation showed that we recognize a wide range of atomic activities with an accuracy of 87%, and that we outperform the state-of-the art consumer devices in the estimation of energy expenditure. In stress monitoring we achieved an accuracy of 92% in a real-life setting.

Download article

We present an explainable recommendation system for novels and authors, called Lit@EVE, which is based on Wikipedia concept vectors. In this system, each novel or author is treated as a concept whose definition is extracted as a concept vector through the application of an explainable word embedding technique called EVE. Each dimension of the concept vector is labelled as either a Wikipedia article or a Wikipedia category name, making the vector representation readily interpretable. In order to recommend items, the Lit@EVE system uses these vectors to compute similarity scores between a target novel or author and all other candidate items. Finally, the system generates an ordered list of suggested items by showing the most informative features as human-readable labels, thereby making the recommendation explainable.

Download article

TF Boosted Trees (TFBT) is a new open-sourced framework for the distributed training of gradient boosted trees. It is based on TensorFlow, and its distinguishing features include a novel architecture, automatic loss differentiation, layer-by-layer boosting that results in smaller ensembles and faster prediction, principled multi-class handling, and a number of regularization techniques to prevent overfitting.

Download article

Often the manual review of large data sets, either for purposes of labeling unlabeled instances or for classifying meaningful results from uninteresting (but statistically significant) ones is extremely resource intensive, especially in terms of subject matter expert (SME) time. Use of active learning has been shown to reduce this review time significantly. However, since active learning is an iterative process of learning a classifier based on a small number of SME-provided labels at each iteration, the lack of an enabling tool can hinder the process of adoption of these technologies in real-life, in spite of their labor-saving potential. In this demo we present ASK-the-Expert, an interactive tool that allows SMEs to review instances from a data set and provide labels within a single framework. ASK-the-Expert is powered by an active learning algorithm for training a classifier in the backend. We demonstrate this system in the context of an aviation safety application, but the tool can be adopted to work as a simple review and labeling tool as well, without the use of active learning.

Download article

Visualizing frequently occurring patterns and potentially unusual behaviors in trajectory can provide valuable insights into activities behind the data. In this paper, we introduce TrajViz, a motif (frequently repeated subsequences) based visualization software that detects patterns and anomalies by inducing ``grammars" from discretized spatial trajectories. We consider patterns as a set of sub-trajectories with unknown lengths that are spatially similar to each other. We demonstrate that TrajViz has the capacity to help users visualize anomalies and patterns effectively.

Download article

Nectar Track

Sequential data can be found in many settings, e.g., as sequences of visited websites or as location sequences of travellers. To improve the understanding of the underlying mechanisms that generate such sequences, the HypTrails approach provides for a novel data analysis method. Based on first-order Markov chain models and Bayesian hypothesis testing, it allows for comparing a set of hypotheses, i.e., beliefs about transitions between states, with respect to their plausibility considering observed data. HypTrails has been successfully employed to study phenomena in the online and the offline world. In this talk, we want to give an introduction to HypTrails and showcase selected real-world applications on urban mobility and reading behavior on Wikipedia.

Download article

Music generation has recently become popular as an application of machine learning. To generate polyphonic music, one must consider both simultaneity (the vertical consistency) and sequentiality (the horizontal consistency). Bayesian networks are suitable to model both simultaneity and sequentiality simultaneously. Here, we present music generation models based on Bayesian networks applied to chord voicing, four-part harmonization, and real-time chord prediction.

Download article

We describe ProTraits, a machine learning pipeline that systematically annotates microbes with phenotypes using a large amount of textual data from scientific literature and other online resources, as well as genome sequencing data. Moreover, by relying on a multi-view non-negative matrix factorization approach, ProTraits pipeline is also able to discover novel phenotypic concepts from unstructured text. We present the main components of the developed pipeline and outline challenges for the application to other fields.

Download article

Finding a parking space is a key problem in urban scenarios, often due to the lack of actual parking availability information for drivers. Modern vehicles, able to identify free parking spaces using standard on-board sensors, have been proven to be effective probes to measure parking availability. Nevertheless, spatio-temporal datasets resulting from probe vehicles pose significant challenges to the machine learning and data mining communities, due to volume, noise, and heterogeneous spatio-temporal coverage. In this paper we summarize some of the approaches we proposed to extract new knowledge from this data, with the final goal to reduce the parking search time. First, we present a spatio-temporal analysis of the suitability of taxi movements for parking crowd-sensing. Second, we describe machine learning approaches to automatically generate maps of parking spots and to predict parking availability. Finally, we discuss some open issues for the ML/KDD community.

Download article

Process-based modeling is an approach to constructing explanatory models of dynamical systems from knowledge and data. The knowledge encodes information about potential processes that explain the relationships between the observed system entities. The resulting process-based models provide both an explanatory overview of the system components and closed-form equations that allow for simulating the system behavior. In this paper, we present three recent improvements of the process-based approach: (i) improving predictive performance of process-based models using ensembles, (ii) extending the scope of process-based models towards handling uncertainty and (iii) addressing the task of automated process-based design.

Download article

We review our recent progress in the development of graph kernels. We discuss the hash graph kernel framework, which makes the computation of kernels for graphs with vertices and edges annotated with real-valued information feasible for large data sets. Moreover, we summarize our general investigation of the benefits of explicit graph feature maps in comparison to using the kernel trick. Our experimental studies on real-world data sets suggest that explicit feature maps often provide sufficient classification accuracy while being computed more efficiently. Finally, we describe how to construct valid kernels from optimal assignments to obtain new expressive graph kernels. These make use of the kernel trick to establish one-to-one correspondences. We conclude by a discussion of our results and their implication for the future development of graph kernels.

Download article

Interaction networks consist of a static graph with a timestamped list of edges over which interaction took place. Examples of interaction networks are social networks whose users interact with each other through messages or location-based social networks where people interact by checking in to locations. Previous work on finding influential nodes in such networks mainly concentrate on the static structure imposed by the interactions or are based on fixed models for which parameters are learned using the interactions. In two recent works, however, we proposed an alternative activity data-driven approach based on the identification of influence propagation patterns. In the first work, we identify so-called information-channels to model potential pathways for information spread, while the second work exploits how users in a location-based social network check-in to locations in order to identify influential locations. To make our algorithms scalable, approximate versions based on sketching techniques from the data streams domain have been developed. Experiments show that in this way it is possible to efficiently find good seed sets for influence propagation in social networks.

Download article

Machine-learnt models based on additive ensembles of binary regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. Evaluating these models is a computationally demanding task as it needs to traverse thousands of trees with hundreds of nodes each. The cost of traversing such large forests of trees significantly impacts their application to big and stream input data, when the time budget available for each prediction is limited to guarantee a given processing throughput. Document ranking in Web search is a typical example of this challenging scenario, where the exploitation of tree-based models to score query-document pairs, and finally rank lists of documents for each incoming query, is the state-of-art method for ranking (a.k.a. Learning-to-Rank). This paper presents QuickScorer, a novel algorithm for the traversal of huge decision trees ensembles that, thanks to a cache- and CPU-aware design, provides a ~9x speedup over best competitors.

Download article

Data Cleaning represents a crucial and error prone activity in KDD that might have unpredictable effects on data analytics, affecting the believability of the whole KDD process. In this paper we describe how a bridge between AI Planning and Data Quality communities has been made, by expressing both the data quality and cleaning tasks in terms of AI planning. We also report a real-life application of our approach

Download article

In this work, we summarize our work on using the predictive clustering framework for image analysis. More specifically, we used predictive clustering trees to generate image representations, that can then be used to perform image retrieval and/or image annotation. We evaluated the proposed method for performing image retrieval on general purpose images, and annotation of general purpose images, medical images and diatom images.

Download article

Monday, September 18th, 2017

Abstract:
Sports Analytics has been a steadily growing and rapidly evolving area over the last decade, both in US professional sports leagues and in European football leagues. The majority of techniques used in the field so far are statistical. However, there has been growing interest in the Machine Learning and Data Mining community about this topic as this setting is interesting, challenging and offers new sources of data. The workshop concerns all aspects of applying machine learning and data mining techniques for sports problems such as match strategy, tactics, and analysis; player acquisition, player valuation, and team spending; injury prediction and prevention; match outcome and league table prediction; and tournament design and scheduling among others.

Organizers:
Jesse Davis, KU Leuven, Belgium
Mehdi Kaytoue, INSA Lyon, France
Albrecht Zimmermann, University of Caen, France

Workshop web page

Abstract:
In the era of Big Data, every single user of our hyper-connected world leaves behind a myriad of digital breadcrumbs while performing her daily activities. In this context personal data analytics and individual privacy protection are the key elements to leverage nowadays services to a new type of systems. The availability of personal analytics tools able to extract hidden knowledge from individual data while protecting the privacy right can help the society to move from organization-centric systems to user-centric systems, where the user is the owner of her personal data and is able to manage, understand, exploit, control and share her own data and the knowledge deliverable from them in a completely safe way.

Organizers:
Serge Abiteboul, Inria, ENS Paris, France
Riccardo Guidotti, KDDLab, ISTI-CNR Pisa, Italy
Anna Monreale, University of Pisa, Italy
Dino Pedreschi, University of Pisa, Italy

Workshop web page

Abstract:
This workshops aims to attract papers presenting applications of Data Science to Social Good, or else that take into account social aspects of Data Science methods and techniques. Application domains should be as varied as possible. The novelty of the application and its social impact will be major selection criteria.

Organizers:
Ricard Gavaldà, UPC BarcelonaTech, Spain
Irena Koprinska, University of Sidney, Australia
Stefan Kramer, JGU Mainz, Germany

Workshop web page

Abstract:
Reinforcement Learning (RL) has achieved many successes over the years in training autonomous agents to perform simple tasks. However, one of the major remaining challenges in RL is scaling it to high-dimensional, real-world applications.

Although many works have already focused on strategies to scale-up RL techniques and to find solutions for more complex problems with reasonable successes, many issues still exist. This workshop encourages to discuss diverse approaches to accelerate and generalize RL, such as the use of approximations, abstractions, hierarchical approaches, and Transfer Learning.

Scaling-up RL methods has major implications on the research and practice of complex learning problems and will eventually lead to successful implementations in real-world applications.

This workshop intends to bridge the gap between conventional and scalable RL approaches. We aim to bring together resarchers working on different approaches to scale-up RL with the goal to solve more complex or larger scale problems. We intend to make this an exciting event for researchers worldwide, not only for the presentation of top quality papers, but also to spark the discussion of opportunities and challenges for future research directions.

Organizers:
Felipe Leno da Silva, University of São Paulo, Brazil
Ruben Glatt, University of São Paulo, Brazil

Workshop web page

Abstract:
Like the famous King Midas, popularly remembered in Greek mythology for his ability to turn everything he touched with his hand into gold, we believe that the wealth of data generated by modern technologies, with widespread presence of computers, users and media connected by Internet, is a goldmine for tackling a variety of problems in the financial domain.

The MIDAS workshop is aimed at discussing challenges, potentialities, and applications of leveraging data-mining tasks to tackle problems in the financial domain. The workshop provides a premier forum for sharing findings, knowledge, insights, experience and lessons learned from mining data generated in various application domains.

Organizers:
Ilaria Bordino, UniCredit, R& D Dept., Italy
Guido Caldarelli, IMT Institute for Advanced Studies Lucca, Italy
Fabio Fumarola, UniCredit, R& D Dept., Italy Francesco Gullo, UniCredit, R& D Dept., Italy
Tiziano Squartini, IMT Institute for Advanced Studies Lucca, Italy

Workshop web page

Abstract:
Network science, network analysis, and network mining are new scientific topics that emerged in recent years and are growing quickly. Instead of studying the properties of entities, network science focus on the interaction between these entities. The tremendous quantity of relational data that become available (Online Social Networks, cell phones, the Internet and the Web, trip datasets, etc.) encourage new research on the topic.

In the last years, we witnessed a shift from static network analysis to dynamic ones, i.e., the study of networks whose structure changes over time. As time goes by, all the perturbations which occur in the network topology due to the rise and fall of nodes and edges have repercussions on the network phenomena we are used to observing. As an example, evolution over time of social interactions in a network can play an important role in the diffusion of an infectious disease.

Nowadays, one of the most fascinating challenges is to analyze the structural dynamics of real world networks and how they impact on the processes which occur on them, i.e. the spreading of social influence and diffusion of innovations. Results in this field will enable a better understanding of important aspects of human behaviors as well as to a more detailed characterization of the complex interconnected society we inhabit. Since the last decades, diffusive and spreading phenomena were facilitated by the enormous popularity of the Internet and the evolution of social media that enable an unprecedented exchange of information. For this reason, understanding how social relationships unravel in these rapidly evolving contexts represents one of the most interesting fields of research. The purpose of the third edition of this workshop is to encourage research that will lead to the advancement of the social science in time-evolving networks.

Organizers:
Giulio Rossetti, KDD Laboratory, ISTI-CNR Pisa, Italy
Rémy Cazabet, LIP6, CNRS, Sorbonne Universités, France
Letizia Milli, Computer Science Department - University of Pisa, Italy

Workshop web page

Abstract:
The aim of this workshop called Large-Scale Time Dependent Graphs (TD-LSG) is to bring together active scholars and practitioners of dynamic graphs. Graph models and algorithms are ubiquitous of a large number of application domains, ranging from transportation to social networks, semantic web, or data mining. However, many applications require graph models that are time dependent. For example, applications related to urban mobility analysis employ a graph structure of the underlying road network. Indeed, the nature of such networks are spatiotemporal. Therefore, the time a moving object takes to cross a path segment typically depends on the starting instant of time. So, we call time-dependent graphs, the graphs that have this spatiotemporal feature.

In this workshop, we aim to discuss the problem of mining large-scale time-dependent graphs, since there are many real world applications deal with a large volumes of spatio-temporal data (e.g. moving objects’ trajectories). Managing and analysing large-scale time-dependent graphs is very challenging since this requires sophisticated methods and techniques for creating, storing, accessing and processing such graphs in a distributed environment, because centralized approaches do not scale in a Big Data scenario. Contributions will clearly point out answers to one of these challenges focusing on large-scale graphs.

Organizers:
Sabeur Aridhi, University of Lorraine, France
José Fernandes de Macedo, Universidade Federale do Ceara, Fortaleza, Brazil
Engelbert Mephu Nguifo, LIMOS, Blaise Pascal University, France
Karine Zeitouni, DAVID, Université de Versailles Saint-Quentin, France

Workshop web page

Combined Workshops with Tutorials

Abstract:
The volume of data is rapidly increasing due to the development of the technology of information and communication. This data comes mostly in the form of streams. Learning from this ever-growing amount of data requires flexible learning models that self-adapt over time. In addition, these models must take into account many constraints: (pseudo) real-time processing, high-velocity, and dynamic multi-form change such as concept drift and novelty. This workshop welcomes novel research about learning from data streams in evolving environments. It will provide the researchers and participants with a forum for exchanging ideas, presenting recent advances and discussing challenges related to data streams processing. It solicits original work, already completed or in progress. Position papers are also considered. This workshop is combined with a tutorial treating the same topic and will be presented in the same day.

Organizers:
Moamar Sayed-Mouchaweh, Computer Science and Automatic Control Labs, High Engineering School of Mines, Douai
Albert Bifet, Telecom-ParisTech; Paris, France
Hamid Bouchachia, Department of Computing & Informatics, University of Bournemouth, Bournemouth, UK
João Gama, Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
Rita Ribeiro, Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal

Workshop and tutorial web page

Abstract:
This workshop on interactive adaptive learning aims at discussing techniques and approaches for optimising the whole learning process, including the interaction with human supervisors, processing systems, and includes adaptive, active, semi-supervised, and transfer learning techniques, and combinations thereof in interactive and adaptive machine learning systems.

Our objective is to bridge the communities researching and developing these techniques and systems in machine learning and data mining. Therefore we welcome contributions that present a novel problem setting, propose a novel approach, or report experience with the practical deployment of such a system and raise unsolved questions to the research community.

Organizers:
Georg Krempl, University Magdeburg, Germany
Vincent Lemaire, Orange Labs, France
Robi Polikar, Rowan University, USA
Bernhard Sick, University of Kassel, Germany
Daniel Kottke, University of Kassel, Germany
Adrian Calma, University of Kassel, Germany

Workshop and tutorial web page

Friday, September 22nd, 2017

Abstract:
Modern automatic systems are able to collect huge volumes of data, often with a complex structure (e.g. multi-table data, XML data, web data, time series and sequences, graphs and trees). The massive and complex data pose new challenges for current research in Knowledge Discovery and Data Mining. They require new methods for storing, managing and analysing them by taking into account various complexity aspects: Complex structures (e.g. multi-relational, time series and sequences, networks, and trees) as input/output of the data mining process; Massive amounts of high dimensional data collections flooding as high-speed streams and requiring (near) real time processing and model adaptation to concept drifts; New application scenarios involving security issues, interaction with other entities and real-time response to events triggered by sensors.

The purpose of the workshop is to bring together researchers and practitioners of data mining and machine learning interested in analysis of complex data, in order to promote and publish research in the field of complex pattern mining.

Organizers:
Annalisa Appice, University of Bari Aldo Moro, Bari, Italy
Corrado Loglisci, University of Bari Aldo Moro, Bari, Italy
Giuseppe Manco, ICAR-CNR, Rende, Italy
Elio Masciari, ICAR-CNR, Rende, Italy
Zbigniew W. Ras, Department of Computer Science, University of North Carolina, Charlotte, USA

Workshop web page

Abstract:
Many real-world data-mining applications involve obtaining and evaluating predictive models using data sets with strongly imbalanced distributions of the target variable. Frequently, the least-common values are associated with events that are highly relevant for end users. This problem has been thoroughly studied in the last decade with a specific focus on classification tasks. However, the research community has started to address this problem within other contexts. It is now recognized that imbalanced domains are a broader and important problem posing relevant challenges for both supervised and unsupervised learning tasks, in an increasing number of real world applications. This workshop invites inter-disciplinary contributions to tackle the problems that many real-world domains face today. With the growing attention that this problem has collected, it is crucial to promote its development and to tackle its theoretical and application challenges.

Organizers:
Luís Torgo, University of Porto; LIAAD - INESC Tec, Portugal
Bartosz Krawczyk, Virginia Commonwealth University; Department of Computer Science,USA
Paula Branco, University of Porto; LIAAD - INESC Tec, Portugal
Nuno Moniz, University of Porto; LIAAD - INESC Tec, Portugal

Workshop web page

Abstract:
Deep Learning is beginning to exert a disruptive impact for functional genomics, with applications of high industrial and ethical relevance in pharmacogenomics and toxicogenomics. Moreover, in less than one year, Deep Learning has emerged with solutions in diagnostic imaging and pathology that have reached best human expertise, as for the success in games. Examples in miRNA prediction already demonstrated the potential for deriving implicit features with high predictive accuracy, and novel methods for genomewide association studies and prediction of molecular traits following suite are appearing both as scientific initiatives as well as key technologies of AI startups. Following the success of DLPM2016, also colocated with ECML/PKDD in 2016, we thus wish to discuss about the best options for the adoption of deep learning models, both for improved accuracy as well as for better biomedical understanding. Questions such as end-to-end modeling from structure to functionality and biological impact as well as architectures for integration of genotype, expression and epigenetics would be of immediate interest for the workshop. Further topics of interest are described in the call.

This event is intended to be a one-day workshop within ECML-PKDD 2017 including lectures and discussions on this very timely, pressing and active subject. We aim to create a connection between machine learning experts and leaders in the Precision Medicine initiatives in Europe and the USA. In particular, the workshop aims to link experts from the FDA SEQC2 initiative on Precision Medicine, which will pave the way for defining optimal procedures for the development of actionable drugs that can target phenotype-selected patient groups. We wish to discuss also technical challenges such as working with very large cohorts (e.g. from 60K to 300K to 1M subjects in molecular psychiatry studies) that are now amenable for modeling with deep learning. Further, family cohorts will challenge machine learning and bioinformatics experts for new efficient solutions. In summary, both methodological aspects from deep learning, machine learning, information technology, statistics as well as actual applications, pitfalls and (medical) needs are to be featured.

Organizers:
Bertram Müller-Myhsok, University of Liverpool, Liverpool, UK and Max Planck Institute of Psychiatry, Munich, Germany
Cesare Furlanello, Fondazione Bruno Kessler - FBK Trento, Italy

Contact: dlpm2017@fbk.eu

Workshop web page

Abstract:
DMNLP’17 will be the fourth edition of the Data Mining and Natural Language Processing (DMNLP) workshop. The workshop will favor the use of symbolic methods. Indeed, statistical and machine learning methods (CRF, SVM, Naive Bayes) holds a predominant position in NLP researches and ”may have been too successful (...) as there is no longer much room for anything else”. They have proved their effectiveness for some tasks but one major drawback is that they do not provide human readable models. By contrast, symbolic machine learning methods are known to provide more human-readable model that could be an end in itself (e.g., for stylistics) or improve, by combination, further methods including numerical ones. Research in Data Mining has progressed significantly in the last decades, through the development of advanced algorithms and techniques to extract knowledge from data in different forms. In particular, for two decades Pattern Mining has been one of the most active field in Knowledge Discovery.

Recently, a new field has emerged taking benefit of both domains: Data Mining and NLP. The objective of DMNLP is thus to provide a forum to discuss how Data Mining can be interesting for NLP tasks, providing symbolic knowledge, but also how NLP can enhance data mining approaches by providing richer and/or more complex information to mine and by integrating linguistic knowledge directly in the mining process. The workshop aims at bringing together researchers from both communities in order to stimulate discussions about the cross-fertilization of those two research fields. The idea of this workshop is to discuss future directions and new challenges emerging from this cross-fertilization of Data Mining and NLP and in the same time to initiate collaborations between researchers of both communities.

Organizers:
Peggy Cellierm, INSA Rennes, IRISA (UMR 6074), Rennes, France
Thierry Charnois, Université de Paris 13, LIPN (UMR 7030), France
Andreas Hotho, University of Kassel, Germany
Marie-Francine Moens: Katholieke Universiteit, Leuven, Belgium
Stan Matwin, Dalhousie University, Canada
Yannick Toussaint, INRIA, LORIA (UMR 7503), 54506 Vandoeuvre-les-Nancy, France

Workshop web page

Abstract:
Security and privacy aspects of data analytics become of central importance in many application areas. New legislation also pushes companies and research communities to address challenges of privacy-preserving data analytics. In our data mining community, questions about data privacy and security have been predominantly approached from the perspective of k-anonymity and differential privacy.

The aim of this workshop is to draw attention to secure multiparty computation (MPC), a subfield of cryptology, as the key foundation for building privacy-preserving data mining (DM) and machine learning (ML). In this approach, sensitive data is typically secret-shared over multiple players, such that those players can jointly perform DM / ML computations, but individual players (or collusions) cannot learn anything about the data, beyond the result of the computations.

Organizers:
Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands
Stefan Kramer, University of Mainz, Germany
Niek J. Bouman, Eindhoven University of Technology, Netherlands

Workshop web page

Abstract:
The recent technological advances on telecommunications create a new reality on mobility sensing. Nowadays, we live in an era where ubiquitous digital devices are able to broadcast rich information about human mobility in real-time and at high rate. Such fact exponentially increased the availability of large-scale mobility data which has been popularized in the media as the new currency, fueling the future vision of our smart cities that will transform our lives. The reality is that we just began to recognize significant research challenges across a spectrum of topics. Consequently, there is an increasing interest among different research communities (ranging from civil engineering to computer science) and industrial stakeholders on build knowledge discovery pipelines over such data sources. However, such availability also raise privacy issues that must be considered by both industrial and academic stakeholders on using these resources.

This workshop intends to be a top-quality venue to bring together transdisciplinary researchers and practitioners working in topics from multiple areas such as Data Mining, Machine Learning, Numerical Optimization, Public Transport, Traffic Engineering, Multi-Agent Systems, Human-Computer Interaction and Telecommunications, among others. The ultimate goal of this venue is to evaluate not only the theoretical contribution of the data driven methodology proposed in each research work, but also its potential deployment/impact as well as its advances with respect to the State-of-the-Art/State-of-the-Practice in the domains of the related applications.

Organizers:
Luis Moreira-Matias – NEC Laboratories Europe, Germany
Roberto Trasarti, KDD Lab ISTI-CNR, Pisa, Italy
Rahul Nair, IBM Research Ireland

Workshop web page

Abstract:
Climate change, the depletion of natural resources and rising energy costs have led to an increasing focus on renewable sources of energy. A lot of research has been devoted to the technologies used to extract energy from these sources; however, equally important is the storage and distribution of this energy in a way that is efficient and cost effective. Achieving this would generally require integration with existing energy infrastructure.

The challenge of renewable energy integration is inherently multidisciplinary and is particularly dependant on the use of techniques from the domains of data analytics, pattern recognition and machine learning. Examples of relevant research topics include the forecasting of electricity supply and demand, the detection of faults, demand response applications and many others. This workshop will provides a forum where interested researchers from the various related domains will be able to present and discuss their findings.

Organizers:
Wei Lee Woon, Masdar Institute, United Arab Emirates
Zeyar Aung, Masdar Institute, United Arab Emirates
Oliver Kramer, University of Oldenburg, Germany
Stuart Madnick, Massachusetts Institute of Technology, USA

Workshop web page

Combined Workshops with Tutorials

Workshop abstract:
This workshop will provide a platform for discussing the recent developments in the area of algorithm selection and configuration, which arises in many diverse domains, such as machine learning, data mining, optimization and automated reasoning. Algorithm selection and configuration are increasingly relevant today. Researchers and practitioners from all areas of science and technology face a large choice of parameterized machine learning algorithms, with little guidance as to which techniques to use in a given application context. Moreover, data mining challenges frequently remind us that algorithm selection and configuration are crucial in order to achieve cutting-edge performance, and drive industrial applications.

Meta-learning leverages knowledge of past algorithm applications to select the best techniques for future applications, and offers effective techniques that are superior to humans both in terms of the end result and especially in the time required to achieve it. In this workshop, we will discuss different ways of exploiting meta-learning techniques to identify the potentially best algorithm(s) for a new task, based on meta-level information, prior experiments on both past datasets and the current one. Many contemporary problems require the use of workflows that consist of several processes or operations. Constructing such complex workflows requires extensive expertise, and could be greatly facilitated by leveraging planning, meta-learning and intelligent system design. This task is inherently interdisciplinary, as it builds on expertise in various areas of AI.

Workshop web page

Tutorial abstract:
This tutorial will introduce and discuss state of the art methods in meta-learning, algorithm selection, and algorithm configuration, which are increasingly relevant today. Researchers and practitioners from all areas of science and technology face a large choice of parameterized machine learning algorithms, with little guidance as to when and how to use which technique. Data mining challenges frequently remind us that algorithm selection and configuration play a crucial role in achieving cutting-edge performance, and are indispensible in industrial applications.

Meta-learning leverages knowledge of past applications of algorithms applications to learn how to select the best techniques for future applications, and offers effective techniques that are superior to humans both in terms of the quality of the end result and even more so in the time required to achieve it. Recent approaches include also (preferably very fast) partial probing runs on a given problem with the aim of determining the best strategy to be used from there onwards. This may include further probing or recommending an algorithm to be used to solve the given problem. A recent trend is to incorporate such techniques into software platforms. This synergy leads to new advances that recommend combinations of algorithms and hyperparameter settings simultaneously, and that speed up algorithm configuration by learning which parameter settings are likely most useful for dealing with the data at hand.

After motivating and introducing the concepts of algorithm selection and configuration, we elucidate how they arise in machine learning and data mining, but also in other domains, such as optimization. We demonstrate how meta-learning techniques can be effectively used in this context, exploiting information gleaned from past experiments as well as by probing the data at hand. Moreover, many current applications require the use of machine learning or data mining workflows that consist of several different processes or operations. Constructing such complex systems or workflows requires extensive expertise, as well as existing meta-data and software, and can be greatly facilitated by leveraging the methodologies presented at this tutorial.

Tutorial web page

Organizers:
Pavel Brazdil, LIAAD Inesc Tec., Portugal
Joaquin Vanschoren, Eindhoven University of Technology, Netherlands
Holger H. Hoos, Universiteit Leiden, Netherlands
Frank Hutter, University of Freiburg, Germany

Monday, September 18th, 2017

Abstract:
Graph mining is an important research area with a plethora of practical applications. Core decomposition of networks is a fundamental operation strongly related to more complex mining tasks such as community detection, dense subgraph discovery, identification of influential nodes, network visualization, text mining, just to name a few. In this tutorial, we will present in detail the concept and properties of core decomposition in graphs, the associated algorithms for its efficient computation and important cross-disciplinary applications that benefit from it.

Organizers:
Fragkiskos D. Malliaros, UC San Diego La Jolla, USA
Apostolos N. Papadopoulos, Aristotle University of Thessaloniki, Thessaloniki, Greece
Michalis Vazirgiannis, Ecole Polytechnique Palaiseau, France

Tutorial web page

Combined Workshops with Tutorials

Abstract:
The volume of data is rapidly increasing due to the development of the technology of information and communication. This data comes mostly in the form of streams. Learning from this ever-growing amount of data requires flexible learning models that self-adapt over time. In addition, these models must take into account many constraints: (pseudo) real-time processing, high-velocity, and dynamic multi-form change such as concept drift and novelty. This workshop welcomes novel research about learning from data streams in evolving environments. It will provide the researchers and participants with a forum for exchanging ideas, presenting recent advances and discussing challenges related to data streams processing. It solicits original work, already completed or in progress. Position papers are also considered. This workshop is combined with a tutorial treating the same topic and will be presented in the same day.

Organizers:
Moamar Sayed-Mouchaweh, Computer Science and Automatic Control Labs, High Engineering School of Mines, Douai
Albert Bifet, Telecom-ParisTech; Paris, France
Hamid Bouchachia, Department of Computing & Informatics, University of Bournemouth, Bournemouth, UK
João Gama, Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
Rita Ribeiro, Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal

Workshop and tutorial web page

Abstract:
This workshop on interactive adaptive learning aims at discussing techniques and approaches for optimising the whole learning process, including the interaction with human supervisors, processing systems, and includes adaptive, active, semi-supervised, and transfer learning techniques, and combinations thereof in interactive and adaptive machine learning systems.

Our objective is to bridge the communities researching and developing these techniques and systems in machine learning and data mining. Therefore we welcome contributions that present a novel problem setting, propose a novel approach, or report experience with the practical deployment of such a system and raise unsolved questions to the research community.

Organizers:
Georg Krempl, University Magdeburg, Germany
Vincent Lemaire, Orange Labs, France
Robi Polikar, Rowan University, USA
Bernhard Sick, University of Kassel, Germany
Daniel Kottke, University of Kassel, Germany
Adrian Calma, University of Kassel, Germany

Workshop and tutorial web page

Friday, September 22nd, 2017

Abstract:
Global fossil databases have been growing rapidly in the last decade. They aggregate and accumulate findings and knowledge that palaeobiologists acquired over many years. These datasets are big data in their essence - compiled from different sources, to an extent subjective, include specific biases and uncertainties, data sparseness and quality varies over time and space. In addition, to understand relations between organisms and climate high volume and large velocity satellite observations some into play that require scalability in computing. Databases of this kind offer an excellent ground for interdisciplinary machine learning research. This tutorial will outline research questions that could be addressed using computational methods, discuss characteristics of fossil data and computational tasks for machine learning and data mining, overview existing computational approaches, and discuss what more could be done from the machine learning and data mining perspective.

Organization:
Indrė Žliobaitė, University of Helsinki, Finnland

Tutorial web page

Abstract:
Deep Learning methods have become ubiquitous for computer vision tasks. This tutorial will focus on recent advances in deep learning for vision applications in robotics and autonomous vehicles. The tutorial will start with basic Deep Learning techniques and will highlight state-of-the-art methods in the three major topics in computer vision: classification, detection and segmentation. Then the tutorial will continue with more concrete methods and their applications, e.g. in scene understanding, 3D analysis, perception for robotics and autonomous driving. The goal of the tutorial is to focus on relevant techniques, which are of significant impact to real-world applications, and which will benefit the broader Machine Learning community.

Organization:
Anelia Angelova, Google Research / Google Brain
Sanja Fidler, University of Toronto

Tutorial web page

Combined Workshops with Tutorials

Workshop abstract:
This workshop will provide a platform for discussing the recent developments in the area of algorithm selection and configuration, which arises in many diverse domains, such as machine learning, data mining, optimization and automated reasoning. Algorithm selection and configuration are increasingly relevant today. Researchers and practitioners from all areas of science and technology face a large choice of parameterized machine learning algorithms, with little guidance as to which techniques to use in a given application context. Moreover, data mining challenges frequently remind us that algorithm selection and configuration are crucial in order to achieve cutting-edge performance, and drive industrial applications.

Meta-learning leverages knowledge of past algorithm applications to select the best techniques for future applications, and offers effective techniques that are superior to humans both in terms of the end result and especially in the time required to achieve it. In this workshop, we will discuss different ways of exploiting meta-learning techniques to identify the potentially best algorithm(s) for a new task, based on meta-level information, prior experiments on both past datasets and the current one. Many contemporary problems require the use of workflows that consist of several processes or operations. Constructing such complex workflows requires extensive expertise, and could be greatly facilitated by leveraging planning, meta-learning and intelligent system design. This task is inherently interdisciplinary, as it builds on expertise in various areas of AI.

Workshop web page

Tutorial abstract:
This tutorial will introduce and discuss state of the art methods in meta-learning, algorithm selection, and algorithm configuration, which are increasingly relevant today. Researchers and practitioners from all areas of science and technology face a large choice of parameterized machine learning algorithms, with little guidance as to when and how to use which technique. Data mining challenges frequently remind us that algorithm selection and configuration play a crucial role in achieving cutting-edge performance, and are indispensible in industrial applications.

Meta-learning leverages knowledge of past applications of algorithms applications to learn how to select the best techniques for future applications, and offers effective techniques that are superior to humans both in terms of the quality of the end result and even more so in the time required to achieve it. Recent approaches include also (preferably very fast) partial probing runs on a given problem with the aim of determining the best strategy to be used from there onwards. This may include further probing or recommending an algorithm to be used to solve the given problem. A recent trend is to incorporate such techniques into software platforms. This synergy leads to new advances that recommend combinations of algorithms and hyperparameter settings simultaneously, and that speed up algorithm configuration by learning which parameter settings are likely most useful for dealing with the data at hand.

After motivating and introducing the concepts of algorithm selection and configuration, we elucidate how they arise in machine learning and data mining, but also in other domains, such as optimization. We demonstrate how meta-learning techniques can be effectively used in this context, exploiting information gleaned from past experiments as well as by probing the data at hand. Moreover, many current applications require the use of machine learning or data mining workflows that consist of several different processes or operations. Constructing such complex systems or workflows requires extensive expertise, as well as existing meta-data and software, and can be greatly facilitated by leveraging the methodologies presented at this tutorial.

Tutorial web page

Organizers:
Pavel Brazdil, LIAAD Inesc Tec., Portugal
Joaquin Vanschoren, Eindhoven University of Technology, Netherlands
Holger H. Hoos, Universiteit Leiden, Netherlands
Frank Hutter, University of Freiburg, Germany

Johannes Fürnkranz

TU Darmstadt, Germany

50 Ways to Tweak your Paper

Very often, we see papers that have good research ideas be rejected because the quality of the write-up does not quite live up to the quality of the idea. Whether you like it or not, the presentation of your work can often make the difference that tilts a borderline paper one way or the other. In particular for conference-style reviewing, where reviewers have to make recommendations for multiple papers in a very narrow time frame, they are often influenced by the writing and the appearance of the paper (whether they like it or not). In this talk, targeted towards junior Ph. D. students, we will make a few suggestions how to improve the presentation of your work. Most of them are obvious, but we nevertheless often see them violated in practice.


Johannes Fürnkranz is a full Professor for Knowledge Engineering at TU Darmstadt, Germany. His main research interests are machine learning and data mining, in particular inductive rule learning, learning of intrpretable models, multi-label classification and preference learning, and their applications in game playing, web mining, and scientific data mining. Since 2015, he serves as the editor-in-chief of Data Mining and Knowledge Discovery, the most traditional and renown journal in this area. He is also a long-time action editor for Machine Learning, and current or past editorial board member of several other well-known journals, and a regular PC or senior PC member of premier conferences in the areas of machine learning, data mining, information retrieval, and artificial intelligence. He was nominated “best reviewer” at two Machine Learning conferences, “outstanding PC member” at the AAAI conference, and “outstanding editor” of the Machine Learning journal. He served as the program co-chair of the 6th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Berlin 2006), the 27th International Conference on Machine Learning (Haifa 2010), the 16th International Conference on Discovery Science (Singapore 2013), and the 40th German Conference on Artificial Intelligence (Dortmund 2017, September 25-29th, 2017)

Tias Guns

Vrije Universiteit Brussel (VUB), Belgium

Tips for a successful PhD, and how to win an award with it

The talk will dive into some of the every day challenges a PhD student faces: working with your promoter, writing a paper, conference visits, how to get your work more widely known and up to winning an award with your thesis. The talk will have anecdotes based on my or colleagues' experiences and indispensable references to PhD comics, XKCD and other highly valuable sources.


Tias Guns is Assistant Professor at the Vrije Universiteit Brussel (VUB), Belgium, in the Business, Technology and Operations lab of the faculty of Economic and Social Sciences & Solvay Business School. His research lies on the border between data mining and constraint programming, and his main interest is in integrating domain expertise and user constraints into data analytics tasks. As part of his PhD, he has developed the CP4IM framework which showed for the first time the potential of using constraint programming for pattern mining. He is an active member of the community and has organized a number of workshops and a special issue on the topic of combining constraint programming with machine learning and data mining. His PhD was awarded with both the constraint programming dissertation award and the ECCAI artificial intelligence dissertation award.

Discovery Challenges

TiSeLaC : Time Series Land Cover Classification Challenge

TBA

Multi-Plant Photovoltaic Energy Forecasting Challenge

TBA

Mars Express Power Challenge

TBA

Salvatore Spinello

Research Programme Officer at REA

FET Open: main features and evaluation process

FET Open is one of the most attractive research programme under Horizon 2020. FET-Open supports the early-stage, high-risk research around new ideas towards radically new future technologies. It explores an open range of new and disruptive technological possibilities in all areas of Science & Technology, inspired by cutting edge science, unconventional collaborations and pioneering new ways to create the optimum conditions for serendipity to occur.

In these first 3 years of Horizon 2020, a total of 2.648 proposals were submitted to the FET-Open programme and covered a wide range of disciplines: from Physics to Life Sciences, from Information Sciences and Engineering to Chemistry. Most proposals show indeed high degree of interdisciplinarity.

During my presentation I will focus on Research and Innovation-Actions (RIA) and the so-called "gatekeepers" that every excellent proposal should address. I will then present the evaluation process that allows the selection of the best proposals, resulting in a continuously growing portfolio of high quality interdisciplinary projects. I will conclude providing some statistics in terms of country and organization participations, scientific fields covered and interdisciplinarity.


Salvatore Spinello started his Ph.D. in Computer Science in 1997 at the University of Catania (Italy). He moved to Germany in 1999 where I finished his Ph.D. in collaboration with the University of Erlangen-Nuremberg. He then moved to London for his first Post-Doc (UCL) and after one year to Bordeaux (France) for his second Post-Doc (Inria). In 2004 he joined a small company, the French leader in the distribution of Virtual Reality's products. He held the position of Director of the R&D Department managing two European Projects.

In 2006 he joined Inria, the French Institute for Research in Computer Science and Control. He was in charge of identifying knowledge and technologies from research teams which were transferable to the external world (both industry and academic partners), enabling the transfer typically through licensing or joint R&D projects, protecting whenever appropriate the underlying intellectual property. He was also in charge of activities which aimed to facilitate the participation of researchers to National and European collaborative projects.

In 2013 he took the position of Project Officer at the Aquitaine Regional Council (France). He negotiated grant agreements; he monitored his portfolio from administrative, financial and technical aspects; he assessed technological progress and the fulfilment of contractual obligations; he managed the correct use of resources allocated to the projects, ensuring that the work was been carried out as planned; he monitored the overall performance (technical, dissemination, exploitation) and the strategic impact of projects.

In 2015 he joined the Research Executive Agency as Research Programme Officer where he is participating to the evaluation and selection of proposals submitted to the FET Open Programme and monitoring several funded projects.

Richard Wheeler

Edinburgh Scientific

Hints on how to write a successful project proposal

EU Funding has never been more important, nor harder to get. In this very practical talk, Richard Wheeler will provide an insider's view to securing European Union funding, including tips and tricks on good proposal writing, what really happens in EU review meetings, why most proposals fail, good project management methods, and more. Attendees will have the opportunity to ask questions and discuss their ideas in an informal workshop environment.

Topics will include:

  • What makes a good consortium
  • Why proposals fail
  • The EU review process
  • What makes good proposal writing
  • Getting started
  • Writing: Science and Technology
  • Writing: Knowledge Transfer
  • Writing: Exchange Programmes
  • Writing: Management and Implementation
  • Writing: Impact
  • Writing: Exploitation
  • Writing: Dissemination

Richard Wheeler is a specialist in artificial intelligence and computer science who has worked for the World Health Organisation in Geneva and The University of Edinburgh, and been a research manager at laboratories in Brussel s in Vienna. He currently runs a private scientific consultancy (Edinburgh Scientific) serving academic and industrial clients across Europe. He is active in the field of renewable energy and scientific management, acts as a chair in a number of EU funding schemes, and recently completed a book “Success with EU Proposals”.