Predictive data mining: A practical guide. Westphal, C. Data mining solutions. Witten, I. Data mining. New York: Morgan-Kaufmann. Crucial Concepts in Data Mining Bagging Voting, Averaging The concept of bagging voting for classification, averaging for regression-type problems with continuous dependent variables of interest applies to the area of predictive data mining , to combine the predicted classifications prediction from multiple models, or from the same type of model for different learning data.
It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model learning data set, which contains observed classifications is relatively small. You could repeatedly sub-sample with replacement from the dataset, and apply, for example, a tree classifier e.
In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets.
One method of deriving a single prediction for new observations is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees.
Note that some weighted combination of predictions weighted vote, weighted average is also possible, and commonly used. A sophisticated machine learning algorithm for generating weights for weighted prediction or voting is the Boosting procedure. Boosting The concept of boosting applies to the area of predictive data mining , to generate multiple models or classifiers for prediction or classification , and to derive weights to combine the predictions from those models into a single prediction or predicted classification see also Bagging.
A simple algorithm for boosting works like this: Start by applying some method e. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify where the misclassification rate was high , and lower weights to those that were easy to classify where the misclassification rate was low.
Then apply the classifier again to the weighted data or with different misclassification costs , and continue with the next iteration application of the analysis method for classification to the re-weighted data.
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment for prediction or classification of new cases , the predictions from the different classifiers can then be combined e.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration in the sequence of iterations of the boosting procedure.
Data Preparation in Data Mining Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods e. Problem Setting and Notation Let n denote the number of subjects in the study and d the number of SNPs under investigation.
This paper focuses on binary phenotypes, i. Table 1: Tabular representation of single SNP data. Notice that the row sums n1. Hence, the random vectors n11, n12, n13 T and n21, n22, n23 T follow a multinomial distribution with three categories, sample sizes n1. The parameter of the statistical model for the whole study thus consists of all such pairs of multinomial probability vectors, one for each of the d SNPs under investigation.
For every SNP j, we are interested in testing the null hypothesis , where we introduced the superscript j to indicate the SNP. This hypothesis is equivalent to the null hypothesis that the genotype at locus j is independent of the binary trait of interest.
Two standard asymptotic tests for Hj versus its two-sided alternative Kj genotype j is associated with the trait are: the chi-square test for association and the Cochran-Armitage trend test see, e. Both tests employ test statistics which are asymptotically as min n1.
Observe that the test statistics obtained for different SNPs will be highly correlated if these SNPs are in strong LD to each other; consequently, the corresponding p-values will also exhibit strong dependencies 20 , If there was a single test to perform i.
The simplest method is the so-called Bonferroni correction,. Furthermore, Meinshausen et al. By contrast, machine learning approaches aimed at prediction try to take the information of the whole genotype into account at once, and thus implicitly consider all possible correlations, to strive for an optimal prediction of the phenotype.
Based on this observation, we propose Algorithm 1 combining the advantages of the two techniques, consisting of the following two steps: the machine learning step, where an appropriate subset of candidate SNPs is selected, based on their relevance for prediction of the phenotype; the statistical testing step, where a hypothesis test is performed together with a Westfall-Young type threshold calibration for each SNP.
Additionally, a filter first processes the weight vector w output in the machine learning step before using it for the selection of candidate SNPs. The above steps are discussed in more detail in the following sections. The machine learning and SNP selection step The goal in machine learning is to determine, based on the sample, a function f x that predicts the unknown phenotype y based on the observation of genotype x. It is crucial to require such a function to not only capture the sample at hand, but to also generalize, as well as possible, to new and unseen measurements, i.
Once a classification function f has been determined by solving the above optimization problem, it can be used to predict the phenotype of any genotype by putting The above equation shows that the largest components in absolute value of the vector w called SVM parameter or weight vector also have the most influence on the predicted phenotype. Note that the weights vector contains three values for each position due to the feature embedding, which encodes each SNP with three binary variables.
To convert the vector back to the original length, we simply take the average over the three weights. We also include an offset by including a constant feature that is all one.
Considering that the use of SVM weights as importance measures is a standard approach 25 , for each j the score abs wj can be interpreted as a measure for the importance of the j-th SNP for the phenotype prediction task. The main idea is to select only a small number k of candidate SNPs before statistical testing, namely those SNPs having the largest scores. Calculation of these p-values is performed exactly as described above for RPVT, with the only modification that p-values for SNPs not ranked among the top k in terms of their filtered SVM weights are set to 1, without calculating a test statistic.
To this end, we investigated prior approaches 26 , 27 based on sample splitting meaning that the selection of k SNPs is done on one randomly chosen sub-sample of individuals, while the p-value calculation and thresholding for the selected SNPs is performed on another. Appropriate conclusions from statistical tests should involve explicit considerations about the magnitude of effect that would be important to be able to detect, if it were real, and whether the probabilities of Type I and Type II error reflect the relative seriousness of the consequences of a Type I vs.
Type II error. Although the traditional approach of ignoring Type II error probabilities may be easy, it can result in poor decisions. Our goal is to improve upon an obviously flawed hypothesis-testing system while working under the constraints of that system, because we acknowledge that researchers are not very likely to abandon an approach that is so easy to use and so widely understood. Apriori is designed to operate on databases containing transactions for example, collections of items bought by customers, or details of a website frequentation or IP addresses .
Other algorithms are designed for finding association rules in data having no transactions Winepi and Minepi , or having no timestamps DNA sequencing.
Observe that the test statistics obtained for different SNPs will be highly correlated if these SNPs are in strong LD to each other; consequently, the corresponding p-values will also exhibit strong dependencies 20 , Notice that the row sums n1. The last column of Table 2 indicates whether the reported associations were validated i. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets.
In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets. Westphal, C. If there was a single test to perform i. Find articles by Joseph F.
In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration in the sequence of iterations of the boosting procedure. Westphal, C. There are several possible not mutually exclusive explanations for that phenomenon 10 , 11 , 12 , Note that some weighted combination of predictions weighted vote, weighted average is also possible, and commonly used.
So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier s can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy. The second step uses multiple statistical hypotheses testing for a quantitative assessment of individual relevance of the filtered SNPs.
Crucial Concepts in Data Mining Bagging Voting, Averaging The concept of bagging voting for classification, averaging for regression-type problems with continuous dependent variables of interest applies to the area of predictive data mining , to combine the predicted classifications prediction from multiple models, or from the same type of model for different learning data.
See Supplementary Section 1. The complete method is available in all these programming languages. Regarding the biological plausibility of these two SNPs, we examined a number of functional indicators to assess their potential role in disease.
This choice is admittedly a wide, arbitrary upper bound for the number of SNPs that can present a detectable association with a given phenotype.
All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making. In particular, we explored the genomic regions in which they map and their potential roles as regulatory SNPs, status as eQTLs, and role in Mendelian disease. The above steps are discussed in more detail in the following sections. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. This procedure is explained in detail in the Methods Section and Supplementary Section 1.
In a sense, we thus examined how well any particular method, when applied to the WTCCC dataset, is able to make discoveries in that dataset that were actually confirmed by later research using RPVT in independent publications. Stage 3: Deployment. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables "fields" - performing some preliminary feature selection operations to bring the number of variables to a manageable range depending on the statistical methods which are being considered.