Foreword
- Output options: the ‘tango’ syntax and the ‘readable’ theme.
What is data mining?
Data Mining is focused on discovering novel patterns in data. Data mining use unsupervised learning techniques for the most part.
- Clustering (partitioning).
- Hierarchical clustering.
- Hybrid clustering.
- Neural clustering.
- Aggregation clustering.
- Association analysis.
- Dimension reduction techniques.
- Factor analysis.
- Principal component analysis.
- Contingent analysis.
- Multidimensional scaling.
- Supervised classification and prediction with:
- Tree-based methods.
- Discriminant analysis.
- Regressions.
- k-nearest neighbours.
- Bayesian methods.
- Bootstrapping and Monte Carlos.
- Support vector machines.
- Neural networks.
Even though these techniques are also part of machine learning or statistical learning, data mining is different.
In data mining, we don’t seek to automate processes or mix artificial intelligence with statistical techniques. Data mining belongs to the field of research; marketing research, for example.
- Prepare the data: preprocessing and cleaning the data is 80% of the job.
- Explore the data: analysis is part of all steps (visual and statistical exploration).
- Develop a model: the complex art and science of finding the most appropriate techniques.
- Interpret and present the results: reporting.
Therefore, the data miner must search through massive data haystacks and find the interesting ‘needles’.
Data mining checklist
- Establish the data mining goals.
- Select the data.
- Preprocess the data.
- Transform the data.
- Store the data.
- Mine the data.
- Evaluate the results.
Data preparation, preprocessing and cleaning is 80% of the job.