Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Note


Dimension Reduction

Data can be voluminous (long vectors and data frames), count lots of variables (large longitudinal dataset and extensive panel data), and/or incorporate several dimensions and sub-divisions.

Analyzing mass of data is a challenge. We face entropy.

Hopefully, with the help of data mining techniques, we can sieve, filter, slice and dice data.

Data mining

Data mining is closely related to machine learning (ML). However, with ML, we seek to build a model and then use the model to make predictions. In data mining, we mainly seek to understand the data and report the findings. Data mining is a vast field. The following list is a high-level overview:

  • Frequent Pattern Mining: the arules and arulesNBMinerpackages for example.
  • Sequence Mining: the SpadeR and DEGSeq packages for example.
  • Clustering: the kmeans and hclust functions for example.
  • Classification and Segmentation: with Support Vector Machine, k-NN, tree-based methods, Naïve Bayes methods, among many.

We can also proceed with visualization techniques.

Dimension reduction methods

They are simplification methods. We want to boil down, distil data, and come up with understandable patterns, trends, and clusters. We look for similarities, factors, eigenvalues, components, dimensions, latent variables, synthetic variables, etc.

In other words, these methods pick up ‘signals’, ‘traces’ and, materialize them into an understandable format such as a scatter plots or a ‘maps’. These visualizations look like radar or sonar images.

Dimension reduction methods comprise:

  • Multidimensional Scaling (MDS).
  • Component Analysis.
    • Principal (PCA).
    • Single (CA).
    • Multiple (MCA).
  • Factor (Factorial) Analysis (FA).
  • Linear Discriminant Analysis (LDA).
  • Feature Selection.

These methods are different because they borrow different paths, but the destination is the same.

Fields of application

Dimension reduction methods can be applied in many fields. Here are some examples:

  • Psychology: study the underlying structure of perceptions of stimuli (personality traits, gender roles, human faces, everyday sounds, fragrances, colors). Create perceptual maps. Understand the psychological dimensions hidden in the data.
  • Marketing: derive from maps of consumer choice and product preference so that relationships between products can be discerned. Use these maps to position your products appropriately. Emphasize brand differentiation. Design future experiments. Discriminate between similar products.
  • Ecology: provide environmental impact maps of pollution (oil spills, sewage pollution, mud dispersal). Develop biological taxonomy. Classify populations using genetic data.
  • Molecular biology: reconstruct the spatial structures of molecules. Interpret their interrelations, similarities, and differences. Construct a 3D map.
  • Computational chemistry: use a measure of molecular similarity (interatomic distance) to characterize the behavior and function of molecules derived from large collections of compounds.
  • Social networks: develop telephone-call graphs where the vertices are telephone numbers and the edges can respond to calls between them. Recognize instances of credit card fraud and network intrusion detection. Identify clusters enlarged scientific collaboration networks.
  • Graph layout: design a diagram to describe a network and the system it represents using a graph-theoric distance (minimum-path length) between pairs of nodes and vertices. Examples include communications networks, electrical circuit diagrams, wiring diagrams, and protein-protein interaction drafts. Create graphic visualizations of digital image libraries, with images as vertices and proximities (perceptual differences) between pairs of images as edge weights.
  • Music: use a measure of musical sound quality (set the spectral components with high resolution at low frequencies to mimic the human auditory system) as input to a nonlinear distance measure to assess the similarities and differences between a variety of songs.