Book Review – Interactive and Dynamic Graphics for Data Analysis: With R and GGobi by Dianne Cook and Deborah F. Swayne (Springer 2007)

October 8th, 2009

[amazonshowcase_8e4d1053a4b601bb3654d3a5d4e8ed6d]

This book covers interactive graphics and their role in data analysis and covers the GGobi software package, which is an open source project for data visualisation, and the book is written by the two authors as well in addition to the R statistical environment.

Overall this is a nice introduction to the data analysis and graphical methods to support such analysis and the book doesn’t try to cover too much information. This is possibly a reflection on the scope of GGobi which has been designed for some specific tasks and written so that it can be linked to other software such as R rather than a bloated system with far more functionality that is required for a specific application.

The book starts with an introductory chapter that gently works through an example using tipping data from a restaurant. Graphs are produced and discussed in the text to show how they relate to the thought process of the authors when undertaking their analysis. The chapter ends with a short piece describing moving from static to interactive analysis and how this can benefit an analysis.

The second chapter, titled the Toolbox, covers the types of graph that are available for different data and focusses primarily on GGobi but mentions other types of plot that are available in R to compliment the tools available in GGobi. One main feature of GGobi that does not appear in many packages is the tour that is based on linked different interesting projections of variables to study what projection provides separation between the data. This is a useful activity for pattern recognition where we have a supervised problem with known classes and want to investigate how to use the variables in our data set to discriminate between the classes. Brushing is also discussed as a technique to link multiple plots, so we might want to focus on one subset of the data and see what values this group has for the variables in the data.

The third chapter is a short look at missing data, which is a problem encountered by all data analysts on a regular basis and in particular when we are working with large multivariate data sets. The chapter touches briefly on data imputation to fill in the gaps so that we can use the whole data set for any multivariate analysis. Overall the chapter is a short introduction to the area rather than a comprehensive coverage.

In chapter four the authors discuss supervised classification methods starting with parametric methods, such as Fisher’s linear discriminant analysis and then more recent algorithmic methods from fields such as data mining which are more black box methods. There is a good introduction to using graphical methods only to get an initial idea of class structure (groups) within the data and separation of the data into these classes using one or more of the recorded variables. The chapter then moves on to combine graphical displays with numerical output from the classification algorithms to get a broader picture of how well a method describes a particular set of data. Tree methods are introduced and their benefits including interpretation based on a number of decision points compared to the black box methods of neural networks or support vector machines that are also touched on in the chapter.

Chapter five provides coverage of cluster analysis, which of one of the techniques used during unsupervised classification. There is a nice display of graphs showing the process of identifying potential clusters in a data set shown near the start of the chapter, which provides an insight into how the authors would undertake this type of analysis. Numerical techniques for cluster analysis are then covered as a formal approach to identifying potential clusters based on numeric metrics. Output from the techniques is shown with colour graphs to highlight the clusters identified in the analysis.

In chapter six various unconnected ideas are discussed to cover some other topics like longitudinal data or multidimensional scaling which might be of interest to the reader.

The book ends with a chapter discussing the data sets that are available with GGobi.

Overall comment: this is a reasonably short but very readable introduction to exploratory data analysis using graphical methods. Various methods are illustrated well with colour graphics to provide an insight into how to analyse data at the start of an investigation prior to any model building activities.

Comments are closed.