The following table contains a compilation of all data sets which are necessary for various exercises. Most of the data sets are real data which have been obtained from various sources (see reference section at end of this page). A few data sets are simulated data sets which have been generated with a background story in mind. The file names of the simulated data sets are indicated by the symbol
in the right column. All the simulated data sets have been generated using DataLab.
Filename |
Description |
Ref. |
ALCOHOL |
Subset of the data set WINE containing only the alcohol content of two brands. |
 data courtesy of M. Forina |
BANANAS |
Properties of 38 bananas. The bananas have been weighed, measured and eaten by the author. |
 |
BENZ500 |
Spectroscopic data (NMR) on various brands of gasoline, and the relative octane number. |
 data courtesy of R. Meusinger |
BODYFAT |
Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) are recorded for 252 men. Body fat, a measure of health, is estimated through an underwater weighing technique. Fitting body fat to the other measurements using multiple regression provides a convenient way of estimating body fat for men using only a scale and a measuring tape. |
 data courtesy of Garth Fisher |
BOILPTS |
Boiling points and topological descriptors of 185 chemical substances. |
 |
CANCER |
Number of intestine cancer cases in West Germany in the period between 1955 and 1995 |
 |
CIGART |
Artificial data set for classification, created by INSPECT. The data points are arranged in a way that only non-linear methods are able to classify the data correctly |
 |
COINS |
Weight of 114 coins (Austrian 1 Schilling pieces) of different age. |
Coins have been collected and weighed by H. Lohninger and A. Satzinger |
ETHANOL |
NOx concentration in the exhaust gases of an experimental ethanol motor. |
 |
EXMPL-A |
Artificial data set which shows a few simple relationships among variables. |
 |
FISH1SPECIES |
Subset of data set FISHCATCH showing the relationship between length of weight of fish. |
 |
FISHCATCH |
Body measurements of different species of perch. |
 |
FLURIEDW |
This data set comprises geometric measures of 100 authentic and 100 counterfeit bank notes. |
 data courtesy of H. Riedwyl, Bern, Switzerland |
FREEFALL |
Simulated data to show variability in data. A steelball is released at different heights; for each height the experiment is repeated 100 times. |
 |
HENRYSEM |
Henry's constant of chemical substances together with molecular descriptors. The physical data has been obtained from Hine et al., the molecular descriptors have been calculated using TOPIX |
 |
HUMIDIT2 |
Average Relative Humidity(%) of 264 places in USA. The data set contains the data of June and September, morning and afternoon each. In addition, the annual averages are in the last two columns. |
 |
IRIS |
Three types of iris plants. The plants are described by four variables. |
 |
METHANE |
This data set contains the concentration of atmospheric methane measured monthly during the period from September 1980 to September 1988. |
 |
MINWATER |
Chemical analysis of different brands of mineral water. |
 |
MINWATER2 |
Subset of MINWATER |
 |
MOTE9603 |
Climate data obtained from Mote weather station, Florida, USA. Data set contains measurements of 9 meteorological variables over a period of ten days in March 1996. |
 data courtesy of Don Hayward |
MOTETIDES |
Water level at the Mote weather station, Florida, USA, during July 1998. Data was obtained every 15 minutes. |
 data courtesy of Don Hayward |
MULTIEST |
Artificial data used in an interactive example on multidimensional models. |
 |
POLYFIT |
Artificial data showing a polynomial relationship of the third order. |
 |
PRECIPITATION |
Normal monthly precipitation (Inches) in the period 1961-90. |
 |
REACTTEST |
The reaction times to visual stimuli were recorded for 9 persons. The experiment was repeated on two different days; one series was obtained before a two-hour lecture, the other series after a two-hour lecture. |
 |
STRONTIUM |
Simulated data to show two-sample t-test. |
 |
SUNSPOTS |
Average monthly sunspot areas between 1874 and 1998. |

data courtesy of David H. Hathaway |
TRAIN |
Simulated data to show a skewed distribution. |
 |
TWOCLASS |
Artificial data set containing two classes of observations |
 |
WATERRESID |
Subset of MINWATER |
 |
WINE |
Chemical analysis of three kinds of Italian red wines (Barolo, Grignolino, Barbera). |
 data courtesy of M. Forina |
WINEGER |
Chemical analysis of various kinds of German wines. |
 data courtesy of Klaus Danzer, Friedrich-Schiller-Universität Jena, Germany |
WORLDPOP |
Demographical, sociological and economical data on the world's nations (1988). |

data source: various publications of the UN, the Worldbank and the CIA |