Robust/src/Benchmarks/SingleTM/KMeans/README

   1 Introduction
   2 ------------
   3
   4 This benchmark was taken from NU-MineBench 2.0. Their technical report [3]
   5 provides a description of the algorithm:
   6
   7     K-means is a partition-based method [4] and is arguably the most
   8     commonly used clustering technique. K-means represents a cluster by the
   9     mean value of all objects contained in it. Given the user-provided
  10     parameter k, the initial k cluster centers are randomly selected from
  11     the database. Then, K-means assigns each object to its nearest cluster
  12     center based on the similarity function. For example, for spatial
  13     clustering, usually the Euclid distance is used to measure the closeness
  14     of two objects. Once the assignments are completed, new centers are
  15     found by finding the mean of all the objects in each cluster. This
  16     process is repeated until two consecutive iterations generate the same
  17     cluster assignment. The clusters produced by the K-means algorithm are
  18     sometimes called "hard" clusters, since any data object either is or is
  19     not a member of a particular cluster.
  20
  21
  22 Compiling and Running
  23 ---------------------
  24
  25 To build the application, simply run:
  26
  27     make
  28
  29 By default, this produces an executable named "KMeans.bin", which can then be
  30 run in the following manner:
  31
  32     ./KMeans.bin -m <max_clusters> \
  33              -n <min_clusters> \
  34              -t <threshold> \
  35              -i <input_file_name> \
  36              -nthreads <number of threads>
  37
  38 To produce the data in [1], the following values were used:
  39
  40     low contention:  -m 40 -n 40 -t 0.05 -i inputs/random-n2048-d16-c16.txt
  41     high contention: -m 15 -n 15 -t 0.05 -i inputs/random-n2048-d16-c16.txt
  42
  43 For runs without a simulator, a larger input file, (more info below) can be used
  44 instead:
  45
  46     low contention:  -m 40 -n 40 -t 0.00001 -i inputs/random-n65536-d32-c16.txt
  47     high contention: -m 15 -n 15 -t 0.00001 -i inputs/random-n65536-d32-c16.txt
  48
  49
  50 Input Files
  51 -----------
  52
  53 A section of the real image database distributed by Corel Corporation is used
  54 for K-means. This database consists of 17695 scenery pictures. Each picture is
  55 represented by two features: color and edge. The color feature is a vector of 9
  56 floating points while the edge feature is a vector of size 18. Both K-means
  57 implementations use Euclid distance as the similarity function and execute it
  58 for the two features separately. Since the clustering quality of K-means
  59 methods highly depends on the input parameter k, both K-means were executed
  60 with ten different k values ranging from 4 to 13.
  61
  62 Smaller versions of "color", "edge", and "texture" were made by taking the
  63 first 100 lines of each file. The smaller inputs are called "color100",
  64 "edge100", and "texture100".
  65
  66 The randomX_Y files were created by generating random data sets with X points
  67 of Y dimensions each. Values were selected from a uniform distribution.
  68
  69 More input sets can be generated by using "inputs/generate.py". The script picks
  70 a given number of uniformly distributed random centers and then selects random
  71 points around those centers using a gaussian distribution. There are two
  72 included input files from this script: small, random-r2048-d16-c16.txt; and
  73 large, random-r65536-d32-c16.txt. In the filename, "n" refers to the number of
  74 points, "d" the number of dimensions, and "c" the number of centers.
  75
  76
  77 References
  78 ----------
  79
  80 [1] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford
  81     Transactional Applications for Multi-processing. In IISWC '08: Proceedings
  82     of The IEEE International Symposium on Workload Characterization,
  83     September 2008.
  84
  85 [2] J. Pisharath. NU-MineBench 2.0. Technical Report CUCIS-2005-08-01, 2005.
  86
  87 [3] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.
  88     Kluwer Academic Publishers, 1981
  89