4 This benchmark was taken from NU-MineBench 2.0. Their technical report [3]
5 provides a description of the algorithm:
7 K-means is a partition-based method [4] and is arguably the most
8 commonly used clustering technique. K-means represents a cluster by the
9 mean value of all objects contained in it. Given the user-provided
10 parameter k, the initial k cluster centers are randomly selected from
11 the database. Then, K-means assigns each object to its nearest cluster
12 center based on the similarity function. For example, for spatial
13 clustering, usually the Euclid distance is used to measure the closeness
14 of two objects. Once the assignments are completed, new centers are
15 found by finding the mean of all the objects in each cluster. This
16 process is repeated until two consecutive iterations generate the same
17 cluster assignment. The clusters produced by the K-means algorithm are
18 sometimes called "hard" clusters, since any data object either is or is
19 not a member of a particular cluster.
25 To build the application, simply run:
29 By default, this produces an executable named "KMeans.bin", which can then be
30 run in the following manner:
32 ./KMeans.bin -m <max_clusters> \
35 -i <input_file_name> \
36 -nthreads <number of threads>
38 To produce the data in [1], the following values were used:
40 low contention: -m 40 -n 40 -t 0.05 -i inputs/random-n2048-d16-c16.txt
41 high contention: -m 15 -n 15 -t 0.05 -i inputs/random-n2048-d16-c16.txt
43 For runs without a simulator, a larger input file, (more info below) can be used
46 low contention: -m 40 -n 40 -t 0.00001 -i inputs/random-n65536-d32-c16.txt
47 high contention: -m 15 -n 15 -t 0.00001 -i inputs/random-n65536-d32-c16.txt
53 A section of the real image database distributed by Corel Corporation is used
54 for K-means. This database consists of 17695 scenery pictures. Each picture is
55 represented by two features: color and edge. The color feature is a vector of 9
56 floating points while the edge feature is a vector of size 18. Both K-means
57 implementations use Euclid distance as the similarity function and execute it
58 for the two features separately. Since the clustering quality of K-means
59 methods highly depends on the input parameter k, both K-means were executed
60 with ten different k values ranging from 4 to 13.
62 Smaller versions of "color", "edge", and "texture" were made by taking the
63 first 100 lines of each file. The smaller inputs are called "color100",
64 "edge100", and "texture100".
66 The randomX_Y files were created by generating random data sets with X points
67 of Y dimensions each. Values were selected from a uniform distribution.
69 More input sets can be generated by using "inputs/generate.py". The script picks
70 a given number of uniformly distributed random centers and then selects random
71 points around those centers using a gaussian distribution. There are two
72 included input files from this script: small, random-r2048-d16-c16.txt; and
73 large, random-r65536-d32-c16.txt. In the filename, "n" refers to the number of
74 points, "d" the number of dimensions, and "c" the number of centers.
80 [1] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford
81 Transactional Applications for Multi-processing. In IISWC '08: Proceedings
82 of The IEEE International Symposium on Workload Characterization,
85 [2] J. Pisharath. NU-MineBench 2.0. Technical Report CUCIS-2005-08-01, 2005.
87 [3] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.
88 Kluwer Academic Publishers, 1981