Sequential k-Means
The Sequential k-Means algorithm is an implementation of the unmonitored clustering algorithm of the same name. It is a sequential variant of the widely used k-Means clustering algorithm for streaming data. The aim of the algorithm is to find clusters based on the structure of the data, each of which contains similar data points and separates different data points from each other.
The number of input channels (referred to below as n) for this algorithm can be freely selected by the user. These inputs span the n-dimensional feature space in which the clusters are found. In each analysis cycle, the data stream provides the algorithm with a new feature vector that can be interpreted as a data point in this feature space. Data points that are close to each other in this feature space are assigned to the same cluster. The number of clusters present must be set by the user before the analysis begins and remains fixed.
In contrast to the k-Means algorithm for conventional batch analysis, the data for the Sequential k-Means are not fully available at the time of analysis. Instead, the data points arrive one by one in the form of streaming data. They are therefore processed sequentially and assigned to the corresponding cluster closest to them. This approach results in a number of differences, two of which are particularly relevant to the use of the algorithm as well as the parameter settings.
On the one hand, all data points and thus the value ranges of the individual features are already available at the beginning of a batch analysis, whereas this is not the case with sequential analysis, so that the value ranges are not necessarily fixed in advance. However, it is helpful to know the value ranges of the input channels in advance, even if the actual values only arrive during the course of the analysis. This is particularly important for the initialization of cluster centers. Three different approaches are available for initialization. The center points can be specified in the form of specific values via a parameter array. Alternatively, the center points can be set randomly or equidistantly in a defined range of values. For the initialization modes Random and Equidistant the value ranges are required and have to be set via the parameters Lower Bounds and Upper Bounds for the individual input channels.
On the other hand, in a batch analysis all data points are typically traversed multiple times to update the cluster centers until they change only minimally. This is not possible within the framework of the sequential analysis. However, in order to still be able to adjust the cluster centers and traverse data points multiple times, the algorithm Sequential k-Means has a buffering mechanism referred to as Aggregation Buffer, which makes it possible to store a limited number of values temporarily. When filling the buffer, all incoming data points are assigned to the closest cluster. The distance between a data point and the cluster centers is determined by the Euclidean norm. Only when the buffer is filled are the cluster centers updated based on the newly allocated data points in the buffer. The new cluster center corresponds to the mean value of all data points contained in the cluster. This can be calculated incrementally, so that the old data points are not needed for the calculation. The size of the buffer is set by the parameter Aggregation Buffer Size; the default value is 10. The parameter Max Iterations can be used to specify the number of iterations through the buffer. The default value is one. If the value is set to two, for example, after the first adjustment of the cluster centers the data points in the buffer are reassigned to the clusters and then the cluster centers are adjusted again. Due to the shift in cluster centers, it is possible for individual data points to be assigned to different clusters from one iteration to the next. Due to the limited computing capacity for data processing between two cycles, excessively high values should be avoided for the parameters Aggregation Buffer Size and Max Iterations, otherwise the update of the cluster centers may not be guaranteed. If the cluster centers are not updated for large values for these parameters but are updated for smaller parameter values, this is an indication that the computing capacity is insufficient for the set parameter values and smaller values should be selected.
Optionally, a Boolean signal can be selected for the Enable Execution input so that the algorithm is only active if the value of the selected signal is TRUE.
Input values
- Update Cluster Centers: If TRUE, the centers of each cluster are updated by the incoming data. If FALSE, the cluster centers remain unchanged and are only used to determine the cluster index of the incoming data points.
- Input 01, ..., Input 0n: These inputs form the n-dimensional feature space for which clustering is performed.
Configuration options
- Num Channels: Determines the number of input channels.
- Number of Clusters: Defines the number of clusters.
- Aggregation Buffer Size: Specifies the size of the aggregation buffer and thus the number of cycles after which the cluster centers are updated. The input values of these cycles are stored internally (in the aggregation buffer). The default value for this parameter is 10.
- Max Iterations: Specifies how often to iterate over the values in the aggregation buffer. The default value for this parameter is 1.
- Initialization Mode: Specifies the way in which the cluster centers are initialized:
- Random: The cluster centers are set randomly within the limits set by the Lower Bounds and Upper Bounds.
- Equidistant: The cluster centers are distributed equidistantly in the range of values defined by the Lower Bounds and Upper Bounds.
- Values: The cluster centers are initialized with the values set by the array Initial Cluster Centers.
- Initial Cluster Centers: For the initialization mode Values the values for the initial cluster centers are set here. The values for the individual clusters are set line by line. That is, the number of matrix rows corresponds to the Number of Clusters and the number of matrix columns corresponds to the Number of Channels. The first row contains the values for the first cluster for each input channel, and so on.
- Lower Bounds: For the modes Random and Equidistant the lower limits for the individual input channels are set.
- Upper Bound: For the modes Random and Equidistant the upper limits for the individual input channels are set.
Output values
- Cluster Index: Specifies the cluster index assigned to the data point of the last cycle, indicating the corresponding assigned cluster.
- Distance: Specifies the Euclidean distance between the data point and the assigned cluster center.
- Cluster Centers: Outputs the cluster centers of all clusters line by line. This corresponds to a matrix of dimension Number of Clusters x Number of Channels.
Standard HMI Controls
For the Sequential k-Means algorithm, the following HMI controls are available for generating an Analytics Dashboard:
1. The Table Control or Multivalue Control visualizes the output values: Lower Bounds, Upper Bounds and Initial Centers.
Alternatively, customer-specific HMI controls can be mapped in the Sequential k-Means algorithm using the Mapping Wizard.