Before reading this article about kmeans clustering, please make sure you have gone through part I of the same.
Let us continue incremental clustering topic with its fundamental example, I am happy to tell you that online version of K-means is simple example of incremental clustering. Are’t you glad? I know yes, since K-means is pretty handy and popular algorithm. K-means is famous because of its ease and rapidity of categorizing huge data very professionally.
However, the yield of KMeans algorithm majorly hinge on the choice of preliminary cluster centers because the initial cluster centers are selected arbitrarily.The conventional K-means algorithm is computationally very costly because each iteration calculates the remoteness between data points and all the centroids so for. Hence to overcome this issue several incremental KMeans clustering methods are present in literature.
In online KMeans begin with k random centroids, and keep count of how many points belong to every cluster, initially single point in every cluster.Then for each new point calculate the nearest centroid, allocate the point to the centroid and transfer that centroid to the new point correspondingly to the number of points in the cluster. Hence as one have extra points in the cluster the weight of a new point has in the centroid reduces.
Today’s most of the databases are dynamic in nature because of rapid evolution of Internet and Hence new data sets are dynamically added into an older dataset and on modified data traditional algorithms does not perform clustering each time easily.
The dataset is dynamic, so it is not possible to gather entire data objects before beginning the process of clustering. When new data instances arrive, non-incremental clustering are prone to re-cluster entire data set, which surely diminutions proficiency.
Open source implementation based on the Lloyd’s method of online kmeans clustering is available with popular python library known as scikit learn.
For more details and code snippets, see the official site of sklearn library:
A very basic example (python code):
“””import necessary classes”””
>>> from sklearn.cluster import MiniBatchKMeans
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> trainLst = [“My name is Pallavi”, “My name is Neha”, “I write articles for Killiscorner”]
>>> tstLst = [“My name is Pallavi”, “I write articles for Killiscorner”]
“””the input data is in list of strings/text, convert the data into vectors”””
>>> vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = “english”,ngram_range = (1,3))
>>> vec = vectorizer.fit(trainLst)
>>> vectorized = vec.transform(trainLst)
>>> listToVec = vec.transform(tstLst)
“”” create an object of MiniBatchKMeans class”””
>>> minibatchKmeans = MiniBatchKMeans(n_clusters=2,random_state=0,batch_size=6)
“””Fit/train the vectorized data”””
“”””predict cluster centers”””