Python实现-Kmeans聚类算法

老牧童 • 2024-03-24 18:26 • python • 阅读 141

Python实现-Kmeans聚类算法Python实现-Kmeans聚类算法1.Kmeans聚类定义2.问题描述3.实现过程1.Kmeans聚类算法Kmeans聚类算法：k均值聚类算法（k-meansclusteringalgorithm）是一种迭代求解的聚类分析算法，其步骤是随机选取K个对象作为初始的聚类中心，然后计算每个对象与各个种子聚类中心之间的距离，把每个对象分配给距离它最近的聚类中心。聚类中心以及分配给…

大家好，欢迎来到IT知识分享网。

Python实现-Kmeans聚类算法

1.Kmeans聚类定义
2.问题描述
3.实现过程

1. Kmeans聚类算法

Kmeans聚类算法： k均值聚类算法（k-means clustering algorithm）是一种迭代求解的聚类分析算法，其步骤是随机选取K个对象作为初始的聚类中心，然后计算每个对象与各个种子聚类中心之间的距离，把每个对象分配给距离它最近的聚类中心。聚类中心以及分配给它们的对象就代表一个聚类。每分配一个样本，聚类的聚类中心会根据聚类中现有的对象被重新计算。这个过程将不断重复直到满足某个终止条件。来自百度百科
本人理解： k均值聚类算法（k-means clustering algorithm）即通过随机得到的K个聚类对象作为聚类中心，把每个对象分配给距离它最近的聚类中心。每次迭代聚类求解都是对不同的聚类中心进行聚类分组求解，直到满足某个条件，不在进行迭代。

2. 问题描述

问题描述： 随机生成100个点,分成N类(N),在100个点中随机出N个点作为初始的分类中心点,计算其他点和这N个点之间的距离，将整个点分给距离最近的点.
收敛条件： 计算新的中心点(质心)：通过新生成的聚类求取想x，y平均值。当本次的聚类中心点和上次的聚类中心点距离差小于threshold（限制条件）时，迭代结束。

3. 代码实现

import random
import matplotlib.pyplot as plt
class Kmeans():
    def __init__(self, k):
        ''' 初始化 :param k:代表聚类中心的个数 '''
        self.__k = k
        self.__data = []  #存放原始数据,初次生成的100个点
        self.__pointCenter = []   #存放聚类中心点
        self.__result = []      #存放最后的聚类结果
        for i in range(k):      #默认有5个聚类，即生成[[],[],[],[],[]]
            self.__result.append([])   #五个子列表，存放分类的点
            pass
        pass

    def calDistance(self,points1,points2):
        ''' 欧式距离：sprt((x1-x2)^2+(y1-y2)^2) :param points1: 一维列表 :param points2: 一维列表 :return: 两个点之间的直线距离 '''
        distance=(sum([(x1-x2)**2 for x1,x2 in zip(points1,points2)]))**0.5  #开平方等于乘以1/2次方
        return distance
        pass

    def randomCenter(self):
        ''' 生成self.__pointCenter：初次聚类中心点列表 :return: '''
        while len(self.__pointCenter)<self.__k:
            index=random.randint(0,len(self.__data))  #得到0-len(self.__data)-1之间的索引
            if self.__data[index] not in self.__pointCenter:   #用索引值得到列表的值
                self.__pointCenter.append(self.__data[index])
                pass
        pass

    def calPointToCenterDistance(self,data,center):
        ''' 计算每个点和聚类中心点之间的距离 :param data: 原始数据,初次生成的100个点 :param center: 中心聚类点 :return: 距离 '''
        distance=[]
        for i in data:
            distance.append([self.calDistance(i,centerpoint) for centerpoint in center])
            pass
        return distance
        pass

    # def sortPoint(self,distance):
    # '''
    # 对原始数据进行分类，将每个点分到离他最近的聚类中心点
    # :param distance: 得到的距离值
    # :return: 返回最终的分类结果
    # '''
    # for i in distance:
    # index=i.index(min(i)) #得到五个距离值中的最小值的索引
    # self.__result[index].append(self.__data[i]) #通过索引进行分类
    # pass
    # return self.__result
    # pass

    def calNewCenterPoint(self,result):
        ''' 计算新的中心点：生成方式：通过生成的新的聚类求取新的平均值 :param result: 分类结果 :return: 返回新的聚类中心点 '''
        newCenterPoint1=[]
        for temp in result:
            #进行转置，即将N*M转为M*N形式，即将所有point.x值和point.y值装置到一个列表中
            #例如：[[x1,y1],[x2,y2]] 装置后： [[x1,x2],[y1,y2]]。便于求取新的平均值
            temps=[[temp[x][i] for x in range(len(temp))] for i in range(len(temp[0]))]
            point=[]
            for i in temps:
                point.append(sum(i)/len(i))#求和在除以数组长度，求取平均值
                pass
            newCenterPoint1.append(point)
            pass
        return newCenterPoint1
        pass

    def calCenterToCenterDistance(self,old,new):
        ''' 迭代结束条件： 计算新旧中心点之间的距离： :param old: :param new: :return: '''
        total=0
        for point1,point2 in zip(old,new):
            total += self.calDistance(point1,point2)
            pass
        return total/len(old)
        pass


    def fit(self,data,threshold,times=50000):
        self.__data = data
        self.randomCenter()
        print(self.__pointCenter)
        centerDistance = self.calPointToCenterDistance(self.__data, self.__pointCenter)

        # 对原始数据进行分类，将每个点分到离它最近的中心点
        i = 0
        for temp in centerDistance:
            index = temp.index(min(temp))
            self.__result[index].append(self.__data[i])
            i += 1
            pass
        # 打印分类结果
        # print(self.__result)
        oldCenterPoint = self.__pointCenter
        newCenterPoint = self.calNewCenterPoint(self.__result)

        while self.calCenterToCenterDistance(oldCenterPoint, newCenterPoint) > threshold:
            times -= 1
            result = []
            for i in range(self.__k):
                result.append([])
                pass
            # 保存上次的中心点
            oldCenterPoint = newCenterPoint
            centerDistance = self.calPointToCenterDistance(self.__data,newCenterPoint)

            # 对原始数据进行分类，将每个点分到离它最近的中心点
            i = 0
            for temp in centerDistance:
                index = temp.index(min(temp))
                result[index].append(self.__data[i])  # result = [[[10,20]]]
                i += 1
                pass

            newCenterPoint = self.calNewCenterPoint(result)
            print(self.calCenterToCenterDistance(oldCenterPoint, newCenterPoint))
            self.__result = result
            pass
        self.__pointCenter = newCenterPoint
        return newCenterPoint, self.__result
        pass

    pass


if __name__ == "__main__":
    data = [[random.randint(1, 100), random.randint(1, 100)] for i in range(1000)]
    for i in range(100):
        kmeans = Kmeans(k=5)
        centerPoint, result = kmeans.fit(data, 0.0001)
        print(centerPoint)
        plt.plot()
        plt.title("KMeans Classification")
        i = 0
        tempx = []
        tempy = []
        color = []
        for temp in result:
            temps = [[temp[x][i] for x in range(len(temp))] for i in range(len(temp[0]))]
            color += [i] * len(temps[0])
            tempx += temps[0]
            tempy += temps[1]

            i += 2
            pass
        plt.scatter(tempx, tempy, c=color, s=30)
        plt.show()
        pass
    pass

如图所示为聚类分组图片

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/11245.html

赞 (0)

0

python评分卡模型

上一篇 2024-03-24 15:00

python时间间隔循环_python循环间隔

下一篇 2024-03-24 19:33

发表回复

关注微信