返回列表 发布新帖

数据降维与K均值聚类的质量评估

9 0
admin 发表于 2024-12-14 03:04 | 查看全部 阅读模式

文档名:数据降维与K均值聚类的质量评估
摘要:聚类分析在大数据时代应用广泛,但缺乏直观评价聚类质量的有效方法.为此,提出一种具有数据降维和搜寻数据固有聚类数量的处理模式.在数据散射矩阵基础上构造一个增广矩阵,利用线性辨别分析将高维数据变换到最具辨别性的低维特征子空间以实现数据降维.为解决分区聚类算法的随机初始化问题,提出最小-最大规则,避免出现空聚类并确保数据的可分性.对于聚类的结果,计算每个聚类的轮廓系数,通过比较轮廓的尺寸以评价不同聚类数量情况下的聚类质量.对K-均值算法的仿真结果说明,这种处理模式不仅能够可视化确定未知数据所固有的聚类数量,而且能为高维数据提供有效的分析方法.

Abstract:Intheageofbigdata,dataanalysisisbecomingmoreandmoreimportant,andoneofthemostimportanttasksindataanalysisisdataclassification.Inpatternrecognitionandmachinelearning,classificationcanalsobedividedintosupervisedandunsupervisedclassification.Insupervisedclassification,thedataincludesbothfeaturesandclasslabels.However,inpracticalapplications,datasourcesareusuallyobtainedthroughsensordevice,andtherearenoavailableclasslabelsforthedata.Asaresult,unsupervisedclassification,especiallyclusteringtechniques,playsacrucialroleindataanalysis.Clustering,asanexploratorydataanalysismethod,candiscovertheinherentstructureofrawdatabygroupingdatasamplesintodifferentclusters.Intheeraofmobileinternet,thedimensionalityandstructureofdataarebecomingmorecomplex,clusteranalysisofhigh-dimensionaldataisinevitable.Forthehugeamountofdatathatneedstobeprocessed,tomoreeasilyorganize,summarizeandextracttheusefulinformationcontainedinthedata,compressionhasalsobecomeaveryimportanttopic.Datacompression(dimensionalityreduction)istotransformthedataintoanewfeaturesubspacewithalowerdimensionality.Dimensionalityreductionmainlyincludesfeatureselectionandfeatureextraction.Featureselectionistoselectasubsetofthefeatures.Infeatureextraction,therelevantinformationisderivedfromthefeaturesetinordertoconstructanewfeaturesubspace.Obviously,dimensionalityreductionisnotonlyabasicstepofreprocessing,butalsoconducivetodatavisualization.Basedonthepropertiesofgeneratedclusters,clusteringcanbedividedintopartitionalclusteringandhierarchicalclustering.Inacademicandindustrialfields,however,partitionalclusteringisthemostwidelyused.Invariouspartitionalclusteringmethods,theK-meansclusteringalgorithmhasbecomethemostclassicandpopularalgorithm.Thisisbecauseitslowcomputationalcomplexitymakesitpopular.TheK-meansalgorithmhasachievedverygoodclusteringeffectsinmanypracticalapplications,especiallywhentheclusteringresultsarecompactandhyper-sphericalinshape.AlthoughK-meansalgorithmisbyfarthemostlyusedclusteringmethod,italsohasseveralmajordisadvantages.Thefirstproblemisthattheiterativeoptimizationprocesscannotguaranteethealgorithmtoconvergetotheglobeoptimum,i.e.,K-meansclusteringcanconvergetoalocaloptimum.DuetotheuseofrandomlyassignedcentroidpositionsinK-means,thecentroidsmaybetoocloseandmaybemergedinsubsequenceiterations,resultinginthealgorithmgeneratingoneormoreemptyclusters.Thatis,K-meansalgorithmhasareasonablyinitializationproblem.Intheory,thereisnoeffectiveandgeneralschemetodeterminesuchreasonableinitialization.ThesecondflawtobehighlightedisthatK-meansalgorithmassumesthattheuserknowsinadvancethenumberofclustersinthedata.However,choosingtherightnumberofclusterscanbeverydifficultbecausetheinherentdatastructureoftherealdataisunknown.Asinthecaseofrandominitialization,thereisalsonoefficientschemetocorrectlyselectthenumberofclusters.ThethirddrawbackisthattheK-meansalgorithmisalsosensitivetooutliersandnoiseinthedata.Evenifthedatapointisfarfromthecenterofthecluster,theK-meansalgorithmstillforcesthepointtobeincludedintheclusterandusedtocalculatethecentroids,whichdistortstheshapeofthecluster.AimingattheabovethreeproblemsofK-meansalgorithm,thispaperproposesthecorrespondingimprovementscheme.Firstly,thelocationoftheinitialcentroidisdeterminedbythefarthestinitialcenterselectionprincipleandthemin-maxdistancerule,whichavoidstheoccurrenceofemptyclustersintheclusteringresultsandsolvestheproblemofuncertaintyofclusteringresultscausedbytherandominitializationoftheclassicalK-meansalgorithm.Secondly,todeterminetheoptimalnumberofclusters,amethodofestimatingthenumberrangeofclustersbasedonthestatisticalempiricalruleisproposed,andinthisrange,byobservingthecurveofthewithinclustersum-of-squared-errors(SSE)withthenumberofclusters,theelbowmethodisusedtointuitivelydeterminetheinherentclusterstructureofthedata.Thirdly,tooptimizetheperformanceofK-meansalgorithm,animportfeaturescalingmethod,standardizationisusedasdatapreprocessing.Thestandardizeddataobeysanormaldistributionwithzeromeanandunitvariance,whichsolvestheproblemthattheK-meansalgorithmissensitivetooutliersandnoise.Fourth,comparingwithotherfeatureextractiontechniques,suchasprincipalcomponentanalysis(PCA)andkernelPCA,thesupervisedlineardiscriminantanalysis(LDA)isproposedtocompresshigh-dimensionaldataintolowfeaturesubspacesfordatavisualization,andmoreimportantly,LDAisadimensionalityreductionmethodthatmaximizesclusterseparability.Finally,toevaluatetheclusteringquality,silhouetteanalysisisusedtoverifythevalidityofclustering.Thesilhouettecoefficientofeachclusteriscalculated,andbycomparingthesilhouettesize,thefinalnumberofclustersisdetermined.

作者:何帆   何选森   刘润宗   樊跃平   熊茂华 Author:HEFan   HEXuansen   LIURunzong   FANYueping   XIONGMaohua
作者单位:北京理工大学管理与经济学院,北京100081广州商学院信息技术与工程学院,广州511363;湖南大学信息科学与工程学院,长沙410082广州商学院信息技术与工程学院,广州511363
刊名:重庆理工大学学报
Journal:JournalofChongqingInstituteofTechnology
年,卷(期):2024, 38(1)
分类号:TP391
关键词:聚类质量  散射矩阵  线性辨别分析  最小-最大规则  轮廓分析  
Keywords:clusteringquality  scatteringmatrix  lineardiscriminantanalysis  min-maxrule  silhouetteanalysis  
机标分类号:
在线出版日期:2024年3月6日
基金项目:广东省普通高校重点领域专项,广东省教育厅特色创新项目数据降维与K-均值聚类的质量评估[
期刊论文]  重庆理工大学学报--2024, 38(1)何帆  何选森  刘润宗  樊跃平  熊茂华聚类分析在大数据时代应用广泛,但缺乏直观评价聚类质量的有效方法.为此,提出一种具有数据降维和搜寻数据固有聚类数量的处理模式.在数据散射矩阵基础上构造一个增广矩阵,利用线性辨别分析将高维数据变换到最具辨别性的...参考文献和引证文献
参考文献
引证文献
本文读者也读过
相似文献
相关博文

        数据降维与K-均值聚类的质量评估  Data dimensionality reduction and clustering quality evaluation of K-means clustering

数据降维与K-均值聚类的质量评估.pdf
2024-12-14 03:04 上传
文件大小:
6.72 MB
下载次数:
60
高速下载
【温馨提示】 您好!以下是下载说明,请您仔细阅读:
1、推荐使用360安全浏览器访问本站,选择您所需的PDF文档,点击页面下方“本地下载”按钮。
2、耐心等待两秒钟,系统将自动开始下载,本站文件均为高速下载。
3、下载完成后,请查看您浏览器的下载文件夹,找到对应的PDF文件。
4、使用PDF阅读器打开文档,开始阅读学习。
5、使用过程中遇到问题,请联系QQ客服。

本站提供的所有PDF文档、软件、资料等均为网友上传或网络收集,仅供学习和研究使用,不得用于任何商业用途。
本站尊重知识产权,若本站内容侵犯了您的权益,请及时通知我们,我们将尽快予以删除。
  • 手机访问
    微信扫一扫
  • 联系QQ客服
    QQ扫一扫
2022-2025 新资汇 - 参考资料免费下载网站 最近更新浙ICP备2024084428号
关灯 返回顶部
快速回复 返回顶部 返回列表