文档名:数据降维与K均值聚类的质量评估
摘要:聚类分析在大数据时代应用广泛,但缺乏直观评价聚类质量的有效方法.为此,提出一种具有数据降维和搜寻数据固有聚类数量的处理模式.在数据散射矩阵基础上构造一个增广矩阵,利用线性辨别分析将高维数据变换到最具辨别性的低维特征子空间以实现数据降维.为解决分区聚类算法的随机初始化问题,提出最小-最大规则,避免出现空聚类并确保数据的可分性.对于聚类的结果,计算每个聚类的轮廓系数,通过比较轮廓的尺寸以评价不同聚类数量情况下的聚类质量.对K-均值算法的仿真结果说明,这种处理模式不仅能够可视化确定未知数据所固有的聚类数量,而且能为高维数据提供有效的分析方法.
Abstract:Intheageofbigdata,dataanalysisisbecomingmoreandmoreimportant,andoneofthemostimportanttasksindataanalysisisdataclassification.Inpatternrecognitionandmachinelearning,classificationcanalsobedividedintosupervisedandunsupervisedclassification.Insupervisedclassification,thedataincludesbothfeaturesandclasslabels.However,inpracticalapplications,datasourcesareusuallyobtainedthroughsensordevice,andtherearenoavailableclasslabelsforthedata.Asaresult,unsupervisedclassification,especiallyclusteringtechniques,playsacrucialroleindataanalysis.Clustering,asanexploratorydataanalysismethod,candiscovertheinherentstructureofrawdatabygroupingdatasamplesintodifferentclusters.Intheeraofmobileinternet,thedimensionalityandstructureofdataarebecomingmorecomplex,clusteranalysisofhigh-dimensionaldataisinevitable.Forthehugeamountofdatathatneedstobeprocessed,tomoreeasilyorganize,summarizeandextracttheusefulinformationcontainedinthedata,compressionhasalsobecomeaveryimportanttopic.Datacompression(dimensionalityreduction)istotransformthedataintoanewfeaturesubspacewithalowerdimensionality.Dimensionalityreductionmainlyincludesfeatureselectionandfeatureextraction.Featureselectionistoselectasubsetofthefeatures.Infeatureextraction,therelevantinformationisderivedfromthefeaturesetinordertoconstructanewfeaturesubspace.Obviously,dimensionalityreductionisnotonlyabasicstepofreprocessing,butalsoconducivetodatavisualization.Basedonthepropertiesofgeneratedclusters,clusteringcanbedividedintopartitionalclusteringandhierarchicalclustering.Inacademicandindustrialfields,however,partitionalclusteringisthemostwidelyused.Invariouspartitionalclusteringmethods,theK-meansclusteringalgorithmhasbecomethemostclassicandpopularalgorithm.Thisisbecauseitslowcomputationalcomplexitymakesitpopular.TheK-meansalgorithmhasachievedverygoodclusteringeffectsinmanypracticalapplications,especiallywhentheclusteringresultsarecompactandhyper-sphericalinshape.AlthoughK-meansalgorithmisbyfarthemostlyusedclusteringmethod,italsohasseveralmajordisadvantages.Thefirstproblemisthattheiterativeoptimizationprocesscannotguaranteethealgorithmtoconvergetotheglobeoptimum,i.e.,K-meansclusteringcanconvergetoalocaloptimum.DuetotheuseofrandomlyassignedcentroidpositionsinK-means,thecentroidsmaybetoocloseandmaybemergedinsubsequenceiterations,resultinginthealgorithmgeneratingoneormoreemptyclusters.Thatis,K-meansalgorithmhasareasonablyinitializationproblem.Intheory,thereisnoeffectiveandgeneralschemetodeterminesuchreasonableinitialization.ThesecondflawtobehighlightedisthatK-meansalgorithmassumesthattheuserknowsinadvancethenumberofclustersinthedata.However,choosingtherightnumberofclusterscanbeverydifficultbecausetheinherentdatastructureoftherealdataisunknown.Asinthecaseofrandominitialization,thereisalsonoefficientschemetocorrectlyselectthenumberofclusters.ThethirddrawbackisthattheK-meansalgorithmisalsosensitivetooutliersandnoiseinthedata.Evenifthedatapointisfarfromthecenterofthecluster,theK-meansalgorithmstillforcesthepointtobeincludedintheclusterandusedtocalculatethecentroids,whichdistortstheshapeofthecluster.AimingattheabovethreeproblemsofK-meansalgorithm,thispaperproposesthecorrespondingimprovementscheme.Firstly,thelocationoftheinitialcentroidisdeterminedbythefarthestinitialcenterselectionprincipleandthemin-maxdistancerule,whichavoidstheoccurrenceofemptyclustersintheclusteringresultsandsolvestheproblemofuncertaintyofclusteringresultscausedbytherandominitializationoftheclassicalK-meansalgorithm.Secondly,todeterminetheoptimalnumberofclusters,amethodofestimatingthenumberrangeofclustersbasedonthestatisticalempiricalruleisproposed,andinthisrange,byobservingthecurveofthewithinclustersum-of-squared-errors(SSE)withthenumberofclusters,theelbowmethodisusedtointuitivelydeterminetheinherentclusterstructureofthedata.Thirdly,tooptimizetheperformanceofK-meansalgorithm,animportfeaturescalingmethod,standardizationisusedasdatapreprocessing.Thestandardizeddataobeysanormaldistributionwithzeromeanandunitvariance,whichsolvestheproblemthattheK-meansalgorithmissensitivetooutliersandnoise.Fourth,comparingwithotherfeatureextractiontechniques,suchasprincipalcomponentanalysis(PCA)andkernelPCA,thesupervisedlineardiscriminantanalysis(LDA)isproposedtocompresshigh-dimensionaldataintolowfeaturesubspacesfordatavisualization,andmoreimportantly,LDAisadimensionalityreductionmethodthatmaximizesclusterseparability.Finally,toevaluatetheclusteringquality,silhouetteanalysisisusedtoverifythevalidityofclustering.Thesilhouettecoefficientofeachclusteriscalculated,andbycomparingthesilhouettesize,thefinalnumberofclustersisdetermined.
作者:何帆 何选森 刘润宗 樊跃平 熊茂华 Author:HEFan HEXuansen LIURunzong FANYueping XIONGMaohua
作者单位:北京理工大学管理与经济学院,北京100081广州商学院信息技术与工程学院,广州511363;湖南大学信息科学与工程学院,长沙410082广州商学院信息技术与工程学院,广州511363
刊名:重庆理工大学学报
Journal:JournalofChongqingInstituteofTechnology
年,卷(期):2024, 38(1)
分类号:TP391
关键词:聚类质量 散射矩阵 线性辨别分析 最小-最大规则 轮廓分析
Keywords:clusteringquality scatteringmatrix lineardiscriminantanalysis min-maxrule silhouetteanalysis
机标分类号:
在线出版日期:2024年3月6日
基金项目:广东省普通高校重点领域专项,广东省教育厅特色创新项目数据降维与K-均值聚类的质量评估[
期刊论文] 重庆理工大学学报--2024, 38(1)何帆 何选森 刘润宗 樊跃平 熊茂华聚类分析在大数据时代应用广泛,但缺乏直观评价聚类质量的有效方法.为此,提出一种具有数据降维和搜寻数据固有聚类数量的处理模式.在数据散射矩阵基础上构造一个增广矩阵,利用线性辨别分析将高维数据变换到最具辨别性的...参考文献和引证文献
参考文献
引证文献
本文读者也读过
相似文献
相关博文
数据降维与K-均值聚类的质量评估 Data dimensionality reduction and clustering quality evaluation of K-means clustering
数据降维与K-均值聚类的质量评估.pdf
- 文件大小:
- 6.72 MB
- 下载次数:
- 60
-
高速下载
|
|