久久精品人人爽,华人av在线,亚洲性视频网站,欧美专区一二三

mahout canopy怎么使用

共計(jì) 6888 個(gè)字符,預(yù)計(jì)需要花費(fèi) 18 分鐘才能閱讀完成。

這篇文章主要介紹“mahout canopy 怎么使用”,在日常操作中,相信很多人在 mahout canopy 怎么使用問(wèn)題上存在疑惑,丸趣 TV 小編查閱了各式資料,整理出簡(jiǎn)單好用的操作方法,希望對(duì)大家解答”mahout canopy 怎么使用”的疑惑有所幫助!接下來(lái),請(qǐng)跟著丸趣 TV 小編一起來(lái)學(xué)習(xí)吧!

canopy 原理是聚類(lèi)算法的一種實(shí)現(xiàn)
canopy 是一種簡(jiǎn)單,快速但是不準(zhǔn)確的聚類(lèi)方法
cannopy 是一種小而美的聚類(lèi)方法,其算法流程如下
1 設(shè)樣本集為 S 確定兩個(gè)闕值 t1 和 t2 其中 t1 t2
2 任取一個(gè)樣本點(diǎn) p 屬于 s 作為一個(gè) canopy 記為 c, 從 s 中移除 p
3 記錄 s 中所有點(diǎn)到 p 的距離 dist
4 若 dist t1 則將其點(diǎn)歸為 C
5 若 dist t2 則將其點(diǎn)歸為 S
重復(fù) 2 - 5 直至 S 為空
T1 和 T2 參數(shù)
當(dāng) T1 過(guò)大時(shí),會(huì)使許多點(diǎn)屬于多個(gè) cannopy,可能造成各個(gè)點(diǎn)的中心點(diǎn)間距比較近,各族區(qū)間不明顯
當(dāng) T2 過(guò)大時(shí),增加強(qiáng)標(biāo)記數(shù)據(jù)點(diǎn)的數(shù)量,會(huì)減少族的個(gè)數(shù),T2 過(guò)小,會(huì)增加族的個(gè)數(shù),同時(shí),增加計(jì)算時(shí)間
mahout 中對(duì) canopy clustering 的實(shí)現(xiàn)是比較巧妙的,整個(gè)聚類(lèi)過(guò)程用兩個(gè) map 操作和一個(gè) reduce 操作就完成了
canopy 構(gòu)建過(guò)程可以概括為 遍歷給定點(diǎn)集 S,設(shè)置兩個(gè)闕值,t1 和 t2 且 t1 t2 選擇一個(gè)點(diǎn),用低成本算法計(jì)算它與其他
canopy 中心的距離,如果距離小于 t1  則將該點(diǎn)加入那個(gè) canopy 如果小于 T2  則該點(diǎn)不會(huì)成為某個(gè) canopy 的中心,重復(fù)整個(gè)過(guò)程,直到 s 非空
距離的實(shí)現(xiàn)
org.apache.mahout.common.distance.DistanceMeasure 接口
CosineDistanceMeasure
SquaredEuclideanDistanceMeasure 計(jì)算歐式距離的平方
EuclideanDistanceMeasure 計(jì)算歐式距離
ManhatanDistanceMeasure 馬氏距離,圖像處理用的比較多
TanimotoDistanceMeasure jaccard 相似度帶權(quán)重的歐式距離和馬氏距離
canopy 使用注意點(diǎn)
1 首先是輕量距離亮度的選擇。是選擇一個(gè)模型中的屬性,還是其他外部屬性這對(duì) canopy 的分布很重要
2 T1 和 T2 取值影響到重疊度 F,以及 canopy 的粒度
3.canopy 有消除孤立點(diǎn)的作用,而 kmeas 卻無(wú)能為力,建立 canopies 后,可以刪除那些包含比較少的 canopy,往往這些 canopy 包含孤立點(diǎn)
4,設(shè)置好 canopy 內(nèi)點(diǎn)的數(shù)目,來(lái)決定聚類(lèi)中心數(shù)目 k,這樣效果比較好
[root@localhost bin]# hadoop fs -mkdir /20140824
[root@localhost data]# vi test-data.csv
1 -0.213  -0.956  -0.003  0.056  0.091  0.017  -0.024  1
1 3.147  2.129  -0.006  -0.056  -0.063  -0.002  0.109  0
1 -2.165  -2.718  -0.008  0.043  -0.103  -0.156  -0.024  1
1 -4.337  -2.686  -0.012  0.122  0.082  -0.021  -0.042  1
root@localhost data]# hadoop fs -put test-data.csv /20140824
[root@localhost mahout-distribution-0.7]# hadoop jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job -i /20140824/test-data.csv -o /20140824  -t1 10  -t2 1
6/12/05 05:37:09 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/12/05 05:37:13 INFO input.FileInputFormat: Total input paths to process : 1
16/12/05 05:37:14 INFO mapreduce.JobSubmitter: number of splits:1
16/12/05 05:37:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480730026445_0005
16/12/05 05:37:17 INFO impl.YarnClientImpl: Submitted application application_1480730026445_0005
16/12/05 05:37:17 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1480730026445_0005/
16/12/05 05:37:17 INFO mapreduce.Job: Running job: job_1480730026445_0005
16/12/05 05:38:26 INFO mapreduce.Job: Job job_1480730026445_0005 running in uber mode : false
16/12/05 05:38:27 INFO mapreduce.Job:  map 0% reduce 0%
16/12/05 05:39:25 INFO mapreduce.Job:  map 100% reduce 0%
16/12/05 05:39:28 INFO mapreduce.Job: Job job_1480730026445_0005 completed successfully
16/12/05 05:39:30 INFO mapreduce.Job: Counters: 30
   File System Counters
     FILE: Number of bytes read=0
     FILE: Number of bytes written=105369
     FILE: Number of read operations=0
     FILE: Number of large read operations=0
     FILE: Number of write operations=0
     HDFS: Number of bytes read=339
     HDFS: Number of bytes written=457
     HDFS: Number of read operations=5
     HDFS: Number of large read operations=0
     HDFS: Number of write operations=2
   Job Counters
     Launched map tasks=1
     Data-local map tasks=1
     Total time spent by all maps in occupied slots (ms)=51412
     Total time spent by all reduces in occupied slots (ms)=0
     Total time spent by all map tasks (ms)=51412
     Total vcore-seconds taken by all map tasks=51412
     Total megabyte-seconds taken by all map tasks=52645888
   Map-Reduce Framework
     Map input records=4
     Map output records=4
     Input split bytes=108
     Spilled Records=0
     Failed Shuffles=0
     Merged Map outputs=0
     GC time elapsed (ms)=140
     CPU time spent (ms)=1620
     Physical memory (bytes) snapshot=87416832
     Virtual memory (bytes) snapshot=841273344
     Total committed heap usage (bytes)=15597568
   File Input Format Counters
     Bytes Read=231
   File Output Format Counters
     Bytes Written=457
16/12/05 05:39:31 INFO canopy.CanopyDriver: Build Clusters Input: /20140824/data Out: /20140824 Measure: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@79b0cd8f t1: 10.0 t2: 1.0
16/12/05 05:39:32 INFO client.RMProxy: Connecting to ResourceManager at hadoop02/127.0.0.1:8032
16/12/05 05:39:33 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/12/05 05:39:37 INFO input.FileInputFormat: Total input paths to process : 1
16/12/05 05:39:38 INFO mapreduce.JobSubmitter: number of splits:1
16/12/05 05:39:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480730026445_0006
16/12/05 05:39:38 INFO impl.YarnClientImpl: Submitted application application_1480730026445_0006
16/12/05 05:39:39 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1480730026445_0006/
16/12/05 05:39:39 INFO mapreduce.Job: Running job: job_1480730026445_0006
   File System Counters
     FILE: Number of bytes read=0
     FILE: Number of bytes written=105814
     FILE: Number of read operations=0
     FILE: Number of large read operations=0
     FILE: Number of write operations=0
     HDFS: Number of bytes read=1970
     HDFS: Number of bytes written=527
     HDFS: Number of read operations=13
     HDFS: Number of large read operations=0
     HDFS: Number of write operations=2
   Job Counters
     Launched map tasks=1
     Data-local map tasks=1
     Total time spent by all maps in occupied slots (ms)=26957
     Total time spent by all reduces in occupied slots (ms)=0
     Total time spent by all map tasks (ms)=26957
     Total vcore-seconds taken by all map tasks=26957
     Total megabyte-seconds taken by all map tasks=27603968
   Map-Reduce Framework
     Map input records=4
     Map output records=4
     Input split bytes=112
     Spilled Records=0
     Failed Shuffles=0
     Merged Map outputs=0
     GC time elapsed (ms)=134
     CPU time spent (ms)=1880
     Physical memory (bytes) snapshot=96550912
     Virtual memory (bytes) snapshot=841433088
     Total committed heap usage (bytes)=15597568
   File Input Format Counters
     Bytes Read=457
   File Output Format Counters
     Bytes Written=527
C-0{n=2 c=[1.000, -3.794, -2.694, -0.011, 0.102, 0.036, -0.055, -0.038, 1.000] r=[1:0.543, 2:0.008, 3:0.001, 4:0.020, 5:0.046, 6:0.034, 7:0.004]}
   Weight : [props – optional]:  Point:
   1.0: [1.000, -4.337, -2.686, -0.012, 0.122, 0.082, -0.021, -0.042, 1.000]
C-1{n=2 c=[1.000, -2.220, -2.270, -0.008, 0.066, -0.008, -0.079, -0.029, 1.000] r=[1:1.031, 2:0.433, 3:0.002, 4:0.016, 5:0.002, 6:0.010, 7:0.005]}
   Weight : [props – optional]:  Point:
   1.0: [1.000, -2.165, -2.718, -0.008, 0.043, -0.103, -0.156, -0.024, 1.000]
C-2{n=1 c=[0:1.000, 1:3.147, 2:2.129, 3:-0.006, 4:-0.056, 5:-0.063, 6:-0.002, 7:0.109] r=[]}
   Weight : [props – optional]:  Point:
   1.0: [0:1.000, 1:3.147, 2:2.129, 3:-0.006, 4:-0.056, 5:-0.063, 6:-0.002, 7:0.109]
C-3{n=1 c=[1.000, -1.189, -1.837, -0.006, 0.050, -0.006, -0.070, -0.024, 1.000] r=[]}
   Weight : [props – optional]:  Point:
   1.0: [1.000, -0.213, -0.956, -0.003, 0.056, 0.091, 0.017, -0.024, 1.000]
16/12/05 05:43:59 INFO clustering.ClusterDumper: Wrote 4 clusters
16/12/05 05:55:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 4 items
drwxr-xr-x  – root supergroup  0 2016-12-05 05:43 /20140824/clusteredPoints
drwxr-xr-x  – root supergroup  0 2016-12-05 05:42 /20140824/clusters-0-final
drwxr-xr-x  – root supergroup  0 2016-12-05 05:39 /20140824/data
-rw-r–r–  1 root supergroup  231 2016-12-05 05:21 /20140824/test-data.csv

到此,關(guān)于“mahout canopy 怎么使用”的學(xué)習(xí)就結(jié)束了,希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí),快去試試吧!若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí),請(qǐng)繼續(xù)關(guān)注丸趣 TV 網(wǎng)站,丸趣 TV 小編會(huì)繼續(xù)努力為大家?guī)?lái)更多實(shí)用的文章!

正文完
 
丸趣
版權(quán)聲明:本站原創(chuàng)文章,由 丸趣 2023-08-16發(fā)表,共計(jì)6888字。
轉(zhuǎn)載說(shuō)明:除特殊說(shuō)明外本站除技術(shù)相關(guān)以外文章皆由網(wǎng)絡(luò)搜集發(fā)布,轉(zhuǎn)載請(qǐng)注明出處。
評(píng)論(沒(méi)有評(píng)論)
主站蜘蛛池模板: 肥东县| 渭南市| 青田县| 永清县| 临清市| 交城县| 邢台县| 苍南县| 青田县| 吴忠市| 武定县| 唐河县| 博白县| 津南区| 阳朔县| 盖州市| 亚东县| 兴城市| 沾化县| 江门市| 瑞丽市| 巴林右旗| 兴宁市| 城固县| 离岛区| 承德县| 新巴尔虎左旗| 伊川县| 兴国县| 莒南县| 淅川县| 遂川县| 休宁县| 哈尔滨市| 昌邑市| 灌阳县| 平原县| 遵化市| 绵阳市| 石台县| 祥云县|