eclipse中如何運行spark機器學習代碼

154次閱讀

共計 6134 個字符，預計需要花費 16 分鐘才能閱讀完成。

這篇文章主要介紹 eclipse 中如何運行 spark 機器學習代碼，文中介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們一定要看完！

直接在 eclipse 運行，不需要 hadoop，不需要搭建 spark，只需要 pom.xml 中的依賴完整

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
object MLlib { def main(args: Array[String]) { val conf = new SparkConf().setAppName(s Book example: Scala).setMaster(local[2] )
 val sc = new SparkContext(conf)
 // Load 2 types of emails from text files: spam and ham (non-spam).
 // Each line has text from one email.
 val spam = sc.textFile(file:/Users/xxx/Documents/hadoopTools/scala/eclipse/Eclipse.app/Contents/MacOS/workspace/spark_ml/src/main/resources/files/spam.txt)
 val ham = sc.textFile(file:/Users/xxx/Documents/hadoopTools/scala/eclipse/Eclipse.app/Contents/MacOS/workspace/spark_ml/src/main/resources/files/ham.txt)
 // val abc=sc.parallelize(seq, 2)
 // Create a HashingTF instance to map email text to vectors of 100 features.
 val tf = new HashingTF(numFeatures = 100)
 // Each email is split into words, and each word is mapped to one feature.
 val spamFeatures = spam.map(email =  tf.transform(email.split(  )))
 val hamFeatures = ham.map(email =  tf.transform(email.split(  )))
 // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
 val positiveExamples = spamFeatures.map(features =  LabeledPoint(1, features))
 val negativeExamples = hamFeatures.map(features =  LabeledPoint(0, features))
 val trainingData = positiveExamples ++ negativeExamples
 trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
 // Create a Logistic Regression learner which uses the LBFGS optimizer.
 val lrLearner = new LogisticRegressionWithSGD()
 // Run the actual learning algorithm on the training data.
 val model = lrLearner.run(trainingData)
 // Test on a positive example (spam) and a negative one (ham).
 // First apply the same HashingTF feature transformation used on the training data.
 val posTestExample = tf.transform(O M G GET cheap stuff by sending money to ... .split(  ))
 val negTestExample = tf.transform(Hi Dad, I started studying Spark the other ... .split(  ))
 // Now use the learned model to predict spam/ham for new emails.
 println(s Prediction for positive test example: ${model.predict(posTestExample)} )
 println(s Prediction for negative test example: ${model.predict(negTestExample)} )
 sc.stop()
 }
}

sc.textFile 里的參數是文件在本地的絕對路徑。

setMaster(local[2] ) 表示是本地運行，只使用兩個核

HashingTF 用來從文檔中創建詞條目的頻率特征向量，這里設置維度為 100.

TF-IDF(Term frequency-inverse document frequency ) 是文本挖掘中一種廣泛使用的特征向量化方法。TF-IDF 反映了語料中單詞對文檔的重要程度。假設單詞用 t 表示，文檔用 d 表示，語料用 D 表示，那么文檔頻度 DF(t, D)是包含單詞 t 的文檔數。如果我們只是使用詞頻度量重要性，就會很容易過分強調重負次數多但攜帶信息少的單詞，例如：”a”,“the”以及”of”。如果某個單詞在整個語料庫中高頻出現，意味著它沒有攜帶專門針對某特殊文檔的信息。逆文檔頻度 (IDF) 是單詞攜帶信息量的數值度量。

pom.xml

project xmlns= http://maven.apache.org/POM/4.0.0  xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance 
 xsi:schemaLocation= http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd 
 modelVersion 4.0.0 /modelVersion 
 groupId com.yanan.spark_maven /groupId 
 artifactId spark1.3.1 /artifactId 
 version 0.0.1-SNAPSHOT /version 
 packaging jar /packaging 
 name spark_maven /name 
 url http://maven.apache.org /url 
 properties 
 project.build.sourceEncoding UTF-8 /project.build.sourceEncoding 
 jackson.version 1.9.13 /jackson.version 
 /properties 
 dependencies 
 dependency 
 groupId junit /groupId 
 artifactId junit /artifactId 
 version 3.8.1 /version 
 scope test /scope 
 /dependency 
 dependency 
 groupId org.scala-lang /groupId 
 artifactId scala-library /artifactId 
 version 2.10.4 /version 
 /dependency 
 dependency 
 groupId org.apache.spark /groupId 
 artifactId spark-core_2.10 /artifactId 
 version 1.3.1 /version 
 /dependency 
 !-- dependency   groupId org.apache.spark /groupId   artifactId spark-sql_2.10 /artifactId  
 version 1.3.1 /version   /dependency   dependency   groupId org.apache.spark /groupId  
 artifactId spark-hive_2.10 /artifactId   version 1.3.1 /version   /dependency  
 dependency   groupId org.apache.spark /groupId   artifactId spark-bagel_2.10 /artifactId  
 version 1.3.1 /version   /dependency 
   dependency 
 groupId org.apache.spark /groupId 
 artifactId spark-graphx_2.10 /artifactId 
 version 1.3.1 /version 
 /dependency  -- 
 dependency 
 groupId org.apache.spark /groupId 
 artifactId spark-mllib_2.10 /artifactId 
 version 1.3.1 /version 
 /dependency 
 !-- specify the version for json_truple  dependency   groupId org.codehaus.jackson /groupId  
 artifactId jackson-core-asl /artifactId   version ${jackson.version} /version  
 /dependency   dependency   groupId org.codehaus.jackson /groupId   artifactId jackson-mapper-asl /artifactId  
 version ${jackson.version} /version   /dependency  -- 
 /dependencies 

 name Scala-tools Maven2 Repository /name 
 url http://scala-tools.org/repo-releases /url 
 /pluginRepository 
 /pluginRepositories 
 repositories 
 repository 
 id cloudera-repo-releases /id 
 url https://repository.cloudera.com/artifactory/repo/ /url 
 /repository 
 /repositories 
 /project

ham.txt

Dear Spark Learner, Thanks so much for attending the Spark Summit 2014! Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I haven t yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe

spam.txt

Dear sir, I am a Prince in a far kingdom you have not heard of. I want to send you money via wire transfer so please ...
Get Vi_agra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really cheap and never have to ...