spark怎么安裝

169次閱讀

共計 4460 個字符，預計需要花費 12 分鐘才能閱讀完成。

這篇文章主要介紹“spark 怎么安裝”，在日常操作中，相信很多人在 spark 怎么安裝問題上存在疑惑，丸趣 TV 小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”spark 怎么安裝”的疑惑有所幫助！接下來，請跟著丸趣 TV 小編一起來學習吧！

什么是 RDD

問題：從一個總計 100 行的文件中找出所有包含“包租婆”的行數算法如下：

1.  讀一行，判斷這一行有“包租婆”嗎？如果有，全局變量 count 加 1。2.  文件到末尾了嗎？如果沒有，跳轉到第 1 步繼續執行。3.  打印 count。

RDD 的概念：全稱為 Resilient Distributed Datasets，是一個容錯的、并行的數據結構，可以讓用戶顯式地將數據存儲到磁盤和內存中，并能控制數據的分區。

上述例子中，總計 100 行的文件就是一個 RDD，其中每一行表示一個 RDD 的元素

RDD 兩大特性

1.  對集合的每個記錄執行相同的操作
 -  每一行都做“字符串”檢查
 -  檢查本行是不是到了最后一行
2.  這個操作的具體行為是用戶指定的
 -  包含“包租婆”就為計數器做 + 1 操作
 -  最后一行：結束；不是最后一行：進入下一行檢查

RDD 有哪些操作參考資料

1.  創建 RDD
 -  從文件中創建
 val b = sc.textFile(README.md)
 README.md 每一行都是 RDD 的一個元素  
 -  從普通數組創建 RDD
 scala  val a = sc.parallelize(1 to 9, 3)
  里面包含了 1 到 9 這 9 個數字，它們分別在 3 個分區
2. map
map 是對 RDD 中的每個元素都執行一個指定的函數來產生一個新的 RDD。任何原 RDD 中的元素在新 RDD 中都有且只有一個元素與之對應。 - RDD a  中每個元素都比原來大一倍
 scala  val b = a.map(x =  x*2)
 scala  b.collect
 res11: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
3. mapPartitions
mapPartitions 是 map 的一個變種。map 的輸入函數是應用于 RDD 中每個元素，而 mapPartitions 的輸入函數是應用于每個分區，也就是把每個分區中的內容作為整體來處理的
 -  函數 myfunc 是把分區中一個元素和它的下一個元素組成一個 Tuple
scala  def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = { var res = List[(T, T)]() 
 var pre = iter.next while (iter.hasNext) {
 val cur = iter.next; 
 res .::= (pre, cur) pre = cur;
 } 
 res.iterator
scala  a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
4. mapValues
mapValues 顧名思義就是輸入函數應用于 RDD 中 Kev-Value 的 Value，原 RDD 中的 Key 保持不變，與新的 Value 一起組成新的 RDD 中的元素。因此，該函數只適用于元素為 KV 對的 RDD。_def mapPartitions[U: ClassTag](f: Iterator[T] =  Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
f 即為輸入函數，它處理每個分區里面的內容。每個分區中的內容將以 Iterator[T] 傳遞給輸入函數 f，f 的輸出結果是 Iterator[U]。最終的 RDD 由所有分區經過輸入函數處理后的結果合并起來的。_
 - RDD b  的 key 是字符串長度，value 是當前元素值；對 b 進行 mapValues 操作，使得 value 首尾字符設為 x
 scala  val a = sc.parallelize(List( dog ,  tiger ,  lion ,  cat ,  panther ,   eagle), 2)
 scala  val b = a.map(x =  (x.length, x))
 scala  b.mapValues(x  + _ +  x).collect
 res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx),(3,xcatx), (7,xpantherx), (5,xeaglex))
5. mapWith
mapWith 是 map 的另外一個變種，map 只需要一個輸入函數，而 mapWith 有兩個輸入函數。

spark 安裝

-  資料
 [安裝過程](https://spark.apache.org/downloads.html)
 
-  安裝

wget http://apache.spinellicreations.com/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
tar zxf spark-1.6.1-bin-hadoop2.6.tgz
mv spark-1.6.1-bin-hadoop2.6 spark
mv -f spark ~/app/
vi ~/.bash_profile 
PATH=$PATH:$HOME/bin:/home/solr/app/spark/bin
source ~/.bash_profile

-  啟動 spark

spark-shell
進入 scala 命令行

- hello world

scala  println(hello world)
hello world

spark IDE

下載并安裝 JDK

下載并安裝 IDEA

下載并安裝 SCALA

準備好 spark 的 lib 包

添加 IDEA 的 SCALA 插件 File- Settings- Plugins- 搜索 Scala，并安裝 Scala 插件

新建項目 File- New Project- 選擇 Scala- next- project name location – Finish

添加 spark 的 lib 包“File”–“project structure”–“Libraries”，選擇“+”，將 spark-hadoop 對應的包導入

新建 SparkPi 類（源碼見 $SPARKHOME$/examples/src/main/scala/org/apache/spark/examples）新建包：org.apache.spark.examples 新建 Scala 類：SparkPi

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the  License  you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an  AS IS  BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
// scalastyle:off println
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Spark Pi) // 本地運行加.setMaster(local) 
 val spark = new SparkContext(conf)
 val slices = if (args.length   0) args(0).toInt else 2
 val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
 val count = spark.parallelize(1 until n, slices).map { i = 
 val x = random * 2 - 1
 val y = random * 2 - 1
 if (x*x + y*y   1) 1 else 0
 }.reduce(_ + _)
 println(Pi is roughly   + 4.0 * count / n)
 spark.stop()
 }
// scalastyle:on println

 上傳至 linux 服務器，執行命令
$SPARK_HOME$/bin/spark-submit --class  org.apache.spark.examples.SparkPi  --master spark://updev4:7077 /home/solr/sparkPi.jar
輸出結果：Pi is roughly 3.13662

到此，關于“spark 怎么安裝”的學習就結束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習，快去試試吧！若想繼續學習更多相關知識，請繼續關注丸趣 TV 網站，丸趣 TV 小編會繼續努力為大家帶來更多實用的文章！

正文完