Spark Introduction
- Oct 10, 2015
- 2 min read
The intended audience for the blog are the developers who are starting with Apache spark to provide high level overview.
Spark provides a high level distributed processing framework in which programmers can focus on logic for large scale data processing. The spark APIs can run on top of HDFS storage as well as in a standalone mode.
The application processes are distributed across cluster of worker nodes which are managed by single Master. With Spark, the capacity of the cluster increases proportionally by adding nodes in the cluster and Master assigns the task to different node in case of a node failure.
Use Cases
The following are the high level use cases with Apache Spark:
ELT (Extract/Load/Transform)
Text Mining
Sentiment Analysis
Data Indexing
Pattern Recognition
Clustering
Collaborative Filtering
Prediction Models
Word Count Example
The high level spark APIs enable faster and easier development. The in-memory data storage provides up to 100x performance when comparing with MapReduce.
The code for the sample is written in Scala which is a functional programming language which runs in a Java Virtual Machine.
Scala Code:
var myFilename = “file:/home/test/input.txt”
var myFile = sc.textFile(myFilename)
myFile.flatMap(line => line.split(“ “)
.map(word => (word, 1))
.reduceByKey((v1, v2) => (v1 + v2))
.saveAsTextFile(“file:/home/test/output.txt”)
Important notes:
sc is the SparkContext object which is required by every spark application
textFile function to open & load file(s) from the local file system
Each input line is split into input of word and count of 1 using flatMap function
Word is used as a key with the value of 1 for each word occurrence
Finally the output of the word count is stored in an output file
Benefits
Open source with top contributors
Comprehensive libraries for analysis, machine learning, data mining etc.
Very Fast when comparing with traditional MapReduce
Rapid Development where developers are focused on processing logic
Greater flexibility in choosing underlying distributed storage
Near linear scalability based on the available infrastructure
Thank you for reading. I hope the information in the post is helpful for you. Please provide your feedback as well as share with others.

Comments