Spark Introduction

Oct 10, 2015
2 min read

The intended audience for the blog are the developers who are starting with Apache spark to provide high level overview.

Spark provides a high level distributed processing framework in which programmers can focus on logic for large scale data processing. The spark APIs can run on top of HDFS storage as well as in a standalone mode.

The application processes are distributed across cluster of worker nodes which are managed by single Master. With Spark, the capacity of the cluster increases proportionally by adding nodes in the cluster and Master assigns the task to different node in case of a node failure.

Use Cases

The following are the high level use cases with Apache Spark:

ELT (Extract/Load/Transform)
Text Mining
Sentiment Analysis
Data Indexing
Pattern Recognition
Clustering
Collaborative Filtering
Prediction Models

Word Count Example

The high level spark APIs enable faster and easier development. The in-memory data storage provides up to 100x performance when comparing with MapReduce.

The code for the sample is written in Scala which is a functional programming language which runs in a Java Virtual Machine.

Scala Code:

var myFilename = “file:/home/test/input.txt”

var myFile = sc.textFile(myFilename)

myFile.flatMap(line => line.split(“ “)

.map(word => (word, 1))

.reduceByKey((v1, v2) => (v1 + v2))

.saveAsTextFile(“file:/home/test/output.txt”)

Important notes:

sc is the SparkContext object which is required by every spark application
textFile function to open & load file(s) from the local file system
Each input line is split into input of word and count of 1 using flatMap function
Word is used as a key with the value of 1 for each word occurrence
Finally the output of the word count is stored in an output file

Benefits

Open source with top contributors
Comprehensive libraries for analysis, machine learning, data mining etc.
Very Fast when comparing with traditional MapReduce
Rapid Development where developers are focused on processing logic
Greater flexibility in choosing underlying distributed storage
Near linear scalability based on the available infrastructure

Thank you for reading. I hope the information in the post is helpful for you. Please provide your feedback as well as share with others.

Spark Introduction

Use Cases

Word Count Example

Comments

Featured Posts

Spark: Processing Web Logs

Storm: Business Use Cases

Kafka Cluster Sizing

Understanding Yarn: Business Analogy

Recent Posts

Spark: Processing Web Logs