top of page

Spark Introduction

  • Oct 10, 2015
  • 2 min read

The intended audience for the blog are the developers who are starting with Apache spark to provide high level overview.

Spark provides a high level distributed processing framework in which programmers can focus on logic for large scale data processing. The spark APIs can run on top of HDFS storage as well as in a standalone mode.

The application processes are distributed across cluster of worker nodes which are managed by single Master. With Spark, the capacity of the cluster increases proportionally by adding nodes in the cluster and Master assigns the task to different node in case of a node failure.

Use Cases

The following are the high level use cases with Apache Spark:

  • ELT (Extract/Load/Transform)

  • Text Mining

  • Sentiment Analysis

  • Data Indexing

  • Pattern Recognition

  • Clustering

  • Collaborative Filtering

  • Prediction Models

Word Count Example

The high level spark APIs enable faster and easier development. The in-memory data storage provides up to 100x performance when comparing with MapReduce.

The code for the sample is written in Scala which is a functional programming language which runs in a Java Virtual Machine.

Scala Code:

var myFilename = “file:/home/test/input.txt”

var myFile = sc.textFile(myFilename)

myFile.flatMap(line => line.split(“ “)

.map(word => (word, 1))

.reduceByKey((v1, v2) => (v1 + v2))

.saveAsTextFile(“file:/home/test/output.txt”)

Important notes:

  • sc is the SparkContext object which is required by every spark application

  • textFile function to open & load file(s) from the local file system

  • Each input line is split into input of word and count of 1 using flatMap function

  • Word is used as a key with the value of 1 for each word occurrence

  • Finally the output of the word count is stored in an output file

Benefits

  • Open source with top contributors

  • Comprehensive libraries for analysis, machine learning, data mining etc.

  • Very Fast when comparing with traditional MapReduce

  • Rapid Development where developers are focused on processing logic

  • Greater flexibility in choosing underlying distributed storage

  • Near linear scalability based on the available infrastructure

Thank you for reading. I hope the information in the post is helpful for you. Please provide your feedback as well as share with others.


 
 
 

Comments


Featured Posts
Recent Posts
Archive
Search By Tags
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square
  • LinkedIn - Black Circle
  • Twitter - Black Circle
  • Google+ - Black Circle
  • Facebook - Black Circle

© 2016 by Bhupendra Patni.

Follow me on social netwroks

bottom of page