Spark: Processing Web Logs

Bhupendra patni
Dec 13, 2015
2 min read

The target audience for the blog post are developers who are starting to work with Spark and Scala. In this blog post, we will write code in Scala to process a sample web logs for couple of use cases.

We will cover the following two use cases:

Top 25 users with the maximum hits
Display first 25 users with their IP Addresses

Assumptions

The web log data files are available to process
Files are copied over to Hadoop cluster
SparkContext object is created with pointing to spark master
The User ID is the 5th column and IP address is the 3rd column in the file

Use Case 1: Top 25 users with the maximum hits

The first step in the code is to create a SparkContext object pointing to spark master and then load web log files using sc.textFile() method:

// Create an RDD based on all the weblogs

var weblogs=sc.textFile("hdfs://data/weblogs/*")

The second step in the code is to:

Read each line with the user request in the log file
Split each line by a column separator
Map key-value pair with User ID and its counter as 1
Sum the count with the number of requests per user

Note that we have combined multiple lines of code in single statement in Scala.

//Map each request line to a pair (userid, 1) then sum the hits

var userRequests = weblogs.

map(line => line.split(' ')).

map(words => (words(4),1)).

reduceByKey((v1,v2) => v1 + v2)

The third step in the code is to extract records for the top 25 users with maximum counts

//Extract the records for the 25 users with the highest counts

userRequests.map(pair => (pair._2,pair._1)).sortByKey(false).take(25)

Use Case 2: First 25 users with their IP addresses

For the second use case, we will use the same SparkContext and RDD created in the previous use case for weblogs.

The first step in the code is to:

Read each line with the user request in the log file
Split each line by a column separator (space in this case)
Map key-value pair with User ID and corresponding IP address

//Group IPs by User

var userIPs = weblogs.

map(line => line.split(' ')).

map(words => (words(4),words(2))).

groupByKey()

The last step in the code is to print User ID and corresponding IP Addresses delimited by tab.

// Print first 25 user ids and their corresponding IPs

for (pair <- userIPs.take(25)) {

println(pair._1 + ":")

for (ip <- pair._2) println("\t"+ip)

}

The hope the post is helpful for the developers in using Spark-Scala for applying it for specific use cases. Please share with others and provide your feedback on the post.