Run Scala script in Production


[hadoop@ip-xx-y-zz-12 ~]$ vim /mnt1/analytics/arvind/deploy/
export HTTP_HOST=xx.y.zz.12
exec java -classpath "/mnt1/analytics/arvind/deploy/scala-redis.jar" "com.csscorp.restapi.RestServer" "$0" "$@"
[hadoop@ip-xx-y-zz-12 deploy]$ chmod -R 777
[hadoop@ip-xx-y-zz-12 deploy]$ nohup sh > ./scalaredis.log 2>&1 &
[hadoop@ip-xx-y-zz-12 deploy]$ jps
22938 RestServer
22961 Jps


Scala java.lang.NoSuchMethodError: scala.Predef$.augmentString error

Using Typesafe in Scala

Using Typesafe library of Scala, we can override the configuration settings.  This post describes the techniques:


libraryDependencies ++= Seq.apply(
  // --- Dependencies ---,
  "com.typesafe" % "config" % "1.2.1",
  // --- Dependencies ---
  _.excludeAll(ExclusionRule(organization = "org.mortbay.jetty"))


redis {
  host = "localhost"
  host = ${?REDIS_HOST}
  port = 6379
  port = ${?REDIS_PORT}


package com.csscorp.util

import com.typesafe.config.ConfigFactory

  * Created by Nag Arvind Gudiseva on 15-Dec-16.
object Config {

  /** Loads all key / value pairs from the application configuration file. */
  private val conf =  ConfigFactory.load()

  // Redis Configurations
  object RedisConfig {
    private lazy val redisConfig = conf.getConfig("redis")

    lazy val host = redisConfig.getString("host")
    lazy val port = redisConfig.getInt("port")



package tutorials.typesafe

import com.csscorp.util.Config.RedisConfig

  * Created by Nag Arvind Gudiseva on 26-Dec-16.
object TypeSafeTest {

  def main(args: Array[String]): Unit = {

    val redisHost =
    println(s"Hostname: $redisHost")
    val redisPort = RedisConfig.port
    println(f"Port: $redisPort")



C:\ScalaRedis> sbt clean compile package
C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
    Hostname: localhost
    Port: 6379

C:\ScalaRedis> SET REDIS_HOST=
C:\ScalaRedis> echo %REDIS_HOST%
C:\ScalaRedis> SET REDIS_PORT=1234
C:\ScalaRedis> echo %REDIS_PORT%

C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
    Port: 1234

As observed, the host name and port that is set in the system environment variables gets substituted.

Production (AWS EMR)

[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
2016/12/27 05:56:04-286 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:56:04-287 [INFO] com.csscorp.restapi.RestServer$ - Port: 8080

[hadoop@ip-xx-y-zz-12 arvind]$ export HTTP_PORT=8081
[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
2016/12/27 05:59:39-304 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:59:39-305 [INFO] com.csscorp.restapi.RestServer$ - Port: 8081



Using Typesafe’s Config for Scala (and Java) for Application Configuration

Scala – Head, Tail, Init & Last

Sample Program

object SequencesTest {

  // Conceptual representation
  println("""nag arvind gudiseva scala""")
  println("""---> HEAD""")
  println("""nag arvind gudiseva scala""")
  println("""... ---------------------> TAIL""")
  println("""nag arvind gudiseva scala""")
  println("""-------------------> INIT""")
  println("""nag arvind gudiseva scala""")
  println("""................... -----> LAST""")

  def main(arg: Array[String]): Unit = {

    val str1: String = "nag arvind gudiseva scala"
    val str1Arr: Array[String] = str1.split(" ")

    println("HEAD: " + str1Arr.head)
    println("TAIL: " + str1Arr.tail.deep.mkString)
    println("INIT: " + str1Arr.init.deep.mkString)
    println("LAST: " + str1Arr.last)






HEAD: nag
TAIL: arvindgudisevascala
INIT: nagarvindgudiseva
LAST: scala



Scala sequences – head, tail, init, last


Git Commands on Windows

Install Git on Windows

1. Download the latest Git for Windows stand-alone installer

2. Use the default options from Next and Finish.

3. Open Command Prompt

4. Configure Git username and email (be associated with any commits)

    Git global setup:

    C:\> git config --global "Nag Arvind Gudiseva"
    C:\> git config --global ""

Git Command line instructions

    A. Create a new repository

    C:\> git clone
    C:\> cd analytics-sample-project
    C:\> touch
    C:\> git add
    C:\> git commit -m "add README"
    C:\> git push -u origin master

B. Existing folder or Git repository

    C:\> cd existing_folder
    C:\> git init
    C:\> git remote add origin
    C:\> git add .
    C:\> git commit
    C:\> git push -u origin master

C. In Linux

    $ git config --global http.proxy ""
    $ cd ~/gudiseva
    $ git init
    $ git status
    $ git add arvind.jpeg
    $ git commit -m “Add arvind.jpeg.”
    $ git remote add origin
    $ git remote -v
    $ git remote status
    $ git remote diff
    $ git push


Jersey Multipart File Upload Maven Project

Create Maven Project with the below archetype (Jersey Version: 2.9; Jetty Version: 9.2.18.v20160721):

mvn archetype:generate -DarchetypeGroupId=org.glassfish.jersey.archetypes -DarchetypeArtifactId=jersey-quickstart-webapp -DarchetypeVersion=2.9

Apache FLUME Installation and Configuration in Windows 10

1. Download & install Java

2. Create a Junction Link.  (Needed as the Java Path contains spaces)

    C:\Windows\system32>mklink /J "C:\Program_Files" "C:\Program Files"
    Junction created for C:\Program_Files <<===>> C:\Program Files

3. Set Path and Classpath for Java


4. Download Flume

    Download apache-flume-1.7.0-bin.tar.gz

5. Extract using 7-Zip

 Move to C:\flume\apache-flume-1.7.0-bin directory

6. Set Path and Classpath for Flume


7. Download Windows binaries for Hadoop versions

8. Copy

To C:\hadoop\hadoop-2.6.0\bin

9. Set Path and Classpath for Hadoop


10. Edit file


11. Copy flume-env.ps1.template as flume-env.ps1.

    Add below configuration:
    $JAVA_OPTS="-Xms500m -Xmx1000m"

12. Copy as

13. — Flume Working Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/ –name agent

14. Install HDInsight Emulator (Hadoop) on Windows 10

a. Install Microsoft Web Platform Installer 5.0
b. Search for HDInsight
c. Select Add -> Install -> I Accept -> Finish
d Format Namenode

        C:\hdp> hdfs namenode -format

e. Start Hadoop and Other Services

        C:\hdp> start_local_hdp_services

f. Verify

        c:\hdp> hdfs dfsadmin -report

g. Hadoop Sample Commands

C:\hdp\hadoop-> hdfs dfs -ls hdfs://lap-04-2312:8020/
C:\hdp\hadoop-> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users
C:\hdp\hadoop-> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop
C:\hdp\hadoop-> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop/flume

f. Stop Hadoop and Other Services

        C:\hdp> stop_local_hdp_services

13. — Other Flume Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/ –name SeqLogAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/ –name SeqGenAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/ –name TwitterAgent


Note: Sample Configurations and Properties are attached.

flume-conf-properties   seq_gen-properties   seq_log-properties

Apache Flume Installation & Configurations on Ubuntu 16.04

1. Download Flume

2. Extract Flume tar

$ tar -xzvf apache-flume-1.7.0-bin.tar.gz

3. Move to a folder

$ sudo mv apache-flume-1.7.0-bin /opt/
$ sudo mv apache-flume-1.7.0-bin apache-flume-1.7.0

4. Update the Path

$ gedit ~/.bashrc

export FLUME_HOME=/opt/apache-flume-1.7.0

5. Update the Flume Environment

$ cd conf/
$ cp
$ gedit

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
$JAVA_OPTS=”-Xms500m -Xmx1000m”

$ cd ..

6. Update


7. Reload BashRc

$ source ~/.bashrc (OR) . ~/.bashrc

8. Test Flume

$ flume-ng –help

9. Start Hadoop

$ hadoop fs -ls hdfs://localhost:9000/nag
$ hdfs dfs -mkdir hdfs://localhost:9000/user/gudiseva/twitter_data

10. Run the following commands:

A. Twitter


$ bin/flume-ng agent –conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
$ ./flume-ng  agent -n TwitterAgent  -c conf -f ../conf/twitter.conf
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/twitter.conf Dflume.root.logger=DEBUG,console –name TwitterAgent

B. Sequence


$ bin/flume-ng agent –conf ./conf/ -f conf/seq_gen.conf Dflume.root.logger=DEBUG,console -n SeqGenAgent
$ ./flume-ng  agent -n SeqGenAgent  -c conf -f ../conf/seq_gen.conf
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_gen.conf –name SeqGenAgent

C. NetCat


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/netcat.conf –name NetcatAgent -Dflume.root.logger=INFO,console

D. Sequence Logger


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_log.conf –name SeqLogAgent -Dflume.root.logger=INFO,console

E. Cat / Tail File Channel


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/ –name a1 -Dflume.root.logger=INFO,console

F. Spool Directory


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/ –name agent -Dflume.root.logger=INFO,console

G. Default Template


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/ –name agent -Dflume.root.logger=INFO,console

G. Multiple Sinks


$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/ –name flumeAgent -Dflume.root.logger=INFO,LOGFILE

11. Netcat

A. Check if Netcat is installed or not
   $ which netcat
   $ nc -h
   B. Else, install Netcat
   $ sudo apt-get install netcat
   C. Netcat in listening (Server) mode
   $ nc -l -p 12345
   D. Netcat in Client mode
   $ nc localhost 12345
   $ curl telnet://localhost:12345
   E. Netcat as a Client to perform Port Scanning
   $ nc -v hostname port
   $ nc -v 80
     GET / HTTP/1.1


Note: Sample Configurations and Properties are attached.

cat_tail-properties; flume-conf-properties; multiple_sinks-properties; netcat-conf; seq_gen-conf; seq_log-conf; twitter-conf


Lambda Architecture

What is the Lambda Architecture?

Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective:

LA overview

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.