Run Scala script in Production

Steps

[hadoop@ip-xx-y-zz-12 ~]$ vim /mnt1/analytics/arvind/deploy/scala-redis.sh
#!/bin/sh
export HTTP_HOST=xx.y.zz.12
echo $HTTP_HOST
exec java -classpath "/mnt1/analytics/arvind/deploy/scala-redis.jar" "com.csscorp.restapi.RestServer" "$0" "$@"
!#
[hadoop@ip-xx-y-zz-12 deploy]$ chmod -R 777 scala-redis.sh
[hadoop@ip-xx-y-zz-12 deploy]$ nohup sh scala-redis.sh > ./scalaredis.log 2>&1 &
[hadoop@ip-xx-y-zz-12 deploy]$ jps
22938 RestServer
22961 Jps

Reference

Scala java.lang.NoSuchMethodError: scala.Predef$.augmentString error

Using Typesafe in Scala

Using Typesafe library of Scala, we can override the configuration settings.  This post describes the techniques:

build.sbt

libraryDependencies ++= Seq.apply(
  // --- Dependencies ---,
  "com.typesafe" % "config" % "1.2.1",
  // --- Dependencies ---
).map(
  _.excludeAll(ExclusionRule(organization = "org.mortbay.jetty"))
)

application.conf

redis {
  host = "localhost"
  host = ${?REDIS_HOST}
  port = 6379
  port = ${?REDIS_PORT}
}

Config

package com.csscorp.util

import com.typesafe.config.ConfigFactory

/**
  * Created by Nag Arvind Gudiseva on 15-Dec-16.
  */
object Config {

  /** Loads all key / value pairs from the application configuration file. */
  private val conf =  ConfigFactory.load()

  // Redis Configurations
  object RedisConfig {
    private lazy val redisConfig = conf.getConfig("redis")

    lazy val host = redisConfig.getString("host")
    lazy val port = redisConfig.getInt("port")
  }

}

TypeSafeTest

package tutorials.typesafe

import com.csscorp.util.Config.RedisConfig

/**
  * Created by Nag Arvind Gudiseva on 26-Dec-16.
  */
object TypeSafeTest {

  def main(args: Array[String]): Unit = {

    val redisHost = RedisConfig.host
    println(s"Hostname: $redisHost")
    val redisPort = RedisConfig.port
    println(f"Port: $redisPort")

  }
}

Terminal

C:\ScalaRedis> sbt clean compile package
C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
Output:
    Hostname: localhost
    Port: 6379

C:\ScalaRedis> SET REDIS_HOST=1.2.3.4
C:\ScalaRedis> echo %REDIS_HOST%
Output:
    1.2.3.4
C:\ScalaRedis> SET REDIS_PORT=1234
C:\ScalaRedis> echo %REDIS_PORT%
Output:
    1234

C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
Output:
    Hostname: 1.2.3.4
    Port: 1234

As observed, the host name and port that is set in the system environment variables gets substituted.

Production (AWS EMR)

[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
Output:
<Blank>
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
Output:
2016/12/27 05:56:04-286 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:56:04-287 [INFO] com.csscorp.restapi.RestServer$ - Port: 8080

[hadoop@ip-xx-y-zz-12 arvind]$ export HTTP_PORT=8081
[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
Output:
8081
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
Output:
2016/12/27 05:59:39-304 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:59:39-305 [INFO] com.csscorp.restapi.RestServer$ - Port: 8081

 

Reference

Using Typesafe’s Config for Scala (and Java) for Application Configuration

Scala – Head, Tail, Init & Last

Sample Program

object SequencesTest {

  // Conceptual representation
  println("""nag arvind gudiseva scala""")
  println("""---> HEAD""")
  println("""nag arvind gudiseva scala""")
  println("""... ---------------------> TAIL""")
  println("""nag arvind gudiseva scala""")
  println("""-------------------> INIT""")
  println("""nag arvind gudiseva scala""")
  println("""................... -----> LAST""")
  println("-----------------------------------")

  def main(arg: Array[String]): Unit = {

    val str1: String = "nag arvind gudiseva scala"
    val str1Arr: Array[String] = str1.split(" ")

    println("HEAD: " + str1Arr.head)
    println("TAIL: " + str1Arr.tail.deep.mkString)
    println("INIT: " + str1Arr.init.deep.mkString)
    println("LAST: " + str1Arr.last)

    println("-----------------------------------")

  }

}

 

Output

-----------------------------------
HEAD: nag
TAIL: arvindgudisevascala
INIT: nagarvindgudiseva
LAST: scala
-----------------------------------

 

Reference

Scala sequences – head, tail, init, last

 

Git Commands on Windows

Install Git on Windows

1. Download the latest Git for Windows stand-alone installer

2. Use the default options from Next and Finish.

3. Open Command Prompt

4. Configure Git username and email (be associated with any commits)

    Git global setup:

    C:\> git config --global user.name "Nag Arvind Gudiseva"
    C:\> git config --global user.email "nag.gudiseva@csscorp.com"

Git Command line instructions

    A. Create a new repository

    C:\> git clone https://gitlab.com/csscorpglobal/analytics-sample-project.git
    C:\> cd analytics-sample-project
    C:\> touch README.md
    C:\> git add README.md
    C:\> git commit -m "add README"
    C:\> git push -u origin master

B. Existing folder or Git repository

    C:\> cd existing_folder
    C:\> git init
    C:\> git remote add origin https://gitlab.com/csscorpglobal/analytics-sample-project.git
    C:\> git add .
    C:\> git commit
    C:\> git push -u origin master

C. In Linux

    $ git config --global http.proxy ""
    $ cd ~/gudiseva
    $ git init
    $ git status
    $ git add arvind.jpeg
    $ git commit -m “Add arvind.jpeg.”
    $ git remote add origin https://github.com/gudiseva/arvind.git
    $ git remote -v
    $ git remote status
    $ git remote diff
    $ git push

 

Jersey Multipart File Upload Maven Project

Create Maven Project with the below archetype (Jersey Version: 2.9; Jetty Version: 9.2.18.v20160721):

mvn archetype:generate -DarchetypeGroupId=org.glassfish.jersey.archetypes -DarchetypeArtifactId=jersey-quickstart-webapp -DarchetypeVersion=2.9

Apache FLUME Installation and Configuration in Windows 10

1. Download & install Java

2. Create a Junction Link.  (Needed as the Java Path contains spaces)

    C:\Windows\system32>mklink /J "C:\Program_Files" "C:\Program Files"
    Junction created for C:\Program_Files <<===>> C:\Program Files

3. Set Path and Classpath for Java

    JAVA_HOME=C:\Program_Files\Java\jdk1.8.0_102
    PATH=%JAVA_HOME%\bin;C:\Program_Files\Java\jdk1.8.0_102\bin
    CLASSPATH=%JAVA_HOME%\jre\lib

4. Download Flume

    Download apache-flume-1.7.0-bin.tar.gz

5. Extract using 7-Zip

 Move to C:\flume\apache-flume-1.7.0-bin directory

6. Set Path and Classpath for Flume

    FLUME_HOME=C:\flume\apache-flume-1.7.0-bin
    FLUME_CONF=%FLUME_HOME%\conf
    CLASSPATH=%FLUME_HOME%\lib\*
    PATH=C:\flume\apache-flume-1.7.0-bin\bin

7. Download Windows binaries for Hadoop versions

    https://github.com/steveloughran/winutils

8. Copy

To C:\hadoop\hadoop-2.6.0\bin

9. Set Path and Classpath for Hadoop

    PATH=C:\hadoop\hadoop-2.6.0\bin
    HADOOP_HOME=C:\hadoop\hadoop-2.6.0

10. Edit log4j.properties file

    flume.root.logger=DEBUG,console
    #flume.root.logger=INFO,LOGFILE

11. Copy flume-env.ps1.template as flume-env.ps1.

    Add below configuration:
    $JAVA_OPTS="-Xms500m -Xmx1000m -Dcom.sun.management.jmxremote"

12. Copy flume-conf.properties.template as flume-conf.properties

13. — Flume Working Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/flume-conf.properties.template –name agent

14. Install HDInsight Emulator (Hadoop) on Windows 10

a. Install Microsoft Web Platform Installer 5.0
b. Search for HDInsight
c. Select Add -> Install -> I Accept -> Finish
d Format Namenode

        C:\hdp> hdfs namenode -format

e. Start Hadoop and Other Services

        C:\hdp> start_local_hdp_services

f. Verify

        c:\hdp> hdfs dfsadmin -report

g. Hadoop Sample Commands

C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -ls hdfs://lap-04-2312:8020/
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop/flume

f. Stop Hadoop and Other Services

        C:\hdp> stop_local_hdp_services

13. — Other Flume Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/seq_log.properties –name SeqLogAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/seq_gen.properties –name SeqGenAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/flume-conf.properties –name TwitterAgent

 

Note: Sample Configurations and Properties are attached.

flume-conf-properties   seq_gen-properties   seq_log-properties

Apache Flume Installation & Configurations on Ubuntu 16.04

1. Download Flume

2. Extract Flume tar

$ tar -xzvf apache-flume-1.7.0-bin.tar.gz

3. Move to a folder

$ sudo mv apache-flume-1.7.0-bin /opt/
$ sudo mv apache-flume-1.7.0-bin apache-flume-1.7.0

4. Update the Path

$ gedit ~/.bashrc

export FLUME_HOME=/opt/apache-flume-1.7.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_CLASSPATH=$FLUME_CONF_DIR
export PATH=$PATH:$FLUME_HOME/bin

5. Update the Flume Environment

$ cd conf/
$ cp flume-env.sh.template flume-env.sh
$ gedit flume-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
$JAVA_OPTS=”-Xms500m -Xmx1000m -Dcom.sun.management.jmxremote”
export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

$ cd ..

6. Update log4j.properties

flume.root.logger=DEBUG,console
#flume.root.logger=INFO,LOGFILE

7. Reload BashRc

$ source ~/.bashrc (OR) . ~/.bashrc

8. Test Flume

$ flume-ng –help

9. Start Hadoop

$ start-all.sh
$ hadoop fs -ls hdfs://localhost:9000/nag
$ hdfs dfs -mkdir hdfs://localhost:9000/user/gudiseva/twitter_data

10. Run the following commands:

A. Twitter

$ cd $FLUME_HOME

$ bin/flume-ng agent –conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
(OR)
$ ./flume-ng  agent -n TwitterAgent  -c conf -f ../conf/twitter.conf
(OR)
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/twitter.conf Dflume.root.logger=DEBUG,console –name TwitterAgent

B. Sequence

$ cd $FLUME_HOME

$ bin/flume-ng agent –conf ./conf/ -f conf/seq_gen.conf Dflume.root.logger=DEBUG,console -n SeqGenAgent
(OR)
$ ./flume-ng  agent -n SeqGenAgent  -c conf -f ../conf/seq_gen.conf
(OR)
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_gen.conf –name SeqGenAgent

C. NetCat

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/netcat.conf –name NetcatAgent -Dflume.root.logger=INFO,console

D. Sequence Logger

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_log.conf –name SeqLogAgent -Dflume.root.logger=INFO,console

E. Cat / Tail File Channel

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/cat_tail.properties –name a1 -Dflume.root.logger=INFO,console

F. Spool Directory

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/flume-conf.properties –name agent -Dflume.root.logger=INFO,console

G. Default Template

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/flume-conf.properties.template –name agent -Dflume.root.logger=INFO,console

G. Multiple Sinks

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/multiple_sinks.properties –name flumeAgent -Dflume.root.logger=INFO,LOGFILE

11. Netcat

A. Check if Netcat is installed or not
   $ which netcat
   $ nc -h
   B. Else, install Netcat
   $ sudo apt-get install netcat
   C. Netcat in listening (Server) mode
   $ nc -l -p 12345
   D. Netcat in Client mode
   $ nc localhost 12345
   $ curl telnet://localhost:12345
   E. Netcat as a Client to perform Port Scanning
   $ nc -v hostname port
   $ nc -v www.google.com 80
     GET / HTTP/1.1

 

Note: Sample Configurations and Properties are attached.

cat_tail-properties; flume-conf-properties; multiple_sinks-properties; netcat-conf; seq_gen-conf; seq_log-conf; twitter-conf

 

Lambda Architecture

What is the Lambda Architecture?

Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective:

LA overview

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.

Reference: http://lambda-architecture.net/