Category: Mayura International

How to upload Notebook to Anaconda Cloud (CLI Option)

Step #1

Check if the Anaconda User is signed into the Anaconda Cloud from the Anaconda Navigator

C:\> anaconda whoami

This should display:

  Username: gudiseva

 

Step #2

If Anaconda User is not signed in, then …

C:\> anaconda login

… provide Anaconda Username and Password

 

Step #3

Navigate to the folder where the notebook file is saved and upload the notebook to Anaconda Cloud

C: \> anaconda upload notebook.ipynb

 

Continue reading

Run Scala script in Production

Steps

[hadoop@ip-xx-y-zz-12 ~]$ vim /mnt1/analytics/arvind/deploy/scala-redis.sh
#!/bin/sh
export HTTP_HOST=xx.y.zz.12
echo $HTTP_HOST
exec java -classpath "/mnt1/analytics/arvind/deploy/scala-redis.jar" "com.csscorp.restapi.RestServer" "$0" "$@"
!#
[hadoop@ip-xx-y-zz-12 deploy]$ chmod -R 777 scala-redis.sh
[hadoop@ip-xx-y-zz-12 deploy]$ nohup sh scala-redis.sh > ./scalaredis.log 2>&1 &
[hadoop@ip-xx-y-zz-12 deploy]$ jps
22938 RestServer
22961 Jps

Reference

Scala java.lang.NoSuchMethodError: scala.Predef$.augmentString error

Using Typesafe in Scala

Using Typesafe library of Scala, we can override the configuration settings.  This post describes the techniques:

build.sbt

libraryDependencies ++= Seq.apply(
  // --- Dependencies ---,
  "com.typesafe" % "config" % "1.2.1",
  // --- Dependencies ---
).map(
  _.excludeAll(ExclusionRule(organization = "org.mortbay.jetty"))
)

application.conf

redis {
  host = "localhost"
  host = ${?REDIS_HOST}
  port = 6379
  port = ${?REDIS_PORT}
}

Config

package com.csscorp.util

import com.typesafe.config.ConfigFactory

/**
  * Created by Nag Arvind Gudiseva on 15-Dec-16.
  */
object Config {

  /** Loads all key / value pairs from the application configuration file. */
  private val conf =  ConfigFactory.load()

  // Redis Configurations
  object RedisConfig {
    private lazy val redisConfig = conf.getConfig("redis")

    lazy val host = redisConfig.getString("host")
    lazy val port = redisConfig.getInt("port")
  }

}

TypeSafeTest

package tutorials.typesafe

import com.csscorp.util.Config.RedisConfig

/**
  * Created by Nag Arvind Gudiseva on 26-Dec-16.
  */
object TypeSafeTest {

  def main(args: Array[String]): Unit = {

    val redisHost = RedisConfig.host
    println(s"Hostname: $redisHost")
    val redisPort = RedisConfig.port
    println(f"Port: $redisPort")

  }
}

Terminal

C:\ScalaRedis> sbt clean compile package
C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
Output:
    Hostname: localhost
    Port: 6379

C:\ScalaRedis> SET REDIS_HOST=1.2.3.4
C:\ScalaRedis> echo %REDIS_HOST%
Output:
    1.2.3.4
C:\ScalaRedis> SET REDIS_PORT=1234
C:\ScalaRedis> echo %REDIS_PORT%
Output:
    1234

C:\ScalaRedis> scala -classpath target/scala-2.10/scalaredis_2.10-1.0.jar tutorials.typesafe.TypeSafeTest
Output:
    Hostname: 1.2.3.4
    Port: 1234

As observed, the host name and port that is set in the system environment variables gets substituted.

Production (AWS EMR)

[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
Output:
<Blank>
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
Output:
2016/12/27 05:56:04-286 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:56:04-287 [INFO] com.csscorp.restapi.RestServer$ - Port: 8080

[hadoop@ip-xx-y-zz-12 arvind]$ export HTTP_PORT=8081
[hadoop@ip-xx-y-zz-12 arvind]$ echo $HTTP_PORT
Output:
8081
[hadoop@ip-xx-y-zz-12 arvind]$ scala -classpath deploy/scala-redis.jar com.csscorp.restapi.RestServer
Output:
2016/12/27 05:59:39-304 [INFO] com.csscorp.restapi.RestServer$ - Hostname: localhost
2016/12/27 05:59:39-305 [INFO] com.csscorp.restapi.RestServer$ - Port: 8081

 

Reference

Using Typesafe’s Config for Scala (and Java) for Application Configuration

Scala – Head, Tail, Init & Last

Sample Program

object SequencesTest {

  // Conceptual representation
  println("""nag arvind gudiseva scala""")
  println("""---> HEAD""")
  println("""nag arvind gudiseva scala""")
  println("""... ---------------------> TAIL""")
  println("""nag arvind gudiseva scala""")
  println("""-------------------> INIT""")
  println("""nag arvind gudiseva scala""")
  println("""................... -----> LAST""")
  println("-----------------------------------")

  def main(arg: Array[String]): Unit = {

    val str1: String = "nag arvind gudiseva scala"
    val str1Arr: Array[String] = str1.split(" ")

    println("HEAD: " + str1Arr.head)
    println("TAIL: " + str1Arr.tail.deep.mkString)
    println("INIT: " + str1Arr.init.deep.mkString)
    println("LAST: " + str1Arr.last)

    println("-----------------------------------")

  }

}

 

Output

-----------------------------------
HEAD: nag
TAIL: arvindgudisevascala
INIT: nagarvindgudiseva
LAST: scala
-----------------------------------

 

Reference

Scala sequences – head, tail, init, last

 

Git Commands on Windows

Install Git on Windows

1. Download the latest Git for Windows stand-alone installer

2. Use the default options from Next and Finish.

3. Open Command Prompt

4. Configure Git username and email (be associated with any commits)

    Git global setup:

    C:\> git config --global user.name "Nag Arvind Gudiseva"
    C:\> git config --global user.email "nag.gudiseva@csscorp.com"

Git Command line instructions

    A. Create a new repository

    C:\> git clone https://gitlab.com/csscorpglobal/analytics-sample-project.git
    C:\> cd analytics-sample-project
    C:\> touch README.md
    C:\> git add README.md
    C:\> git commit -m "add README"
    C:\> git push -u origin master

B. Existing folder or Git repository

    C:\> cd existing_folder
    C:\> git init
    C:\> git remote add origin https://gitlab.com/csscorpglobal/analytics-sample-project.git
    C:\> git add .
    C:\> git commit
    C:\> git push -u origin master

C. In Linux

    $ git config --global http.proxy ""
    $ cd ~/gudiseva
    $ git init
    $ git status
    $ git add arvind.jpeg
    $ git commit -m “Add arvind.jpeg.”
    $ git remote add origin https://github.com/gudiseva/arvind.git
    $ git remote -v
    $ git remote status
    $ git remote diff
    $ git push

 

Jersey Multipart File Upload Maven Project

Create Maven Project with the below archetype (Jersey Version: 2.9; Jetty Version: 9.2.18.v20160721):

mvn archetype:generate -DarchetypeGroupId=org.glassfish.jersey.archetypes -DarchetypeArtifactId=jersey-quickstart-webapp -DarchetypeVersion=2.9

Apache FLUME Installation and Configuration in Windows 10

1. Download & install Java

2. Create a Junction Link.  (Needed as the Java Path contains spaces)

    C:\Windows\system32>mklink /J "C:\Program_Files" "C:\Program Files"
    Junction created for C:\Program_Files <<===>> C:\Program Files

3. Set Path and Classpath for Java

    JAVA_HOME=C:\Program_Files\Java\jdk1.8.0_102
    PATH=%JAVA_HOME%\bin;C:\Program_Files\Java\jdk1.8.0_102\bin
    CLASSPATH=%JAVA_HOME%\jre\lib

4. Download Flume

    Download apache-flume-1.7.0-bin.tar.gz

5. Extract using 7-Zip

 Move to C:\flume\apache-flume-1.7.0-bin directory

6. Set Path and Classpath for Flume

    FLUME_HOME=C:\flume\apache-flume-1.7.0-bin
    FLUME_CONF=%FLUME_HOME%\conf
    CLASSPATH=%FLUME_HOME%\lib\*
    PATH=C:\flume\apache-flume-1.7.0-bin\bin

7. Download Windows binaries for Hadoop versions

    https://github.com/steveloughran/winutils

8. Copy

To C:\hadoop\hadoop-2.6.0\bin

9. Set Path and Classpath for Hadoop

    PATH=C:\hadoop\hadoop-2.6.0\bin
    HADOOP_HOME=C:\hadoop\hadoop-2.6.0

10. Edit log4j.properties file

    flume.root.logger=DEBUG,console
    #flume.root.logger=INFO,LOGFILE

11. Copy flume-env.ps1.template as flume-env.ps1.

    Add below configuration:
    $JAVA_OPTS="-Xms500m -Xmx1000m -Dcom.sun.management.jmxremote"

12. Copy flume-conf.properties.template as flume-conf.properties

13. — Flume Working Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/flume-conf.properties.template –name agent

14. Install HDInsight Emulator (Hadoop) on Windows 10

a. Install Microsoft Web Platform Installer 5.0
b. Search for HDInsight
c. Select Add -> Install -> I Accept -> Finish
d Format Namenode

        C:\hdp> hdfs namenode -format

e. Start Hadoop and Other Services

        C:\hdp> start_local_hdp_services

f. Verify

        c:\hdp> hdfs dfsadmin -report

g. Hadoop Sample Commands

C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -ls hdfs://lap-04-2312:8020/
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop
C:\hdp\hadoop-2.4.0.2.1.3.0-1981> hdfs dfs -mkdir hdfs://lap-04-2312:8020/users/hadoop/flume

f. Stop Hadoop and Other Services

        C:\hdp> stop_local_hdp_services

13. — Other Flume Commands —

C:\> cd %FLUME_HOME%/bin
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/seq_log.properties –name SeqLogAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/seq_gen.properties –name SeqGenAgent
C:\flume\apache-flume-1.7.0-bin\bin> flume-ng agent –conf %FLUME_CONF% –conf-file %FLUME_CONF%/flume-conf.properties –name TwitterAgent

 

Note: Sample Configurations and Properties are attached.

flume-conf-properties   seq_gen-properties   seq_log-properties

Apache Flume Installation & Configurations on Ubuntu 16.04

1. Download Flume

2. Extract Flume tar

$ tar -xzvf apache-flume-1.7.0-bin.tar.gz

3. Move to a folder

$ sudo mv apache-flume-1.7.0-bin /opt/
$ sudo mv apache-flume-1.7.0-bin apache-flume-1.7.0

4. Update the Path

$ gedit ~/.bashrc

export FLUME_HOME=/opt/apache-flume-1.7.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_CLASSPATH=$FLUME_CONF_DIR
export PATH=$PATH:$FLUME_HOME/bin

5. Update the Flume Environment

$ cd conf/
$ cp flume-env.sh.template flume-env.sh
$ gedit flume-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
$JAVA_OPTS=”-Xms500m -Xmx1000m -Dcom.sun.management.jmxremote”
export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

$ cd ..

6. Update log4j.properties

flume.root.logger=DEBUG,console
#flume.root.logger=INFO,LOGFILE

7. Reload BashRc

$ source ~/.bashrc (OR) . ~/.bashrc

8. Test Flume

$ flume-ng –help

9. Start Hadoop

$ start-all.sh
$ hadoop fs -ls hdfs://localhost:9000/nag
$ hdfs dfs -mkdir hdfs://localhost:9000/user/gudiseva/twitter_data

10. Run the following commands:

A. Twitter

$ cd $FLUME_HOME

$ bin/flume-ng agent –conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
(OR)
$ ./flume-ng  agent -n TwitterAgent  -c conf -f ../conf/twitter.conf
(OR)
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/twitter.conf Dflume.root.logger=DEBUG,console –name TwitterAgent

B. Sequence

$ cd $FLUME_HOME

$ bin/flume-ng agent –conf ./conf/ -f conf/seq_gen.conf Dflume.root.logger=DEBUG,console -n SeqGenAgent
(OR)
$ ./flume-ng  agent -n SeqGenAgent  -c conf -f ../conf/seq_gen.conf
(OR)
$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_gen.conf –name SeqGenAgent

C. NetCat

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/netcat.conf –name NetcatAgent -Dflume.root.logger=INFO,console

D. Sequence Logger

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/seq_log.conf –name SeqLogAgent -Dflume.root.logger=INFO,console

E. Cat / Tail File Channel

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/cat_tail.properties –name a1 -Dflume.root.logger=INFO,console

F. Spool Directory

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/flume-conf.properties –name agent -Dflume.root.logger=INFO,console

G. Default Template

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/flume-conf.properties.template –name agent -Dflume.root.logger=INFO,console

G. Multiple Sinks

$ cd $FLUME_HOME

$ ./bin/flume-ng agent –conf $FLUME_CONF –conf-file $FLUME_CONF/multiple_sinks.properties –name flumeAgent -Dflume.root.logger=INFO,LOGFILE

11. Netcat

A. Check if Netcat is installed or not
   $ which netcat
   $ nc -h
   B. Else, install Netcat
   $ sudo apt-get install netcat
   C. Netcat in listening (Server) mode
   $ nc -l -p 12345
   D. Netcat in Client mode
   $ nc localhost 12345
   $ curl telnet://localhost:12345
   E. Netcat as a Client to perform Port Scanning
   $ nc -v hostname port
   $ nc -v www.google.com 80
     GET / HTTP/1.1

 

Note: Sample Configurations and Properties are attached.

cat_tail-properties; flume-conf-properties; multiple_sinks-properties; netcat-conf; seq_gen-conf; seq_log-conf; twitter-conf

 

Lambda Architecture

What is the Lambda Architecture?

Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective:

LA overview

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.

Reference: http://lambda-architecture.net/

 

ssh to Ubuntu 16.04 running on Oracle VM VirtualBox

I. In Ubuntu VM:

1. Localhost IP (ifconfig): 127.0.0.1

2. Install the ssh server and client
$ sudo apt-get install ssh

3. Default SSH is listening on port 22
$ ssh gudiseva@127.0.0.1

4. Reconfigure the port for the ssh server
$ sudo nano /etc/ssh/sshd_config
Change Port 22 to Port 2222

5. Reload the configuration
$ sudo service ssh force-reload

6. Test the connection
$ ssh gudiseva@127.0.0.1 -p 2222

 

II. In Oracle VM VirtualBox Manager:

Oracle_VirtualBox_Port_Forward_SSH

 

III. In PuTTY / WinSCP:

Host / IP Address: 127.0.1.1
Port: 22

 

AWS Essentials for Hadoop Developers

MapReduce (with HDFS Path)

hadoop jar WordCount.jar WordCount /analytics/aws/input/result.csv /analytics/aws/output/1

 

MapReduce (with S3 Path)

hadoop jar WordCount.jar WordCount s3://emr-analytics-dev/input/result.csv s3://emr-analytics-dev/output/2

 

AWS S3 Cp

Usage: Copy files from EBS (mounted on EMR) to S3

aws s3 cp /mnt1/analytics/aruba/aruba_2016_clean/aruba_2016_full.csv s3://emr-analytics-dev/hdfs/analytics/aruba/

 

S3DistCp

Usage: Copy files from (a) HDFS to S3; (b) S3 to HDFS; (c) S3 to S3

s3-dist-cp –src=hdfs:///nag/sample.xml –dest=s3://emr-analytics-dev/conf/
s3-dist-cp –src=s3://emr-analytics-dev/jars/ –dest=hdfs:///nag/
s3-dist-cp –src=s3://emr-analytics-dev/jars/ –dest=/analytics/aws/input/
s3-dist-cp –src=hdfs:///analytics/aws/input/result.csv –dest=s3://emr-analytics-dev/conf/

 

WGet

Usage: Copy files from S3 to EMRFS

wget http://emr-analytics-dev.s3.amazonaws.com/jars/WordCount.jar [Action Required: S3 Folder -> Actions -> Make Public]

 

S3Put

Usage: Copy files from EMRFS to S3

s3put -a <Access Key Id> -s <Secret Access Key> -b emr-analytics-dev –region ap-southeast-1 /home/hadoop/WordCountTest.jar
s3put -b emr-analytics-dev –region ap-southeast-1 /home/hadoop/WordCountTest.jar
s3put -b emr-analytics-dev -p /home/hadoop -k jars –region ap-southeast-1 /home/hadoop/WordCountTest.jar

 

Hive External Table with S3

CREATE EXTERNAL TABLE aruba_open_word_cloud_v5_s3(
product string,
category string,
sub_category string,
calendar_year string,
calendar_quarter string,
csat string,
sentiment string,
sentiment_outlier string,
word string,
count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://emr-analytics-dev/hdfs/analytics/aruba/word_cloud_v5_output’;

 

Location of HDFS Site Configuration (hdfs-site.xml) in AWS

/usr/lib/hadoop/etc/hadoop