ssh to Ubuntu 16.04 running on Oracle VM VirtualBox

I. In Ubuntu VM:

1. Localhost IP (ifconfig): 127.0.0.1

2. Install the ssh server and client
$ sudo apt-get install ssh

3. Default SSH is listening on port 22
$ ssh gudiseva@127.0.0.1

4. Reconfigure the port for the ssh server
$ sudo nano /etc/ssh/sshd_config
Change Port 22 to Port 2222

5. Reload the configuration
$ sudo service ssh force-reload

6. Test the connection
$ ssh gudiseva@127.0.0.1 -p 2222

 

II. In Oracle VM VirtualBox Manager:

Oracle_VirtualBox_Port_Forward_SSH

 

III. In PuTTY / WinSCP:

Host / IP Address: 127.0.1.1
Port: 22

 

AWS Essentials for Hadoop Developers

MapReduce (with HDFS Path)

hadoop jar WordCount.jar WordCount /analytics/aws/input/result.csv /analytics/aws/output/1

 

MapReduce (with S3 Path)

hadoop jar WordCount.jar WordCount s3://emr-analytics-dev/input/result.csv s3://emr-analytics-dev/output/2

 

AWS S3 Cp

Usage: Copy files from EBS (mounted on EMR) to S3

aws s3 cp /mnt1/analytics/aruba/aruba_2016_clean/aruba_2016_full.csv s3://emr-analytics-dev/hdfs/analytics/aruba/

 

S3DistCp

Usage: Copy files from (a) HDFS to S3; (b) S3 to HDFS; (c) S3 to S3

s3-dist-cp –src=hdfs:///nag/sample.xml –dest=s3://emr-analytics-dev/conf/
s3-dist-cp –src=s3://emr-analytics-dev/jars/ –dest=hdfs:///nag/
s3-dist-cp –src=s3://emr-analytics-dev/jars/ –dest=/analytics/aws/input/
s3-dist-cp –src=hdfs:///analytics/aws/input/result.csv –dest=s3://emr-analytics-dev/conf/

 

WGet

Usage: Copy files from S3 to EMRFS

wget http://emr-analytics-dev.s3.amazonaws.com/jars/WordCount.jar [Action Required: S3 Folder -> Actions -> Make Public]

 

S3Put

Usage: Copy files from EMRFS to S3

s3put -a <Access Key Id> -s <Secret Access Key> -b emr-analytics-dev –region ap-southeast-1 /home/hadoop/WordCountTest.jar
s3put -b emr-analytics-dev –region ap-southeast-1 /home/hadoop/WordCountTest.jar
s3put -b emr-analytics-dev -p /home/hadoop -k jars –region ap-southeast-1 /home/hadoop/WordCountTest.jar

 

Hive External Table with S3

CREATE EXTERNAL TABLE aruba_open_word_cloud_v5_s3(
product string,
category string,
sub_category string,
calendar_year string,
calendar_quarter string,
csat string,
sentiment string,
sentiment_outlier string,
word string,
count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘s3://emr-analytics-dev/hdfs/analytics/aruba/word_cloud_v5_output’;

 

Location of HDFS Site Configuration (hdfs-site.xml) in AWS

/usr/lib/hadoop/etc/hadoop

 

Fetch the YouTube comments using YouTube Data API v3

Pass the following parameters:

Key = <Generated using Google API Console with personal Google Id>

Text Format = Plain Text / HTML

Part = Snippet and Replies –> To get both the Comments and Replies (as nested JSON)

Top Level Comment = True –> To get quality comments

Max Results = 100 –> Max Limit.  Limitation from Google

Video Id = <Need to be provided.  We can automate this for Production>

 

YouTube REST API:

https://www.googleapis.com/youtube/v3/commentThreads?key=<Generated using Google API Console with personal Google Id>&textFormat=plainText&part=snippet,replies&topLevelComment&maxResults=100&videoId=OL57w5nj6Y4

 

Result in JSON:

comments-REST-API.json

Extract rows from CSV file containing specific values using MapReduce, Pig, Hive, Apache Drill and Spark

example.csv

101,87,65,67
102,43,45,40
103,23,56,34
104,65,55,40
105,87,96,40

Q. Retrieve “value in first column” of rows containing “40 in the last column”.
E.g. Output:
40 102
40 104
40 105

# PIG:

example.pig
———–
input_file = LOAD ‘/user/samples/input/csv/example/example.csv’ USING PigStorage(‘,’);
input_file = LOAD ‘/Users/ArvindGudiseva/workspace/hadoop/samples/input/example.csv’ USING PigStorage(‘,’);

filtered_records = FILTER input_file BY $3==40;
final_records = FOREACH filtered_records GENERATE $3 AS num1, $0 AS num2;
DUMP final_records;

Run:

1) HDFS Mode: pig -f /Users/ArvindGudiseva/workspace/hadoop/pig/example_hdfs.pig
In case of errors:
mr-jobhistory-daemon.sh start historyserver
pig -x mapreduce /Users/ArvindGudiseva/workspace/hadoop/pig/example_hdfs.pig

2) Local Mode: pig -x local /Users/ArvindGudiseva/workspace/hadoop/pig/example_local.pig

# HIVE:
—-
example.hql (OR) example.sql
—————————-
CREATE EXTERNAL TABLE IF NOT EXISTS example(num0 INT, num1 INT, num2 INT, num3 INT)
COMMENT ‘extract rows containing specific value’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
— LOCATION ‘/user/samples/input/csv/example/’ // Error as forward slash is added at the end
— LOCATION ‘/user/samples/input/csv/example/example.csv’ // Not valid as Directory is required
LOCATION ‘/user/samples/input/csv/example’;

SELECT num3, num0 FROM example WHERE num3=40;

Run:

hive –f ‘/Users/ArvindGudiseva/workspace/hadoop/hive/example.sql’
(OR)
hive –f ‘/Users/ArvindGudiseva/workspace/hadoop/hive/example.hql’

# DRILL:
—–
Start Drill
———–
Access: http://localhost:8047/query
$ drill-embedded
$ sqlline -u jdbc:drill:zk=local

Run:

1) Local File System:
SELECT columns[3], columns[0] FROM dfs.`/Users/ArvindGudiseva/workspace/hadoop/samples/input/example.csv`
WHERE columns[3]=40;

2) HDFS File System:
–#– Storage Plugin for HDFS —
{
“type”: “file”,
“enabled”: true,
“connection”: “hdfs://localhost:9000/”,
“workspaces”: {
“root”: {
“location”: “/”,
“writable”: true,
“defaultInputFormat”: null
}
},
“formats”: {
“json”: {
“type”: “json”
},
“csv”: {
“type”: “text”,
“extensions”: [
“csv”
],
“delimiter”: “,”
}
}
}

–#– Query —
SELECT columns[3], columns[0] FROM hdfs.`/user/samples/input/csv/example/example.csv`
WHERE columns[3]=40;

# SPARK:
—–
Start Spark Shell
—————–
$ spark-shell –packages com.databricks:spark-csv_2.10:1.3.0

Run:

val csv_file = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “false”).option(“inferschema”, “true”).load(“/user/samples/input/csv/example/example.csv”)
// Schema is Inferred -> csv_file: org.apache.spark.sql.DataFrame = [C0: int, C1: int, C2: int, C3: int]
csv_file.registerTempTable(“example”)
val result = sqlContext.sql(“SELECT C3, C0 FROM example WHERE C3=40”)
result.collect.foreach(println)

# MAP REDUCE:
———-
Note: It’s a Mapper only job

–#– Class Mapper —
public class ExampleMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {

String[] tokens = value.toString().split(“,”);

String col0 = tokens[0];
String col1 = tokens[1];
String col2 = tokens[2];
String col3 = tokens[3];

int colInt0 = Integer.parseInt(col0);
int colInt3 = Integer.parseInt(col3);

//if col3 is == 40; print col3 and col0
if(colInt3 == 40){
context.write(new IntWritable(colInt3), new IntWritable(colInt0));
}
}

}

–#– Class Reducer (Not Required) —
public class ExampleReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {

@Override
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {

int col0 = 0;
int col3 = 0;

Iterator itr = values.iterator();
while (itr.hasNext()){
col3 = Integer.parseInt(key.toString());
col0 = Integer.parseInt(itr.next().toString());

context.write(new IntWritable(col3), new IntWritable(col0));
}

}

}

–#– Class Driver —
public class ExampleDriver extends Configured implements Tool {

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ExampleDriver(), args);
System.exit(exitCode);
}

public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf(“Usage: %s [generic options] <input> <output>\n”,
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}

Job job = new org.apache.hadoop.mapreduce.Job();
job.setJarByClass(ExampleDriver.class);
job.setJobName(“ExampleDriver”);

FileInputFormat.addInputPath(job, new Path(args[0]));
// Added to delete the Output folder if already exists
Configuration conf = job.getConfiguration();
FileSystem.get(conf).delete(new Path(args[1]), true);
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(ExampleMapper.class);
//job.setReducerClass(ExampleReducer.class);

// Sets reducer tasks to 0
job.setNumReduceTasks(0);

int returnValue = job.waitForCompletion(true) ? 0:1;
System.out.println(“job.isSuccessful ” + job.isSuccessful());
return returnValue;

}
}

Run:

$ mvn clean install
$ hadoop jar target/HadoopMapReduce-1.0.jar example.ExampleDriver /user/samples/input/csv/example/example.csv /user/samples/output
$ hdfs dfs -cat /user/samples/output/part-m-00000
$ hdfs dfs -cat /user/samples/output/part-r-00000 (if Reducer is used)

 

CSV files with Apache Spark

CSV File as Input:

File: family_names.csv

first_name,last_name,address_street,address_city,address_state,address_zip
Arvind,Gudiseva,Kharkhana,Secunderabad,AP,500009
Dhyuti,Gudiseva,Circlepet,Machilipatnam,AP,521001
Haritha,Murari,Whitefield,Bangalore,KA,560066

 

Copy file to HDFS:

hdfs dfs -copyFromLocal /Users/ArvindGudiseva/workspace/hadoop/samples/input/family_names.csv /user/samples/input/csv/

 

Start Spark Shell:

$ spark-shell –packages com.databricks:spark-csv_2.10:1.3.0

 

Spark Scala Console:

scala> val familyCSV = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “true”).option(“inferschema”, “true”).load(“/user/samples/input/csv/family_names.csv”)

        familyCSV: org.apache.spark.sql.DataFrame = [first_name: string, last_name: string, address_street: string, address_city: string, address_state: string, address_zip: int]

scala> familyCSV.registerTempTable(“families”)
scala> val familyDetails = sqlContext.sql(“SELECT first_name, address_city, address_state FROM families”)
scala> familyDetails.printSchema
scala> familyDetails.collect.foreach(println)

Result:

[Arvind,Secunderabad,AP]
[Dhyuti,Machilipatnam,AP]
[Haritha,Bangalore,KA]

 

Scala – JUnit

Below is the basic implementation of Test Case and Test Suite using JUnit (Java Unit Test Style) in Scala.

Note:  Test Cases and Test Suite should be placed in the package (src\test\scala).

 

build.sbt (Dependencies)

 

For Simple JUnit Tests from IDE

libraryDependencies ++= Seq(

“junit” % “junit” % “4.8.1” % “test”
)

 

For Running SBT Tests ($ sbt test)

libraryDependencies ++= Seq(

“com.novocode” % “junit-interface” % “0.8” % “test->default”
)

 

 

StartApp.scala

package com.cisco.cand.app

 

object startApp extends App {

 

def fullName (firstName: String, lastName: String): String = {

“Hello, ” + firstName + ” ” + lastName + “!”
}

 

}

 

StartAppTestCase1.scala

import com.cisco.cand.app.startApp

import org.junit.Assert._

import org.junit.Test
class StartAppTestCase1 {

 

@Test
def equalsTest(): Unit ={

assertEquals(“Compare method output”, “Hello, Arvind Gudiseva!”, startApp.fullName (“Arvind”, “Gudiseva”))

}

 

@Test
def trueTest(): Unit ={

val resultStr: String = startApp.fullName (“Arvind”, “Gudiseva”)

assertTrue(“Matching method output”, resultStr.equalsIgnoreCase(“Hello, Arvind Gudiseva!”))

}

 

@Test
def falseTest(): Unit ={

val resultStr: String = startApp.fullName (“Arvind”, “Gudiseva”)

assertFalse(“Mismatch of method output”, resultStr.equalsIgnoreCase(“Hi, Arvind Gudiseva!”))

}

 

}

 

StartAppTestCase2.scala

import com.cisco.cand.app.startApp

import org.junit.Assert._

import org.junit.Test
class StartAppTestCase2 {

 

@Test
def isEmptyFalse(): Unit ={

val resultStr: String = startApp.fullName (“”, “”)

assertFalse(“Result String should not be empty”, resultStr.isEmpty)

}

 

}

 

StartAppTestSuite.scala

import org.junit.runner.RunWith

import org.junit.runners.Suite
@RunWith(classOf[Suite])

@Suite.SuiteClasses(Array(classOf[StartAppTestCase1], classOf[StartAppTestCase2]))

class StartAppTestSuite{

// No code exists here

}

Scala – SLF4J Logging

Configuration details and implementation usage of logging in Scala:

 

build.sbt

scalaVersion := “2.10.3”

libraryDependencies ++= Seq(

– – –

“org.slf4j” % “slf4j-api” % “1.7.5”,

“org.slf4j” % “slf4j-simple” % “1.7.5”,

"org.clapper" %% "grizzled-slf4j" % "1.0.2"

)

 

MainJob.scala

package com.cisco.cand.app

 

import java.util.Date

import org.slf4j.LoggerFactory

/**
* Created by Nag Arvind Gudiseva on 07-Apr-2016.
*/
object MainJob extends App{

val logger = LoggerFactory.getLogger(MainJob.getClass)

val startTime = new Date()

logger.info(” :: Start Time :: ” + startTime + ” :: “)

logger.info(” :: Start Module :: Main Job :: “)

– – –

val endTime = new Date()

logger.info(” :: End Time :: ” + endTime + ” :: “)

val timeDiff = (endTime.getTime() – startTime.getTime()) / (1000 * 60)

logger.info(” :: Time Difference :: ” + timeDiff + ” :: “)

logger.info(” :: End Module :: Main Job :: “)

}

 

simplelogger.properties [Location: src\main\resources\]

org.slf4j.simpleLogger.logFile  = System.err
org.slf4j.simpleLogger.defaultLogLevel = info
org.slf4j.simpleLogger.showDateTime  = true
org.slf4j.simpleLogger.dateTimeFormat  = yyyy’/’MM’/’dd’ ‘HH’:’mm’:’ss’-‘S
org.slf4j.simpleLogger.showThreadName  = true
org.slf4j.simpleLogger.showLogName  = true
org.slf4j.simpleLogger.showShortLogName= false
org.slf4j.simpleLogger.levelInBrackets = true

– – –