CSV files with Apache Spark

CSV File as Input:

File: family_names.csv

first_name,last_name,address_street,address_city,address_state,address_zip
Arvind,Gudiseva,Kharkhana,Secunderabad,AP,500009
Dhyuti,Gudiseva,Circlepet,Machilipatnam,AP,521001
Haritha,Murari,Whitefield,Bangalore,KA,560066

 

Copy file to HDFS:

hdfs dfs -copyFromLocal /Users/ArvindGudiseva/workspace/hadoop/samples/input/family_names.csv /user/samples/input/csv/

 

Start Spark Shell:

$ spark-shell –packages com.databricks:spark-csv_2.10:1.3.0

 

Spark Scala Console:

scala> val familyCSV = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “true”).option(“inferschema”, “true”).load(“/user/samples/input/csv/family_names.csv”)

        familyCSV: org.apache.spark.sql.DataFrame = [first_name: string, last_name: string, address_street: string, address_city: string, address_state: string, address_zip: int]

scala> familyCSV.registerTempTable(“families”)
scala> val familyDetails = sqlContext.sql(“SELECT first_name, address_city, address_state FROM families”)
scala> familyDetails.printSchema
scala> familyDetails.collect.foreach(println)

Result:

[Arvind,Secunderabad,AP]
[Dhyuti,Machilipatnam,AP]
[Haritha,Bangalore,KA]

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s