CSV files with Apache Spark

CSV File as Input:

File: family_names.csv



Copy file to HDFS:

hdfs dfs -copyFromLocal /Users/ArvindGudiseva/workspace/hadoop/samples/input/family_names.csv /user/samples/input/csv/


Start Spark Shell:

$ spark-shell –packages com.databricks:spark-csv_2.10:1.3.0


Spark Scala Console:

scala> val familyCSV = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “true”).option(“inferschema”, “true”).load(“/user/samples/input/csv/family_names.csv”)

        familyCSV: org.apache.spark.sql.DataFrame = [first_name: string, last_name: string, address_street: string, address_city: string, address_state: string, address_zip: int]

scala> familyCSV.registerTempTable(“families”)
scala> val familyDetails = sqlContext.sql(“SELECT first_name, address_city, address_state FROM families”)
scala> familyDetails.printSchema
scala> familyDetails.collect.foreach(println)




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.