Andrew Flint

Accessing AW Datasets from a Notebook

Blog Post created by Andrew Flint Advocate on Mar 31, 2017

One of the great features of the Zeppelin notebook in our Analytics Workbench is the ability to write code in the language best suited for any task. Python, R, SQL and Scala each have some stand-out strengths, and if we're going to do some real data science here, you'll need to know how to access the AW dataset in each of these critical languages.

 

Here's a quick tutorial, which complements the "Data Access v1.2" sample notebook you can find in this Community.

 

We'll assume that somebody has already uploaded at least one dataset into the AW data inventory. In this case, we'll focus on the "HELOC_with_State" dataset:

 

 

Whether you are a Python pro or novice, the first step in this brief journey will often be with a simple Pyspark cell, to get a handle on that data:

 

%pyspark

# get a DataFrame reader from Analytics Workbench's existing Data

data_from_AW = aw.data.read('HELOC_with_State')

 

The magic here is the aw.data.read() function, which is only available in AW's pyspark cells, and takes the (case sensitive) argument of the dataset's name.

 

If you want to plow forward in Python, great, you're all set! Interrogate that PySpark DataFrame object all you like.

 

But if you want to go into Scala or SQL or R, you'll take one more step before you're done with Python, and that is to register the dataset (temporarily) into the Spark's SQL context:

 

# register a temporary view of data in SQL Context as "heloc"

data_from_AW.createOrReplaceTempView("heloc")

 

And now, accessing that data from the other three languages is at most a simple one liner:

 

Language
Interpreter
Example Access
SQL%sqlselect * from heloc
R%spark.r

heloc.df <- sql("select * from heloc")

Python%pyspark

heloc_df = sqlContext.table( "heloc" )

Scala%spark

val helocDF = sqlContext.table( "heloc" )

 

That was easy. Now, rest assured you can do fancier things than just ask for the SQL table. You could also apply filters and subset the columns and so forth, with typical SQL syntax, but we'll save that for another day.

 

At any time, if you'd like to check the status of Spark's SQL context and see which datasets are already available there, you can just ask for a quick table dump ("show tables") in a SQL cell:

 

"Houston, we have a heloc!"

 

And this is just where the fun begins. To go even further, check out the Sample Notebooks available in this Community, and come back often.

Outcomes