Setting up Spark Notebook on Mac OS X

Get Spark up and running is pretty easy. After downloading the distribution and unarchiving it, you can directly execute the binaries.

For a more convenient handling of things update your .bash_profile with the following settings:

# Setting spark home and variables for iphyton notebook
export IPYTHON=1
export IPYTHON_OPTS="notebook"
export SPARK_HOME=/Users/thorsten/tools/spark
export LC_CTYPE=en_US.UTF8

Obviously you’d have to update the path according to your destination folder.
I’ve updated the settings to use the iphyton notebook.
The last setting is a workaround for a ‘bug’ in python on Mac OS.

Failing Test TestZKSignerSecretProvider

When compiling Hadoop from sources in a virtual machine with rather poor performance I’ve faced failing tests on TestZKSignerSecretProvider.

The error message is `java.lang.AssertionError: expected null, but was:<[B@142bad79>`.
Increase the timeout in line 37 to `private final int timeout = 16000;` and the test passes.

Further details on this topic can be found at

Getting Data from IBM Emptoris SLM

IBM offers a platform to manage your suppliers lifecycle and contracts. These are former products from Xcitec (as Siemens carve-out) and Emptoris (who first aquired Xcitec and then was aquired by IBM, which looks a lot like small fish eaten by bigger fish eaten by even bigger fish).

You can check out the software SLM and Contract-Management n your own, because what I’m interested in, is how to make this data available in my purchasing-warehouse.

Although this software suite now offers reporting capabilities, this is based on IBM Cognos. If you prefer other tools for your data analysis (e.g. Tableau, Python-Notebooks, R, …) you’d better get this data out of these system and into your own data-lake.

The Webservices return the data as xml-document within a SOAP Response or as xml-document under a RESTful resource.
You’d need some mappings from the document centric presentation to a more relational one. E.g. a contract is represented as a document with a list of units who can use this contract. Here’s a snippet:

"description" : "My-Contract",
"orgs" : [

In reporting it’s not so much interesting, whether there’s a contract, which can be used by some of the subsidiaries.
It’s more intersting, which one of the subsidiaries has a contract in place. So the document is pivoted on the inner list:

My-Contract Org-Unit-01
My-Contract Org-Unit-02
My-Contract Org-Unit-03
My-Contract Org-Unit-04

This projection can be put into a table and joined to your dimension and fact tables from the warehouse.

For a full example set on how this works have a look at the README.

I wrote this application in mainly in Clojure. It has excellent performance and vast library support.

So why writing this application in a functional language?
Well, first there aren’t any real objects involved. And with that I mean, no state needs to
be maintained. Secondly the implementation by means of pure functions, means small functions
which do exactly one thing. So development is divide and conquer at its best.


– Working with SOAP webservices is agile on a pure xml-level.

– JSON really is the lightweight alternative to xml.

– A DSL is a good way to keep the customer happy.

PostgreSQL – Load the data

In my last post, I’ve created the database-schema for AdventureWorks with proper naming convention.

But that was actually the easiest task. We still have no data stored in our newly created PostgreSQL. Having roughly 80 tables, the AW database could do with a lot of ETL jobs.

But who really wants to write 80 ETL jobs for a proof-of-concept? As the naming is different in source and target there are a few options to do it smarter:

  1. Writing custom functions in your ETL tool to change the naming on the fly. That would still mean to create a job for each table, but you’ll benefit from name-matching mechanism (if your tool provides that).
  2. Writing some code to iterate over the tables and load the data with jdbc.

As I’m rather lazy I went to the second option. The code is available on github. It’s just a dirty hack, but does the job to get a meaningful database in PostgreSQL. This tool can be nicely extended to also create the target schema. But the focus for now is the reporting on PostgreSQL.

Learning PostgreSQL

When we start with learning a new database, the first step after successfully installing it, is getting a proper sample database. After coming from SQL Server I have to admit, that I actually liked the Adventure Works database. It has a version for OLTP and analytical purposes and has a pretty advanced design.

So we want to get the SQL Server database to PostgreSQL.


On SQL Server the names of database objects is case-insensitive, so for example ‘ProductSubcategory’ is the same as ‘productsubcategory’, whereas PostgreSQL would consider this two different objects (at least when the name is quoted). As we don’t want to quote every object in our queries, we’d have to change the naming. But how do we change the object names for 100 tables with roughly 920 columns? Obviously we don’t want to do this by hand.

So let’s write some Python code to convert our camel case names to proper PostgreSQL.

Lets use AquaDataStudio to generate the create scripts for the Adventure Works databases. With quoting of object names we can identify them by their enclosing quotes. Once identified these, just put them into a little function to get a proper naming scheme.

def convert_camel_word(in_word):
    out_word = ""
    pos = 0
    for c in in_word:
        if c.isupper(): 
            if (pos > 0 
               and (in_word[pos-1] != " " 
                    and in_word[pos-1].islower())): 
                out_word += '_'
        pos += 1
        if c == " ": 
            out_word += "_"
            out_word += c.lower()
    return out_word

sql_file = 'adventureworks.sql'

postgresql_file = open('adventureworks_postgres.sql', mode='w', encoding='utf8')

import re
for line in open(sql_file, encoding='utf8'):
    pos = 0
    line_list = re.split(r'(\w+)', line)
    for item in line_list:
        if ('"' in  line_list[pos - 1]
            and '"' in  line_list[pos + 1]):
            print(convert_camel_word(item), end='', file=postgresql_file)
            print(item.replace('"', ''), end='', file=postgresql_file)
        pos += 1

That should do the basic trick to get proper names. Now some fine-tuning is required to change the proprietary data-types from the tables and replace them with the ones fit for PostgreSQL. Additionally you want to get rid of these calculated columns (after all, that’s not Excel we’re working with. If you want calculated columns use a view.).

Goal Driven Data Manager

I’ve just learned about schedoscope. You’ve to check out the software by yourself (haven’t had time to go through the tutorials yet), but after a discussion with one of the creators I got an idea what to do with it.

Basically schedoscope is a goal driven data manager for hadoop. That is, you tell the system, what data you want to have and it takes care for the creation and maintenance of your data (by using a dag.

So what could you do, with a goal-driven data manager and why should you care? Well, think about an analytical data-structure.

An OLAP cube (simply speaking) consists of a set of aggregated values for your hierarchies and thus enables quick navigation of your data. When you process an OLAP cube with e.g. SSAS you can watch the different SQL statements to understand what data actually goes into your cube.

Now with schedoscope at hand, you could actually go and build an OLAP cube on top of hadoop, so you could drill-up-n-down your big-data.