Dundas Rest API and Python: introducing PyDundas

Dundas has a very complete REST API. You can do just about everything with it. It is what the web app uses, so if you are in need of an example, you can just look at the queries which are sent with the developer tools of your favorite browser.

That said, it is a bit of a pain to use in a script.

To make my life easy, I built PyDundas: a python package using in the background the API. This lets you abstract away all the nitty-gritty and you can concentrate on semantically pleasing code.

For instance, warehousing a cube is quite simple. It abstracts out for you logging, checking, waiting and more.

from pydundas import Api, Session, creds_from_yaml
import sys
import json
creds = creds_from_yaml('credentials.yaml')

with Session(**creds) as d:
    api = Api(d)
    capi = api.cube()
    cube = capi.getByPath('Awesome Project', '/relevant/path')
    if cube is None:
        print("Gotcha, no cube named like that.")
        sys.exit(1)
    print(cube.json())
    if not cube.is_checked_out():
        cube.clear_cache()

    cube.warehouse()
    print(cube.isWarehousing())
    cube.waitForWarehousingCompletion()
    print('Done')

There is a lot more that can be done with PyDundas. I developed it for my needs, so I only added what I actually needed and is thus far from complete. That said, I regularly update it and add new APIs, and it is open source so feel free to send me pull requests to expand it.

Advertisements

Bridge parameter with multi-level hierarchy in Dundas

Sometimes you need to send parameters to a data cube, which you cannot just handle via the slicers. In that case, the bridge parameters are awesome.

You can define inside your data cube a parameter (single, range, hierarchy…) and bind it to one or more placeholders in your data cube query. Then from your dashboard you can add filters bound to this bridge parameter.

I ended up using them a lot for a 3 level hierarchy, and this post explains how to do it.

Your parameter will be a “Single Member” (in opposition to range member (e.g. from and to for a calendar) or single number/string). This member will have multiple levels, depending on the hierarchy it’s bound to.

Indeed, when you create a bridge parameter, you bind it to a hierarchy. Then each level in the hierarchy is bound to a name which you will receive from the value in your parameter. In the example below, Top/Mid/Low Level are names coming from my hierarchy, A/B/C are generated by Dundas.

levels_vs_names

Those names can be used inside the Dundas script binding the bridge parameter and the placeholders. You just need to attach the bridge parameter to the placeholders, $top$, $mid$, $low$.

attach

You will need 3 slightly different codes for each level. In my example, if the level is not filled I return a different default value, so the code is not nice to factorise, especially as it’s only 7 lines long.

This is the code for the top level. It is always filled, but there could be a second or more level.

// always Resolve() first, to handle unresolved tokens.
SingleMemberValue h = (SingleMemberValue)$input$.Resolve();
// eg. top.mid.low.C or top.mid.B or top.A
string fullName = h.Value.UniqueName;
// Just return first element
return fullName.Split('.')[0];

The second level might be empty. In that case I return -1.

// always Resolve() first, to handle unresolved tokens.
SingleMemberValue h = (SingleMemberValue)$input$.Resolve();
// eg. top.mid.low.C or top.mid.B or top.A
String fullName = h.Value.UniqueName;
String[] bits=fullName.Split('.');
if (bits.Length >= 3) {
  // Note that Strings are coming in, but I need to convert to int
  return Convert.ToInt32(bits[1])
} else {
  return -1
}

Same for my third level, might be empty, is an int.

// always Resolve() first, to handle unresolved tokens.
SingleMemberValue h = (SingleMemberValue)$input$.Resolve();
// eg. top.mid.low.C or top.mid.B or top.A
String fullName = h.Value.UniqueName;
String[] bits=fullName.Split('.');
if (bits.Length >= 4) {
  // Note that Strings are coming in, but I need to convert to int
  return Convert.ToInt32(bits[2])
} else {
  return -1
}

Dundas: states with constant

Within Dundas, you can set up states, which are great to be able to change some display depending on some values.

On a table, you might, for instance, want to show in green all values greater than 60% and in red all values less than 20%. This is easy enough by following the states documentation.

If you are not careful, though, you might end up with coloration not matching what you expect. This is especially true when you have total rows at different levels of a hierarchy. The first level will have the right colors, but others not (for instance states for days are ok, but not for month and year).

You need to know that states are a metric set feature and calculated at the server, not by visualisations. States have thus an aggregator, which by default is Sum. It means that if you have a constant of 0.6 (60%), the level higher in the hierarchy will check against twice this (so 120%) and so on.

To change this, you just need to change the aggregator. For a constant, you could use min, max or average, they would all give the same value, which is the one you expect.

Hive and ODBC confusions

Hive has no official ODBC drivers since version 3 at least. All Hadoop distributions (and Microsoft) distribute the ODBC driver from Simba. It works OK if you can use native queries (the driver passes the query as-is to Hive) or if you query is simple. Otherwise, the driver tries to be smart and fails miserably.

As I am neither a customer of Simba nor of Hortonworks, I cannot send a bug report. I asked on the Hortonworks community, but I feel quite isolated. I will share here a few of my experience, and hopefully, a good soul might pop by and tell me what I am doing wrong (or join me in whinging about this driver).

I should note that I cannot use native queries because I need to use parametrised statements, which are not available with native queries.

Parse Error

Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: ParseException line 25:29 cannot recognize input near ‘?’ ‘and’ ‘s’ in expression specification

You will get that one a lot. Basically, on any error, this is what you will get, with the place of the error being your first question mark. I thought for a long time that the driver was completely borked, but actually no (just majorly, not completely). If you enable logging (LogLevel=4 and eg. LogPath=/tmp/hivelogs) in your obcinst.ini you will be able to see the inner error, which is a lot more informative.

unix_timestamp

Any query using unix_timestamp will give you

unix_timestamp is not a valid scalar function or procedure call

My guess is that the driver mixes up with unix_timestamp(), with no parameters, which is deprecated. As a workaround, you can cast your date as bigint, which works the same. I was proud of myself with this workaround, but look below (Cast) for the issues this causes.

CTE

They are the best thing in SQL with the analytics functions. The driver does not support them:

  syntax error near ‘with<<< ??? >>> init as (select ? as lic, ? as cpg) select * from init’.

The solution is, of course, to use subqueries instead.

‘floor’ is a reserved keyword.

Yes, I agree that it’s reserved, but because it’s an actual function. I should not have this error when I am using eg. floor(42) in a query.

This one surprises me because a simple query (select floor(42)) will succeed, whereas the same line use in a more complex query will give fail. I see from the logs that the driver shows the error but is somehow able to recover for simple queries, not for complex queries.

Cast does not only returns string

Casting to dates as bigint and taking a diff:

select cast(cast('2019-01-21 01:32:32' as timestamp) as bigint) - cast(cast('2019-02-21 01:32:32' as timestamp) as bigint) as tto

fails as well:

Operand types SQL_WCHAR and SQL_WCHAR are incompatible for the binary minus operator

Same as for floor, in some cases the driver recovers, sometimes not.

 

Accessing Dundas with Python

Dundas has a great REST API. You can basically do everything with. Furthermore, it’s easy to find examples, as you just have to look at what your Dundas web app does, and you have all the examples and use cases you can wish for.

I wanted to schedule cube refreshes, so I naturally turned toward Python. It wasn’t too complex, as you will see below. The one thing to take care of is logging out, in all possible cases, otherwise you will burn very fast through your elastic hours (been there, done that). I’ll show you here how I did it with a context manager, which means that I can basically not forget to log out, whatever happens, it’s all managed for me.

In my code in production, this object actually does something (refreshes cubes), this is only a skeleton to get you started. You can find the code as well on my github, it might be easier to read.

The code actually does not do much: the call to __new__ associated with contextlib.closing() allows you to use the object DundasSession within a context manager, with the keyword with, thus guaranteeing you that you will always log out, no matter what, even if an exception or a sys.exit occurs.

#!/usr/bin/env python3

"""
Skeleton to use Dundas Rest API, with guaranteed log out.
"""
import contextlib
import logging
import requests
import sys

class DundasSession:
    """
    Using __new__ + contextlib.closing() is awesome.

    DundasSession can now be used within a context manager, meaning that whatever happens, its close() method
    will be called (thus logging out). No need to call logout() explicitely!
    Just using __del__ is not guaranteed to work because when __del__ is called, you do not know which objects
    are already destroyed, and the session object might well be dead.
    """
    def __new__(cls, *args, **kwargs):
        o=super().__new__(cls)
        o.__init__(*args, **kwargs)
        return contextlib.closing(o)

    def __init__(self, user, pwd, url):
        # For session reuse - TCP connection reuse, keeps cookies.
        self.s = requests.session()

        self.user = user
        self.pwd = pwd
        self.url = url
        self.api = self.url + '/api/'
        self.session_id = None  # Will bet set in login()

    def login(self):
        """Login and returns the session_id"""
        login_data = {
            'accountName': self.user,
            'password': self.pwd,
            'deleteOtherSessions': False,
            'isWindowsLogOn': False
        }
        logging.info('Logging in.')
        r = self.s.post(self.api + 'logon/', json=login_data)
        # The following line exceptions out on not 200 return code.
        r.raise_for_status()

        resp = r.json()
        if resp['logOnFailureReason'].lower() == "none":
            # We're in!
            logging.info('Logged in')
            self.session_id = resp['sessionId']
        else:
            logging.error('Login failed with message: ' + r.text)
            sys.exit(1)

    def close(self):
        """Automagically called by the context manager."""
        self.logout()

    def logout(self):
        """If you do not logout, session will stay active, potentially burning through your elastic hours very fast."""

        # If session_id is not defined, we did not even log in (or we are already logged out).
        logging.info('Logging out.')
        if getattr(self, 'session_id', None):
            r = self.s.delete(self.api + 'session/current', params={'sessionId': self.session_id})
            r.raise_for_status()
            del self.session_id
            logging.info('Logged out.')
        else:
            logging.info('Was not yet Logged in.')

logging.basicConfig(level='INFO')

with DundasSession(user='yourapiuser', pwd='pwd', url='https://reports.example.com') as dundas:
    dundas.login()

    # Do something smart with your Dundas object.

# No need to log out, this is handled for you via the context manager, even in case of exception or even sys.exit.

Easy test data with Hive

Testing a query on a small dataset, especially if you need to carefully check your joins is usually made by creating a few temporary tables with hand-crafted data. This is a true and tested method, but it has a few disadvantages:

  • Requires some work if you need to change your data,
  • If the table is not temporary you need to not forget to drop it,
  • If your table is temporary it needs to be recreated after a reconnection,
  • If you don’t save the initialisation statements your test data is gone,
  • Especially with Hive, handling small tables has a lot of overhead.

It all works, but there is a nicer alternative: CTE + UDTF. Expanded, it means Common Table Expression with User Defined Table-generating Function.

Without further ado, here is an example, with the usual employees and departement:

with employee as(
  select inline(array(
      struct('Alice', '2017-03-04', 1)
    , struct('Bob', '2017-04-12', 1)
    , struct('Carol', '2018-12-24', 2)
  ))  as (name, start_date, dpt_id)
)
, department as (
  select inline(array(
      struct('IT', 1)
    , struct('Finance', 2)
  ))  as (name, id)
)
select
    e.name
  , e.start_date
  , d.name
from
  employee e
join
  department d
on
  e.dpt_id=d.id
;

And the result:

+---------+---------------+----------+
| e.name  | e.start_date  |  d.name  |
+---------+---------------+----------+
| Alice   | 2017-03-04    | IT       |
| Bob     | 2017-04-12    | IT       |
| Carol   | 2018-12-24    | Finance  |
+---------+---------------+----------+

So, what do we have here?

I define 2 common table expressions (with .. as () statement), which is a sort of run-time table. They can be used in any following CTE or queries. This table is defined by just giving the data we want in it (surrounded by inline(array(…)) as). Changing, adding, removing data is thus trivial and all is nicely bundled in one place.

Another nice thing is that these CTEs actually shadow real tables with the same name. This means that once you’re done testing, you just comment out the CTE definitions and the query will run with real data. This has the added benefit that you can always keep your test data with your actual query. You just need to uncomment the CTEs to use them.

Many other RDBMs (Mysql, Postgres, Oracle…) have CTEs. The UDTF (inline function) is less common, unfortunately.

ATS server does not start

The newer versions of Hadoop, including HDP3, use HBase as backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).

When using the system HBase, you could end up with the timeline server up and running, but with an alert (in Ambari) saying:

ATSv2 HBase Application The HBase application reported a ‘STARTED’ state. Check took 2.125s

The direct impact will be that Oozie jobs (among others) will take forever to run, as each step will wait for a timeout from the ATS (Application Timeline Server) before carrying on.

The solution I found to fix this is as follow:

    1. Check your yarn logs (/var/log/hadoop-yarn/yarn/ on hdp) for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant),
    2. Clean up hdfs ATS data as described on the HDP docs,
    3. Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure
    4. Restart *all* YARN services,
    5. Restart ambari server (we had a case where it looked like the alert was wrongly cached).

The steps cleaning hdfs and zookeeper will make you lose your ATS history (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.