The newer versions of Hadoop, including HDP3, use HBase as the backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).
When using the system HBase, you could end up with the timeline server up and running, but with an alert (in Ambari) saying:
ATSv2 HBase Application The HBase application reported a ‘STARTED’ state. Check took 2.125s
The direct impact will be that Oozie jobs (among others) will take forever to run, as each step will wait for a timeout from the ATS (Application Timeline Server) before carrying on.
The solution I found to fix this is as follow:
- Check your yarn logs (/var/log/hadoop-yarn/yarn/ on hdp) for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant),
- Clean up hdfs ATS data as described on the HDP docs,
- Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure
- Restart *all* YARN services,
- Restart ambari server (we had a case where it looked like the alert was wrongly cached).
- Restart all services on the host where the ATS server lives.
The steps cleaning hdfs and zookeeper will make you lose your ATS history (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.
I found that it helped to run ats-hbase in embedded mode on a tiny one host cluster as documented (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/data-operating-system/content/installation_modes_hbase_timeline_service.html). This is just a simple checkbox in ambari (and clearing it happened to be the recommended value too).
Also had to clean up the hdfs data (as you mentioned in ‘B’) and create `/home/yarn-ats` and `/var/run/hbase` as `yarn-ats:hadoop` and `hbase:hbase` respectively.
Just replying here because your page showed up in my search engine results 😉
P.s. I’m pretty sure that in my case this problem was caused by using an ambari blueprint to set stuff up on a single host instead of a proper cluster. Who knows, maybe this post can help someone else.