The newer versions of Hadoop, including HDP3, use HBase as the backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).
When using the system HBase, you could end up with the timeline server up and running, but with an alert (in Ambari) saying:
ATSv2 HBase Application The HBase application reported a ‘STARTED’ state. Check took 2.125s
The direct impact will be that Oozie jobs (among others) will take forever to run, as each step will wait for a timeout from the ATS (Application Timeline Server) before carrying on.
The solution I found to fix this is as follow:
- Check your yarn logs (/var/log/hadoop-yarn/yarn/ on hdp) for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant),
- Clean up hdfs ATS data as described on the HDP docs,
- Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure
- Restart *all* YARN services,
- Restart ambari server (we had a case where it looked like the alert was wrongly cached).
- Restart all services on the host where the ATS server lives.
The steps cleaning hdfs and zookeeper will make you lose your ATS history (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.