Thursday, November 21, 2013


Objective:

 Audit and record all activities taking place on a Hadoop cluster including Hive SQL as soon as possible

Problem:

A Hadoop cluster essentially has mapreduce jobs which perform all the tasks consuming resources on the cluster. A job can be written in java, in which case the logic is inside class or jar files and as such hidden. Same thing applies to a streaming hadoop job as well.

However, in case of hive, the SQL query can reveal detailed information about the data components accessed such as tables and columns. As such, the query can shed light on any PII (Personally identifiable information) accessed by the user.

It is desirable to capture such information and preferable as soon as the access has taken place. Performing such collection by culling logs at a fixed time is not only viable but also stale. The logs may be cleaned up and any violation reported late may not be beneficial. Hence, timely capture and reporting of such information is of paramount importance.

Consequently, such metric collection cannot be done effectively in a periodic fashion. Ideally, it should be collected as soon as the job is completed. Additionally, rather than polling this information, the job itself should ideally relay this data to a collector framework.

Approach:


  1. Configure Hadoop to send job completion notifications by adding "mapreduce.job.end-notification.url" property in the mapred-site.xml configuration file. This property can be set to any URL hosted as REST or in a servlet. Additionally, Hadoop can pass two parameters - jobid and job status in the URL. This provides additional job specific information to the servlet for further processing.
  1. Above notification URL can be hosted in a servlet container such as Tomcat as a servlet. This servlet will perform authentication against Hadoop using UserGroupInformation class via Kerberos as needed
  1. This servlet will extract job specific information including the job XML URL. This URL will be on the jobtracker in the form of
  1.  The servlet will also retrieve job completion metrics such as number of mappers, time taken to complete etc. from hadoop.
  2. Additionally, the servlet will retrieve the Hive SQL query when applicable from the job XML file.
  1. All these metrics will be persisted in a relational database for future reporting.

Details:

Some of the main pieces of the puzzle can be coded as follows,
  • Authentication - Kerberized Hadoop makes it just a bit more complex the task of authenticating a user. However not much, since the UserGroupInformation package has most of the logic with wrappers for JAAS mechanism already packed in. A user can authenticate against Hadoop as
Using a keytab file -
UserGroupInformation.loginUserFromKeytab(user, keytab_file_path);
The keytab file can be generated from the kerberos password as,
ktutil> addent -password -p <principal> -k 1 -e <encryption type>
The encryption type needs to match what /etc/krb5.conf file supports - for e.g. aes128-cts
ktutil> wkt <keytab file>
ktutil> quit
This generates the keytab file, which may be used to perform the authentication
  • SQL - The SQL can be derived as follows,
    1. First construct a jobtracker HTTP URL based on the configuration object - conf.get("mapred.job.tracker.http.address")
    2. From this host, construct the URL - "http://"+< jobTrackerURI >+"/jobconf.jsp?jobid="+<jobIdString> This will allow us to get the job configuration data from job tracker in a very seamless manner, whether the job is completed, retired or running.
    3. Get the XML configuration file from the above URL and look for the property "hive.query.string". This should contain the full SQL query

Benefits:

Once the hive SQL and the user ID for the request are received, appropriate accounting can be performed, sending alerts as needed in almost real time.