Objective:
Audit and record all activities taking place
on a Hadoop cluster including Hive SQL as soon as possible
Problem:
A Hadoop cluster
essentially has mapreduce jobs which
perform all the tasks consuming resources on the cluster. A job can be written
in java, in which case the logic is inside class or jar files and as such
hidden. Same thing applies to a streaming hadoop job as well.
However, in case of
hive, the SQL query can reveal detailed information about the data components
accessed such as tables and columns. As such, the query can shed light on any
PII (Personally identifiable information) accessed by the user.
It is desirable to
capture such information and preferable as soon as the access has taken place.
Performing such collection by culling logs at a fixed time is not only viable
but also stale. The logs may be cleaned up and any violation reported late may
not be beneficial. Hence, timely capture and reporting of such information is
of paramount importance.
Consequently, such
metric collection cannot be done effectively in a periodic fashion. Ideally, it
should be collected as soon as the job is completed. Additionally, rather than
polling this information, the job itself should ideally relay this data to a
collector framework.
Approach:
- Configure Hadoop to send job completion notifications by adding "mapreduce.job.end-notification.url" property in the mapred-site.xml configuration file. This property can be set to any URL hosted as REST or in a servlet. Additionally, Hadoop can pass two parameters - jobid and job status in the URL. This provides additional job specific information to the servlet for further processing.
- Above notification URL can be hosted in a servlet container such as Tomcat as a servlet. This servlet will perform authentication against Hadoop using UserGroupInformation class via Kerberos as needed
- This servlet will extract job specific information including the job XML URL. This URL will be on the jobtracker in the form of
- The servlet will also retrieve job completion metrics such as number of mappers, time taken to complete etc. from hadoop.
- Additionally, the servlet will retrieve the Hive SQL query when applicable from the job XML file.
- All these metrics will be persisted in a relational database for future reporting.
Details:
Some of the main
pieces of the puzzle can be coded as follows,
- Authentication - Kerberized Hadoop makes it just a bit more complex the task of authenticating a user. However not much, since the UserGroupInformation package has most of the logic with wrappers for JAAS mechanism already packed in. A user can authenticate against Hadoop as
Using
a keytab file -
UserGroupInformation.loginUserFromKeytab(user,
keytab_file_path);
The
keytab file can be generated from the kerberos password as,
ktutil> addent -password -p
<principal> -k 1 -e <encryption type>
The
encryption type needs to match what /etc/krb5.conf file supports - for e.g.
aes128-cts
ktutil> wkt <keytab
file>
ktutil> quit
This
generates the keytab file, which may be used to perform the authentication
- SQL - The SQL can be derived as follows,
- First construct a jobtracker HTTP URL based on the configuration object - conf.get("mapred.job.tracker.http.address")
- From this host, construct the URL - "http://"+< jobTrackerURI >+"/jobconf.jsp?jobid="+<jobIdString> This will allow us to get the job configuration data from job tracker in a very seamless manner, whether the job is completed, retired or running.
- Get the XML configuration file from the above URL and look for the property "hive.query.string". This should contain the full SQL query
Benefits:
Once the hive SQL
and the user ID for the request are received, appropriate accounting can be
performed, sending alerts as needed in almost real time.