I needed to debug a WordPress plugin problem on ConDFW’s site, and while doing so I discovered my Apache 2.2 instance on FreeBSD 8 was not logging properly. Well, says I, why don’t I log to a database again? And instead of using mod_log_sql like a sane human, and like I’ve done in the past, I decided to peek into the exciting world of NoSQL.
First, choices. You need a NoSQL database. See http://en.wikipedia.org/wiki/NoSQL for a big long list, but the main choices are: key/value, graph, document, object, multivalue, and table. I went with Cassandra, a key/value DB. In retrospect, HBase might’ve been a better choice. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis describes all of the common databases and their strengths/weaknesses.
Then you need a pipe. Two ways I found were Flume and Scribe. Scribe is a log aggregation system created by the Facebook team, while Flume was a tool created by Cloudera and then picked up by the Apache team. Scribe hasn’t seen commits to its github in forever (maybe it’s super stable?) and Flume allows some pretty robust manipulation of how your logs are picked up, modifications that are made in-flight, and where they end up. Honestly, I didn’t even read about Scribe until I was well into the Flume configuration.
OK, now you can get concerned with point A (Apache webserver, in my case) to point B (your NoSQL db). I decided to do things the “right way”. It’s entirely possible to have an exec source in Flume that runs “tail -F” on your logfile (note the -F, this follows your log when the old inode is rotated off) of your Apache logs, but that does not provide any means for Flume to signal that it is, for instance, unable to keep up. I chose to use Flume’s spooling directory source which makes all sorts of promises about consistency, reliability, etc. The source expects files to appear in the spooldir atomically, that is as the result of a mv operation, and it expect unique filenames. After it’s done injesting the logfile, it renames it with .COMPLETED at the end, for easier collection. I decided to have Apache use its “rotatelogs” utility to generate a logfile at a short interval (I don’t want to wait forever for logs to appear in the db) that I then move to the spooldir for flume.
Apache httpd.conf:
# logs for Flume injestion. Custom log format for easier regex matching
LogFormat "%{UNIQUE_ID}e %v %h %l %u %{%Y%m%d%H%M%S}t %r %>s %b %{Referer}i %{User-Agent}i" vfcombined
CustomLog "|/usr/local/sbin/rotatelogs /www/logs/al_flume.%Y%m%d%H%M%S 60" vfcombined
(in the LogFormat, those are tabs separating the fields. It’s important(sh) ). So you see, I added UNIQUE_ID, an environment variable generated by mod_unique_id which is a module that seems to be built by default with Apache 2.2 (at least it was in my install, and I don’t remember adding it); and I changed the timestamp format to something more machine parseable. I also could’ve use epoch time, but this timestamp style allowed easier human verification that everything was working properly. The CustomLog directive pipes the log out to rotatelogs (part of Apache), with a particular filename template. I rotate the file every 60 seconds.
Movement into spooldir, flume_rot.sh:
#!/bin/sh
# current time minus 61 seconds (latest file we want to move)
# used as a filename and timestamp for a file (used for a find(8) comparison)
STAMP=`date -j -v-61S “+%Y%m%d%H%M.%S”`
# where rotatelogs is writing
SDIR=/www/logs
# where flume is picking up
DDIR=/www/logs.flume
# debugging
# echo “Date: ” `date`
# echo “Using: $STAMP”
# Create our comparison file
touch -t $STAMP $SDIR/$STAMP
# move older logs to flume spooldir
find $SDIR -name ‘al_flume.*’ ‘!’ -newer $SDIR/$STAMP -exec mv “{}” $DDIR/ \;
# cleanup touchfile
rm $SDIR/$STAMP
This code is called from cron every minute.
After I verified that was dropping logs and moving them properly, it was on to configuring Flume. A quick note, if you’re building 1.4.0 prerelease from source on FreeBSD (like I chose to do for some weird reason), then this might still be an issue for you. It’s literally a single space in a single file that keeps Flume from building, so seriously fuck everyone who only tests their Maven code on Linux. (well no, I understand why you don’t, but it’s hurtful). Also make sure you use -DskipTests at the end of your maven build line with Flume to skip tests (which broke for me in 1.4.0-pre)
Here’s a decent guide on getting started with Flume. After that, you need a “sink” from Flume to your database. I used a Cassandra sink available on github, and if you use it definitely read the README like five or six times. Because it explains most of the issues I had to debug with the sink. You can build it with:
mvn clean package -P assemble-artifacts
then untar the .tar.gz file in target/ (in my case flume-ng-cassandra-sink-1.0.0-SNAPSHOT-dist.tar.gz) into your flume/lib/ directory (wherever you installed Flume to).
Next up, your database. Nothing fancy here, I just installed Cassandra from the FreeBSD ports (/usr/ports/databases/cassandra). You’ll need to make your cluster and keyspace and column family names match with what is being used by the Cassandra sink. You can most easily change your cluster name by editing /usr/local/share/cassandra/conf/cassandra.yaml . I recommend avoiding any periods or spaces, I ended up going with cepheidORG. Copy src/main/resources/speed4j.properties from the flume-ng-cassandra source to flume/conf (in your Flume install directory). Edit it, changing Logging to whatever you named your cluster. Run zip -d lib/flume-ng-cassandra-sink-1.0.0-SNAPSHOT.jar speed4j.properties. to remove that configuration file from the jarfile. Go ahead and start Cassandra (/usr/local/etc/rc.d/cassandra start, after you’ve enabled in /etc/rc.conf). Log in and create a keyspace and column family for your logs.
On to Flume configuration! Here’s the relevant bit from my flume.conf:
# Agent
webserver.sources = spoolDirSrc
webserver.channels = fileChannel
webserver.sinks = cassandraSink
# Source spoolDirSrc
webserver.sources.spoolDirSrc.type = spooldir
webserver.sources.spoolDirSrc.channels = fileChannel
webserver.sources.spoolDirSrc.spoolDir = /www/logs.flume/
webserver.sources.spoolDirSrc.interceptors = logDeserial host src ts
# interceptor for log format parsing
# LogFormat “%{UNIQUE_ID}e %v %h %l %u %{%Y%m%d%H%M%S}t %r %>s %b %{Referer}i %{User-Agent}i” vfcombined
webserver.sources.spoolDirSrc.interceptors.logDeserial.type = regex_extractor
webserver.sources.spoolDirSrc.interceptors.logDeserial.regex = ^([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)\\t([^\\t]*)
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers = s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s1.name = key
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s2.name = hostname
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s3.name = remotehost
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s4.name = remoteident
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s5.name = remoteuser
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s6.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s6.name = timestamp
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s6.pattern = yyyyMMddHHmmss
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s7.name = request
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s8.name = requeststatus
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s9.name = responsebytes
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s10.name = referrer
webserver.sources.spoolDirSrc.interceptors.logDeserial.serializers.s11.name = agent
# interceptor to insert host field (static to this source/spooldir)
webserver.sources.spoolDirSrc.interceptors.host.type = static
webserver.sources.spoolDirSrc.interceptors.host.preserveExisting = false
webserver.sources.spoolDirSrc.interceptors.host.key = host
webserver.sources.spoolDirSrc.interceptors.host.value = www.cepheid.org
# interceptor to insert src field (static to this application)
webserver.sources.spoolDirSrc.interceptors.src.type = static
webserver.sources.spoolDirSrc.interceptors.src.preserveExisting = false
webserver.sources.spoolDirSrc.interceptors.src.key = src
webserver.sources.spoolDirSrc.interceptors.src.value = apacheWWW
# last chance timestamp
webserver.sources.spoolDirSrc.interceptors.ts.type = timestamp
webserver.sources.spoolDirSrc.interceptors.ts.preserveExisting = true
# Sink definitions
# loggerSink (testing)
webserver.sinks.loggerSink.type = logger
webserver.sinks.loggerSink.channel = fileChannel
# cassandraSink
webserver.sinks.cassandraSink.type = com.btoddb.flume.sinks.cassandra.CassandraSink
webserver.sinks.cassandraSink.channel = fileChannel
webserver.sinks.cassandraSink.hosts = localhost
webserver.sinks.cassandraSink.cluster-name = CepheidORG
webserver.sinks.cassandraSink.keyspace-name = ApacheLogs
webserver.sinks.cassandraSink.records-colfam = Requests
# Channel definitions
# durable channel
webserver.channels.fileChannel.type = file
webserver.channels.fileChannel.checkpointDir = /var/spool/flume/fchans/fchan1/checkpoint
webserver.channels.fileChannel.dataDirs = /var/spool/flume/fchans/fchan1/data
# non-durable channel
webserver.channels.memoryChannel.type = memory
webserver.channels.memoryChannel.capacity = 100
Things of note: First, read through the Flume configuration guide to get a basic idea of the format. Second, while I used a file channel, it very much wasn’t necessary. The configuration has a memory channel defined that would work as well, but to use my file channel you need to “mkdir -p /var/spool/flume/fchans/fchan1/checkpoint; mkdir -p /var/spool/flume/fchans/fchan1/data”. Third, I used the regex_extractor interceptor to pull out the UNIQUE_ID and timestamp of the request, but the other fields I populate are not used by the Cassandra sink, in the end.
If Apache’s throwing logs in the spool directory, and Cassandra seems to be up and responding to the CLI, it’s time to hook them together. Start Flume with /usr/local/flume/bin/flume-ng agent -c /usr/local/flume/conf/ -f /usr/local/flume/conf/flume.conf -n webserver -Dflume.root.logger=DEBUG,console &> logs/flume.log , and you will be able to pull up logs/flume.log for extensive debugging information. If everything is going well, you should see your spool files change their names to .COMPLETED, and you should be able to run cqlsh and see log entries with:
use ApacheLogs;
select * from Requests;
(changing whatever your keyspace and column family are)
Good luck!