One day of last week I came up with an idea that I had to test against the HTTP headers of millions of servers. To do this quickly, writing practically no code, I’ve used Rapid7’s DAP. It is a nifty little utility that has been the ultimate time-saver for me, and it certainly deserves much more spotlight than it currently has. Plus, hdmoore digs it.
Since the docs are scarce at the moment, I though to contribute with a complete example.
Get all the webservers with a particular header
Get the dataset
First, get the data. Scans.io does the heavy lifting for us, providing a dataset of GET requests to all IPv4 hosts (so, no vhosts, unfortunately, so we won’t see domains co-located on the same host). Get it:
wget https://scans.io/data/rapid7/sonar.http/20150203-http.gz -O the_Internet.gz # Get the latest one from the site.
Of course, we need to install stuff. Here’s the instructions for Debian/Ubuntu:
sudo aptitude install libgeoip-dev parallel pigz ruby-bundler pv git clone https://github.com/rapid7/dap.git cd dap gem install bundler bundle install
Now we are ready to go. It’s a long pipeline, that works like this:
pigz -dc the_Internet.gz | \ pv -t -r -a -l | \ head -n 300000 | \ parallel --gnu --nice 10 --pipe \ "cd dap; bundle exec bin/dap json + transform data=base64decode + decode_http_reply data + include data.http_server='Apache' + include data.fancy_header='42' + select ip data data.http_code data.http_raw_headers data.http_server + json"
pigzwill decompress our dataset quickly (using multiple cores)
pvwill just let us see what’s going on (printing the progress on the screen)
headis nice for getting the pipeline right quickly (processing only the first k sites, for a faster turnaround during debugging)
parallellets us use DAP (a single-core program) on multiple cores (or even multiple hosts!), by splitting the input across a set of workers.
daphas is own pipeline to interpret the data, described by its arguments. In particular:
- Input processing
jsonspecifies the input is in JSON, and it needs to be decoded
transform data=base64decodespecifies that the “data” field in the JSON is BASE64 encoded, and needs decoding
decode_http_reply dataspecifies that the “data” field contains an HTTP reply, which DAP knows how to process to give easier access to its fields.
include data.http_server=’Apache’will filter out all servers that do not self-report as “Apache” in theirs HTTP headers.
include data.fancy_header=’42’will further filter all the servers without this header
- Output generation
select ip data data.http_code data.http_raw_headers data.http_serverspecifies we only want these fields in our output, discarding all the rest
jsontells DAP we want our output as a JSON file.
- Input processing
So, there you go. On my workstation, I can process about 10k sites per second. Pretty sweet, for a few lines of bash.