Wide Finder in … errr … PHP

Share this article

Long time no blog – very busy searching Switzerland these days but popping in first to say congrats to the 2nd edition team and second, for the sheer evil delight in submitting a late PHP implementation for Tim Bray’s fascinating Wide Finder Project.

Tim set a simple, but very much real-world challenge; write an app that determines the top 10 most popular blogs from his Apache access log. It should be fast and readable, with a subtext of illustrating how “language X” copes in terms of parallel processing and utilizing “wider” (many processor) systems.

Now technically, PHP has almost zero ability to execute stuff in parallel, other than pcntl_fork – (don’t try that on your live webserver – CLI only!), unless you’re counting ticks, which I hope you aren’t.

But we’re not going to let little details stop us! Because we’ve got curl_multi_exec() which allows us to talk to multiple URLs and process their responses in parallel. And just think how many requests Apache + mod_php is able to serve at the same time. Perfect for some map and reduce

So first a mapper;


<?php
if ( isset($argv[1]) ) $_GET['f'] = $argv[1];

if ( !isset($_GET['f']) || !is_readable($_GET['f']) ) {
    header("HTTP/1.0 404 Not Found");
	die("Cant read file ".htmlspecialchars($_GET['f']));
}

$s = isset($_GET['s']) && ctype_digit($_GET['s']) ? $_GET['s'] : 0;
$e = isset($_GET['s']) && ctype_digit($_GET['e']) ? $_GET['e'] : filesize($_GET['f']);

$f = fopen($_GET['f'], 'r');
fseek($f, $s);

$counts = array();
while ( !feof($f) && ftell($f) < $e ) {
	$pattern = '#GET /ongoing/When/dddx/(dddd/dd/dd/[^ .]+) #';
	if ( preg_match($pattern, fgets($f), $m) ) {
		isset($counts[$m[1]]) ? $counts[$m[1]]++ : $counts[$m[1]] = 1;
	}

}
fclose($f);

$out = serialize($counts);
header("Content-Length: ".strlen($out));
echo $out;

You point it at a file it can reach via it’s filesystem plus give it a range of bytes ( a start and end ) to analyze. It returns a serialized hash, the keys being the blog entry titles and the values the number of occurences. A URL to call it might look like http://localhost/~harry/wf/wf_map.php?f=/tmp/o10k.ap&s=257664&e=524288 . Also for benchmarking it can be called from the command line to process a complete file in a single thread like $ php ./wf_map.php /tmp/o10k.ap.

Then we have the reducer, which actually invokes the mapper(s) using curl;


#!/usr/bin/php
<?php
function wf_reduce($urls) {

	$counts = array();
	
	$mh = curl_multi_init();
	
	foreach ( $urls as $url ) {
		$ch = curl_init($url);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_multi_add_handle($mh, $ch);
	}

	do {
		curl_multi_exec($mh,$running);
		while ( FALSE !==($msg = curl_multi_info_read($mh)) ) {
			
			if ( "200" != ( $status = curl_getinfo ( $msg['handle'], CURLINFO_HTTP_CODE ) ) ) {
				die("HTTP Status: $status!");
			}
			
			// Here we can "reduce" as soon as we get a response...
			foreach ( unserialize( curl_multi_getcontent ( $msg['handle'] )  ) as $page => $hits ) {
				isset($counts[$page]) ? $counts[$page] += $hits : $counts[$page] = $hits;
			}
			
		}
	} while ($running > 0);

	return $counts;
	
}

$mapurl = "http://localhost/~harry/wf/wf_map.php";
$file = "/tmp/o10k.ap";

if (!file_exists($file) ) {
	print "First run: downloading sample data!";
	file_put_contents($file, file_get_contents("http://www.tbray.org/tmp/o10k.ap"));
	die();
}

$partitions = isset($argv[1]) && ctype_digit($argv[1]) ? $argv[1] : 1;

$end = filesize($file);
$partition = floor($end / $partitions);

$s = 0;
$e = $partition;
$urls = array();

while ( $e <= $end ) {
	$urls[] = sprintf("%s?f=%s&s=%s&e=%s", $mapurl, urlencode($file), $s, $e);
	$s = $e;
	$e += $partition;
}

$counts = wf_reduce($urls);
arsort($counts, SORT_NUMERIC);

$top10 = array_slice($counts, 0, 10, TRUE);
print_r( array_reverse($top10) );

It is meant purely for command line execution, taking one argument which is the number of parallel “threads” which should run e.g. $ ./wf_reducer.php 10 which will fire off 10 HTTP requests in parallel to the mapper, requesting it parse a different “chunk” of the file each time. You’ll need to change the $mapurl.

Doubt this will win any awards for beauty and benchmarks with Tim’s sample file are not impressive – it’s faster to run single threaded – perhaps you’d start gaining with a much bigger file? My desire to find out is diminishing.

But anyway – PHP was also there. Wonder if Tim will now compile it on that T5120 ? ;)

Update: Just discovered Russell Beattie got there first with a single threaded solution.

Harry FuecksHarry Fuecks
View Author

Harry Fuecks is the Engineering Project Lead at Tamedia and formerly the Head of Engineering at Squirro. He is a data-driven facilitator, leader, coach and specializes in line management, hiring software engineers, analytics, mobile, and marketing. Harry also enjoys writing and you can read his articles on SitePoint and Medium.

Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week