A while back, we looked at Diffbot, the machine learning AI for processing web pages, as a means to extract SitePoint author portfolios. That tutorial focused on using the Diffbot UI only, and consuming the API created would entail pinging the API endpoint manually. Additionally, since then, the design of the pages we processed has changed, and thus the API no longer reliably works.
In this tutorial, apart from rebuilding the API so that it works again, we’ll use the official Diffbot client to build custom entities that correspond to the data we seek (author portfolios).
Bootstrapping
We’ll be using Homestead Improved as usual. The following few commands will bootstrap the Vagrant box, create the project folder, and install the Diffbot client.
git clone https://github.com/swader/homestead_improved hi_diffbot_authorfolio; cd hi_diffbot_authorfolio
./bin/folderfix.sh
vagrant up; vagrant ssh
mkdir -p Code/Project/public; cd Code/Project; touch public/index.php
composer require swader/diffbot-php-client
Additionally, we can install Symfony’s vardumper as a development requirement, just to get prettier debug outputs.
composer require symfony/var-dumper --dev
If we now give index.php
the following content, provided we added homestead.app
to our host machine’s /etc/hosts
file, we should see “Hello world” if we visit http://homestead.app in our browser:
<?php
// index.php
require '../vendor/autoload.php';
echo "Hello World";
Diffbot Initialization
Note that to follow along, you’ll need a free Diffbot token – get one here.
define('TOKEN', 'token');
use Swader\Diffbot\Diffbot;
$d = new Diffbot(TOKEN);
This is all we need to init Diffbot. Let’s test it on a sample article.
echo $d->createArticleAPI('https://www.sitepoint.com/crawling-searching-entire-domains-diffbot')->call()->getAuthor(); // Bruno Skvorc
Custom API
First, we need to rebuild our API from the last post, so that it can become operational again. We do this by logging into the dev panel and going to https://www.diffbot.com/dev/customize/.
Let’s create a new API:
After entering a sample URL like www.sitepoint.com/author/bskvorc/
, we can add some custom fields, like author
:
We can use this same approach to define fields like bio
, and nextPage
, in order to activate Diffbot’s automatic pagination:
We also need to define a collection which would gather all the article cards and process them. Making a collection entails selecting an element the selector of which is repeated multiple times. In our case, that’s the li
element of the .article-list
class.
Within that collection, we define fields for each card (when in doubt, the browser’s dev tools can help us identify the classes and elements we need to specify as selectors to get the desired result):
Besides title and primary category, we should also to extract the date of publication, primary category URL, article URLs, number of likes, etc. For the sake of brevity, we’ll skip defining those here.
If we now access our endpoint directly rather than in the API toolkit, we should get the fully merged 9 pages of posts back, processed just the way we want them.
http://api.diffbot.com/v3/diffpoint?token=token&url=https://www.sitepoint.com/author/bskvorc/
We can see that the API successfully found all the pages in the set and returned even the oldest of posts.
Extending the Client
Let’s see if the Custom API behaves as expected.
echo $d->createCustomAPI('https://www.sitepoint.com/author/bskvorc', 'diffpoint')->call()->getBio();
This should echo the correct bio.
This step is, in a way, optional. We could consume the returned data as is, and just iterate through keys and arrays, but let’s pretend our data is much more complex than a simple portfolio page and do it right regardless.
We need two new classes: an Entity Factory, and an Entity. Let’s create them at /src/AuthorFolio.php
and src/CustomFactory.php
, relative to the root of our project (src
is in the root folder).
AuthorFolio
Let’s start with the new entity. As per the docs, we have an abstract class we can extend.
<?php
// src/AuthorFolio.php
namespace My\Custom;
use Swader\Diffbot\Abstracts\Entity;
class AuthorFolio extends Entity
{
}
We extend the abstract entity and give our new entity its own namespace. This is optional, but useful. At this point, the entity would already be usable – it is essentially identical to the Wildcard entity which uses magic methods to resolve requests for various properties of the returned data (which is why the getBio
method in the example above worked without us having to define anything). But the goal is to have the AuthorFolio class verbose, with support for custom, SitePoint-specific data and maybe some shortcut methods. Let’s do this now.
The API will return the full list of an author’s articles – but not their count. To find out how many posts an author has, we’d have to count
the articles
property, so let’s wrap that process in a shortcut method. We can also tell PHPStorm that the class will have an articles
property using the @property tag, so it stops complaining about accessing the field with magic methods:
<?php
// src/AuthorFolio.php
namespace My\Custom;
use Swader\Diffbot\Abstracts\Entity;
/**
* Class AuthorFolio
* @property array articles
* @package My\Custom
*/
class AuthorFolio extends Entity
{
public function getType()
{
return 'authorfolio';
}
public function getNumPosts()
{
return count($this->articles);
}
}
Other methods we could define are totalLikes
, activeSince
, favoredCategory
, etc.
CustomFactory
The entity being ready, it’s time to define a custom factory to bind it to the type of return data we’re getting from our custom API. We’re writing an alternative to the default factory, but the original class already contains some methods we can use – it’s designed to be reused by its children. As such, we merely need to extend the original, map the new type to our custom entity, and we’re done.
<?php
// src/CustomFactory.php
namespace My\Custom;
use Swader\Diffbot\Factory\Entity;
class CustomFactory extends Entity
{
public function __construct()
{
$this->apiEntities = array_merge(
$this->apiEntities,
['diffpoint' => '\My\Custom\AuthorFolio']
);
}
}
We merged the original API-to-entity list with our own custom binding, thereby telling the Factory class to both keep an eye on the standard types and APIs, and our new ones. This means we can keep using this factory for default Diffbot APIs as well.
Plugging the Factory In
To make our classes autoloadable, we should probably add them to composer.json
:
"autoload": {
"psr-4": {
"My\\Custom\\": "src"
}
}
We activate these new autoload mappings by running composer dump-autoload
.
Next, we instantiate the new factory, plug it into our Diffbot instance, and test the API:
$d = new Diffbot(TOKEN);
$d->setEntityFactory(new My\Custom\CustomFactory());
$api = $d->createCustomAPI('https://www.sitepoint.com/author/bskvorc', 'diffpoint');
$api->setTimeout(120000);
$result = $api->call();
dump($result->getNumPosts());
Note that we increased the timeout because a heavily paginated set of posts can take a while to render on Diffbot’s end.
Conclusion
In this tutorial, by using the official Diffbot client, we constructed custom entities and built a custom API which returns them. We saw how easy it is to leverage machine learning and optical content processing for grabbing arbitrary data from websites of any type, and we saw how heavily customizable the Diffbot client is.
While this was a rather simple example, it isn’t difficult to imagine advanced use cases on more complex entities, or perhaps several of them spread over multiple APIs, all processed through a single EntityFactory, each custom API corresponding to a special Entity type. With a well trained visual neural network, the only processing limit is one’s imagination.
If you’d like to read more about the Diffbot client, check out the full docs and play around for yourself – just don’t forget to fetch a fresh free two-week demo token!
Bruno is a blockchain developer and technical educator at the Web3 Foundation, the foundation that's building the next generation of the free people's internet. He runs two newsletters you should subscribe to if you're interested in Web3.0: Dot Leap covers ecosystem and tech development of Web3, and NFT Review covers the evolution of the non-fungible token (digital collectibles) ecosystem inside this emerging new web. His current passion project is RMRK.app, the most advanced NFT system in the world, which allows NFTs to own other NFTs, NFTs to react to emotion, NFTs to be governed democratically, and NFTs to be multiple things at once.