The post below is taken from Bill Michels (VP Developer Platform, factual) blog over at factual.com. I thought Bill did a great job describing for us thier efforts to build a massive de-duping (or entity resolution) engine that is both language – and vertical-agnostic. You might like it as well!
From the beginning, we’ve always struggled with drafting a nice clean blurb that really summarizes Factual’s capabilities. On our homepage we say, “Access great data for your web and mobile apps,” which tells developers what we offer, but it doesn’t really shed light on our core technology stack.
Our more targeted message to the developer community is that Factual provides “an open data platform for application developers that leverages large-scale aggregation and community exchange.” OK, that definitely tells you a little more, but it likewise doesn’t articulate what our engineers are working so hard on–and trust me, they are working hard on some serious data problems.
Like with most technologies, it’s probably better to show, then explain.
We recently shipped a new version of our US Local dataset (our previous version was in beta). At first glance, this looks like a pretty standard dataset of places. But things get more interesting when you click into a cell value on the dash. Double-click on the “Name” field for the record of a great place in NYC – such as “Ino Cafe and Wine Bar.” You’ll see a pop-up similar to the image below.
As you see, this tells us that Factual has cited references to 142 total occurrences of this entity, of which 114 refer to the name as “Ino Cafe and Wine Bar,” along with a few others that refer to it as “Ino Bar” or “Ino.” Based on all of these references, Factual has calculated that the most relevant value is “Ino Cafe and Wine Bar.”
Where do all these references come from? There are a few major sources: partner data sharing agreements, crowd-sourced user submissions, and analyses of open caches of the web. We have data sharing relationships with companies like Allmenus, GrubHub, Homejunction, etc. If you have some data you would like to add, definitely let us know at email@example.com, and we’ll be pleased to provide you with data in return.
The crowd-sourced submissions typically come through apps that run on this dataset (via a PUT API call). Also, any user can supplement or correct this information by clicking on the “Correct This” link in the pop-up. On the adjacent “Fact History” tab, we display the values and the corresponding referenced URLs.
Finally, the identification of strings of factual data (notice the lower case) on open caches of the Web is accomplished through a proprietary machine learning process trained on publicly available sources.
How does this explain our stack? Well, the really impressive stuff happens when we throw all this data at our platform. All of these strings are clustered based on Factual’s de-duping algorithms–the specifics of which I’ll save for another post–and we then create canonical references of entities, something we refer to internally as “folding the web.” Another way to say it is that the data, and its corresponding referenced URL, is resolved to a Factual entity (notice the upper case). To give you a sense of the scale we are talking about, this US dataset contains references for roughly 1.5 billion referenced entities from about 5 million unique domains–and growing.
Consequently, you could describe our stack as a massive de-duping (or entity resolution) engine that is both language- and vertical-agnostic. Needless to say, we won’t be using that as Factual’s main tag line any time soon. We admit it’s not for everyone; nevertheless, we think it’s a pretty cool and extremely useful tool. We’re currently using this stack to build structured datasets, but have plans to apply it to other interesting use cases. More on that soon.
After all this work, we wind up with a nice, clean structured dataset of Places, which we give developers access to through APIs and downloads. You can also see and access the data programmatically on Factual endpoint pages. These web pages show the various attributes of each entity in the US Local dataset. There is appropriate semantic mark-up on these pages to help machines read the data, and by adding a “.json”, you get a JSON representation of the Factual endpoint page. This way, developers that have implemented Factual data have yet another way to check for the latest values.
I realize this was a slightly lengthy explanation, but I hope it gives you a window into our stack, and on the challenging problems our engineers have been working on. And with all these ways to get at the data, we hope you can leverage it to go off and build the next Great App…or at least one that doesn’t require a blog post to explain what it actually does!
Bill Michels, VP Developer Platform