Big Data and the Semantic Web

I wrote this post in the train on the way back from Luxembourg where I participated at the ICT Call 8 Information and Networking Day: Intelligent Information Management, which is another EU FP7 call for research projects on big data.

The information day was pretty interesting as I didn’t really read into the big data issue yet. The summary was basically that big data is when the size of the data itself is a problem. Examples given included Google which talks about 1 petabyte of data, Amazon S3 with 500 billion objects or Wall Mart which seems to process data sets up to 100 million each day.

They did have quite some RDF/semantic web related projects there, existing ones (lod2.eu) and proposals for new ones by groups which search partners. I was a bit confused about RDF and LOD because although the total data size is impressive, each one of the data bases like DBpedia itself is not that big (DBpedia is only few 100 gigabytes). And funnily enough, I had an article on my reading list about exactly this problem at semanticweb.com: Two kinds of big data.

Rob Gonzalez makes some really good remarks in there, like the statement that there are two kinds of big data: Really big data sets which need to be processed on one box/instance (vertical big data) and the semantic web, which in itself is horizontal big data.

With Horizontal Big Data (maybe HBD will start catching on!), the problem isn’t how to crunch lots of data fast.  Instead, it’s how to rapidly define a working subset of information to help solve a specific need.

That’s a really good remark and I am curious about how we will be able to solve the problem of widely distributed data. So, semantic web community, listen up: There is some money available in this EU FP7 call, deadline for proposals is 17 January 2012 at 17:00 (Brussels local time) !

Recommended readings (mentioned at the FP7 information day):

On the quest of explaining Semantic Web

A few weeks ago I did the (for me) so far most successful presentation about semantic web. And guess what: it worked so well because I almost did not mention it in the presentation.

An old friend of mine asked me to present the idea of semantic web to the company she is working for in an event they call Puzzle Lunch. The idea is to present a technology to everyone interested in the company and have lunch after that. The time limit was one hour, which I considered as practically impossible. I did quite a lot of talks about it in the past, both to programmers and to non-technical people and I always found it easier to explain it to someone with little or no technical background. This way I could skip the time-consuming details of RDF and related standards.

Inspired by a presentation from Bart and a long discussion with Christian the night before I decided to drop the  technical aspect completely and just try to explain how we and others use the technology. In the train to Bern I was not so sure anymore if this was really the way to go, but in the end I decided to give it a try. I went there only with an improvised spreadsheet on which I explain the issues with list and table-like structures. I did not had a single slide prepared.

After a warm welcome the room was filled with quite a lot of people, most of them as expected programmers and a few customers from Puzzle. I started my presentation and explained the issues with implicit knowledge we find everywhere where we use list- or table-like structures, like Excel files or databases. I showed them how much information gets lost that way and talked about why we need unique identifiers not just for the information itself but also for the headings, the annotations, the relationships between entries in tables. Finally I showed them how this implicit knowledge can be visualized in a graph, which I explained by very simple examples on the whiteboard.

I already talked longer than planned for this part so I switched to examples, stories from our customers and use cases I found on semanticweb.com or heard at SemTech. Every example I explained based on three great points Christian and I figured out the night before:

  • dissolve existing data silos
  • make implicit (or tacit) knowledge explicit available
  • make it possible to store any relationship without the need of designing an appropriate data model upfront

The most important key message was however that semantic web does not offer entirely new things, but it allows us to solve common problems more quickly and, hence, at lower costs.

Surely I did briefly mention RDF as the data model, I did talk about vocabularies and how you describe them in so called ontologies but I did not waste more time on it than absolutely necessary. Later we briefly talked about why we have to bring the benefits of the flexible data model in the backend to the user interface, which is exactly what we are working on at netlabs.org.

In the end I showed a few examples of our user interface technology and briefly talked about Linked Open Data and its potential. Surely I did talk a bit longer than an hour and when I decided to stop, I knew that I didn’t talk about quite a lot of things which I find tremendously exciting about the semantic web.

The feedback was just great. We had great discussions at lunch, later I got various text messages and emails from Puzzle employees and customers which told me the loved the idea of the semantic web and the way I presented it. They were sparkling with ideas of how they could use it in their own company or at customers and some of them want to read into the technology and get their hands dirty.

But there was one ultimate remark which proved to me that presenting semantic web without lot’s of technical details is definitely the way to go: At lunch they realized that someone else presented RDF a few years back but back then no one understood what it could be used for. So no matter to whom you are explaining the semantic web: rather show how it can solve existing problems, but don’t waste time on the technology itself. And by the way: me and my guys at netlabs.org would be more than happy to explain it to you as well :)