Big Data and the Semantic Web

I wrote this post in the train on the way back from Luxembourg where I participated at the ICT Call 8 Information and Networking Day: Intelligent Information Management, which is another EU FP7 call for research projects on big data.

The information day was pretty interesting as I didn’t really read into the big data issue yet. The summary was basically that big data is when the size of the data itself is a problem. Examples given included Google which talks about 1 petabyte of data, Amazon S3 with 500 billion objects or Wall Mart which seems to process data sets up to 100 million each day.

They did have quite some RDF/semantic web related projects there, existing ones (lod2.eu) and proposals for new ones by groups which search partners. I was a bit confused about RDF and LOD because although the total data size is impressive, each one of the data bases like DBpedia itself is not that big (DBpedia is only few 100 gigabytes). And funnily enough, I had an article on my reading list about exactly this problem at semanticweb.com: Two kinds of big data.

Rob Gonzalez makes some really good remarks in there, like the statement that there are two kinds of big data: Really big data sets which need to be processed on one box/instance (vertical big data) and the semantic web, which in itself is horizontal big data.

With Horizontal Big Data (maybe HBD will start catching on!), the problem isn’t how to crunch lots of data fast. Instead, it’s how to rapidly define a working subset of information to help solve a specific need.

That’s a really good remark and I am curious about how we will be able to solve the problem of widely distributed data. So, semantic web community, listen up: There is some money available in this EU FP7 call, deadline for proposals is 17 January 2012 at 17:00 (Brussels local time) !

Recommended readings (mentioned at the FP7 information day):

Big Data Now (O’Reilly)
Big data: The next frontier for innovation, competition and productivity (McKinsey & Company)
The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research)

More posts

Fly Me to the Moon

Data quality counts

Does RDF fit into my architecture?

Inner nerd and Semantic Web: The glory details