Category Archives: Knowledge management

How my inner nerd got hooked by the Semantic Web

This was supposed to be the first post for this blog but I never published it so far. The second part is a pretty technical explanation on why I started to love the semantic web, which might also explain the subtitle of this blog ;) I got way better in explaining it meanwhile but I still think it makes sense to post it for historical and nerdy reasons, so here we go.

About four years ago a friend of mine and I were having dinner at my place and I tried to explain him what we aim at. Our vision was still very abstract back then but he told me that the stuff I talk about sounds a lot like something which goes under the name Semantic Web. Some time later he made a short presentation to our team, I remember sitting there and hear about things like triples, SPARQL, giant global graph and so on.

To be honest, I didn’t get it at all at first. But somehow it stuck in my head, I had the idea that this technology indeed might be a part of the puzzle we try to solve. A few month later it was summer and I was looking for a good excuse to sit at the lake in the sun instead of working on the computer.  So I printed a bunch of papers from W3C explaining the semantic web and its components and started reading.

I was amazed. I mean I was seriously amazed. I spend quite some time in the IT business and I did quite a bit of data modeling, programming and all that stuff but what I read was sexy, super sexy (inner nerd speaking here). I still didn’t understand the whole thing yet but it seemed like there is something out there which has the potential to solve all the nasty technical problems I was running into sooner or later in the past.

So what is so sexy about it? As the Internet (or rather its protocol suite called TCP/IP) connected computers in a universal language in the 70ies and the Web connected documents in the 90ies Semantic Web connect things the same way. What things? Anything! Seriously!

If this sounds fair enough you can now stop reading. If you don’t believe me yet, read on.

Big Data and the Semantic Web

I wrote this post in the train on the way back from Luxembourg where I participated at the ICT Call 8 Information and Networking Day: Intelligent Information Management, which is another EU FP7 call for research projects on big data.

The information day was pretty interesting as I didn’t really read into the big data issue yet. The summary was basically that big data is when the size of the data itself is a problem. Examples given included Google which talks about 1 petabyte of data, Amazon S3 with 500 billion objects or Wall Mart which seems to process data sets up to 100 million each day.

They did have quite some RDF/semantic web related projects there, existing ones (lod2.eu) and proposals for new ones by groups which search partners. I was a bit confused about RDF and LOD because although the total data size is impressive, each one of the data bases like DBpedia itself is not that big (DBpedia is only few 100 gigabytes). And funnily enough, I had an article on my reading list about exactly this problem at semanticweb.com: Two kinds of big data.

Rob Gonzalez makes some really good remarks in there, like the statement that there are two kinds of big data: Really big data sets which need to be processed on one box/instance (vertical big data) and the semantic web, which in itself is horizontal big data.

With Horizontal Big Data (maybe HBD will start catching on!), the problem isn’t how to crunch lots of data fast.  Instead, it’s how to rapidly define a working subset of information to help solve a specific need.

That’s a really good remark and I am curious about how we will be able to solve the problem of widely distributed data. So, semantic web community, listen up: There is some money available in this EU FP7 call, deadline for proposals is 17 January 2012 at 17:00 (Brussels local time) !

Recommended readings (mentioned at the FP7 information day):

On the quest of explaining Semantic Web

A few weeks ago I did the (for me) so far most successful presentation about semantic web. And guess what: it worked so well because I almost did not mention it in the presentation.

An old friend of mine asked me to present the idea of semantic web to the company she is working for in an event they call Puzzle Lunch. The idea is to present a technology to everyone interested in the company and have lunch after that. The time limit was one hour, which I considered as practically impossible. I did quite a lot of talks about it in the past, both to programmers and to non-technical people and I always found it easier to explain it to someone with little or no technical background. This way I could skip the time-consuming details of RDF and related standards.

Inspired by a presentation from Bart and a long discussion with Christian the night before I decided to drop the  technical aspect completely and just try to explain how we and others use the technology. In the train to Bern I was not so sure anymore if this was really the way to go, but in the end I decided to give it a try. I went there only with an improvised spreadsheet on which I explain the issues with list and table-like structures. I did not had a single slide prepared.

After a warm welcome the room was filled with quite a lot of people, most of them as expected programmers and a few customers from Puzzle. I started my presentation and explained the issues with implicit knowledge we find everywhere where we use list- or table-like structures, like Excel files or databases. I showed them how much information gets lost that way and talked about why we need unique identifiers not just for the information itself but also for the headings, the annotations, the relationships between entries in tables. Finally I showed them how this implicit knowledge can be visualized in a graph, which I explained by very simple examples on the whiteboard.

I already talked longer than planned for this part so I switched to examples, stories from our customers and use cases I found on semanticweb.com or heard at SemTech. Every example I explained based on three great points Christian and I figured out the night before:

  • dissolve existing data silos
  • make implicit (or tacit) knowledge explicit available
  • make it possible to store any relationship without the need of designing an appropriate data model upfront

The most important key message was however that semantic web does not offer entirely new things, but it allows us to solve common problems more quickly and, hence, at lower costs.

Surely I did briefly mention RDF as the data model, I did talk about vocabularies and how you describe them in so called ontologies but I did not waste more time on it than absolutely necessary. Later we briefly talked about why we have to bring the benefits of the flexible data model in the backend to the user interface, which is exactly what we are working on at netlabs.org.

In the end I showed a few examples of our user interface technology and briefly talked about Linked Open Data and its potential. Surely I did talk a bit longer than an hour and when I decided to stop, I knew that I didn’t talk about quite a lot of things which I find tremendously exciting about the semantic web.

The feedback was just great. We had great discussions at lunch, later I got various text messages and emails from Puzzle employees and customers which told me the loved the idea of the semantic web and the way I presented it. They were sparkling with ideas of how they could use it in their own company or at customers and some of them want to read into the technology and get their hands dirty.

But there was one ultimate remark which proved to me that presenting semantic web without lot’s of technical details is definitely the way to go: At lunch they realized that someone else presented RDF a few years back but back then no one understood what it could be used for. So no matter to whom you are explaining the semantic web: rather show how it can solve existing problems, but don’t waste time on the technology itself. And by the way: me and my guys at netlabs.org would be more than happy to explain it to you as well :)

Why tables and hierarchies don’t scale

Maintaining data is a challenge, and there are several things that can make maintenance a real pain. Many problems with that arise out of the patterns that we use to store our data, as they have a big impact on how easy we can retrieve stored data later, may it be for viewing or cleaning up obsolete data or other tasks.

Before computer-age, people stored data by simply writing it. Lists and tables were invented to make retrieval of stored data much easier, just think of birth and wedding registries etc. Beside that hierarchies were used e.g. in natural science to display relations between unequal things and similar things at a time. Both methods implied a limit, as lists either needed to be short enough to remain usable, or there had to be a suitable scheme to split it up in smaller parts. Hierarchies could as well not grow too big, otherwise it would have been impossible to display them on a single or at least on a small amount of sheets of paper.

Surprisingly, even since the invention of computers, the concepts of tables and hierarchies still dominate the way we store and retrieve data. Tables are used in the majority of database management systems as well as in spreadsheets, and hierarchies are used in file systems and applications to store data. And although we might expect that with a computer we should not have a problem with storing large amounts of data, somehow the limits of pre-computer ages still apply.  This happens at all places where people need to access data, or require to design tables or hierarchies to suit a specific use case. Many problems arise out of that, but mostly the storage patterns are either not seen as the underlying reason, or the problems are taken as irrevocable.

Data to be held in a table, may it be in a spreadsheet or a database, may still not grow too big or  complex, otherwise it cannot be stored in one table of reasonable size and/or complexity. In this case people cannot use spreadsheets or other kind of simple table views anymore, but require use case specific applications for accessing and visualizing data. A more important drawback that applies to tables of all sizes is that they are not very suitable for data exchange, when either the meaning or the formatting of data items can be misinterpreted.

Wherever users create hierarchies, e.g. in file systems or within applications, they face the problem that the hierarchy strongly depends on the logic created in it by one or a group of persons. With with increasing complexity it gets more and more difficult, if not impossible to extended it without breaking this logic.

Semantic web technology is a true game changer in that regard. Beside other advantages, it comes with the ability of linking things with any amount of other thing, instead interlinking documents like the world wide web does, or linking a set of things with another set of things, like tables do. Because of using the most fine-granular relation possible, relations do not have to fit into any other logical system, which would have to be designed for a given use case. And semantic data is not required to be stored in a hierarchy, so there is no risk of implementing today a boundary of tomorrow. As a result, the storage pattern is completely nonspecific to any use case, and can be scaled to any size and complexity.

Of course, retrieval of data stored like this is not bound to a use case as well, as no knowledge about tables or hierarchies is required. Instead data is retrieved by querying relations between things, which is far more intuitive. Interestingly this storage and retrieval pattern matches exactly how we memorize things, namely simply by association between things! Or do you open a table or a directory structure in your mind to remember what you had for lunch yesterday?

This does not mean however that tables and hierarchies are not longer required. Data needs to be displayed for to create, view and modify it in the front end, and therefore we still need the tables or hierarchies we are used to – you can take any form as a hierarchical way of displaying data. And whatever is required for that is already part of the data, because another, most important feature of semantic web technology is that the description of the data is part of the data itself.

However, for that applications need to apply the concept of tables and hierarchies only to a small part of available data, so there is no scaling problem. At the same time storage of data logically scales like never before, not hindered by any schemes that are otherwise only required for the visualization of data.

Files are not the problem, finding them is

Recently Alex Bowyer blogged at O’Reilly Radar about why files need to die. He provides some good ideas about how we will store and find information in the future but he also misses the point a bit. Files are not really the problem, finding them is.

The past months I spent a lot of time writing on the quest for money. Due to the decentralized nature of our team at netlabs.org, we already realized a long time ago that doing work in something like a Word/OpenOffice document is not the way to go. In the peak time of Wikis we set up our own MediaWiki and started working in there. That did and does work quite well for some things but it is something I am not gonna do forever for various reasons. So what are the benefits/drawbacks?

Benefits of working with a Wiki:

  • History. I can not imagine working on a text anymore without a powerful history. I need to be able to see any change between version x and y at any time. If you interact with several people on the same text this is the only right way to do it in my opinion. A good history in a Wiki is far more powerful than tracking changes in an office document.
  • Multiple users can work on it at the same time. Yes there are more advanced ways of doing that nowadays, some suck (Google Docs: “someone else is editing this file”), some are a bit overkill for my taste (Etherpad). If we screw up the same paragraph we get hints and we have to fix it by hand. Not great, but works most of the time.
  • I just work on the text. I might add some headings and things like bold/emphasis etc but that’s about it. I do not loose any time at formatting it to look good on something ridiculous like an A4 page.
  • The text itself always has the same link and I can access it from everywhere. This is what made the Web so useful.

Drawbacks of a Wiki:

  • Finding wiki pages. This problem started with (digital) files in the late 70-ies, continued with emails in the 90-ies and we still have it with links (technically URIs) nowadays. There were plenty of ideas on how to solve it and none of them worked for me yet.
  • Exporting that stuff. At some point I have to do a snapshot in time which I send out to someone by email/PDF/office document etc. This is the point where I start to cry because I have to launch an office-like app to do that and believe me, I really hate all of them (LaTeX included).

A few years ago, a former boss very proudly told me that he had rearranged the files on the file server. I had a look at it and was totally lost. Not that I liked the old structure but I really couldn’t figure out which files were where anymore. It was logical for him but my brain totally disagreed with that particular structure. He could not understand why this did not work for me.

This was one of the early lessons we learned in our team. Have a look at someones Desktop. Might scare the sh*t out of you but it works for that person. For a long time I had the tendency to order everything in nested structures myself, like emails or files. Result: A few months later I absolutely could not remember where I saved that information and I either wasted a lot of time finding it or I let it go. And I did let go a lot I noticed.

So how would I want to handle knowledge in the future? I do want to have all benefits of a wiki, combined with a way of finding content by context. This means I don’t want to crawl through a fixed structure for locating a text I wrote, as I won’t agree with the structure or won’t remember the right way through it anymore.

My brain remembers the things around a specific event, even if it is only a simple text about a certain topic.  I know it had something to do with funding, particularly for an EU FP7 project. I remember that I sent it to that guy of the University in Bern for a review. I know that my friend Barbara had a look at the English before we sent it out. I remember that I went to Bali, the day after I finished the stuff. I know that one guy sent me an email about it and told me he loves the project. Note that I don’t remember the exact time of it but the things around it.

If I want to find that text, that’s the way I want to search and find it. This can be a file on my disk or a link on the Internet. In that regard a URI is nothing else than a modern form of a file, including all the problems we know.  And this is part of what we are working on at netlabs.org these days.