Category Archives: RDF

Fly Me to the Moon

It’s interesting to see how people talk about Linked Data & RDF these days. Most of the time the discussions talk about one specific feature of the technology stack which either rocks or sucks, depending on which side the author stands.

Let’s start with what are for me the best two pages about RDF I’ve read since I started working with the technology five years ago: Irene Polikoff in my opinion summarizes perfectly what RDF is about:

The ability to combine multiple data sources into a whole that is greater than the sum of its parts can let the business glean new insights. So how can IT combine multiple data sources with different structures while retaining the flexibility to add new ones into the mix as you go along? How do you query the combined data? How do you look for patterns across subsets of data that came from different sources?

The article gives a very good idea of when you need what parts of the RDF stack to tackle these kind of questions. The reason why I started reading into RDF & Linked Data is because I think RDF can solve these kind of questions in a time and money efficient way up to the scale of global companies and governments. And this is the scale I’m really interested in.

And this brings us to the other end of what we need to become mainstream with a technology: The average (web) developer. It’s still painfully hard to use the highly flexible data model you get with RDF to create user interfaces. I know this because me and some colleagues work on this for some time now and it’s also the domain where we see a lot of (often negative) postings about Linked Data and RDF. Some examples:

What they have in common is that they only look at the Semantic Web stack from their particular, limited perspective. The things they criticize are mostly correct, in its own small world. What they fail to see is that the Semantic Web does not try to solve a problem that is easy but one that is pretty hard: Find a way to split up the web of documents in a web of data and make sure that machines can help us interpreting it and make our life easier. I wasn’t aware of the real complexity of this before I started working with the RDF stack.

Now there are several options to handle this:

  • Ignore everything else than what you are trying to solve: JSON-LD is great, it probably does make things easier for a lot of developers. Manu states that he never had the need for a quadstore and SPARQL in +7 years of working with the technology stack. Good for him but then we obviously don’t solve the same kind of problems. This is not a problem at all but it’s important to keep in mind when we compare technologies.
  • Reinvent the wheel: Jens Ohlig first rants about Semantic Web and then explains for 30 minutes why Wikidata is so much work: unique identifiers, relationship between data, ontologies, provenance, multiple languages etc. I understand that Wikidata decided against using RDF and go for what they know best, which is probably PHP & MySQL. But it doesn’t help your point if you show me that in the end you solve exactly the same kind of problems RDF defined in W3C standards. You just build yet another data silo.
  • Not invented here. The Nepomuk project was funded by an EU FP7 research grant and I guess that none of the guys which originally worked on the RDF code are still there. The new guys probably mainly know key/value stores and didn’t understand RDF or graphs. The normal reaction in this case is to throw things away and start from scratch, instead of learning something which looks unfamiliar at first.
  • Accept that the world is complicated and continue working on the missing parts of the stack.

Manu Sporny:

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

I fully agree Manu but again, there are more problems out there than the ones JSON-LD tries to address. I think Brian Sletten summarized this best in a recent posting at semanticweb.com:

Fundamentally, however, I think the problem comes down to the fact that the Semantic Web technology stack gets a lot of criticism for not hiding the fact that Reality is Hard. The kind of Big Enterprise software sales that get attention promise to hide the details, protect you from complexity, to sweep everything under the rug.

[lots of more good stuff]

What is the alternative? If we abandon these ideas, what do we turn to? The answer after even the briefest consideration is that there is nothing else on the table. No other technology purports to attempt to solve the wide variety of problems that RDF, RDFS, OWL, SPARQL, RDFa, JSON-LD, etc. do.

I couldn’t agree more. You can be big enough that you do all this work on your own. If you are Google or Facebook that might even make sense. For everyone else, go with the standards. Even Google recommends you this.

I’m glad that Manu Sporny accepted to keep JSON-LD RDF compatible, as they solved a lot of interesting problems around JSON-LD like graph normalization and data signing. Maybe we need more people like him which “stop making the case for it and spend [their] time doing something useful”. But at the same time we need the guys who want to bring us to the moon. I’m glad Tim Berners-Lee decided to do so more than 20 years ago when he wrote his ‘Vague, but exciting’ proposal.

Data quality counts

When linking with other data sources,  without a doubt knowing the origin and the quality level of external data is important. After all, an important point in using semantic web technology is to use other data sources to enrich our own data, and enhance our own solutions by that.

People being new to RDF, which never or rarely came in touch with the task of interlinking with foreign data sources so far, may now think that this was a problem coming only with RDF. At the contrary! Any data retrieved of whatever other system may be right or wrong, and may be complete or incomplete. Within the RDF concept it is even literally stated that RDF-based data by definition is neither correct nor complete – well, at least if you are not the one to ensure that. But this problem may only get bigger the more you access external data sources, and once you would start to use semantic web technology to do so, it may just happen more and more often.

This puts up questions like: how securely do you know from what source a given set of data originates, how much do you know about the provider of that data source, how accurate is the data model and how much can you trust in the quality of data maintenance. Obviously the answers to these questions will have a direct impact on how much you can benefit from interlinking with the given data. If the quality of external data is not ensured, the quality of your solution in turn will suffer.

So if you link to external data yourself, you will have to pay tight attention to the quality of the used data sources, in order to avoid any bad impact on the quality of your own products and services.

Does RDF fit into my architecture?

When software engineers come into touch with the world of semantic data for the first time, it may be difficult for them to grasp all the benefits and consequences at once. When I heard of RDF for the first time, I thought this concept could possibly impose more risks than benefits. So how would RDF fit ino the architecture of my solutions? Or better: why did I feel at first that this possibly may not be the case ? Just in case you have or have had similar feelings, I would want to share my thoughts on that with you.

I think that back then I felt quite uncertain, because I would have to let go all the known ways of accessing structured data. At first I even considered RDF-based data to be stored with no real structure, but of course that is not true. Later, I felt that modeling data with literally any vocabulary would imply lots of problems and hinder applications from being able to deal with data from anybody else from outside, but of course that is not true as well.

After some time, dealing with RDF and practically working with it, I more and more understood that these points are not a problem at all, instead RDF shows its beauty and flexibility by letting me do things just differently.

First of all, it is not that RDF data would be poorly structured. Instead, it is structured by semantics – only!

In fact vocabulary used within the RDF concept provides the most flexible and at the same time a very precise and reliable way to describe and structure data in a machine-readable way. The vocabulary can be uniquely identified, so that the the meaning of the described data can always be exactly determined. And with RDF you put the knowledge about the structure of data into he data itself, while with other concepts you have to put it mostly into (self-)developed software. This is why RDF is said to compare to other storage concepts like knowledge engineering to software engineering.

Using semantics over conventional ways of structuring data, so e.g. by tables and hierarchies, has another important advantage: scalability in complexity. Without any problem, RDF data can get more complex without breaking any existing logic or requiring more efforts to keep more complex data models performing well.  On the other hand this is not true for tables and hierarchies, they don’t scale well in complexity at all.

Second, it is not that one would need a restricted set of vocabulary to keep data usable for oneself or anybody else. Instead, RDF encourages to use any existing vocabulary in order to increase reusability of data. In fact I can model a given set of data with any amount of different sets of vocabulary at the same time, and that without much overhead or  redundancy (try to do that with SQL based data sources…).

If you would restrict your solutions to a restricted set of vocabulary, it would not give you any extra benefit. Instead you would restrict your services to those RDF-based data sources that use the very same set of vocabulary that you do. And that would be much more of a restriction in the future than you can think of today.

However, even if you use existing vocabulary as much as possible, you may still need to interlink your data with data from a source describing same things, but using different vocabulary. Then you simply need a way to translate between different sets of vocabularies. Due to the nature of vocabularies being described semantically themselves, this can be done in a generic manner. In the easiest case this translation would take place in the back end, so to say in your triple store engine, making this step completely transparent to your services using the store. By the time of this writing however, no database product seems to be able to do that. As an alternative, such translation could be encapsulated in a framework, through which your services would access RDF data sources, including our own. netlabs.org is currently working on a technique that allows interlinking between sets of vocabularies with a flexible way of defining such a translation, for sure basing on RDF-based data as well.

Inner nerd and Semantic Web: The glory details

What we just heard in the introduction means that the semantic web once and for all (at least for a while) solves the data modeling problem we face today. There is no application or use case proprietary data anymore. The data describes itself, regardless of a specific application or use case. Can you imagine what this means for data re-use?

But why is re-use so important? Let me explain that by a posting of Tim Berners-Lee in 2007:

The word Web we normally use as short for World Wide Web. The WWW increases the power we have as users again. The realization was “It isn’t the computers, but the documents which are interesting”. Now you could browse around a sea of documents without having to worry about which computer they were stored on. Simpler, more powerful. Obvious, really.

If I look back this fits so perfectly well in one of the revelations I had myself one day. In the early 90ies before I started using the web, my computer was the center of my digital universe. Everything was on my OS/2 box and I was happy.  Now I have multiple devices and data on each one of them and on various sites somewhere in the Internet. Is this better? Not so sure about it, as most of the devices or sites act as a small islands somewhere in the wide ocean and there is no way to get from one island to the other. So let us quote Tim again:

[…] The Net links computers, the Web links documents. Now, people are making another mental move. There is realization now, “It’s not the documents, it is the things they are about which are important”. Obvious, really.

There are some important remarks in here: While we (or rather our brain) can make the link between things, the computer can not. If you don’t believe me, google for something like Jaguar and get me only the sites which are related to the animal. Seems to be pretty hard for Google.

Biologists are interested in proteins, drugs, genes. Businesspeople are interested in customers, products, sales. We are all interested in friends, family, colleagues, and acquaintances. There is a lot of blogging about the strain, and total frustration that, while you have a set of friends, the Web is providing you with separate documents about your friends. One in Facebook, one on LinkedIn, one in LiveJournal, one on advogato, and so on. The frustration that, when you join a photo site or a movie site or a travel site, you name it, you have to tell it who your friends are all over again. The separate Web sites, separate documents, are in fact about the same thing — but the system doesn’t know it.

The other remark is related to what Tim calls separate documents. You can take sites like LinkedIn or Facebook as separate documents in that regard. Why? Simple: Those sites pervert the original design idea of the web as they create something like a giant document or black hole which sucks data in and just opens up a few things over proprietary APIs to the outside world. Sounds like what? Right, sooo 90ies! Did that, done that, just with Microsoft products back then. Doing the same in the web browser as Web 2.0 doesn’t really make the whole thing better.

So how is the Semantic Web gonna make this better? Pretty simple, the data is the API! If you describe information in a semantic web way you will use RDF as the lingua franca of the web and this, by definition, provides a universal, unambiguous way for accessing and querying it. No more lock-in, no more application or use-case proprietary data but data re-use. And another thing I really love about it: As transport it is using the same foundation of what the web runs on since 20 years: http. Good times!

If you want to get your hands dirty now you might check out our gentle introduction to the technology behind the semantic web.

How my inner nerd got hooked by the Semantic Web

This was supposed to be the first post for this blog but I never published it so far. The second part is a pretty technical explanation on why I started to love the semantic web, which might also explain the subtitle of this blog ;) I got way better in explaining it meanwhile but I still think it makes sense to post it for historical and nerdy reasons, so here we go.

About four years ago a friend of mine and I were having dinner at my place and I tried to explain him what we aim at. Our vision was still very abstract back then but he told me that the stuff I talk about sounds a lot like something which goes under the name Semantic Web. Some time later he made a short presentation to our team, I remember sitting there and hear about things like triples, SPARQL, giant global graph and so on.

To be honest, I didn’t get it at all at first. But somehow it stuck in my head, I had the idea that this technology indeed might be a part of the puzzle we try to solve. A few month later it was summer and I was looking for a good excuse to sit at the lake in the sun instead of working on the computer.  So I printed a bunch of papers from W3C explaining the semantic web and its components and started reading.

I was amazed. I mean I was seriously amazed. I spend quite some time in the IT business and I did quite a bit of data modeling, programming and all that stuff but what I read was sexy, super sexy (inner nerd speaking here). I still didn’t understand the whole thing yet but it seemed like there is something out there which has the potential to solve all the nasty technical problems I was running into sooner or later in the past.

So what is so sexy about it? As the Internet (or rather its protocol suite called TCP/IP) connected computers in a universal language in the 70ies and the Web connected documents in the 90ies Semantic Web connect things the same way. What things? Anything! Seriously!

If this sounds fair enough you can now stop reading. If you don’t believe me yet, read on.

Big Data and the Semantic Web

I wrote this post in the train on the way back from Luxembourg where I participated at the ICT Call 8 Information and Networking Day: Intelligent Information Management, which is another EU FP7 call for research projects on big data.

The information day was pretty interesting as I didn’t really read into the big data issue yet. The summary was basically that big data is when the size of the data itself is a problem. Examples given included Google which talks about 1 petabyte of data, Amazon S3 with 500 billion objects or Wall Mart which seems to process data sets up to 100 million each day.

They did have quite some RDF/semantic web related projects there, existing ones (lod2.eu) and proposals for new ones by groups which search partners. I was a bit confused about RDF and LOD because although the total data size is impressive, each one of the data bases like DBpedia itself is not that big (DBpedia is only few 100 gigabytes). And funnily enough, I had an article on my reading list about exactly this problem at semanticweb.com: Two kinds of big data.

Rob Gonzalez makes some really good remarks in there, like the statement that there are two kinds of big data: Really big data sets which need to be processed on one box/instance (vertical big data) and the semantic web, which in itself is horizontal big data.

With Horizontal Big Data (maybe HBD will start catching on!), the problem isn’t how to crunch lots of data fast.  Instead, it’s how to rapidly define a working subset of information to help solve a specific need.

That’s a really good remark and I am curious about how we will be able to solve the problem of widely distributed data. So, semantic web community, listen up: There is some money available in this EU FP7 call, deadline for proposals is 17 January 2012 at 17:00 (Brussels local time) !

Recommended readings (mentioned at the FP7 information day):

On the quest of explaining Semantic Web

A few weeks ago I did the (for me) so far most successful presentation about semantic web. And guess what: it worked so well because I almost did not mention it in the presentation.

An old friend of mine asked me to present the idea of semantic web to the company she is working for in an event they call Puzzle Lunch. The idea is to present a technology to everyone interested in the company and have lunch after that. The time limit was one hour, which I considered as practically impossible. I did quite a lot of talks about it in the past, both to programmers and to non-technical people and I always found it easier to explain it to someone with little or no technical background. This way I could skip the time-consuming details of RDF and related standards.

Inspired by a presentation from Bart and a long discussion with Christian the night before I decided to drop the  technical aspect completely and just try to explain how we and others use the technology. In the train to Bern I was not so sure anymore if this was really the way to go, but in the end I decided to give it a try. I went there only with an improvised spreadsheet on which I explain the issues with list and table-like structures. I did not had a single slide prepared.

After a warm welcome the room was filled with quite a lot of people, most of them as expected programmers and a few customers from Puzzle. I started my presentation and explained the issues with implicit knowledge we find everywhere where we use list- or table-like structures, like Excel files or databases. I showed them how much information gets lost that way and talked about why we need unique identifiers not just for the information itself but also for the headings, the annotations, the relationships between entries in tables. Finally I showed them how this implicit knowledge can be visualized in a graph, which I explained by very simple examples on the whiteboard.

I already talked longer than planned for this part so I switched to examples, stories from our customers and use cases I found on semanticweb.com or heard at SemTech. Every example I explained based on three great points Christian and I figured out the night before:

  • dissolve existing data silos
  • make implicit (or tacit) knowledge explicit available
  • make it possible to store any relationship without the need of designing an appropriate data model upfront

The most important key message was however that semantic web does not offer entirely new things, but it allows us to solve common problems more quickly and, hence, at lower costs.

Surely I did briefly mention RDF as the data model, I did talk about vocabularies and how you describe them in so called ontologies but I did not waste more time on it than absolutely necessary. Later we briefly talked about why we have to bring the benefits of the flexible data model in the backend to the user interface, which is exactly what we are working on at netlabs.org.

In the end I showed a few examples of our user interface technology and briefly talked about Linked Open Data and its potential. Surely I did talk a bit longer than an hour and when I decided to stop, I knew that I didn’t talk about quite a lot of things which I find tremendously exciting about the semantic web.

The feedback was just great. We had great discussions at lunch, later I got various text messages and emails from Puzzle employees and customers which told me the loved the idea of the semantic web and the way I presented it. They were sparkling with ideas of how they could use it in their own company or at customers and some of them want to read into the technology and get their hands dirty.

But there was one ultimate remark which proved to me that presenting semantic web without lot’s of technical details is definitely the way to go: At lunch they realized that someone else presented RDF a few years back but back then no one understood what it could be used for. So no matter to whom you are explaining the semantic web: rather show how it can solve existing problems, but don’t waste time on the technology itself. And by the way: me and my guys at netlabs.org would be more than happy to explain it to you as well :)

Why tables and hierarchies don’t scale

Maintaining data is a challenge, and there are several things that can make maintenance a real pain. Many problems with that arise out of the patterns that we use to store our data, as they have a big impact on how easy we can retrieve stored data later, may it be for viewing or cleaning up obsolete data or other tasks.

Before computer-age, people stored data by simply writing it. Lists and tables were invented to make retrieval of stored data much easier, just think of birth and wedding registries etc. Beside that hierarchies were used e.g. in natural science to display relations between unequal things and similar things at a time. Both methods implied a limit, as lists either needed to be short enough to remain usable, or there had to be a suitable scheme to split it up in smaller parts. Hierarchies could as well not grow too big, otherwise it would have been impossible to display them on a single or at least on a small amount of sheets of paper.

Surprisingly, even since the invention of computers, the concepts of tables and hierarchies still dominate the way we store and retrieve data. Tables are used in the majority of database management systems as well as in spreadsheets, and hierarchies are used in file systems and applications to store data. And although we might expect that with a computer we should not have a problem with storing large amounts of data, somehow the limits of pre-computer ages still apply.  This happens at all places where people need to access data, or require to design tables or hierarchies to suit a specific use case. Many problems arise out of that, but mostly the storage patterns are either not seen as the underlying reason, or the problems are taken as irrevocable.

Data to be held in a table, may it be in a spreadsheet or a database, may still not grow too big or  complex, otherwise it cannot be stored in one table of reasonable size and/or complexity. In this case people cannot use spreadsheets or other kind of simple table views anymore, but require use case specific applications for accessing and visualizing data. A more important drawback that applies to tables of all sizes is that they are not very suitable for data exchange, when either the meaning or the formatting of data items can be misinterpreted.

Wherever users create hierarchies, e.g. in file systems or within applications, they face the problem that the hierarchy strongly depends on the logic created in it by one or a group of persons. With with increasing complexity it gets more and more difficult, if not impossible to extended it without breaking this logic.

Semantic web technology is a true game changer in that regard. Beside other advantages, it comes with the ability of linking things with any amount of other thing, instead interlinking documents like the world wide web does, or linking a set of things with another set of things, like tables do. Because of using the most fine-granular relation possible, relations do not have to fit into any other logical system, which would have to be designed for a given use case. And semantic data is not required to be stored in a hierarchy, so there is no risk of implementing today a boundary of tomorrow. As a result, the storage pattern is completely nonspecific to any use case, and can be scaled to any size and complexity.

Of course, retrieval of data stored like this is not bound to a use case as well, as no knowledge about tables or hierarchies is required. Instead data is retrieved by querying relations between things, which is far more intuitive. Interestingly this storage and retrieval pattern matches exactly how we memorize things, namely simply by association between things! Or do you open a table or a directory structure in your mind to remember what you had for lunch yesterday?

This does not mean however that tables and hierarchies are not longer required. Data needs to be displayed for to create, view and modify it in the front end, and therefore we still need the tables or hierarchies we are used to – you can take any form as a hierarchical way of displaying data. And whatever is required for that is already part of the data, because another, most important feature of semantic web technology is that the description of the data is part of the data itself.

However, for that applications need to apply the concept of tables and hierarchies only to a small part of available data, so there is no scaling problem. At the same time storage of data logically scales like never before, not hindered by any schemes that are otherwise only required for the visualization of data.

schema.org: Not Too Impressive

Last Friday Google, Yahoo and Bing announced the launch of schema.org, which promotes annotation of web pages to make them more useful for search engines. This is definitely a hot topic as the current web of documents reached its limits a few years ago. Every search engine user knows what I am talking about.

I am active in the semantic web world for a while now and many people in this community were not very pleased about the decisions taken by the big three. But what is my problem with schema.org? There are quite some and most of them got well addressed in other blog posts, let me recap:

  • RDFa is from a complexity point of view the same thing as Microdata, Manu Sporny proves that in his blog post by example. The argument that RDFa is more complex than Microdata is pure nonsense.
  • RDFa is a serialization of RDF within XML/(X)HTML trees. In case you do not know RDF, Mike Bergmann calls it the universal data solvent, which gets it pretty well. RDF provides much more than Microdata and is so much more powerful. There is simply no excuse for not using RDFa in the first place.
  • There are lots of great examples out there how you can use RDFa, one of the famous examples is probably GoodRelations. On the schema.org FAQ they state that their work is “inspired by earlier work like Microformats, FOAF, GoodRelations, OpenCyc, etc.”.

The last point needs some more explanation. In RDF, a shared vocabulary gets described in a so called ontology which is most of the times expressed in RDF Schema or OWL. Such an ontology defines the wording used (called predicates) and also the data type of each predicate and relationships to other predicates. Both RDF Schema and OWL are expressed in RDF, which makes it possible to bootstrap not just the data itself but also the shared vocabulary used for describing the data in the same format. This is big, really big!

Another important aspect is that data modeled in RDF can be neither correct, nor complete. If you ask 10 persons to model the world, you will get 10 different results. If you express those 10 models in RDF, we will be able to map matching things between the different models even if it is not always exactly the same thing. RDF can handle this uncertainty, which is for me one of the favorite things about RDF.

This is an important lesson learned after failures in the 1990ies when companies like Taligent tried to model the whole world in one single library. RDF instead propagates the concept of domain experts. If you are strong in a specific domain, you should create the vocabulary for it, not some experts at Google/Yahoo/Bing which try to figure out how they can squeeze the whole universe in 300 or so tags. Maybe your domain vocabulary is not fully compatible with my domain vocabulary but that is just how the world works and RDF can handle that by design.

So beside the technical decision not to use RDFa this is for me the biggest fail with schema.org. They thought of a vocabulary which fits for them. This vocabulary is not described in RDF which makes it far less useful for machines/computers and it is very hard to extend it that way or interlink it with more powerful vocabularies like GoodRelations, FOAF etc. which are already out there for a long time. Tim Berners-Lee suggested a 5-star deployment scheme for Linked Open Data, according to that I would probably give schema.org a 4 right now but there is still lots of room for improvement.

Fortunately some people already addressed the RDF part of it: With the help of some well known people in the Semantic Web world Michael Hausenblas created a “real” schema out of it, expressed in RDF. The results can be found at schema.rdfs.org. That is the way the well payed engineers of the three big companies should have done it in the first place. Now we can link it to DBPedia and other resources, extend it for our specific domains and use it in RDFa or whatever RDF serialization we choose.

I am not sure where it is going with schema.org. RDFa co-creator Manu Sporny is pessimistic about the current state while others like Mike Bergmann are very optimistic and think it is one of the most important steps in the semantic web world so far. I think the RDF Schema of the vocabulary is the first step into the right direction but I am afraid that the decision for Microdata will seriously harm adoption of RDFa as a standard. This should be changed as soon as possible! Let us see what Manu Sporny and others will present in the next few days or weeks. By the way there is also a session at this years SemTech about it.

So what is schema.org currently? A step back in terms of technology used plus a vocabulary which is not according to the intentions of the semantic web world as being done for several years by now. Not too impressive.