All posts by Adrian Gschwend

Founder and head of netlabs.org

Fly Me to the Moon

It’s interesting to see how people talk about Linked Data & RDF these days. Most of the time the discussions talk about one specific feature of the technology stack which either rocks or sucks, depending on which side the author stands.

Let’s start with what are for me the best two pages about RDF I’ve read since I started working with the technology five years ago: Irene Polikoff in my opinion summarizes perfectly what RDF is about:

The ability to combine multiple data sources into a whole that is greater than the sum of its parts can let the business glean new insights. So how can IT combine multiple data sources with different structures while retaining the flexibility to add new ones into the mix as you go along? How do you query the combined data? How do you look for patterns across subsets of data that came from different sources?

The article gives a very good idea of when you need what parts of the RDF stack to tackle these kind of questions. The reason why I started reading into RDF & Linked Data is because I think RDF can solve these kind of questions in a time and money efficient way up to the scale of global companies and governments. And this is the scale I’m really interested in.

And this brings us to the other end of what we need to become mainstream with a technology: The average (web) developer. It’s still painfully hard to use the highly flexible data model you get with RDF to create user interfaces. I know this because me and some colleagues work on this for some time now and it’s also the domain where we see a lot of (often negative) postings about Linked Data and RDF. Some examples:

What they have in common is that they only look at the Semantic Web stack from their particular, limited perspective. The things they criticize are mostly correct, in its own small world. What they fail to see is that the Semantic Web does not try to solve a problem that is easy but one that is pretty hard: Find a way to split up the web of documents in a web of data and make sure that machines can help us interpreting it and make our life easier. I wasn’t aware of the real complexity of this before I started working with the RDF stack.

Now there are several options to handle this:

  • Ignore everything else than what you are trying to solve: JSON-LD is great, it probably does make things easier for a lot of developers. Manu states that he never had the need for a quadstore and SPARQL in +7 years of working with the technology stack. Good for him but then we obviously don’t solve the same kind of problems. This is not a problem at all but it’s important to keep in mind when we compare technologies.
  • Reinvent the wheel: Jens Ohlig first rants about Semantic Web and then explains for 30 minutes why Wikidata is so much work: unique identifiers, relationship between data, ontologies, provenance, multiple languages etc. I understand that Wikidata decided against using RDF and go for what they know best, which is probably PHP & MySQL. But it doesn’t help your point if you show me that in the end you solve exactly the same kind of problems RDF defined in W3C standards. You just build yet another data silo.
  • Not invented here. The Nepomuk project was funded by an EU FP7 research grant and I guess that none of the guys which originally worked on the RDF code are still there. The new guys probably mainly know key/value stores and didn’t understand RDF or graphs. The normal reaction in this case is to throw things away and start from scratch, instead of learning something which looks unfamiliar at first.
  • Accept that the world is complicated and continue working on the missing parts of the stack.

Manu Sporny:

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

I fully agree Manu but again, there are more problems out there than the ones JSON-LD tries to address. I think Brian Sletten summarized this best in a recent posting at semanticweb.com:

Fundamentally, however, I think the problem comes down to the fact that the Semantic Web technology stack gets a lot of criticism for not hiding the fact that Reality is Hard. The kind of Big Enterprise software sales that get attention promise to hide the details, protect you from complexity, to sweep everything under the rug.

[lots of more good stuff]

What is the alternative? If we abandon these ideas, what do we turn to? The answer after even the briefest consideration is that there is nothing else on the table. No other technology purports to attempt to solve the wide variety of problems that RDF, RDFS, OWL, SPARQL, RDFa, JSON-LD, etc. do.

I couldn’t agree more. You can be big enough that you do all this work on your own. If you are Google or Facebook that might even make sense. For everyone else, go with the standards. Even Google recommends you this.

I’m glad that Manu Sporny accepted to keep JSON-LD RDF compatible, as they solved a lot of interesting problems around JSON-LD like graph normalization and data signing. Maybe we need more people like him which “stop making the case for it and spend [their] time doing something useful”. But at the same time we need the guys who want to bring us to the moon. I’m glad Tim Berners-Lee decided to do so more than 20 years ago when he wrote his ‘Vague, but exciting’ proposal.

Inner nerd and Semantic Web: The glory details

What we just heard in the introduction means that the semantic web once and for all (at least for a while) solves the data modeling problem we face today. There is no application or use case proprietary data anymore. The data describes itself, regardless of a specific application or use case. Can you imagine what this means for data re-use?

But why is re-use so important? Let me explain that by a posting of Tim Berners-Lee in 2007:

The word Web we normally use as short for World Wide Web. The WWW increases the power we have as users again. The realization was “It isn’t the computers, but the documents which are interesting”. Now you could browse around a sea of documents without having to worry about which computer they were stored on. Simpler, more powerful. Obvious, really.

If I look back this fits so perfectly well in one of the revelations I had myself one day. In the early 90ies before I started using the web, my computer was the center of my digital universe. Everything was on my OS/2 box and I was happy.  Now I have multiple devices and data on each one of them and on various sites somewhere in the Internet. Is this better? Not so sure about it, as most of the devices or sites act as a small islands somewhere in the wide ocean and there is no way to get from one island to the other. So let us quote Tim again:

[…] The Net links computers, the Web links documents. Now, people are making another mental move. There is realization now, “It’s not the documents, it is the things they are about which are important”. Obvious, really.

There are some important remarks in here: While we (or rather our brain) can make the link between things, the computer can not. If you don’t believe me, google for something like Jaguar and get me only the sites which are related to the animal. Seems to be pretty hard for Google.

Biologists are interested in proteins, drugs, genes. Businesspeople are interested in customers, products, sales. We are all interested in friends, family, colleagues, and acquaintances. There is a lot of blogging about the strain, and total frustration that, while you have a set of friends, the Web is providing you with separate documents about your friends. One in Facebook, one on LinkedIn, one in LiveJournal, one on advogato, and so on. The frustration that, when you join a photo site or a movie site or a travel site, you name it, you have to tell it who your friends are all over again. The separate Web sites, separate documents, are in fact about the same thing — but the system doesn’t know it.

The other remark is related to what Tim calls separate documents. You can take sites like LinkedIn or Facebook as separate documents in that regard. Why? Simple: Those sites pervert the original design idea of the web as they create something like a giant document or black hole which sucks data in and just opens up a few things over proprietary APIs to the outside world. Sounds like what? Right, sooo 90ies! Did that, done that, just with Microsoft products back then. Doing the same in the web browser as Web 2.0 doesn’t really make the whole thing better.

So how is the Semantic Web gonna make this better? Pretty simple, the data is the API! If you describe information in a semantic web way you will use RDF as the lingua franca of the web and this, by definition, provides a universal, unambiguous way for accessing and querying it. No more lock-in, no more application or use-case proprietary data but data re-use. And another thing I really love about it: As transport it is using the same foundation of what the web runs on since 20 years: http. Good times!

If you want to get your hands dirty now you might check out our gentle introduction to the technology behind the semantic web.

How my inner nerd got hooked by the Semantic Web

This was supposed to be the first post for this blog but I never published it so far. The second part is a pretty technical explanation on why I started to love the semantic web, which might also explain the subtitle of this blog ;) I got way better in explaining it meanwhile but I still think it makes sense to post it for historical and nerdy reasons, so here we go.

About four years ago a friend of mine and I were having dinner at my place and I tried to explain him what we aim at. Our vision was still very abstract back then but he told me that the stuff I talk about sounds a lot like something which goes under the name Semantic Web. Some time later he made a short presentation to our team, I remember sitting there and hear about things like triples, SPARQL, giant global graph and so on.

To be honest, I didn’t get it at all at first. But somehow it stuck in my head, I had the idea that this technology indeed might be a part of the puzzle we try to solve. A few month later it was summer and I was looking for a good excuse to sit at the lake in the sun instead of working on the computer.  So I printed a bunch of papers from W3C explaining the semantic web and its components and started reading.

I was amazed. I mean I was seriously amazed. I spend quite some time in the IT business and I did quite a bit of data modeling, programming and all that stuff but what I read was sexy, super sexy (inner nerd speaking here). I still didn’t understand the whole thing yet but it seemed like there is something out there which has the potential to solve all the nasty technical problems I was running into sooner or later in the past.

So what is so sexy about it? As the Internet (or rather its protocol suite called TCP/IP) connected computers in a universal language in the 70ies and the Web connected documents in the 90ies Semantic Web connect things the same way. What things? Anything! Seriously!

If this sounds fair enough you can now stop reading. If you don’t believe me yet, read on.

Academics are rewarded for publishing papers

As mentioned in the last post I participated at the information day for the FP7 call on big data a few weeks ago.

I’m not sure if netlabs.org will participate in this call, I see the need for big data but our focus is on creating and interfacing graph based information in smart ways, more about that in another post. However, I often use existing datasets to demo our technology and I regularly run into issues which were mentioned by one of the FP7 coordinators as well:

Academics are rewarded for publishing papers, not for writing robust code.

Later when they talked about making the research available as open source software he mentioned another problem: Sometimes the outcome of projects is code, that compiles and works only on the computer of the PhD student which wrote it. And as we probably all agree, this is not really useful.

Also he said that a good response time of a system from the user perspective is something that gives a result within at max half a second, followed by the statement that if it doesn’t provide an answer within 20ms, it won’t be part of the technology chain – which is a statement from Google (IIRC) that shows that even if you are within 0.5 secs, you are not necessarily the only part of a chain :)

Unfortunately, most or at least many of the LOD resources I use on a regular base out there for demo cases are a good example of #FAIL for the “robust code” and “half a second” remark. I for example often use linkedgeodata.org, unfortunately the response time for a non-cached query is still way too high on this site, it usually takes several seconds.

Linkedgeodata is just one of the examples, many projects do not realize that the service needs to be fast and reliable as well to make sure people will be able to use it in the real world. This leads to some sort of a chicken-egg situation: As long as the experience is slow (aka sucks), people do not use a (semantic web powered) site. And as long as there are no real world users there is less motivation for the service provider to improve the response time.

Conclusion: If we want the semantic web to become a success, we definitely need to fit the 20ms rule! Every service which gives back RDF is just one part of the technology chain and thus just one part of the final user experience!

Big Data and the Semantic Web

I wrote this post in the train on the way back from Luxembourg where I participated at the ICT Call 8 Information and Networking Day: Intelligent Information Management, which is another EU FP7 call for research projects on big data.

The information day was pretty interesting as I didn’t really read into the big data issue yet. The summary was basically that big data is when the size of the data itself is a problem. Examples given included Google which talks about 1 petabyte of data, Amazon S3 with 500 billion objects or Wall Mart which seems to process data sets up to 100 million each day.

They did have quite some RDF/semantic web related projects there, existing ones (lod2.eu) and proposals for new ones by groups which search partners. I was a bit confused about RDF and LOD because although the total data size is impressive, each one of the data bases like DBpedia itself is not that big (DBpedia is only few 100 gigabytes). And funnily enough, I had an article on my reading list about exactly this problem at semanticweb.com: Two kinds of big data.

Rob Gonzalez makes some really good remarks in there, like the statement that there are two kinds of big data: Really big data sets which need to be processed on one box/instance (vertical big data) and the semantic web, which in itself is horizontal big data.

With Horizontal Big Data (maybe HBD will start catching on!), the problem isn’t how to crunch lots of data fast.  Instead, it’s how to rapidly define a working subset of information to help solve a specific need.

That’s a really good remark and I am curious about how we will be able to solve the problem of widely distributed data. So, semantic web community, listen up: There is some money available in this EU FP7 call, deadline for proposals is 17 January 2012 at 17:00 (Brussels local time) !

Recommended readings (mentioned at the FP7 information day):

On the quest of explaining Semantic Web

A few weeks ago I did the (for me) so far most successful presentation about semantic web. And guess what: it worked so well because I almost did not mention it in the presentation.

An old friend of mine asked me to present the idea of semantic web to the company she is working for in an event they call Puzzle Lunch. The idea is to present a technology to everyone interested in the company and have lunch after that. The time limit was one hour, which I considered as practically impossible. I did quite a lot of talks about it in the past, both to programmers and to non-technical people and I always found it easier to explain it to someone with little or no technical background. This way I could skip the time-consuming details of RDF and related standards.

Inspired by a presentation from Bart and a long discussion with Christian the night before I decided to drop the  technical aspect completely and just try to explain how we and others use the technology. In the train to Bern I was not so sure anymore if this was really the way to go, but in the end I decided to give it a try. I went there only with an improvised spreadsheet on which I explain the issues with list and table-like structures. I did not had a single slide prepared.

After a warm welcome the room was filled with quite a lot of people, most of them as expected programmers and a few customers from Puzzle. I started my presentation and explained the issues with implicit knowledge we find everywhere where we use list- or table-like structures, like Excel files or databases. I showed them how much information gets lost that way and talked about why we need unique identifiers not just for the information itself but also for the headings, the annotations, the relationships between entries in tables. Finally I showed them how this implicit knowledge can be visualized in a graph, which I explained by very simple examples on the whiteboard.

I already talked longer than planned for this part so I switched to examples, stories from our customers and use cases I found on semanticweb.com or heard at SemTech. Every example I explained based on three great points Christian and I figured out the night before:

  • dissolve existing data silos
  • make implicit (or tacit) knowledge explicit available
  • make it possible to store any relationship without the need of designing an appropriate data model upfront

The most important key message was however that semantic web does not offer entirely new things, but it allows us to solve common problems more quickly and, hence, at lower costs.

Surely I did briefly mention RDF as the data model, I did talk about vocabularies and how you describe them in so called ontologies but I did not waste more time on it than absolutely necessary. Later we briefly talked about why we have to bring the benefits of the flexible data model in the backend to the user interface, which is exactly what we are working on at netlabs.org.

In the end I showed a few examples of our user interface technology and briefly talked about Linked Open Data and its potential. Surely I did talk a bit longer than an hour and when I decided to stop, I knew that I didn’t talk about quite a lot of things which I find tremendously exciting about the semantic web.

The feedback was just great. We had great discussions at lunch, later I got various text messages and emails from Puzzle employees and customers which told me the loved the idea of the semantic web and the way I presented it. They were sparkling with ideas of how they could use it in their own company or at customers and some of them want to read into the technology and get their hands dirty.

But there was one ultimate remark which proved to me that presenting semantic web without lot’s of technical details is definitely the way to go: At lunch they realized that someone else presented RDF a few years back but back then no one understood what it could be used for. So no matter to whom you are explaining the semantic web: rather show how it can solve existing problems, but don’t waste time on the technology itself. And by the way: me and my guys at netlabs.org would be more than happy to explain it to you as well :)

Files are not the problem, finding them is

Recently Alex Bowyer blogged at O’Reilly Radar about why files need to die. He provides some good ideas about how we will store and find information in the future but he also misses the point a bit. Files are not really the problem, finding them is.

The past months I spent a lot of time writing on the quest for money. Due to the decentralized nature of our team at netlabs.org, we already realized a long time ago that doing work in something like a Word/OpenOffice document is not the way to go. In the peak time of Wikis we set up our own MediaWiki and started working in there. That did and does work quite well for some things but it is something I am not gonna do forever for various reasons. So what are the benefits/drawbacks?

Benefits of working with a Wiki:

  • History. I can not imagine working on a text anymore without a powerful history. I need to be able to see any change between version x and y at any time. If you interact with several people on the same text this is the only right way to do it in my opinion. A good history in a Wiki is far more powerful than tracking changes in an office document.
  • Multiple users can work on it at the same time. Yes there are more advanced ways of doing that nowadays, some suck (Google Docs: “someone else is editing this file”), some are a bit overkill for my taste (Etherpad). If we screw up the same paragraph we get hints and we have to fix it by hand. Not great, but works most of the time.
  • I just work on the text. I might add some headings and things like bold/emphasis etc but that’s about it. I do not loose any time at formatting it to look good on something ridiculous like an A4 page.
  • The text itself always has the same link and I can access it from everywhere. This is what made the Web so useful.

Drawbacks of a Wiki:

  • Finding wiki pages. This problem started with (digital) files in the late 70-ies, continued with emails in the 90-ies and we still have it with links (technically URIs) nowadays. There were plenty of ideas on how to solve it and none of them worked for me yet.
  • Exporting that stuff. At some point I have to do a snapshot in time which I send out to someone by email/PDF/office document etc. This is the point where I start to cry because I have to launch an office-like app to do that and believe me, I really hate all of them (LaTeX included).

A few years ago, a former boss very proudly told me that he had rearranged the files on the file server. I had a look at it and was totally lost. Not that I liked the old structure but I really couldn’t figure out which files were where anymore. It was logical for him but my brain totally disagreed with that particular structure. He could not understand why this did not work for me.

This was one of the early lessons we learned in our team. Have a look at someones Desktop. Might scare the sh*t out of you but it works for that person. For a long time I had the tendency to order everything in nested structures myself, like emails or files. Result: A few months later I absolutely could not remember where I saved that information and I either wasted a lot of time finding it or I let it go. And I did let go a lot I noticed.

So how would I want to handle knowledge in the future? I do want to have all benefits of a wiki, combined with a way of finding content by context. This means I don’t want to crawl through a fixed structure for locating a text I wrote, as I won’t agree with the structure or won’t remember the right way through it anymore.

My brain remembers the things around a specific event, even if it is only a simple text about a certain topic.  I know it had something to do with funding, particularly for an EU FP7 project. I remember that I sent it to that guy of the University in Bern for a review. I know that my friend Barbara had a look at the English before we sent it out. I remember that I went to Bali, the day after I finished the stuff. I know that one guy sent me an email about it and told me he loves the project. Note that I don’t remember the exact time of it but the things around it.

If I want to find that text, that’s the way I want to search and find it. This can be a file on my disk or a link on the Internet. In that regard a URI is nothing else than a modern form of a file, including all the problems we know.  And this is part of what we are working on at netlabs.org these days.

schema.org: Not Too Impressive

Last Friday Google, Yahoo and Bing announced the launch of schema.org, which promotes annotation of web pages to make them more useful for search engines. This is definitely a hot topic as the current web of documents reached its limits a few years ago. Every search engine user knows what I am talking about.

I am active in the semantic web world for a while now and many people in this community were not very pleased about the decisions taken by the big three. But what is my problem with schema.org? There are quite some and most of them got well addressed in other blog posts, let me recap:

  • RDFa is from a complexity point of view the same thing as Microdata, Manu Sporny proves that in his blog post by example. The argument that RDFa is more complex than Microdata is pure nonsense.
  • RDFa is a serialization of RDF within XML/(X)HTML trees. In case you do not know RDF, Mike Bergmann calls it the universal data solvent, which gets it pretty well. RDF provides much more than Microdata and is so much more powerful. There is simply no excuse for not using RDFa in the first place.
  • There are lots of great examples out there how you can use RDFa, one of the famous examples is probably GoodRelations. On the schema.org FAQ they state that their work is “inspired by earlier work like Microformats, FOAF, GoodRelations, OpenCyc, etc.”.

The last point needs some more explanation. In RDF, a shared vocabulary gets described in a so called ontology which is most of the times expressed in RDF Schema or OWL. Such an ontology defines the wording used (called predicates) and also the data type of each predicate and relationships to other predicates. Both RDF Schema and OWL are expressed in RDF, which makes it possible to bootstrap not just the data itself but also the shared vocabulary used for describing the data in the same format. This is big, really big!

Another important aspect is that data modeled in RDF can be neither correct, nor complete. If you ask 10 persons to model the world, you will get 10 different results. If you express those 10 models in RDF, we will be able to map matching things between the different models even if it is not always exactly the same thing. RDF can handle this uncertainty, which is for me one of the favorite things about RDF.

This is an important lesson learned after failures in the 1990ies when companies like Taligent tried to model the whole world in one single library. RDF instead propagates the concept of domain experts. If you are strong in a specific domain, you should create the vocabulary for it, not some experts at Google/Yahoo/Bing which try to figure out how they can squeeze the whole universe in 300 or so tags. Maybe your domain vocabulary is not fully compatible with my domain vocabulary but that is just how the world works and RDF can handle that by design.

So beside the technical decision not to use RDFa this is for me the biggest fail with schema.org. They thought of a vocabulary which fits for them. This vocabulary is not described in RDF which makes it far less useful for machines/computers and it is very hard to extend it that way or interlink it with more powerful vocabularies like GoodRelations, FOAF etc. which are already out there for a long time. Tim Berners-Lee suggested a 5-star deployment scheme for Linked Open Data, according to that I would probably give schema.org a 4 right now but there is still lots of room for improvement.

Fortunately some people already addressed the RDF part of it: With the help of some well known people in the Semantic Web world Michael Hausenblas created a “real” schema out of it, expressed in RDF. The results can be found at schema.rdfs.org. That is the way the well payed engineers of the three big companies should have done it in the first place. Now we can link it to DBPedia and other resources, extend it for our specific domains and use it in RDFa or whatever RDF serialization we choose.

I am not sure where it is going with schema.org. RDFa co-creator Manu Sporny is pessimistic about the current state while others like Mike Bergmann are very optimistic and think it is one of the most important steps in the semantic web world so far. I think the RDF Schema of the vocabulary is the first step into the right direction but I am afraid that the decision for Microdata will seriously harm adoption of RDFa as a standard. This should be changed as soon as possible! Let us see what Manu Sporny and others will present in the next few days or weeks. By the way there is also a session at this years SemTech about it.

So what is schema.org currently? A step back in terms of technology used plus a vocabulary which is not according to the intentions of the semantic web world as being done for several years by now. Not too impressive.