All posts by Christian Langanke

Data quality counts

When linking with other data sources,  without a doubt knowing the origin and the quality level of external data is important. After all, an important point in using semantic web technology is to use other data sources to enrich our own data, and enhance our own solutions by that.

People being new to RDF, which never or rarely came in touch with the task of interlinking with foreign data sources so far, may now think that this was a problem coming only with RDF. At the contrary! Any data retrieved of whatever other system may be right or wrong, and may be complete or incomplete. Within the RDF concept it is even literally stated that RDF-based data by definition is neither correct nor complete – well, at least if you are not the one to ensure that. But this problem may only get bigger the more you access external data sources, and once you would start to use semantic web technology to do so, it may just happen more and more often.

This puts up questions like: how securely do you know from what source a given set of data originates, how much do you know about the provider of that data source, how accurate is the data model and how much can you trust in the quality of data maintenance. Obviously the answers to these questions will have a direct impact on how much you can benefit from interlinking with the given data. If the quality of external data is not ensured, the quality of your solution in turn will suffer.

So if you link to external data yourself, you will have to pay tight attention to the quality of the used data sources, in order to avoid any bad impact on the quality of your own products and services.

Does RDF fit into my architecture?

When software engineers come into touch with the world of semantic data for the first time, it may be difficult for them to grasp all the benefits and consequences at once. When I heard of RDF for the first time, I thought this concept could possibly impose more risks than benefits. So how would RDF fit ino the architecture of my solutions? Or better: why did I feel at first that this possibly may not be the case ? Just in case you have or have had similar feelings, I would want to share my thoughts on that with you.

I think that back then I felt quite uncertain, because I would have to let go all the known ways of accessing structured data. At first I even considered RDF-based data to be stored with no real structure, but of course that is not true. Later, I felt that modeling data with literally any vocabulary would imply lots of problems and hinder applications from being able to deal with data from anybody else from outside, but of course that is not true as well.

After some time, dealing with RDF and practically working with it, I more and more understood that these points are not a problem at all, instead RDF shows its beauty and flexibility by letting me do things just differently.

First of all, it is not that RDF data would be poorly structured. Instead, it is structured by semantics – only!

In fact vocabulary used within the RDF concept provides the most flexible and at the same time a very precise and reliable way to describe and structure data in a machine-readable way. The vocabulary can be uniquely identified, so that the the meaning of the described data can always be exactly determined. And with RDF you put the knowledge about the structure of data into he data itself, while with other concepts you have to put it mostly into (self-)developed software. This is why RDF is said to compare to other storage concepts like knowledge engineering to software engineering.

Using semantics over conventional ways of structuring data, so e.g. by tables and hierarchies, has another important advantage: scalability in complexity. Without any problem, RDF data can get more complex without breaking any existing logic or requiring more efforts to keep more complex data models performing well.  On the other hand this is not true for tables and hierarchies, they don’t scale well in complexity at all.

Second, it is not that one would need a restricted set of vocabulary to keep data usable for oneself or anybody else. Instead, RDF encourages to use any existing vocabulary in order to increase reusability of data. In fact I can model a given set of data with any amount of different sets of vocabulary at the same time, and that without much overhead or  redundancy (try to do that with SQL based data sources…).

If you would restrict your solutions to a restricted set of vocabulary, it would not give you any extra benefit. Instead you would restrict your services to those RDF-based data sources that use the very same set of vocabulary that you do. And that would be much more of a restriction in the future than you can think of today.

However, even if you use existing vocabulary as much as possible, you may still need to interlink your data with data from a source describing same things, but using different vocabulary. Then you simply need a way to translate between different sets of vocabularies. Due to the nature of vocabularies being described semantically themselves, this can be done in a generic manner. In the easiest case this translation would take place in the back end, so to say in your triple store engine, making this step completely transparent to your services using the store. By the time of this writing however, no database product seems to be able to do that. As an alternative, such translation could be encapsulated in a framework, through which your services would access RDF data sources, including our own. netlabs.org is currently working on a technique that allows interlinking between sets of vocabularies with a flexible way of defining such a translation, for sure basing on RDF-based data as well.

Why tables and hierarchies don’t scale

Maintaining data is a challenge, and there are several things that can make maintenance a real pain. Many problems with that arise out of the patterns that we use to store our data, as they have a big impact on how easy we can retrieve stored data later, may it be for viewing or cleaning up obsolete data or other tasks.

Before computer-age, people stored data by simply writing it. Lists and tables were invented to make retrieval of stored data much easier, just think of birth and wedding registries etc. Beside that hierarchies were used e.g. in natural science to display relations between unequal things and similar things at a time. Both methods implied a limit, as lists either needed to be short enough to remain usable, or there had to be a suitable scheme to split it up in smaller parts. Hierarchies could as well not grow too big, otherwise it would have been impossible to display them on a single or at least on a small amount of sheets of paper.

Surprisingly, even since the invention of computers, the concepts of tables and hierarchies still dominate the way we store and retrieve data. Tables are used in the majority of database management systems as well as in spreadsheets, and hierarchies are used in file systems and applications to store data. And although we might expect that with a computer we should not have a problem with storing large amounts of data, somehow the limits of pre-computer ages still apply.  This happens at all places where people need to access data, or require to design tables or hierarchies to suit a specific use case. Many problems arise out of that, but mostly the storage patterns are either not seen as the underlying reason, or the problems are taken as irrevocable.

Data to be held in a table, may it be in a spreadsheet or a database, may still not grow too big or  complex, otherwise it cannot be stored in one table of reasonable size and/or complexity. In this case people cannot use spreadsheets or other kind of simple table views anymore, but require use case specific applications for accessing and visualizing data. A more important drawback that applies to tables of all sizes is that they are not very suitable for data exchange, when either the meaning or the formatting of data items can be misinterpreted.

Wherever users create hierarchies, e.g. in file systems or within applications, they face the problem that the hierarchy strongly depends on the logic created in it by one or a group of persons. With with increasing complexity it gets more and more difficult, if not impossible to extended it without breaking this logic.

Semantic web technology is a true game changer in that regard. Beside other advantages, it comes with the ability of linking things with any amount of other thing, instead interlinking documents like the world wide web does, or linking a set of things with another set of things, like tables do. Because of using the most fine-granular relation possible, relations do not have to fit into any other logical system, which would have to be designed for a given use case. And semantic data is not required to be stored in a hierarchy, so there is no risk of implementing today a boundary of tomorrow. As a result, the storage pattern is completely nonspecific to any use case, and can be scaled to any size and complexity.

Of course, retrieval of data stored like this is not bound to a use case as well, as no knowledge about tables or hierarchies is required. Instead data is retrieved by querying relations between things, which is far more intuitive. Interestingly this storage and retrieval pattern matches exactly how we memorize things, namely simply by association between things! Or do you open a table or a directory structure in your mind to remember what you had for lunch yesterday?

This does not mean however that tables and hierarchies are not longer required. Data needs to be displayed for to create, view and modify it in the front end, and therefore we still need the tables or hierarchies we are used to – you can take any form as a hierarchical way of displaying data. And whatever is required for that is already part of the data, because another, most important feature of semantic web technology is that the description of the data is part of the data itself.

However, for that applications need to apply the concept of tables and hierarchies only to a small part of available data, so there is no scaling problem. At the same time storage of data logically scales like never before, not hindered by any schemes that are otherwise only required for the visualization of data.