Tag Archives: linked_data

SLA Conference in Chicago

    Start a conversation 
Estimated reading time 3–5 minutes

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes – Dublin Core, FOAF, SKOS, which provide structures but no meanings – and semantic schemes – taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

The role of classification and ontology on the Web – UDCC Seminar 2011

    Start a conversation 
Estimated reading time 2–3 minutes

This post is the second in a series about the UDC consortium international seminar in The Hague, 19-20 September, 2011.

Knowledge Organization Systems (KOSs) as hubs in the Web of Data

In a minor change of schedule, Thomas Baker from the DCMI talked about some of the practical issues with using Linked Data. Provenance data can be recorded as additional information but it is not standardised or an integral part of RDF and this is something that is a growing concern, receving attention from W3C. URI persistence and alignment remain concerns for data managment and governance.

Aligning web vocabularies

Guus Schreiber also dealt with the problem of making sure we are all talking about the same thing when we try to align our vocabularies. He called for ontologists to be modest about what they can achieve and not to try to hide the problems that occur when you try to transfer an ontology form one domain to another. Errors typically occur due to failures to notice subtle differences between domains.

Vocabulary alignment is a complex business that requires a lot of intellectual effort and multiple techniques should be used to reinforce and support each other. It is much better to map small vocabularies to large ones that can then act as “pivots”.

There is still no adequate methodology for evaluating alignments nor for mediating consensus between observers. Perhaps there should be a way of recording the strength of consensus and the presence of disagreements and alternative views.

Classification, Collaboration and the Web of Data

Dan Brickley described three types of graph – the hypertext graph of the Internet’s links between documents, the social graph of links between people, and the factual graph of links between data. Currently Linked Data is bringing together the hypertext and factual graphs, and another step would be to add in the social dimension.

He called for a focus on what the various tools can actually do, to be wary of over-evangelical ontologists, and to remember that subject classifications are strong and robust tools that are more appropriate for many types of work than ontologies.

He said that you could expect Linked Data to solve about a third of your information linking problems.

IASA Conference 2011: Turning archives into assets

    Start a conversation 
Estimated reading time 2–3 minutes

Semantic enrichment

Guy Maréchal continued the Linked Data theme by talking in more detail about how flat data models can be semantically enriched. He pointed out that if you have good structured catalogue records, it takes very little effort to give concepts URIs and to export this data as sets of relationships. This turns your database into a graph, ready for semantic search and querying.

He argued that “going to semantics cannot be avoided” and that “born digital” works will increasingly be created with semantically modelled metadata.

From Mass Digitisation to Mass Content Enrichment

The next talk was a description of the SONUMA digitisation and metadata enhancement project. Sonuma and Memnon Archiving Services have been working on inventories and dictionaries to help them index audio visual assets. They have been converting speech to text, holding the text as XML files, and then associating sections of the XML with the appropriate point in the AV content, so that it can be searched.

They identify breaks in programmes by looking for the time stamps using OCR techniques, and then looking for jumps in the numerical sequences. They assume that jumps in the numbers are breaks in programmes. This enables them to break up long tapes into sections, which usually correspond to programmes.

Social networking and Knowledge Management

Tom Adami described Knowledge Management projects at the United Nations Mission in Sudan (Best Practice Lessons Learnt: How the Exit Interview and Oral History Project at UNMIS is building a knowledge database). The UN in Africa faces problems of high staff turnover, remote locations, and difficulties in maintaining infrastructure. However, they have been using social networking to encourage people to share their knowledge and experience in a user-friendly way and so add to the official knowledge base.

Archive as a social media lab: Creative dissemination of digital sound and audiovisual collections

Budhaditya Chattopadhyay talked about a project to bring together archival practice, artistic practice, and social media. He also referred to the problems of preserving social media which is in essence ephemeral but may be an integral part of an artwork.

IASA Conference 2011: Keynote speech on Linked Open Data

    Start a conversation 
Estimated reading time 4–6 minutes

Kevin Bradley, IASA president, gave the welcome address to the 42nd IASA annual conference. He characterised the digital revolution as one that will continue reverberating for years. He reminded us that it is not always easy to sort the sense from the nonsense and that we are often surprised by what turns out to be valid and how easy it is not to see the wood for the trees – or perhaps to “lose the word to the bits”, or the “picture to the pixels”.

Keynote address – on Linked Open Data

The keynote speech was given by Ute Schwens, deputy director of the Deutsche Nationalbibliothek / German National Library (DNB). She opened with a lovely visualisation by the Opte Project of various routes through a portion of the Internet. It looked a bit like visualisations of neurons in a brain or stars and galaxies.

Ute’s talk was in support of publishing Linked Open Data. She outlined some of the concerns – lack of money, open access versus intellectual property rights, poor quality of data and assets themselves, and inadequate legal frameworks. She said that we shouldn’t be trying to select for digitisation, because everything in an archive has already been selected or it wouldn’t have been kept in the first place. She also highlighted the benefits of making digital versions of unique or fragile artefacts, in order to allow access without risk to the original. She talked about how there are many ways to digitise and that these produce different versions, so an archival master that is as close to the original as possible should always be preserved.

She used as an illustration original piano rolls. These can only be played on very specialised electrical pianos and users cannot practically be given access to them directly, but they were played by specialists and the music recorded, so users can be given access to that. The recordings are not the same as the piano rolls, but are a new and interesting product. It seems obvious that you would not destroy the original piano rolls simply because the music from them had been recorded and now exists in a digital version, so why should you destroy other forms of media such as film, simply because you have a digital version? The digital version in such cases is for access, not preservation.

One fear is that free access to information will diminish usage of an archive or library, but by opening up you can gain new users, especially by providing free access to catalogues and metadata (I like to think of these as “advertising” – shops make their catalogues freely available because they see them primarily as marketing tools).

Another fear is loss of control, but new scientific ideas often arise when diverse strands of thought are brought together and unexpected uses are made of existing data. The unusual and the unforeseen is often the source of the greatest innovation.

She pointed out that we have drafted searching and indexing rules over centuries to try to make objects as findable as possible, so Linked Open Data is merely the next logical step. We can combine automatically generated information with data we already have to provide multiple access points. We need to describe to put objects into context, but we don’t have to describe what they look like in the ways that we used to for catalogues. Good metadata is metadata that useful for users, not metadata merely for maintaining catalogues.

She ended by calling for more open access to data as ways to promote our collections and their value, adding that in uncertain times, our only security is our ability to change.

In the discussion afterwards, she said that Google needs our data and the best way to engage with – and even influence – Google is by gaining recognition as a valued supplier and making sure Google understands how much it needs us to provide it with good quality data.

The conference was hosted by Deutsche Nationalbibliothek / German National Library (DNB), Hessischer Rundfunk / Hessian Broadcasting (hr), and the Deutsches Rundfunkarchiv / German Public Broadcasting Archives (DRA). The sponsors were EMC2 (gold); Memnon Archiving Services, NOA audio solutions, and Arvato digital services (Bertelsmann) (silver); and Cedar audio, Cube-tec International, Front Porch Digital, and Syylex Digital Storage (bronze).

UK Archives Discovery Forum

    1 comment 
Estimated reading time 6–10 minutes

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.

Historypin

Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken – if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct – if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.

Online Information Conference 2010

    Start a conversation 
Estimated reading time 3–4 minutes

Despite the recession, tube strikes, and snow, there was a fine collection of speakers, exhibitors, and delegates at a smaller than usual Online Information Conference and Exhibition this year.

Librarians seem to be getting heavily into Linked Data, while the corporate sector is still mainly concerned with business intelligence and search.

On day one I enjoyed the practical explanations of how Linked Data principles have been made to work at The Guardian, The Press Association, the Dutch Parliament, and the ALISS health project in Scotland.

Linked Data tags are a form of metadata that can be used to automatically generate content aggregations for web pages. This means that not only can you re-use your own content, increasing its lifespan, but you can gather cheap content that is openly available online. This is very familiar territory to me, as we used to build products in the same way back in the 90s, the difference being that we didn’t have much of an external web to link to back then. In the meantime, using a linkable, interoperbale format for your tags has very many benefits, and whether your focus is primarily for content within or beyond a firewall, the arguments for using standards that have the potential to link to the wider world seem very compelling. I can’t see any logical reasons not to standardise the format your metadata is held in (technical and practical issues are another matter), although standardising the semantic content of the metadata is a far more difficult problem.

It was reassuring to hear that everyone else is struggling with the problems of who mints IDs and URIs, who settles arguments about what exactly the IDs refer to – especially across domains – and who resolves and manages mappings. Such issues are difficult to resolve within a firewall, out there on the Web they become vast. The W3C is starting to work on provenance standards (the parametadata or meta-metadata), a pet worry of mine, because I am certain we need to get that layer of semantic information into our tags as soon as possible if we are going to be able to advance the semantic web beyond crunching databases together.

In the meantime, Linked Data is working very well especially for mashups and information visualisations. I particularly liked the Dutch Parliament’s “Attaquograms” – a diagram showing how often MPs were interrupted in debates and how much they interrupted others, although it doesn’t appear to have changed their behaviour yet. I also enjoyed The Guardian’s “league tables” of MPs’ performance. When MPs protested that such analyses ignored qualitative issues, The Guardian pointed out that if MPs advocate such data crunching as a way to judge schools and hospitals, then it must be good enough to apply to MPs themselves.

Andy Hyde from the ALISS project is working on ways to use Linked Data to help people manage their health, especially for patients with long term conditions such as diabetes. He stressed the importance of involving the users in any information project and addressing them on their terms, stating “The most collaborative tool we have is the cup of tea”.

My only concern about using Linked Data to pull in Wikipedia content, is whether the audience will start recognising it. If every website that mentions a topic has the same Wikipedia content attached to it, won’t people get bored? Perhaps there are just so many people out there contributing, so many varieties of alogrithmic aggregations, and so much content to read, it will never happen!

There is a related Guardian technology blog post.

I will post summaries of days two and three shortly.