Identifiers for asset management

    1 comment 
Estimated reading time 3–4 minutes

I met Rob Wilson of RCWS consulting at the DAM London conference last summer, where he talked about digital identifiers. He has long career working on information architecture for range of organisations and industry standards bodies, including the ultimately doomed e-GMS project, which was an attempt to unify government metadata standards for sharing and interoperability.

Identifiers are a hot topic at the moment, and so I was pleased Rob was willing to talk to me and some of my colleagues about some of the problems and attempts at solving them in the wider industry. Rob described the need for stable and persistent IDs for asset management and that this is not a new problem but one that has been worked on by various groups for many years. He distinguished between ID management and metadata management, pointing out that metadata may change but the ID of an asset can be kept stable. In just the same way that managing metadata as a distinct element of content is important, so IDs need to be managed as distinct from other metadata.

Rob is an independent consultant and has no commercial affiliation with the EIDR content asset identifier system, but suggested it is a robust model in many ways. The EIDR system embeds IDs within asset files, using a combination of steganography, strong encryption, and watermarking and “triangulation” of combined aspects to create a single ID. The idea is that the ID is so deeply ingrained and fragmented within the structure of the file that it cannot easily be removed and can be recovered from a small fraction of the file – e.g. if someone takes a clip from a video, the “parent” ID is discoverable from the clip. Rob thought the EIDR system was better than digital rights management (DRM) methods, which rely on trying to prevent distribution, and “locking” content after a certain amount of time or use, which gives an incentive to everyone who has such content to try to break the DRM system. If an individual can still access their illegally held content despite the ID, they have little incentive to try to remove the ID.

The system does not try to prevent theft from source, but helps to prove when copyright has been breached, because the ID – the ownership – remains within the file. It is intended as a tool to deter systematic copyright theft by large organisations, by making legal cases easy to win. Large organisations typically have the funds, or insurances, to cover large claims, unlike individuals, so it is unlikely to be cost-effective to pursue individuals.

The EIDR system also has an ID resolver that manages rights and supply authentication (as well as ownership) so that when content is accessed, the system checks that the appropriate rights and licences have been obtained before delivery.

Rob also outlined the elements of information architecture that need to be considered for unified organisational information management – establish standards, develop models, devise policies, select tools, maintain governance, etc. He emphasised that IDs are not a “silver bullet” to solve all issues, but that if a few problematic key use cases are known, they can be investigated to see if robust federated ID management architecture would help.

Online Information Conference – day three

    Start a conversation 
Estimated reading time 4–6 minutes

Managing content in a mobile world

On Day 3, I decided to try the “Mobile content” track. The speakers were Alan Pelz Sharpe from The Real Story Group, Russell Reeder from Libre Digital and Dave Kellogg from MarkLogic.

Augmented reality is one of my pet topics, so I was pleased to hear the speakers confirm it is all about data and meaning. It is just one aspect of how consumers want more and more data and information presented to them without delay on smaller and simpler devices. However, this means a greater need for metadata and more work for usability specialists.

The whole way people interact with content is very different when they are on the move, so simply trying to re-render websites is not enough. Entire patterns of information needs and user behaviour have to be considered. “A mobile person is not the same as a mobile device…the person needs to be mobile, not necessary the content.” For example, mobile workers often prefer to contact a deskbound researcher and get answers sent to them, not do the research themselves while on the move.

It is not enough just to worry about the technological aspects, or even just the information architecture aspects of versions of content for mobile users. A different editorial style needs to be used for small screens, so content needs to be edited to a very granular level for mobile – no long words!

Users don’t care about formats, so may get a very bad impression of your service if you allow them to access the wrong content. One customer was cited as complaining that they could watch You Tube videos easily on their phone (You Tube transcodes uploads so they are low res and streamable), but another video caused huge problems and took ages (it turned out to be a download of an entire HD feature film).

The session made me feel quite nostalgic, as back in 1999 we spent much time pondering how we would adapt our content to mobile devices. Of course, then we were trying to present everything on tiny text-only screens – 140 characters was seen as a luxury. There is just no comparison with today’s multifunctional multicoloured shiny touch screen devices.

New World of Search – Closing Keynote

I think every conference should include a speaker who rises above day-to-day business concerns and looks at really big pictures. Stephen Arnold outlined the latest trends in search technologies, both open source and proprietary, and how people are now getting better at picking elements from different systems, perhaps combining open source with proprietary options and building modular, integrated systems to prevent lock-in. However, he also talked engagingly about digital identity, privacy, and the balance of power between individuals and large organisations in the digital world.

He reiterated the point that Google (and other search engines) are not free. “Free is not what it seems to the hopeful customer” but that we haven’t yet worked out the value of data and content – let alone information and knowledge – in the light of the digital revolution. Traditional business models do not work and old economic theories no longer apply: “19th century accounting rules don’t work in the 21st century knowledge economy”.

He noted that Facebook has managed to entice users and developers to give up their content, work, time, and intellectual property to a locked-in proprietary walled garden. People have done this willingly and apparently without prompting, enabling Facebook to achieve something that software and content vendors such as IBM and AOL have been trying unsuccessfully to do for decades.

There is no clear way of evaluating the service that Facebook provides against the value of the content that is supplied to it free by users. However, it is clear that it is of huge value in terms of marketing. It is possibly the biggest and most sophisticated marketing database ever built.

As well as content, people are willing to surrender personal information, apparently with minimal concerns for privacy and security. It is not clear what the implications of this are: “What is the true cost of getting people to give up their digital identities?” It is clear that the combined data held by Facebook, Google, Foursquare, mobile phone companies, credit card companies, travel card companies, etc. creates a hugely detailed “digital profile” of human lives. Stephen urged the audience to take seriously the potential for this to be used to cause harm to individuals or society – from cyberstalking to covert government surveillance – because technology is moving so fast that any negative social and political effects may already be irreversible.

Online Information Conference – day two

    1 comment 
Estimated reading time 6–10 minutes

Linked Data in Libraries

I stayed in the Linked Data track for Day 2 of the Online Information Conference, very much enjoying Karen Coyle‘s presentation on metadata standards – FRBR, FRSAR, FRAD, RDA – and Sarah Bartlett‘s enthusiasm for using Linked Data to throw open bibliographic data to the world so that fascinating connections can be made. She explained that while the physical sciences have been well mapped and a number of ontologies are available, far less work has been done in the humanities. She encouraged humanities researchers to extend RDF and develop it.

In the world of literature, the potential connections are infinite and very little numerical analysis has been done by academics. For example, “intertextuality” is a key topic in literary criticism, and Linked Data that exposes the references one author makes to another can be analysed to show the patterns of influence a particular author had on others. (Google ngrams is a step in this direction, part index, part concordance.)

She stressed that libraries and librarians have a duty of care to understand, curate, and manage ontologies as part of their professional role.

Karen and Sarah’s eagerness to make the world a better place by making sure that the thoughtfully curated and well-managed bibliographic data held by libraries is made available to all was especially poignant at a time when library services in the UK are being savaged.

The Swedish Union Catalogue is another library project that has benefited from a Linked Data approach. With a concern to give users more access to and pathways into the collections, Martin Malmsten asked if APIs are enough. He stressed the popularity of just chucking the data out there in a quick and dirty form and making it as simple as possible for people to interact with it. However, he pointed out that licences need to be changed and updated, as copyright law designed for a print world is not always applicable for online content.

Martin pointed out that in a commercialised world, giving anything away seems crazy, but that allowing others to link to your data does not destroy your data. If provenance (parametadata) is kept and curated, you can distinguish between the metadata you assert about content and anything that anybody else asserts.

During the panel discussion, provenance and traceability – which the W3C is now focusing on (parametadata) – was discussed and it was noted that allowing other people to link to your data does not destroy your data, and often makes it more valuable. The question of what the “killer app” for the semantic web might be was raised, as was the question of how we might create user interfaces that allow the kinds of multiple pathway browsing that can render multiple relationships and connections comprehensible to people. This could be something a bit like topic maps – but we probably need a 13-year-old who takes all this data for granted to have a clear vision of its potential!

Tackling Linked Data Challenges

The second session of day two was missing Georgi Kobilarov of Uberblic who was caught up in the bad weather. However, the remaining speakers filled the time admirably.

Paul Nelson of Search Technologies pointed out that Google is not “free” to companies, as they pay billions in search engine optimisation (SEO) to help Google. Google is essentially providing a marketing service, and companies are paying huge amounts trying to present their data in the way that suits Google. It is therefore worth bearing in mind that Google’s algorithms are not resulting in a neutral view of available information resources, but are providing a highly commercial view of the web.

John Sheridan described using Linked Data at the National Archives to open up documentation that previously had very little easily searchable metadata. Much of the documentation in the National Archives is structured – forms, lists, directories, etc. – which present particular problems for free text searches, but are prime sources for mashing up and querying.

Taxonomies, Metadata, and Semantics: Frameworks and Approaches

There were some sensible presentations on how to use taxonomies and ontologies to improve search results in the third session.
Tom Reamy of KAPS noted the end of the “religious fervour” about folksonomy that flourished a few years ago, now that people have realised that there is no way for folksonomies to get better and they offer little help to infrequent users of a system. They are still useful as a way of getting insights into the kind of search terms that people use, and can be easier to analyse than search logs. A hybrid approach, using a lightweight faceted taxonomy over the top of folksonomic tags is proving more useful.

Taxonomies remain key in providing the structure on which autocategorisation and text analytics is based, and so having a central taxonomy team that engages in regular and active dialogue with users is vital. Understanding the “basic concepts” (i.e. Lakoff and Rosch’s “basic categories”) that are the most familiar terms to the community of users is vital for constructing a helpful taxonomy and labels should be as short and simple as possible. Labels should be chosen for their distinctiveness and expressiveness.

He also pointed out that adults and children have different learning strategies, which is worth remembering. I was also pleased to hear his clear and emphatic distinction between leisure and workplace search needs. It’s a personal bugbear of mine that people don’t realise that looking for a hairdresser in central London – where any one of a number will do – is not the same as trying to find a specific shot of a particular celebrity shortly after that controversial haircut a couple of years ago from the interview they gave about it on a chat show.

Tom highlighted four key functions for taxonomies:

  • knowledge organisation systems (for asset management)
  • labelling systems (for asset management)
  • navigation systems (for retrieval and discovery)
  • search systems (for retrieval)

He pointed out that text analytics needs taxonomy to underpin it, to base contextualisation rules on. He also stressed the importance of data quality, as data quality problems cause the majority of search project failures. People often focus on cool new features and fail to pay attention to the underlying data structures they need to put in place for effective searching.

He noted that the volumes of data and metadata that need to processed are growing at a furious rate. He highlighted Comcast as a company that is very highly advanced in the search and data management arena, managing multiple streams of data that are constantly being updated, for an audience that expects instant and accurate information.

He stated that structure will remain the key to findability for the foreseeable future. Autonomy is often hailed as doing something different to other search engines because it uses statistical methods, but at heart it still relies on structure in the data.

Richard Padley made it through the snow despite a four-hour train journey from Brighton, and spoke at length about the importance of knowledge organisation to support search. He explained the differences between controlled vocabularies, indexes, taxonomies, and ontologies and how each performs a different function.

Marianne Lykke then talked about information architecture and persuasive design. She also referred to “basic categories” as well as the need to guide people to where you want them to go via simple and clear steps.

Taxonomies, Metadata, and Semantics in Action

I spoke in the final session of the day, on metadata life cycles, asset lifecycles, parametadata, and managing data flows in complex information “ecosystems” with different “pace layers”.

Neil Blue from Biowisdom gave a fascinating and detailed overview of Biowisdom’s use of semantic technologies, in particular ontology-driven concept extraction. Biowisdom handle huge complex databases of information to do with the biological sciences and pharmaceuticals, so face very domain-specific issues, such as how to bridge the gap between “hard” scientific descriptions and “soft” descriptions of symptoms and side-effects typically given by patients.

In the final presentation of the day, Alessandro Pica outlined the use of semantic technologies by Italian News agency AGI.