Category Archives: information management

Data as a liquid asset and the AI future

Descent of man
    Start a conversation 
Estimated reading time 5–8 minutes

Getting back into the swing of meetups again, last night I went to the MTLData meetup – a group of data scientists and enthusiasts who are looking to raise the profile of data science in Montreal. The event featured a panel discussion on the topic of ‘Build vs Buy?’ when considering software for data solutions.

The panellists were Marc-Antoine Ross, Director of Data Engineering at Intel Security, Maxime Leroux, consulting data scientist at Keyrus Canada, and Jeremy Barnes, Chief Architect at Element AI. The chair was Vaughan DiMarco of Vonalytics.

Data as liquid

The issues were very familiar to me from considering EDRM and DAM systems, which made me think about the way data has changed as an asset, and how management and security of data now has to include the ‘liquid’ nature of data as an asset. This adds another layer of complexity. Data still needs to be archived as a ‘record’ for many reasons (regulatory compliance, business continuity, archival value…) but for a data-driven organisation, the days of rolling back to ‘yesterday’s version of the database’ seem like ancient history. Data assets are also complex in that they are subject to many levels of continuous processing, so the software that manages the processing also has to be robust.

The metaphor of data flowing around the organisation like water seems especially telling. If there is a system failure, you can’t necessarily just turn off the tap of data, and so your contingency plans need to include some kind of ‘emergency reservoir’ so that data that can’t be processed immediately does not get lost and the flow can be re-established easily.

Build vs Buy?

The issues highlighted by the panel included costs – available budget, restrictions from finance departments, balance between in-house and outsourced spending (again all familiar in EDRM and DAM procurement), privacy, security, ability to maintain a system, and availability of skills. Essentially balancing risks, which will be unique to each team and each business. In terms of deciding whether to build something in house, availability of in house resource is an obvious consideration, but Marc-Antoine stressed the importance of thinking through what added value a bespoke build could offer, as opposed to other ways the team could be spending their time. For example, if there are no off-the-shelf or open source products that match requirements, if there is value in owning the IP of a new product, if risks can be kept low, and resources are available, a build might be worthwhile.

There are risks associated with all three of the main options – a big vendor is less likely to go bust, but sometimes they can be acquired, sometimes they can stop supporting a product or particular features, and they can be very costly. Open source has the advantage of being free, but relies on ad hoc communities to maintain and update the code base, and how vibrant and responsive each specific community is, or will remain, can vary. Open source can be a good option for low risk projects – such as proof-of-concept, or for risk tolerant startups with plenty of in-house expertise to handle the open source code themselves.

AI future

The conversation diverged into a discussion of the future of AI, which everyone seemed to agree was going to become a standard tool for most businesses eventually. Jeremy noted that AI at the moment is being sought after for its scarcity value, to give early adopters an edge over the competition, while Maxime suggested that early advantage is likely to fade, just as it has with data science. Data analysis is now so ubiquitous, even small businesses are involved to a certain extent. Jeremy pointed out that it is hard to maintain a competitive edge based on the scarcity of data itself, as data can so easily be copied and distributed, but knowing how to make intelligent use of the data is a scarce commodity. Making connections and managing data in a very tailored specific way could even be a way for organisations to compete with Google, who have more data than anyone else, but are not necessarily able to answer all questions or have the most useful insights into specific problems.

The value of meaning

I was intrigued by this, as it validates the role of semantics – data without meaning is useless – and the importance of the imaginative and creative leaps that humans can make, as well as the moral and social reasoning that humans can bring. With reports of early AI systems reflecting existing biases and prejudices, and with disasters like the SimSimi chatbot causing social problems such as bullying amongst youngsters, the need for a real human heart to accompany artificial intelligence seems ever more important.

Scarcity of understanding?

Someone asked if the panel thought companies would soon need ‘Chief Intelligence Officers’ in the way that many now have ‘Chief Data Officers’. The panel did not seem particularly enthusiastic about the idea (“it sounds like something that you do with founders when you put them out to pasture”) but I think it would be a fascinating role. The BBC had someone to oversee ethics and advise on editorial ethics issues. Perhaps it is in the skills of a Chief Intelligence Officer – someone who can combine an understanding of how data, information, knowledge and wisdom interact, whether within IT systems or beyond, with an understanding of social issues and problems – that the scarcity value lies. Insight, imagination, and compassion could be the skills that will give the competitive edge. In the AI future, could a Chief Intelligence Officer make the difference between a company that succeeds by asking the right questions, not just of its data or its customers, but of itself, and one that fails?

The Information Master – Louis XIV’s Knowledge Manager

    2 comments 
Estimated reading time 4–6 minutes

I recently read The Information Master: Jean-Baptiste Colbert‘s Secret State Intelligence System by Jacob Soll. It is a very readable but scholarly book that tells the story of how Colbert used the accumulation of knowledge to build a highly efficient administrative system and to promote his own political career. He seems to have been the first person to seize upon the notion of “evidence-based” politics and that knowledge, information and data collection, and scholarship could be used to serve the interests of statecraft. In this way he is an ancestor of much of the thinking that is commonplace not only in today’s political administrations but also in all organizations that value the collection and management of information. The principle sits at the heart of what we mean by the “knowledge economy”.

The grim librarian

Jean-Baptiste Colbert (1619-83) is depicted as ruthless, determined, fierce, and serious. He was an ambitious man and saw his ability to control and organize information as a way of gaining and then keeping political influence. By first persuading the King that an informed leadership was a strong and efficient leadership, and then by being the person who best understood and knew how to use the libraries and resources he collected, Colbert rose to political prominence. However, his work eventually fell victim to the political machinations of his rivals and after his death his collection was scattered.

Using knowledge to serve the state

Before Colbert, the scholarly academic tradition in France had existed independently from the monarchy, but Colbert brought the world of scholarship into the service of the state, believing that all knowledge – even from the most unlikely of sources – had potential value. This is very much in line with modern thinking about Big Data and how that can be used in the service of corporations. Even the most unlikely of sources might contain useful insights into customer preferences or previously unseen supply chain inefficiencies, for example.

Colbert’s career was caught up with the political machinations of the time. He worked as a kind of accountant of Cardinal Mazarin, but when Mazarin’s library was ransacked by political rivals and his librarian fell out of favour, Colbert restored the library and built a unified information system based on the combination of scholarship and administrative documentation, ending the former division between academia and government bureaucracy.

Importance of metadata

Colbert also instinctively grasped the importance of good metadata, cataloguing, and an accurate network of links and cross references in order to be able to obtain relevant and comprehensive information quickly, issues that remain even more urgent than ever given the information explosions modern organizations – and indeed nations – face. This enabled him to become a better administrator than his rivals and by becoming not only the source of political expedient information but also the person who knew how to use the information resources most effectively, he was able to gain political influence and become a key minister under Louis XIV.

A personal vision

I was struck by how much his vast library, archive, and document management system was the result of his own personal vision, how it was built on the dismantling and rebuilding of work of predecessors, but also how, after his death, the system itself fell victim to political changes and was left to collapse. This pattern is repeated frequently in modern information projects. So often the work of the champion of the original system is wasted as infighting that is often not directly connected to the information project itself leads to budget cuts, staff changes, or other problems that lead to the system decaying.

Soll argues that the loss of Colbert’s system hampered political administration in France for generations. Ironically, it was Colbert’s own archives that enabled successive generations of political rivals to find the documents with which to undermine the power of the crown, showing the double-edged nature of information work. It is often the same collections that can both redeem and condemn.

Secrecy or transparency?

Another theme that ran throughout Colbert’s career, with both political and practical implications, was the tension between demands for transparent government and the desire for a secret state. Much of the distinction between public and private archives was simply a matter of who was in control of them and who had set them up, so the situation in France under the monarchy was different to the situation in England where Parliament and the Monarchy maintained entirely separate information systems. In France, an insistence on keeping government financial records secret eventually undermined trust in the economy. Throughout his career Colbert was involved in debates over which and how much government information should be made public, with different factions arguing over the issue – arguments that are especially resonant today.

Semantic Search – Call for Papers for Special Issue on Semantic Search for Aslib Journal

    1 comment 
Estimated reading time 4–6 minutes

This special issue aims to explore the possibilities and limitations of Semantic Search. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data.

Research into Semantic Search and its applications has been gaining momentum over the last few years, with an increasing number of studies on general principles, proof of concept and prototypical applications. The market for Semantic Search applications and its role within the general development of (internet) technologies and its impact on different areas of private and public life have attracted attention. Simultaneously, many publicly funded projects in the field of cultural heritage were initialised. Researchers in many disciplines have been making progress in the establishment of both theories and methods for Semantic Search. However, there still is a lack of comparison across individual studies as well as a need for standardisation regarding the dissociation of Semantic Search of other search solutions, agreed upon definitions as well as technologies and interfaces.

Semantic Search research is often based on large and rich data sets and a combination of techniques ranging from statistical bag of words approaches and natural-language-processing enriched via a subtle utilisation of metadata over classificatory approaches right up to ontological reasoning. Over the last 10 years a lot of initial technical and conceptual obstacles in the field of Semantic Search have been overcome. After the initial euphoria for Semantic Search that resulted in a technically driven supply of search solutions, appraisal of successful and less successful approaches is needed. Amongst other things the limitations of working with open world solutions on – only apparently comprehensive – linked open data sets compared to small domain specific solutions need to be determined.
One ongoing challenge for semantic search solutions is their usability and user acceptance, as only highly usable walk-up-and-use-approaches stand a chance in the field of general search.

For this special issue, we invite articles which address the opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives.

Topics of interest include but are not restricted to:

  • The history of semantic search – how the latest techniques and technologies have come out of developments over the last 5, 10, 20, 100, 2000… years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological …
  • Current trends in Semantic Search
  • Best practice – how far along the road from ‘early adopters’ to ‘mainstream users’ has semantic search gone so far?
  • Semantic Search and cultural heritage
  • Usability and user experience of Semantic Search
  • Visualisation and Semantic Search
  • Quality criteria for Semantic Search
  • Impact of norms and standardisation for instance (like ISO 25964 “Thesauri for information retrieval“) and the potential of Semantic Search?
  • How are semantic technologies fostering a need for cross-industry collaboration and standardisation?
  • How are Semantic Search techniques and technologies being used in practice?
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc.
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions
  • Comparison of data collection approaches
  • User behaviour and the evolution of norms and conventions
  • Information behaviour and information literacy
  • User surveys
  • Usage scenarios and case studies

Submissions

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

Papers should be 4,000 to 6,000 words in length (including references). Citations and references should be in our journal style.

Please see the author guidelines at http://www.emeraldinsight.com/products/journals/author_guidelines.htm?id=ap for more details and submission instructions.
Submissions to Aslib Proceedings are made using ScholarOne Manuscripts, the online submission and peer review system. Registration and access is available at http://mc.manuscriptcentral.com/ap.

Important Dates

Paper submission: 15.12.2013
Notice of review results: 15.02.2013
Revisions due: 31.03.2014
Publication: Aslib Proceedings, issue 5, 2014.

About the Journal

Aslib Proceedings (ISSN: 0001-253X) is a peer-reviewed high-quality journal covering international research and practice in library and information science, and information management. The journal is the major publication for ASLIB – the Association for Information Management in the United Kingdom – a membership association for people who manage information and knowledge in organisations and the information industry.
Information about the journal can be found at
http://www.emeraldinsight.com/products/journals/journals.htm?id=ap

Contact the guest editors

Prof. Dr. Ulrike Spree
- Hamburg University of Applied Sciences -
Faculty Design, Medien and Information
Department Information
Finkenau 35
20081 Hamburg
Phone: +49/40/42875/3607
Email: ulrike.spree@haw-hamburg.de

Fran Alexander
Information Architect, BCA Research (2013- )
Taxonomy Manager, BBC Information and Archives (2009-13)
Email: fran@vocabcontrol.com
Twitter: @frangle

This time it’s personal data – Indiverses and Personal APIs

    5 comments 
Estimated reading time 3–4 minutes

Sooner or later I was bound to find some other Semanticists in Canada and on Thursday I attended a Semantic Web meetup in Montreal. The audience was small, but that led to more of a group discussion atmosphere than a formal talk. The presenter, Dr Joan Yess Kahn, has coined the term Indiverse – Individual Information Universe – to facilitate her thinking about the set of personal information and data that we accumulate through our lives.

She pointed out that some of this information is created by us, some about us, some with our knowledge and consent, some without, and our entire digital lives can be stolen and abused. She made some interesting observations about how our personal and public information spaces were essentially one and the same before the industrial revolution, when most people’s work and home lives were intertwined (e.g. artisans living in their workshops), and that changes such as the industrial revolution and public education split those apart as people left home to work somewhere else. However, in the information age more people are returning to working from home while others are increasingly using their computers at work to carry out personal tasks, such as online shopping.

This blurring of the public and private has many social and commercial implications. We discussed the potential monetary value of personal attention and intention data to advertisers, and implications for surveillance of individuals by governments and other organizations.

We also talked about information overload and information anxiety. Joan has written about ways of categorizing, indexing, and managing our personal information – our address books, calendars, to do lists, etc. – and this led us to consider ideas of how to construct sharable, standardized Personal Data Lockers (for example The Locker Project) and to take back control of our online identity and information management, for example in shifting from Customer Relations Management (CRM) to Vendor Relations Management (VRM).

In previous posts I have talked about our need to become our own personal digital archivists as well and I was sent a link by Mark to a Personal API developed by Naveen. This takes personal information curation to the data level, as Naveen is seeking an easy way to manage the huge amounts of data that he generates simply by being a person in the world – his fitness routines, diet, etc.

There is a clear convergence here with the work done by such medical innovators as Patients Know Best electronic patient health records. Moral and social implications of who is responsible for curating and protecting such data are huge and wide-ranging. At the moment doting parents using apps to monitor their babies or fitness enthusiasts using apps (such as map my run etc.) are doing this for fun, but will we start seeing this as a social duty? Will we have right-wing campaigns to deny treatment to people who have failed to look after their health data or mass class actions to sue hospitals that get hacked? If you think biometric passports are information dense, just wait until every heartbeat from ultrasound to grave is encoded somewhere in your Indiverse.

Now you tell me! – Information at the right time

    Start a conversation 
Estimated reading time 7–12 minutes

My posts have been a bit thin on the ground this year because I have just relocated from the UK to Canada, to start a new job. This has been a lot of fun, but hard work.

The process has made me acutely aware of the problems of delivering information at the time when it is the most useful. This is often overlooked in information and knowledge management, but getting the timing wrong can render extremely valuable information almost worthless if it is forgotten, lost, or misunderstood.

Knowledge takes time

When I started my previous job, it was a new role, so I had to spend time researching, collecting, and collating relevant knowledge and information from existing employees and previous projects. There was no one person I could talk to in order to get an overview and there was no single story for me to join. I enjoyed the research process, but from an organizational learning pont of view, the less time and energy each new employee has to spend merely to find out what they need to do to be effective in their job, the better. I find corporate storytelling, as advocated by knowledge managers such as David Gurteen, very useful, especially when forging connections between “technical” and “business” teams and transferring tacit knowledge. However, telling helpful stories is easier said than done for many reasons, but in this post I will focus on the time factor.

When I came to hand over my responsibilities to my successor, I found I could construct “histories” of projects explaining how we had arrived where we are reasonably easily, because time in the past could be compressed into simple linear sequences – we spent a year debating this, then we recieved a directive to do that, then we reassessed where we were, etc. Such a narrative can answer many questions, explaining scope change, for example. However, when trying to explain what must be done next, it is important not simply to explain sequence, but also timing and conditionality – when judgement calls need to be made. The “history” has to become a “prophecy” and prophecies tend to be vague and enigmatic for a reason!

If you are handing over a project that has very clearly defined next steps, you can produce a set of instructions that say “when this happens, do this, which will result in something else, and then you will have to do the next thing”. You can even add expected timings. For example “when the technical team deliver the next software release (scheduled for April), you will have one week to check that the data has been transferred correctly, and if there are any errors, inform the technical team before signing off the release”. However, if the project is broader in scope, or less well defined, it is far harder. For example, “following the release of the new search system, you will need to manage user satisfaction with the results, which you could do by statistical analysis, customer satisfaction surveys, interviews, liaison with the UX teams, etc., but you will need to put forward a proposal for funding to do such work and what you should propose will be more successful if it is tailored in reponse to circumstances – available resource or budgets, departmental politics, etc. So, it may be better to conduct some small-scale interviews with a handful of users or it may be worth pushing for an expensive technical solution.” Even this simple example of a potential future judgement call has become extremely difficult to turn into a coherent story. Perhaps you could create a series of “time capsule” handover notes – if this happens, open the red envelope (or file or video), if the opposite, open the blue one. Of course, this is hugely time consuming to set up, so not very practical, and you are never going to be able to include all possible future universes.

Successful succession planning is about training someone to make good decisions by themselves so that they can take over from you without instructions covering every eventuality. Again, this is about timing. You have to train them over time, by involving them in decisions and teaching them how to make good decisions well ahead of when you leave. Try to do it on the afternoon before you head out of the office and both you and they will struggle!

Information overload is just information arriving at the wrong time

I have received huge amounts of information, advice, and instructions over the past few months about the move. Instructions arriving too soon can be reassuring as you can “practice” what you need to do, but this may be very inefficient in workplace settings, as it is unlikely people will remember clearly enough over time, so they will end up re-reading the same instructions – possibly even several times – and may even become over-confident that they know exactly what to do, fail to double-check at the appropriate moment, and then make unnecessary mistakes. So, it is worth thinking about the nature of the instructions and planning the delivery of the information as a project in its own right.

It is very tempting to offload everything onto somebody as soon as you think of it, but what you are actuallly doing is handing over the management of the delivery of that information to them. If you are doing this as a deliberate delegation to a junior member of staff, it may be a valid part of their job, but then it should be recognised that they are taking on those extra tasks as part of their workload. For example, if your employee is going to have to fill in a set of project completion forms, assemble paperwork, and write the report before a certain date, it is not necessarily helping them if you send them all the project completion forms before they have even started the project, as that burdens them with both the records management of those documents and the project managment aspect of remembering that the forms need to be submitted a week after the report itself is completed and submitted. If there is some way you can set up a business process that automatically triggers the sending of the project completion forms after submission of the project, the employee can think about other things. It may be helpful to tell them about the forms, and give them a list of the key documents that they will need (monthly budget statements, for example) to keep throughout the project, but you probably don’t need to make them read through every single question in detail at the start of the project, when they are absorbing huge amounts of other new – and significant – information.

Business process efficiency requires thinking carefully about the timing of information delivery. Considering potential “information overload” points from a human perspective is too often forgotten. It may seem simpler aned cheaper to lump all the instructions together and deliver them at once from a technical point of view, but it is worth weighing up whether the savings on the technical side are really worth more than the savings on the human cognitive efficiency side.

Interaction design is about timing

The same principles remain relevant on the “small scale” timings of interactions with a particular website or business system. Good interaction designers pay attention to the limits of human memory and concentration to make sure that users are delivered information in digestible chunks and in easy to follow steps and stages. Poor interaction design leaves the user clicking to and fro or opening multiple windows to try to remember something that was on a different screen.

If you have ever felt the need to make notes on a piece of paper while using a particular interface, tell the designer! Deciding which information the user needs to see early in an interaction and which can be left to the end can be controversial. For example, quoting a base price on a home page and only adding compulsory additions, such as taxes, supplements, and other charges at the last stage of booking could be described as a mis-use of the notion that users need to be delivered information in small manageable chunks. However, providing detailed directions about how to contact the delivery company in case of a query right at the end of a purchasing process is likely to be a good use of information timing, as the user has finished thinking about everything else by that stage.

An example of “perfect timing” I encountered recently was when I travelled into central London to go to a particular shop. I had already made the decision to go there and knew the address, but I did not know whether to turn left or right when leaving the tube station. To my relief, there on the wall in front of me at the top of the escalators was a large advertisement for the shop, telling me the exit to take, the direction to turn, and the number of metres to walk. I did not care about the metre-precise location when I decided to visit the shop, and I did not want to be burdened with that level of detail when I planned my route to the tube station, but getting the information at the exact time that I needed it was extremely helpful.

Same problem – different scales

Balancing the urge to deliver all available information at once against the information overload and cognitive burden this can cause applies whether dealing with long-term knowledge management (typically years-decades), medium-term business process management (months-years), and short-term interactions with an interface (minutes). There is no simple formula to getting this right, but these different information delivery situations have much in common. Thinking about your knowledge management as a form of interaction or your interaction design as a business process may help offer a new perspective on tricky problems of creating well-timed narratives and telling people not just what they need to know, but telling them what they need to know when they need to know it.

Change, technology, understanding, and the information professions

    Start a conversation 
Estimated reading time 3–4 minutes

Not being a morning person, I was unsure whether a networking breakfast would suit me, but the recruitment agent Sue Hill’s event offered good food and interesting conversation, so I thought I would give it a try. I wasn’t disappointed – the food was excellent and the big round tables promoted lively group discussion.

We were a mix of information professionals from public and private sector, at different stages of our careers, but three key themes prompted the most debate.

Change management

Managing technology change and bridging the cultural and political divisions within organisations in order to bring about change were key concerns. Information professionals can contribute by explaining how new technologies work, how technologies can be catalysts of changes in behaviour, and how they mitigate or increase informational and archival risks. Even simply letting people know new technology is out there can be hugely valuable. Knowledge and information workers can help manage change on political and cultural levels by understanding the corporate culture they are working in and helping their organisation to understand itself and so make good decisions about systems procurement. Information professionals can also often help to break down cultural barriers, to sharing information, for example.

Social media

Social media are now being used to differing degrees within organisations – some having embraced the technologies wholeheartedly, others seeing them as a problem or a threat. There was a general concern that technology is being adopted and used faster than we can understand its impacts and devise strategies for mitigating any risks.

Personal and cultural understanding of the divisions between the public and the private seemed to be a problematic area. Young people in particular were perceived as being vulnerable to “over exposure” as they seemed not to notice that postings about them – pictures especially – would remain available for decades to come and could compromise them in their future careers. Recruitment agents use social media to find out about potential job candidates, and notice inconsistencies between a very professional image presented in a CV or at interview with a Twitter feed that paints a picture of carelessness, foolishness, or irresponsibility.

Information literacy

Awareness of how to use and abuse social media, search engines and research tools, and data and statistics was seen as an arena in which information professionals can offer advice and mentoring, to young people, but also to organisations. Information professionals should also set good examples of how to use social media tools, adopt new working practices, and evaluate new technologies. They should also be able to explain how search engines work, what the pitfalls of poorly planned or too narrow research strategies are, and how to research in a more efficient and effective manner.

A new area that information professionals also need to understand is data analytics and how statistics and algorithmic data mining can be used or abused. Information professionals need not be advanced mathematicians to contribute in this area – an understanding of how to interpret data, the political and cultural issues that can bias interpretations, how to frame questions to get mathematically and statistically significant results, and how to understand the importance of outliers and statistical anomalies are skills that are becoming more important every day.

Overall, I thoroughly enjoyed being woken up by such thoughtful and interesting breakfast companions and went about the rest of the day with a head full of fresh ideas.

Data: The New Black Gold?

    Start a conversation 
Estimated reading time 5–8 minutes

Last week I attended a seminar organised by The British Screen Advisory Council and Intellect, the technology trade association, and hosted by the law firm SNR Denton. The panellists included Derek Wyatt, internet visionary and former politician, Dr Rob Reid, Science Policy Adviser, Which?, Nick Graham, of SNR Denton, Steve Taylor, creative mentor, Donna Whitehead, Government Affairs Manager, Microsoft, Theo Bertram, UK Policy Manager, Google, David Boyle, Head of Insight, Zeebox, and Louisa Wong, Aegis Media.

Data as oil

The event was chaired by Adam Singer, BSAC chairman, who explored the metaphor of “data as oil”. Like oil, raw data is a valuable commodity, but usually needs processing and refining before it can be used, especially by individual consumers. Like oil, data can leak and spill, and if mishandled can be toxic.

It struck me through the course of the evening, that just like oil, we are in danger of allowing control of data to fall into the hands of a very small number of companies, who could easily form cartels and lock out competition. It became increasingly obvious during the seminar that Google has immense power because of the size of the “data fields” it controls, with Facebook and others trying to stake their claims. All the power Big Data offers – through data mining, analytics, etc. – is dependent on scale. If you don’t have access to data on a huge scale, you cannot get statistically significant results, so you cannot fine tune your algorithms in the way that Google can. The implication is that individual companies will never be able to compete in the Big Data arena, because no matter how much data they gather on their customers, they will only ever have data on a comparatively small number of people.

How much is my data worth?

At a individual level, people seemed to think that “their” data had a value, but could not really see how they could get any benefit from it, other than by trading it for “free” services in an essentially hugely asymmetrical arrangement. The value of “my” data on its own – i.e. what I could sell it for as an individual – is little, but when aggregated, as on Facebook, the whole becomes worth far more than the sum of its parts.

At the same time, the issue of who actually owns data becomes commercially significant. Do I have any rights to data about my shopping habits, for example? There are many facts about ourselves that are simply public, whether we like it or not. If I walk down a public street, anybody can see how tall I am, guess my age, weight, probably work out my gender, social status, where I buy my clothes, even such “personal” details as whether I am confident or nervous. If they then observe that I go into a certain supermarket and purchase several bags of shopping, do I have any right to demand that they “forget” or do not use such observations?

New data, new laws?

It was repeatedly stated that the law as it stands is not keeping up with the implications of technological change. It was suggested that we need to re-think laws about privacy, intellectual property, and personal data.

It occurred to me that we may need laws that deal with malicious use of data, rather than ownership of data. I don’t mind people merely seeing me when I walk down the street, but I don’t want them shouting out observations about me, following me home, or trying to sell me things, as in the “Minority Report” scenario of street signs acting like market hawkers, calling out your name as you walk by.

What sort of a place is the Internet?

Technological change has always provoked psychological and political unease, and some speakers mentioned that younger people are simply adapting to the idea that the online space is a completely open public space. The idea that “on the Internet, no-one knows you are a dog” will be seen as a temporary quirk – a rather quaint notion amongst a few early idealists. Nowadays, not only does everyone know you are a dog, they know which other dogs you hang out with, what your favourite dog food is, and when you last went to the vet.

The focus of the evening seemed to be on how to make marketing more effective, with a few mentions of using Big Data to drive business process efficiencies. A few examples of how Big Data analytics can be used to promote social goods, such as monitoring outbreaks of disease, were also offered.

There were clear differences in attitudes. Some people wanted to keep their data private, and accept in return less personalised marketing. They also seemed to be more willing to pay for ad-free services. Others were far more concerned that data about them should be accurate and they wanted easy ways of correcting their own records. This was not just to ensure factual accuracies, but also because they wanted targeted, personalised advertising and so actively wanted to engage with companies to tell them their preferences and interests. They were quite happy with “Minority Report” style personalisation, provided that it was really good at offering them products they genuinely wanted. They were remarkably intolerant of “mistakes”. The complaint “I bought a book as a present for a friend on Amazon about something I have no interest in, now all it recommends to me are more books on that subject” was common. Off-target recommendations seemed to upset people far more than the thought of companies amassing vast data sets in the first place.

Lifting the lid of the Big Data black box

The issue that I like to raise in these discussions is one that Knowledge Organisation theorists have been concerned about for some time – that we build hidden biases so deeply into our data collection methods, our algorithms, and processes, that our analyses of Big Data only ever give us answers we already knew.

We already know you are more likely to sell luxury cars to people who live in affluent areas, and we already know where those areas are. If all our Big Data analysis does is refine the granularity of this information, it probably won’t gain us that many more sales or improve our lives. If we want Big Data to do more for us, we need to ask better questions – questions that will challenge rather than confirm our existing prejudices and assumptions and promote innovation and creativity, not easy questions that merely consolidate the status quo.

Big Data

    2 comments 
Estimated reading time 6–9 minutes

I recently attended a gem of an event at the British Computer Society. There were three top class speakers and great networking with interesting people afterwards. I am always a little concerned that I’m not “hard” IT enough for these events, but metadata and taxonomies are such a cross-domain topic I always end up in fascinating conversations.

Why Information Professionals should care about Big Data

There are huge skills shortages in Big Data, mainly on the understanding and interpretation side, rather than the IT angle. The feeling seemed to be that knowing how to build the machines to do the work is the easy bit, but understanding the meanings of data, avoiding the semantic traps and pitfalls, the interpretation of statistics, and above all the appreciation of risk in order to make sound judgement calls are far harder. I believe this is where information scientists and professionals, taxonomists and ontologists, librarians, archivists, and historians have a vital role to play. Our skills of understanding and interpretation are vital, so we need to step up and embrace Big Data collections as a new form of libraries and archives.

Big Data is scary because there is so much of it, it changes so fast, and it has the power to create feedback loops. An especial danger, and one that is familiar to classificationists (as Bowker and Star discuss in their book Sorting Things Out), is that interpretations, biases, and assumptions can be “hard coded” into raw data right from the point of collection. We know that the way you ask a survey question skews the answers you are likely to get, and which data sets you decide to collect and compare will lead you to certain interpretations above others. Make the wrong choices and the algorithms will send you down false alleyways while appearing to be coldly neutral and scientific (a theme that Jonah Bossewitch inspired me to think about). Miss a key factor, and you will make an apparently unarguable leap to a wrong conclusion.

For example, if you see some statistics that indicate a recent increase in clickthroughs to advertisements, you might conclude that your new mobile marketing strategy is a success. However, further investigation might reveal that most of those clickthroughs don’t result in a sale. If you look again you might see that most of the clickthroughs are followed by people immediately reverting back to the site they were on before, making it far more likely that they are just accidentally clicking on adverts. If you refine the results to identify those who were using a specific new device, it seems likely most of the hits were because people were getting used to an unfamiliar touchscreen. Your apparently successful marketing strategy might in fact just be annoying your users.

How Big Data is changing the world

The first speaker, John Morton, Chief Technology Officer, SAS UK, described how increases in processing power have meant that data capture, collection, and analysis are now taking place on a mind-boggling scale at incredible speeds. One example of how this has changed over the last decade is a processing job that used to take 167 hours can now be carried out in 84 seconds. At the same time, huge amounts of unstructured and unmanaged data that used to be uncollectable ephemera are now being stored and can be analysed. He spoke of an urgent need for “data curators” and suggested they would be the librarians of the future, as without management and curation the value of the data is lost. Organisations are typically only manging 5% of their data. Issues familiar to librarians and archivists, such as the importance of metadata, quality tracking, and provenance, are vital in a Big Data world, where the quality of your analysis depends on the quality of the data you mine. Much of this quality lies within the structure of the data and its metadata.

Big Data and Artificial Intelligence

Peter Waggett, Emerging Technology Programme Leader, IBM, talked about how Big Data analysis was used to power the “intelligence” of Watson, the computer that won the US quiz game Jeopardy, beating human contestants. The mistakes Watson made were in many ways more interesting than the questions it answered correctly, as was the speed of answering. Watson always needed some processing time, whereas humans were much faster when they just “knew” the answer.

Although Watson just seems like fun, there are many examples of how Big Data techniques can be used in practice. The retail industry is one of the leaders in Big Data analysis, using data on shopping behaviour gathered from sources like loyalty cards and online shopping orders. Some companies are now using RFID chips (e.g. clothing company Zara) to gather data about the physical movement of products.

(I wondered if retailers were leaping ahead because they can use Big Data to reap rewards with comparatively little risk. In retail, there are huge profits to be made by better tailoring stock to consumer demands, but mistakes are generally not disastrous – a product line that doesn’t sell well and has to be dropped is not a new problem and is one that business models and buyers’ expertise already make allowances for.)

Another example of Big Data analysis is the football team AC Milan, where analysing data about players’ physiology and movements has helped predict injury rates and manage players in order to minimise risks.

The Internet of Things is going to generate even more Big Data and understanding its applications in new arenas – sporting or otherwise – is going to be a huge challenge for managers of the future.

Big Data and Bias

Brijesh Malkan, Founding Partner, EGO Ventures, highlighted some issues to be resolved as we move into a Big Data age. The nature of privacy is changing as oceans of personal data sweep around the world (Facebook already knows you), and so organisations are going to need transparent ethical policies about handling such data. We have reached our “cognitive limits” with so much data to read and so information visualisation is going to be of increasing importance. Data quality also needs to be managed if data mining techniques are to be effective and if algorithmic processing is to produce sensible, useful results. Brijesh talked about “Bayesian filters” and “cognitive scrubbers” to help compensate for biases in data, whether these biases are embedded in the data capture process, in the choices of data used, in the algorithms processing the data, or ultimately in the decisions made by the humans who are interpreting the data.

He spoke of the need for more understanding of psychology, especially of groupthink, echo chambers, and risk perception. Financial markets in particular are prone to “stampede” behaviour, creating bubbles and panics in markets. Data mining of social networking can be prone to creating feedback loops and encouraging risky behaviour. He also spoke of a desperate shortage of people who understand statistics and probability, even within the scientific community.

(This reminded me of the question Patrick Lambe raised at the ISKO UK conference in the summer, asking how information professionals can do something useful for science and for society. Understanding how to interpret and capture data and account for biases, explaining how easy it is to manipulate people’s perceptions through the way information is presented, and teaching how knowledge requires judgement as well as number crunching would seem to be skills that we can offer to the Big Data world already.)