I recently attended a gem of an event at the British Computer Society. There were three top class speakers and great networking with interesting people afterwards. I am always a little concerned that I’m not “hard” IT enough for these events, but metadata and taxonomies are such a cross-domain topic I always end up in fascinating conversations.

Why Information Professionals should care about Big Data

There are huge skills shortages in Big Data, mainly on the understanding and interpretation side, rather than the IT angle. The feeling seemed to be that knowing how to build the machines to do the work is the easy bit, but understanding the meanings of data, avoiding the semantic traps and pitfalls, the interpretation of statistics, and above all the appreciation of risk in order to make sound judgement calls are far harder. I believe this is where information scientists and professionals, taxonomists and ontologists, librarians, archivists, and historians have a vital role to play. Our skills of understanding and interpretation are vital, so we need to step up and embrace Big Data collections as a new form of libraries and archives.

Big Data is scary because there is so much of it, it changes so fast, and it has the power to create feedback loops. An especial danger, and one that is familiar to classificationists (as Bowker and Star discuss in their book Sorting Things Out), is that interpretations, biases, and assumptions can be “hard coded” into raw data right from the point of collection. We know that the way you ask a survey question skews the answers you are likely to get, and which data sets you decide to collect and compare will lead you to certain interpretations above others. Make the wrong choices and the algorithms will send you down false alleyways while appearing to be coldly neutral and scientific (a theme that Jonah Bossewitch inspired me to think about). Miss a key factor, and you will make an apparently unarguable leap to a wrong conclusion.

For example, if you see some statistics that indicate a recent increase in clickthroughs to advertisements, you might conclude that your new mobile marketing strategy is a success. However, further investigation might reveal that most of those clickthroughs don’t result in a sale. If you look again you might see that most of the clickthroughs are followed by people immediately reverting back to the site they were on before, making it far more likely that they are just accidentally clicking on adverts. If you refine the results to identify those who were using a specific new device, it seems likely most of the hits were because people were getting used to an unfamiliar touchscreen. Your apparently successful marketing strategy might in fact just be annoying your users.

How Big Data is changing the world

The first speaker, John Morton, Chief Technology Officer, SAS UK, described how increases in processing power have meant that data capture, collection, and analysis are now taking place on a mind-boggling scale at incredible speeds. One example of how this has changed over the last decade is a processing job that used to take 167 hours can now be carried out in 84 seconds. At the same time, huge amounts of unstructured and unmanaged data that used to be uncollectable ephemera are now being stored and can be analysed. He spoke of an urgent need for “data curators” and suggested they would be the librarians of the future, as without management and curation the value of the data is lost. Organisations are typically only manging 5% of their data. Issues familiar to librarians and archivists, such as the importance of metadata, quality tracking, and provenance, are vital in a Big Data world, where the quality of your analysis depends on the quality of the data you mine. Much of this quality lies within the structure of the data and its metadata.

Big Data and Artificial Intelligence

Peter Waggett, Emerging Technology Programme Leader, IBM, talked about how Big Data analysis was used to power the “intelligence” of Watson, the computer that won the US quiz game Jeopardy, beating human contestants. The mistakes Watson made were in many ways more interesting than the questions it answered correctly, as was the speed of answering. Watson always needed some processing time, whereas humans were much faster when they just “knew” the answer.

Although Watson just seems like fun, there are many examples of how Big Data techniques can be used in practice. The retail industry is one of the leaders in Big Data analysis, using data on shopping behaviour gathered from sources like loyalty cards and online shopping orders. Some companies are now using RFID chips (e.g. clothing company Zara) to gather data about the physical movement of products.

(I wondered if retailers were leaping ahead because they can use Big Data to reap rewards with comparatively little risk. In retail, there are huge profits to be made by better tailoring stock to consumer demands, but mistakes are generally not disastrous – a product line that doesn’t sell well and has to be dropped is not a new problem and is one that business models and buyers’ expertise already make allowances for.)

Another example of Big Data analysis is the football team AC Milan, where analysing data about players’ physiology and movements has helped predict injury rates and manage players in order to minimise risks.

The Internet of Things is going to generate even more Big Data and understanding its applications in new arenas – sporting or otherwise – is going to be a huge challenge for managers of the future.

Big Data and Bias

Brijesh Malkan, Founding Partner, EGO Ventures, highlighted some issues to be resolved as we move into a Big Data age. The nature of privacy is changing as oceans of personal data sweep around the world (Facebook already knows you), and so organisations are going to need transparent ethical policies about handling such data. We have reached our “cognitive limits” with so much data to read and so information visualisation is going to be of increasing importance. Data quality also needs to be managed if data mining techniques are to be effective and if algorithmic processing is to produce sensible, useful results. Brijesh talked about “Bayesian filters” and “cognitive scrubbers” to help compensate for biases in data, whether these biases are embedded in the data capture process, in the choices of data used, in the algorithms processing the data, or ultimately in the decisions made by the humans who are interpreting the data.

He spoke of the need for more understanding of psychology, especially of groupthink, echo chambers, and risk perception. Financial markets in particular are prone to “stampede” behaviour, creating bubbles and panics in markets. Data mining of social networking can be prone to creating feedback loops and encouraging risky behaviour. He also spoke of a desperate shortage of people who understand statistics and probability, even within the scientific community.

(This reminded me of the question Patrick Lambe raised at the ISKO UK conference in the summer, asking how information professionals can do something useful for science and for society. Understanding how to interpret and capture data and account for biases, explaining how easy it is to manipulate people’s perceptions through the way information is presented, and teaching how knowledge requires judgement as well as number crunching would seem to be skills that we can offer to the Big Data world already.)