Category Archives: data science

The Accidental Data Scientist

Image of book cover 'The Accidental Data Scientist' by Amy Affelt
    Start a conversation 
Estimated reading time 2–3 minutes

The Accidental Data Scientist* by Amy Affelt is a clarion call to librarians and other information professionals to immerse themselves in the world of Big Data. As such, it is a solid introduction, emphasizing how the traditional skills of librarians are crucial in ensuring that Big Data are reliable, properly prepared, indexed, and abstracted, and intelligently interpreted.

Affelt reassuringly shows that the ‘problems’ of Big Data are not new, but very familiar to librarians, and indicates ways that librarians can add value to Big Data projects, by ensuring such projects deliver what is expected and anticipated. Data and Computer Scientists are good at writing algorithms to process data mathematically, but may not be trained in asking the right questions or knowing where to look for biases and flaws in data sets, and a Big Data project that fails in these aspects could prove an expensive disaster for an organization.

Chapters outlining the tools and techniques currently available for processing and visualizing Big Data, and applications and initiatives in various industry sectors are informative for those new to the issues, and a helpful guide for experienced librarians to demonstrate how their skills are transferable.

Affelt gives examples of specific projects and describes how the input of librarians – especially when ‘embedded’ in data project teams – is extremely beneficial. She suggests ways of proving the value of librarians in modern corporate settings and gives tips and suggestions on career development.

For information professionals unsure about how to engage with the opportunities Big Data offers, this is a wide-ranging and clear overview, and a great starting point.

With increasing media reports of algorithmic bias and amidst a deluge of fake news, it is more important than ever that Big Data projects include professionals with the skills to recognize and identify problematic sources and skewed datasets, and I hope that librarians and information professionals step up and hear Affelt’s call to action.

*Presumably named in the tradition of The Accidental Taxonomist by Heather Hedden.

Data as a liquid asset and the AI future

Descent of man
    Start a conversation 
Estimated reading time 5–8 minutes

Getting back into the swing of meetups again, last night I went to the MTLData meetup – a group of data scientists and enthusiasts who are looking to raise the profile of data science in Montreal. The event featured a panel discussion on the topic of ‘Build vs Buy?’ when considering software for data solutions.

The panellists were Marc-Antoine Ross, Director of Data Engineering at Intel Security, Maxime Leroux, consulting data scientist at Keyrus Canada, and Jeremy Barnes, Chief Architect at Element AI. The chair was Vaughan DiMarco of Vonalytics.

Data as liquid

The issues were very familiar to me from considering EDRM and DAM systems, which made me think about the way data has changed as an asset, and how management and security of data now has to include the ‘liquid’ nature of data as an asset. This adds another layer of complexity. Data still needs to be archived as a ‘record’ for many reasons (regulatory compliance, business continuity, archival value…) but for a data-driven organisation, the days of rolling back to ‘yesterday’s version of the database’ seem like ancient history. Data assets are also complex in that they are subject to many levels of continuous processing, so the software that manages the processing also has to be robust.

The metaphor of data flowing around the organisation like water seems especially telling. If there is a system failure, you can’t necessarily just turn off the tap of data, and so your contingency plans need to include some kind of ‘emergency reservoir’ so that data that can’t be processed immediately does not get lost and the flow can be re-established easily.

Build vs Buy?

The issues highlighted by the panel included costs – available budget, restrictions from finance departments, balance between in-house and outsourced spending (again all familiar in EDRM and DAM procurement), privacy, security, ability to maintain a system, and availability of skills. Essentially balancing risks, which will be unique to each team and each business. In terms of deciding whether to build something in house, availability of in house resource is an obvious consideration, but Marc-Antoine stressed the importance of thinking through what added value a bespoke build could offer, as opposed to other ways the team could be spending their time. For example, if there are no off-the-shelf or open source products that match requirements, if there is value in owning the IP of a new product, if risks can be kept low, and resources are available, a build might be worthwhile.

There are risks associated with all three of the main options – a big vendor is less likely to go bust, but sometimes they can be acquired, sometimes they can stop supporting a product or particular features, and they can be very costly. Open source has the advantage of being free, but relies on ad hoc communities to maintain and update the code base, and how vibrant and responsive each specific community is, or will remain, can vary. Open source can be a good option for low risk projects – such as proof-of-concept, or for risk tolerant startups with plenty of in-house expertise to handle the open source code themselves.

AI future

The conversation diverged into a discussion of the future of AI, which everyone seemed to agree was going to become a standard tool for most businesses eventually. Jeremy noted that AI at the moment is being sought after for its scarcity value, to give early adopters an edge over the competition, while Maxime suggested that early advantage is likely to fade, just as it has with data science. Data analysis is now so ubiquitous, even small businesses are involved to a certain extent. Jeremy pointed out that it is hard to maintain a competitive edge based on the scarcity of data itself, as data can so easily be copied and distributed, but knowing how to make intelligent use of the data is a scarce commodity. Making connections and managing data in a very tailored specific way could even be a way for organisations to compete with Google, who have more data than anyone else, but are not necessarily able to answer all questions or have the most useful insights into specific problems.

The value of meaning

I was intrigued by this, as it validates the role of semantics – data without meaning is useless – and the importance of the imaginative and creative leaps that humans can make, as well as the moral and social reasoning that humans can bring. With reports of early AI systems reflecting existing biases and prejudices, and with disasters like the SimSimi chatbot causing social problems such as bullying amongst youngsters, the need for a real human heart to accompany artificial intelligence seems ever more important.

Scarcity of understanding?

Someone asked if the panel thought companies would soon need ‘Chief Intelligence Officers’ in the way that many now have ‘Chief Data Officers’. The panel did not seem particularly enthusiastic about the idea (“it sounds like something that you do with founders when you put them out to pasture”) but I think it would be a fascinating role. The BBC had someone to oversee ethics and advise on editorial ethics issues. Perhaps it is in the skills of a Chief Intelligence Officer – someone who can combine an understanding of how data, information, knowledge and wisdom interact, whether within IT systems or beyond, with an understanding of social issues and problems – that the scarcity value lies. Insight, imagination, and compassion could be the skills that will give the competitive edge. In the AI future, could a Chief Intelligence Officer make the difference between a company that succeeds by asking the right questions, not just of its data or its customers, but of itself, and one that fails?