Can you use statistics to find meaning?

    Start a conversation 
Estimated reading time 2–2 minutes

I enjoyed this article in New Scientist about using statistical analysis on the Voynich manuscript to try to work out whether it is a meaningful but secret code or just gibberish.

Ultimately, I remain puzzled as to what the statistics actually tell us. They identify patterns, but meaning is more than simply patterns. However, the fact that certain sets of symbols in the Voynich text appear to cluster in sections with common illustrations suggests it is code. The counter-argument that you could deliberately fake such clustering by mechanical means is intriguing. Without far larger samples, and an understanding of random clusterings, I have no idea whether this sort of faking would produce the same patterns as natural language. I am sure clusters must appear all over the place, without bearing any meaning whatsoever.

I also thought it was interesting that one of the arguments in favour of gibberish was that there were no mistakes. It strikes me there could be many reasons for the lack of proofing and correction and I would want to know more about the rate of correction in similar works before I could assess that argument. I know that standardization of spelling came relatively late, presumably before then far more “mistakes” would have been tolerated.

Nevertheless, a fascinating mystery and one that perhaps cannot be resolved by analysis but by coincidental discovery of the key (if it exists!) – if it is gibberish, perhaps we will never know. Either way, I am sure it would have amused the author to know that their work would still be a controversial topic hundreds of years after it was written!

This time it’s personal data – Indiverses and Personal APIs

    5 comments 
Estimated reading time 3–4 minutes

Sooner or later I was bound to find some other Semanticists in Canada and on Thursday I attended a Semantic Web meetup in Montreal. The audience was small, but that led to more of a group discussion atmosphere than a formal talk. The presenter, Dr Joan Yess Kahn, has coined the term Indiverse – Individual Information Universe – to facilitate her thinking about the set of personal information and data that we accumulate through our lives.

She pointed out that some of this information is created by us, some about us, some with our knowledge and consent, some without, and our entire digital lives can be stolen and abused. She made some interesting observations about how our personal and public information spaces were essentially one and the same before the industrial revolution, when most people’s work and home lives were intertwined (e.g. artisans living in their workshops), and that changes such as the industrial revolution and public education split those apart as people left home to work somewhere else. However, in the information age more people are returning to working from home while others are increasingly using their computers at work to carry out personal tasks, such as online shopping.

This blurring of the public and private has many social and commercial implications. We discussed the potential monetary value of personal attention and intention data to advertisers, and implications for surveillance of individuals by governments and other organizations.

We also talked about information overload and information anxiety. Joan has written about ways of categorizing, indexing, and managing our personal information – our address books, calendars, to do lists, etc. – and this led us to consider ideas of how to construct sharable, standardized Personal Data Lockers (for example The Locker Project) and to take back control of our online identity and information management, for example in shifting from Customer Relations Management (CRM) to Vendor Relations Management (VRM).

In previous posts I have talked about our need to become our own personal digital archivists as well and I was sent a link by Mark to a Personal API developed by Naveen. This takes personal information curation to the data level, as Naveen is seeking an easy way to manage the huge amounts of data that he generates simply by being a person in the world – his fitness routines, diet, etc.

There is a clear convergence here with the work done by such medical innovators as Patients Know Best electronic patient health records. Moral and social implications of who is responsible for curating and protecting such data are huge and wide-ranging. At the moment doting parents using apps to monitor their babies or fitness enthusiasts using apps (such as map my run etc.) are doing this for fun, but will we start seeing this as a social duty? Will we have right-wing campaigns to deny treatment to people who have failed to look after their health data or mass class actions to sue hospitals that get hacked? If you think biometric passports are information dense, just wait until every heartbeat from ultrasound to grave is encoded somewhere in your Indiverse.