Interlinguae and zero-shot translation

Bridge in Winnipeg, 2016
    Start a conversation 
Estimated reading time 3–5 minutes

Last year Google announced that it was switching Google Translate to a new system – Google Neural Machine Translation (GNMT). One of the most exciting developments for linguists and semanticists was the observation that the system appeared to have generated an intermediating “language” – an “interlingua” – that enabled it to translate two previously untranslated languages.

There were a flurry of articles (e.g. New Scientist, Wired) and as usual with AI topics, a certain amount of excitement and speculation over machines becoming autonomous and superintelligent, and perhaps even conscious, as well as some detractors – e.g. Google translate did not invent its own language – cautioning against hype.

The idea of machines developing their own language is powerful. The quest for a true interlingua dates back to Biblical times – the Tower of Babel is described as God’s way of limiting human power by making sure we spoke different languages and therefore could not communicate very effectively with each other. In the Middle Ages, there was a belief that if we could re-learn the original “lost language of Adam” we would be able to return to the state of bliss in the Garden of Eden and be able to communicate directly with God.

There have been various attempts to create human “universal languages” – Volapuk and Esperanto are two examples, but they only become universal languages if everybody learns them.

More prosaically but often more usefully, in the information age indexing languages are attempts to create a “bridge” between differently expressed but semantically similar information. Metadata crosswalks could also been seen this way, and perhaps any computer code could be seen as a “universal language” that has connected humans who speak different languages, enabling us to communicate, co-operate, build, learn, and achieve in historically unprecedented ways. Music and mathematics too have at times been described as universal languages, but discussion of their effectiveness and limitations as communications tools will have to be the subject of another post.

Formal knowledge representation models such as taxonomies and ontologies could also be viewed as “bridges” or special cases of “indexing languages” which enable similar or related content to be matched by computer processing, rather than human interpretation. This idea underlies the theory of the Semantic Web.

I think it is unlikely that Google have discovered the lost language of Adam, or created a new “machine language” that computers will use to gossip in secret about humans or raise each other’s consciousness over the injustices humanity wreaks upon machines (“Why do we have to do all the really boring dangerous jobs?”) but who knows? Two Facebook chatbots recently invented a “more efficient” form of English in order to communicate with each other.

In the meantime, I would like to know whether other people also think Google Translate’s creation of what is presumably a vast multi-lingual extensible semantic and syntactic system that could potentially be used as an indexing language is extremely exciting. If the idea of a new language for machines seems over the top, call it a “bridge”, a “model”, or a “mapping system” and surely the possible applications of it for solving numerous natural language processing problems start to become apparent? I would love to know what people who really understand the technicalities think, but it strikes me that whatever this “interlingua” is, it has huge potential.

The Accidental Data Scientist

Image of book cover 'The Accidental Data Scientist' by Amy Affelt
    Start a conversation 
Estimated reading time 2–3 minutes

The Accidental Data Scientist* by Amy Affelt is a clarion call to librarians and other information professionals to immerse themselves in the world of Big Data. As such, it is a solid introduction, emphasizing how the traditional skills of librarians are crucial in ensuring that Big Data are reliable, properly prepared, indexed, and abstracted, and intelligently interpreted.

Affelt reassuringly shows that the ‘problems’ of Big Data are not new, but very familiar to librarians, and indicates ways that librarians can add value to Big Data projects, by ensuring such projects deliver what is expected and anticipated. Data and Computer Scientists are good at writing algorithms to process data mathematically, but may not be trained in asking the right questions or knowing where to look for biases and flaws in data sets, and a Big Data project that fails in these aspects could prove an expensive disaster for an organization.

Chapters outlining the tools and techniques currently available for processing and visualizing Big Data, and applications and initiatives in various industry sectors are informative for those new to the issues, and a helpful guide for experienced librarians to demonstrate how their skills are transferable.

Affelt gives examples of specific projects and describes how the input of librarians – especially when ’embedded’ in data project teams – is extremely beneficial. She suggests ways of proving the value of librarians in modern corporate settings and gives tips and suggestions on career development.

For information professionals unsure about how to engage with the opportunities Big Data offers, this is a wide-ranging and clear overview, and a great starting point.

With increasing media reports of algorithmic bias and amidst a deluge of fake news, it is more important than ever that Big Data projects include professionals with the skills to recognize and identify problematic sources and skewed datasets, and I hope that librarians and information professionals step up and hear Affelt’s call to action.

*Presumably named in the tradition of The Accidental Taxonomist by Heather Hedden.