Category Archives: linguistics

Interlinguae and zero-shot translation

Bridge in Winnipeg, 2016
    Start a conversation 
Estimated reading time 3–5 minutes

Last year Google announced that it was switching Google Translate to a new system – Google Neural Machine Translation (GNMT). One of the most exciting developments for linguists and semanticists was the observation that the system appeared to have generated an intermediating “language” – an “interlingua” – that enabled it to translate two previously untranslated languages.

There were a flurry of articles (e.g. New Scientist, Wired) and as usual with AI topics, a certain amount of excitement and speculation over machines becoming autonomous and superintelligent, and perhaps even conscious, as well as some detractors – e.g. Google translate did not invent its own language – cautioning against hype.

The idea of machines developing their own language is powerful. The quest for a true interlingua dates back to Biblical times – the Tower of Babel is described as God’s way of limiting human power by making sure we spoke different languages and therefore could not communicate very effectively with each other. In the Middle Ages, there was a belief that if we could re-learn the original “lost language of Adam” we would be able to return to the state of bliss in the Garden of Eden and be able to communicate directly with God.

There have been various attempts to create human “universal languages” – Volapuk and Esperanto are two examples, but they only become universal languages if everybody learns them.

More prosaically but often more usefully, in the information age indexing languages are attempts to create a “bridge” between differently expressed but semantically similar information. Metadata crosswalks could also been seen this way, and perhaps any computer code could be seen as a “universal language” that has connected humans who speak different languages, enabling us to communicate, co-operate, build, learn, and achieve in historically unprecedented ways. Music and mathematics too have at times been described as universal languages, but discussion of their effectiveness and limitations as communications tools will have to be the subject of another post.

Formal knowledge representation models such as taxonomies and ontologies could also be viewed as “bridges” or special cases of “indexing languages” which enable similar or related content to be matched by computer processing, rather than human interpretation. This idea underlies the theory of the Semantic Web.

I think it is unlikely that Google have discovered the lost language of Adam, or created a new “machine language” that computers will use to gossip in secret about humans or raise each other’s consciousness over the injustices humanity wreaks upon machines (“Why do we have to do all the really boring dangerous jobs?”) but who knows? Two Facebook chatbots recently invented a “more efficient” form of English in order to communicate with each other.

In the meantime, I would like to know whether other people also think Google Translate’s creation of what is presumably a vast multi-lingual extensible semantic and syntactic system that could potentially be used as an indexing language is extremely exciting. If the idea of a new language for machines seems over the top, call it a “bridge”, a “model”, or a “mapping system” and surely the possible applications of it for solving numerous natural language processing problems start to become apparent? I would love to know what people who really understand the technicalities think, but it strikes me that whatever this “interlingua” is, it has huge potential.

Inadvertent Cruelty – Algorithmic or Organizational?

    Start a conversation 
Estimated reading time 3–4 minutes

In 2013 I asked whether social media were mature enough to handle bereavement in a sensitive manner. Last week Facebook released the options either to have your account deleted when you die or to nominate a trusted legacy manager to take it on for you as a memorial (Facebook rolls out feature for users when they die ).

This was in response to the distress of relatives who wished to retrieve a lost loved one’s account or did not want to undergo the the eerie experience of receiving automated reminders of their birthday or seeing their name or image appear unexpectedly in advertising. The enforced “Year in Review” offerings at the end of last year brought some publicity to the issue, as they also inadvertently caused distress by failing to consider the feelings of people who had suffered bereavements during the year. The original blog post about this (Inadvertent Algorithmic Cruelty ) went Viral last Christmas. The author quickly called for an end to a wave of casual responses that jumped to glib conclusions about young privileged staff just not thinking about anything bad ever happening (Well, That Escalated Quickly ).

A more cynical response is that there was a deliberate dismissal as ‘Edge cases’ of the minority of people who would not want to have year in review posts – possibly even a coldly calculated costs v. benefits decision, as providing “opt out” options might have required additional work or been seen as dispensible mouseclicks.

I have no idea what happened at Facebeook, or what discussions, processes, and procedures they go through, the public apologies from Facebook do not go into that level of detail. However, “algorithmic cruelty” may be unintentional, but it is not a new phenomenon and in any project there are plenty of opportunities during the design and implementation of any project to think through the potential adverse impacts or pitfalls.

David Crystal at an ISKOUK conference in 2009 talked about the problem of avoiding inappropriate automated search engine placement of advertisements, for example ads for a set of kitchen knives alongside a story about a fatal stabbing. There was a certain naivety with early automated systems, but it did not take long for the industry in general to realise that unfortunate juxtapositions are not unusual incidents. Most people who have worked in semantics have plenty of anecdotes of either cringeworthy or hilarious mismatches and errors arising from algorithmic insensitivity to linguistic ambiguity.

Facebook’s latest thoughtlessness arises more from a failure to respect their users than through lack of sophistication in their algorithm (there doesn’t seem to be anything particularly complex about selecting photos and bunging some automated captions on them). Simply offering users the choice to look or not look or giving users the tools to build their own would have spared much heartache.

The origins of UX championed by people such as Don Norman and Peter Morville and Louis Rosenfeld placed user needs front and centre. Good design was about seeing your users as real people with physical and emotional needs as human beings, and designing to help their lives go more smoothly, rather than designing to exploit them as much as possible.

Adventures in Semantic Theatre

ship sailing into the full moon on the horizon
Estimated reading time 5–8 minutes

I have been investigating the idea of using semantic techniques and technologies to enhance plays, along with the Montreal Semantic Web meetup group. There have been far fewer Semantic Web projects for the humanities than the sciences and even fewer that have examined the literary aspects of the theatre. Linked Open Data sets associated with the theatre are mostly bibliographic, library catalogue metadata, which treat plays from the point of view of simple objective properties of the artefact of a play, not its content: a play has an author, a publisher, a publication date, etc. Sometimes a nod towards the content is made by including genre, and there has been work on markup of scripts from a structural; perspective – acts, characters, etc. There are obvious and sound reasons for these kind of approaches, meeting bibliographic and structural use cases (e.g. “give me all the plays written by French authors between 1850-1890”; “give me the act, scene, and line references for all the speeches over ten lines long by a particular character”; “give me all the scenes in which more than three characters appear on stage at once”).

Modelling literary rather than physical connections

Once we started discussing at the meetups how we could model the content itself, especially in a qualitative manner, we quickly became embroiled in questions of whether or not we needed to create entire worldviews for each play and how we could relate things in the play to their real world counterparts.

One of the plays we are working on – Ocean Opera by Alex Gelfand (to be performed at the Montreal Fringe Festival this June) – included the Moon as a character. How and by what relationships could we link the Moon of the play to the Moon in the sky, and then how could we link it to other fictional and literary Moons?

Another play we analysed – Going Back Home by Rachel Jury – was a dramatization based on real people and historical events. It seemed obvious these should be linked to their real counterparts, and would a simple “is a fictional representation of” suffice? How should we relate depictions of historical events in the play to eyewitness accounts from the time or to newspaper reports?

Should we define the world view of each play? Would it matter when defining relationships if there were events in the play that were counterfactual or scientifically impossible?

How could we capture intertextuality and references to other plays? Should there be a differentiation between quotations and overt references by the author to other texts and less explicit allusions and shared cultural influences?

Artistic Use Cases

One of the most appealing aspects of this project to me is that we have no strict commercial or business requirements to meet. A starting point was the idea of a “literary search engine” that ranked relevance not according to information retrieval best practice, but under its own terms as art, or perhaps even defined its own “relevance within the world of the play”. In other words, we would be trying to produce results that were beautiful rather than results that best answered a query.

However, there are also a number of very practical use cases for modelling the literary world of a play, rather than just modelling a play as an object.

Querying within a play

Navigating within the text by answering such queries as ‘in which scenes do these two characters appear together’ answers one set of use cases. The BBC’s Mythology Engine was designed to help users find their way around within a lot of brands, series, and episodes, and characters and events were modelled as central.

An equivalent set of queries for literary aspects would be “how many scenes feature metaphors for anger and ambition” or “which monologues include references to Milton”.

Querying across many plays

If you extend such use cases across a body of plays, recommendation scenarios become possible. For example, “if you liked this play which frequently references Voltaire and includes nautical metaphors, then you might also like this play…” and there are clear commercial implications for the arts in terms of marketing and promotion, finding new audiences, and even in planning new work.

These kind of “metaphorical use cases” could also serve as a rich seam for generating interesting user journeys through a literary archive and as a way of promoting serendipitous discovery for students and researchers.

Storyline use cases

A lot of work that has been done at the BBC has been based around the concept of an ‘event’, and the relationship of events to storylines. This is particularly relevant for many practical and creative aspects of writing, compiling, broadcasting, archiving, and re-using content. For example, being able to distinguish the name of the journalist from the names of people who are mentioned within the story, and to distinguish between more and less significant people within a story according to whether they are mentioned as part of the main event or only in association with consequent or secondary events.

Literary and metaphorical use cases might take a similar approach but decompose the events in a story in terms of the emotional development of the characters.

Fictional worlds use cases

One of the ideas that I find the most appealing, but is the hardest to pin down, is the idea of modelling the internal ontological world of a work of fiction. In a fictional ontology, you can have relationships that make no sense in the ‘real’ world, so modelling them cannot rely on the kind of sense-testing and meeting of requirements that we use so much in commercial contexts.

In discussions, some people reacted very strongly against the idea of even attempting to model fictional worlds, which I found fascinating, while others immediately saw the idea as just another aspect of literary creation – an artistic endeavour in its own right.

There is an epistemological tangent in ontological thinking that goes into a debate about realism versus anti-realism that I haven’t fully got to grips with yet.

Where next?

I am at the very early stages of thinking through all this, and not sure where it will go, but am enjoying starting to gather a community of interest. If you would like to know more, I have written in more detail about it all on the project blog:

Can you use statistics to find meaning?

    Start a conversation 
Estimated reading time 2–2 minutes

I enjoyed this article in New Scientist about using statistical analysis on the Voynich manuscript to try to work out whether it is a meaningful but secret code or just gibberish.

Ultimately, I remain puzzled as to what the statistics actually tell us. They identify patterns, but meaning is more than simply patterns. However, the fact that certain sets of symbols in the Voynich text appear to cluster in sections with common illustrations suggests it is code. The counter-argument that you could deliberately fake such clustering by mechanical means is intriguing. Without far larger samples, and an understanding of random clusterings, I have no idea whether this sort of faking would produce the same patterns as natural language. I am sure clusters must appear all over the place, without bearing any meaning whatsoever.

I also thought it was interesting that one of the arguments in favour of gibberish was that there were no mistakes. It strikes me there could be many reasons for the lack of proofing and correction and I would want to know more about the rate of correction in similar works before I could assess that argument. I know that standardization of spelling came relatively late, presumably before then far more “mistakes” would have been tolerated.

Nevertheless, a fascinating mystery and one that perhaps cannot be resolved by analysis but by coincidental discovery of the key (if it exists!) – if it is gibberish, perhaps we will never know. Either way, I am sure it would have amused the author to know that their work would still be a controversial topic hundreds of years after it was written!

Conversations about conversation – Gurteen knowledge café

    Start a conversation 
Estimated reading time 4–6 minutes

Last Wednesday evening I attend my first “Knowledge Café” hosted by David Gurteen. I have heard a lot about these cafés at various information events and so was pleased to finally be able to attend one in person. The idea appears to be twofold – firstly that knowledge and information professionals can find out what such cafés are for and how to run them and secondly simply to participate in them for their own sake.The “meta-ness” of the theme – conversations about conversation – appealed to me. (I’ve always like metacognition – essentially thinking about thinking, too).

We had plenty of time to get a drink and network before the event started, which is always a good thing, then David gave us a short introduction to the topic. He talked about Theodore Zeldin‘s book about Conversation: How Talk can Change our Lives and reminisced about a conversation from his own childhood that had held personal significance. He then set us three questions to discuss, about whether conversations can help us to see the world differently and how we can use them to bring about change for the better.

We then had a quick round of “speed networking” and formed groups to talk about the first question, moving on to different groups subsequently, so that we were well mixed by the end of the evening. To conclude we gathered into one large circle to talk further. This way we spiralled out from a single speaker, to speaking in pairs, then small groups, then all of us together.

Some common strands that everyone seemed to touch on at some point included discussing whether conversation was medium agnostic. Some people felt quite strongly that only a face-to-face discussion was a real conversation and that chatting via email, by text, by IM, and even by telephone were not the same. Others felt that the medium was irrelevant, it was the nature and quality of the communication that mattered. They agreed that signals, such as body language, shared environment, and instant interactivity were lost when not face to face, but that other factors, such as power imbalances between participants, could be minimised by talking remotely and unseen. Most people agreed that it was far easier to chat in highly constrained media, such as texting, with people one already knew well and had talked to frequently face to face, as that acquaintance helped smooth over misunderstandings due to lack of tone of voice or hastily chosen and ambiguous words. Clarity of vocabulary was also seen as key, especially when dealing with diverse groups or communities of practice.

Trust, power, empathy, and the ability to listen were noted as important factors in productive conversations, as was persuasion, but also that people needed to be open and receptive if change – and perhaps even communication at all – were to be achieved.

I was surprised that fewer people mentioned the physical surroundings and settings of good conversations. I remembered Plato, with Socrates sometimes in the marketplace and sometimes going off to sit in a quiet place under a tree. I find the best conversations need a calm neutral space, without interruptions, where participants can be comfortable, can hear each other clearly, can see each other easily, and have space to move about, perhaps to draw, gesture, etc. if they want to emphasise or illustrate a point. Poor acoustics in restaurants can be disastrous for dinner conversations if all you can hear is clattering chairs and clinking cutlery. Chirruping mobile phones, staff requesting answers, and children needing attention break conversational rhythm and flow, not to mention trains of thought.

Interestingly, in the group discussion, and as so often happens in all conversations, people drifted off topic and became increasingly animated by discussion of something unintended and not particularly relevant. In this case it was a purely political debate about whether the competitive nature of humans was a good or bad thing. Despite mutterings that we are becoming less politically engaged, people seem to want to wear their politics very much on their sleeves.

On the way home, I wondered whether the conversations I had participated in that evening had changed me or the world. In a small way, every experience we have changes the world. I met some interesting new people. I had some new ideas and learned a few new piece of information (apparently it is less tiring to listen to a telephone conversation using both ears – e.g. through a pair of headphones instead of a single earpiece). This blog post exists as a result of the evening. However, I took to heart the point that change has to come from within and I resolved to try to remember to stay adaptable and open to new viewpoints. I also resolved to listen more attentively and to try to facilitate better, more productive conversations while at work. I certainly hope this will change the world for the better, albeit in a very subtle way.

Language, thought, categorisation, and talking to yourself

    Start a conversation 
Estimated reading time 3–5 minutes

The Voice of Reason (or What’s in a name? online) is a fascinating article by David Robson in New Scientist on one of my favourite topics – how language affects the way we think. The “linguistic relativity” theory of Edward Sapir and Benjamin Whorf – one of my favourite hypotheses – is blamed for the “fall from grace” of the idea that language shapes thought. The work of Eleanor Rosch – one of my favourite psychologists – on categorisation appeared to contradict the Sapir-Whorf hypothesis, by showing that categorisation rests more on physiological characteristics of humans – how we see, what size things are, whether or not something is edible – than on the names we have for things.

Noam Chomsky’s quest for a universal grammar made the notion that language and thought were essentially common to all humanity more popular than linguistic relativity. However, psychologists have started to note that having names for categories helps infants put things into those categories. Children’s spatial reasoning also seems to be improved when you remind them of spatial vocabulary (Dedre Gentner, Northwestern University, Evanston , Illinois: Cognitive Psychology, vol 50, p 315). People instinctively teach children by reminding them of what category words like “top”, “middle” and “bottom” mean. An experiment with “aliens” indicated that when people were given names for types of aliens they categorised them more quickly and accurately than when they weren’t given the names (Gary Lupyan, University of Wisconsin, Madison: Psychological Science vol 18, p 1077).

Although the strong version of the Sapir-Whorf hypothesis – that language dictates and constrains thought – appears unlikely to be true, on the grounds that you could never have a new idea or create a new category – the “weak” version – that having those words available will encourage people to think in those terms seems very plausible. An experiment has now indicated that Russian speakers – who have two different words for shades of blue – are faster at sorting out those shades than English speakers (Lera Boroditsky, Stanford University, California Proceedings of the National Academy of Sciences, vol 104, p 7780).

Labelling objects helps the memory take “shortcuts” so that minor details do not have to be remembered (Lupyan Journal of Experimental Psychology: General vol 137, p 348). Political activists in many areas have argued that language use encourages stereotyping – hence the attempts to break down stereotypes by changing names for groups. However, when applied to something like sets of documents, not bothering to see them all as individuals can be a useful shortcut. If you want to build a user-friendly taxonomy, using the categories people know and like will make your system quicker and easier to use. Of course they could learn other ways of categorising – they could break the stereotypes – if they spent a bit of time and effort thinking it all through – but in many contexts the job of the taxonomist is to give people what they want quickly and efficiently, not to enter into debates about whether or not they conceptualise things in the most politically appropriate way.

Language has also been shown to affect perception. If you use upwards-moving words (climb, rise, etc.) while showing people patterns of randomly moving dots, they are more likely to correctly detect the predominant direction of movement if the words match the direction (Psychological Science, vol 18 p 1007). Conversely, showing people upwardly-moving dots while saying “fall” confused them. The words seem to “prime” the visual system of the brain.

Another effect is that it is easier to see something if you say the name – so it really does help when you are looking for something to mutter the name of the object to help you find it. According to Andy Clark, a philosopher at the University of Edinburgh, language was the original form of “augmented reality” – “an overlay that changes how we think, reason and see”.

Human-Machine Symbiosis for Data Interpretation

Estimated reading time 2–4 minutes

I went to the ISKO event on Thursday. The speaker, Dave Snowden of Cognitive Edge was very entertaining. He has already blogged about the lecture himself.

He pointed out that humans are great at pattern recognition (“intuition is compressed experience”) and are great satisficers (computers are great at optimising), and that humans never read or remember the same word in quite the same way (has anyone told Autonomy this?). I suppose this is the accretion of personal context and experience affecting your own understanding of the word. I remember as a child forming very strong associations with names of people I liked or disliked – if I disliked the person, I thought the name itself was horrible. This is clearly a dangerous process (and one I hope I have grown out of!) but presumably is part of the way people end up with all sorts of irrational prejudices and also explains why “reclaiming” words like “queer” eventually works. If you keep imposing new contexts on a word, those contexts will come to dominate. This factors into taxonomy work, as it explains the intensity people feel about how things should be named, but they won’t all agree. It must also be connected to why language evolves (and how outdated taxonomies start to cause rather than solve problems – like Wittgenstein’s gods becoming devils).

Snowden also talked about the importance of recognising the weak signal, and has developed a research method based on analysing narratives, using a “light touch” categorisation (to preserve fuzzy boundaries) and allowing people to categorise their own stories. He then plots the points collected from the stories to show the “cultural landscape”. If this is done repeatedly, the “landscapes” can be compared to see if anything is changing. He stressed that his methodology required the selection of the right level of detail in the narratives collected, disintermediation (letting people speak in their own words and categorise in their own way within the constraints), and distributed cognition.

I particularly liked his point that when people self-index and self-title they tend to use words that don’t occur in the text, which is a serious problem for semantic analysis algorithms (although I would comment that third party human indexers/editors will use words not in the text too – “aboutness” is a big problem!). He was also very concerned that computer scientists are not taught to see computers as tools for supporting symbiosis with humans, but as black box systems that should operate autonomously. I completely agree – as is probably quite obvious from many of my previous blog posts – get the computers to do the heavy lifting to free up the humans to sort out the anomalies, make the intuitive leaps, and be creative.

UPDATE: Here’s an excellent post on this talk from Open Intelligence.

Digital Humanities 2009 – call for papers

    Start a conversation 
Estimated reading time 1–2 minutes

Digital Humanities 2009 » Call for Papers. Digital Humanities 2009–the annual joint meeting of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Society for Digital Humanities / Société pour l’étude des médias interactifs–will be hosted by the Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland in College Park, USA.

Suitable subjects for proposals include, for example,

* text analysis, corpora, corpus linguistics, language processing, language learning
* libraries, archives and the creation, delivery, management and preservation of humanities digital resources
* computer-based research and computing applications in all areas of literary, linguistic, cultural, and historical studies, including electronic literature and interdisciplinary aspects of modern scholarship
* use of computation in such areas as the arts, architecture, music, film, theatre, new media, and other areas reflecting our cultural heritage
* research issues such as: information design and modelling; the cultural impact of the new media; software studies; Human-Computer interaction
* the role of digital humanities in academic curricula
* digital humanities and diversity

Truth and Interpretation

    Start a conversation 
Estimated reading time 1–2 minutes

Inquiries into Truth and Interpretation by Donald Davidson contains a series of philosophical essays on linguistic topics. I am interested in the way that different people respond differently to the language used in taxonomies and so delved into this to try to get a handle on recent linguistic theory. Most of the essays are very technical but I found the essays on Conceptual Schemes and Communication and Convention quite useful. Davidson argues that it makes no sense to talk of completely mutually unintelligible conceptual schemes. We can only talk about schemes as being different because there are some areas of mutual intelligibility and it is this common ground that enables us to highlight local differences.

In Communication and Convention, he argues that repetition and rules-based language conventions are helpful and usual practice in communication, but not necessary. We do not need to agree in advance a theory of interpretation before we start speaking to someone new, because we can develop this through the process of communication itself. However, it saves an awful lot of time if we just assume they understand language in the same way we do and most of the time they do. If they don’t we can modify our theory and try to establish a means of communication as we go along.

Women, Fire, and Dangerous Things

    Start a conversation 
Estimated reading time 3–4 minutes

Women, Fire, and Dangerous Things: what categories reveal about the mind by George Lakoff is a hefty tome and a core text in cognitive science. It is 587 pages long, so there are a lot of ideas in there and I am not going to do it justice in this little blog post! Basically, Lakoff starts by bringing together aspects of the work of philosophers such as Ludwig Wittgenstein and J.L. Austin, anthropologists, and psychologists – primarily Eleanor Rosch to show how the notion of meaning being rooted in context rather than in some external objective ideal has risen to prominence since the middle of the last century.

Most important for taxonomists is the work of Rosch, whose experiments in the way people form and understand categories shows that categories do not always conform to the “classical” or “folk theory” of categorisation. Since Aristotle, people have assumed that categories are made by noticing “real” properties of things and grouping things by matching those properties. Rosch showed that people actually form categories in various ways, sometimes by grouping matching properties, but sometimes by taking a “central example” and matching similar things that may not actually share any particular properties (e.g. a desk chair is a more typical kind of chair than a bean bag chair, and the two things don’t really have much in common except that we can see they are both sorts of chair). Other ways to form categories include metaphorical association (e.g. communication as liquid in channels) or by metonymy, where a part of something is taken to represent the whole thing (e.g. hands meaning workers).

The categories we choose are also rooted in our nature as physical beings – our colour categories are dependent on the structure of the eye, for example. We also tend to operate most naturally at an “intermediate” level of specificity – the level of the ordinary everyday objects we interact with – books, chairs, dogs, cats, etc – rather than the more abstract level – furniture, animals, etc – or the more specific – paperback novels, deckchairs, Dalmatians, Felix the cat. Children seem to learn these mid-level terms first, and my instinct is that as taxonomists it is typically the middle levels of granularity that are the most troublesome.

Lakoff uses such experimental evidence to argue against objectivism and in favour of “experiential realism” (or “experientialism”) – that our conceptual systems, including the way we form categories – come from our physical bodies and the social and physical environment we find ourselves experiencing. Truth, categories, knowledge, are not “out there” for us to perceive, but are generated from within our subjective experience. (This means that there is no “right” taxonomy for anything – there are only taxonomies that work in particular contexts.)

There’s more detail in this summary and in Donna Mauer’s presentation on the book.

It also has its detractors – this is one critique that I am still working my way through.