Hurt: the pain of language and computational linguistics


Professor Mark Bishop,
FACT360, Chief Scientific Adviser
[email protected]

Central to the notion of Natural Language Processing (NLP) is a foundational question that few who engage with the discipline ever take time-out to ask: “What is language?” It is a deceptively simple question, expressed through three simple words in English, but one that, upon reflection, proves fiendishly difficult to unpick.

Most likely, if this question has ever occurred to you, you may probably imagine something along the lines of language being a combination of letters, symbols, words and/or sounds that allows us to communicate.  If you were attempting a more specific and academic definition you may get as far as something like:

  1. Languages are autonomous and well-defined entities which provide stable systems of representation for members of speech communities, wherein words transfer speaker A’s thoughts to hearer B.  Thus, linguistic communication is essentially the recognition, in the mind of hearer B, of the thoughts expressed and conveyed by Speaker A (this is also known as ‘telementation’).
  2. Thus, in this conception, language consists of fixed units (‘codes’) which can be described and studied independently of any particular context, speaker or communicational practice.

Language as a code

Computational linguistics – as concerned with the computational modelling of natural language – is effectively predicated on this conception of ‘language as a code’; an approach known as ‘Saussurean’, after the work of the influential Swiss linguist Ferdinand de Saussure.

The core assumption being that the human language behaviour that we would like computers to model with NLP is a ‘self-contained system’, a mere ‘mechanical behaviour’ guided by the universally compatible machine-for-language in our head with which each of us is genetically endowed.  And by taking this assumption one stage further, it follows that if we can devise machinery of sufficient sophistication, then one day such a machine will pass the ‘Turing Test’ and be able to perform linguistically in a way indistinguishable from humans.

Integrationist language

This approach has had its critics with, perhaps the most notable, Roy Harris (Professor of General Linguistics at the University of Oxford) terming this account of language as ‘Segregational’.   He famously, contrasted this intuitive view of language with his own, “Integrationist” account, wherein language, as interpersonal communication, entails the integration of activities between individuals, realised as a creative exercise of both verbal and non-verbal coordination and future cooperation in everyday life. Herein, both verbal and non-verbal acts are seen as integrated constituents within an interactive continuum of communication.

In this radically different conception of language, both speakers and hearers endeavour to create a joint construal of what the speaker is to be taken to mean. Such a construal represents not what the speaker means per se – which can change in the very process of the communication – but what the participants mutually take the speaker to mean.

If Harris is correct, it will be hard to unambiguously identify the operative units of which the language code allegedly consists and to unambiguously identify particular features/units without reference to some determinate system which defines them and their mutual interrelationship. I will illustrate this paradox by reference to the song “Hurt”.

What NiN and Johnny Cash can teach us about Hurt

“Hurt”, a song written by Trent Reznor in 1995, was originally a hit for his band Nine inch Nails (NiN)[1] and, in its original flavour, is generally taken to be a reflection on self-harm, heroin addiction, depression and [the possibility of] suicide. In contrast, in 2002, at the twilight of his career[2], Johnny Cash re-recorded the song wherein, as old age and death encroach, it was transformed into a paean of a life hard-lived: “If I could start again, a million miles away, I would keep myself, I would find a way”.

It is interesting to note that, judged merely by their relative YouTube views, Cash’s cover has proved more successful than the original[3], leading Reznor – originally very sceptical of how Cash would treat his song – to remark:

Wow. I just lost my girlfriend, because that song isn’t mine anymore… It really made me think about how powerful music is as a medium and art form. I wrote some words and music in my bedroom as a way of staying sane, about a bleak and desperate place I was in, totally isolated and alone. that winds up reinterpreted by a music legend from a radically different era/genre and still retains sincerity and meaning – different, but every bit as pure.”

The linguistic puzzle at stake here is how two versions of the same song[4] can convey such radically different meanings[5]. After all, under a structuralist (Saussurean) approach to linguistic analysis, the identification of linguistic features is predicated upon the pairing of a determinate form with a determinate meaning, whereby the value of the sign (its so-called ‘linguistic meaning’) is held to be invariant across all of its manifestations in actual discourse.

I will return to these puzzling linguistic themes in a later blog; for now, it is simply enough to observe that, for the 'Natural Language Processing' – contra 'Natural Language Understanding' – project, the deeper philosophical nuances of language are not quite as important as base computational pragmatics.

[1] Nine Inch Nails, “Hurt”,
[2] Johnny Cash, “Hurt”,
[3] Relative view statistics from the ‘official versions’ of each song: Cash circa 70m to NiN circa 0.5m.
[4] In Cash’s version he changes one line; replacing “crown of shit” to “crown of thorns”.
[5] And, by extension, how re-reading the same novel at different points in one’s life can often reveal very different interpretations and meanings to the reader.