In moderation of non-text content, automation is vital. With the written word, an editor can quickly scan an article for tell-tale signs and their expertise will tell them that a closer look is required. Equally, searching large volumes of text for specific phrases or terms is a problem that is well understood and provided for by many techniques and tools. This becomes much trickier, for example, with audio. There’s a limit to how fast a clip can be sped up before it becomes unintelligible to a listener, and so figuring out the dangers posed in an hour’s worth of audio requires a significant time investment on the part of the editor. Equally, searching audio for whether a particular term has been spoken out loud is far more challenging than the text equivalent.
At Kinzen we combine an editorial network of human expertise in disinformation with a range of automated techniques for analysing large volumes of data to generate leads and highlight high risk content for our editors and clients.
While our editorial experts bring their own human expertise and experience to every decision made, we on the technical side work hard to allow them to make these decisions as efficiently, and with as much context and additional information, as possible. Equally our human-in-the-loop editors and analysts provide our engineers with vital labelled and audited data and a knowledge graph that allows us to refine and target our tools to a degree that would be impossible without their input.
This dual approach allows us to blend editorial expertise into our technical solutions and tools in a way that companies that only focus on moderation and rely on third party tools for services such as transcription, and those that provide those services, are unable to achieve.
Some of these tools can then be used to generate automatic classifications of content, attempting to distil the knowledge provided by our editors into an algorithm. Another key focus for Kinzen is building tools that help inform and present the huge range of content we monitor to editors in a manner that is accessible and manageable. The goal is to allow them to quickly evaluate content and more broadly find the needles in the haystack where they need to focus their attention.
When dealing with non-text content, an obvious step in tackling this problem is to produce transcripts, allowing us to apply text techniques to audio and video content. Perhaps we are lucky enough that a transcript already exists, but this is rarely the case and if provided by a disinformation spreader, it may omit key sections to avoid moderation attempts of the platform that hosts it, so it’s vital to be able to produce these ourselves with high confidence.
There are many services that offer automatic speech recognition (ASR) and the field has come an incredible way in the past five years especially. We found that while many of these services produced high quality and very readable transcripts, they often fell short in key areas for our purposes.
To get technical, briefly, most ASR systems work in two stages. The first processes the audio and makes predictions as to what letters the sounds might correspond to. The second then takes these predictions and feeds it to a language model, which then evaluates these predictions against the target language to see whether they look like real words or sentences.
To take an English example, the first stage might predict with high confidence that a speaker has said “AN” followed by a letter that it is fairly confident is “B” but could also be “D”. In English “anb” is not a word, while “and” is, so the language model will correct the prediction towards the true word. The first stage will often make mistakes due to the huge range of human accents and intonations, as well as the possibility of interference or background noise that makes guessing letters challenging, not to mention the varied oddities in the English language that make words such as “raise”, “rays” and “raze” difficult to distinguish.
The language model is usually produced using machine learning or statistical techniques on a large collection of written texts (known as training the model). This allows it to understand the likelihood of a given letter or word appearing in combination with the surrounding letters of words. This approach has been proven very successful in creating transcripts without obvious spelling mistakes that are highly readable and for most cases allow a reader to quickly understand the nature of the content being discussed. Where errors are introduced they tend to come from correcting to the wrong word, rather than misspelling the target word.
When one of our editors needs to analyse a given piece of audio, this approach means they will reliably have a transcript that is readable and clear. There will be very few spelling errors and where a word has been corrected to the wrong choice, it will usually be clear in context what the right word should have been. To use the needle in the haystack analogy again, it ensures that everything in the haystack is clean and tidy so when we have a needle in hand, it becomes very easy to analyse it.
Where this approach caused us problems was in highlighting the needles in the first place. The corrections a language model makes are based on the examples it has been given. In the fast moving world of moderation, new terms are constantly appearing and previously unusual or niche terms can become mainstream very quickly. How many people would have been familiar with terms like "herd immunity", "mRNA" or even "Coronavirus" two years ago? Now for many these are daily conversation topics. A new hashtag or political rallying cry can go from being coined to globally trending in days, if not hours. If a language model has never seen this term, it will likely consider it a spelling mistake and correct it to a valid word. This is particularly true of phrases arising from acronyms, names or hashtags that don’t follow standard language rules, which unfortunately are often the terms an editor needs to focus on most.
The first stage of an ASR system might predict someone saying “mRNA vaccine” as “em ar an ay vac seen” which might in turn be corrected to “Emma and a vaccine” by the language model if it hasn’t been trained on relevant data. An experienced editor may be able to understand the correct term from the context, but this error makes automatic searching for content containing these terms near to impossible, and puts the editor back to having to find the needles themselves by examining every piece of hay.
For many companies this is a problem they have to live with. Those that focus on moderating large volumes of audio, facing fast moving and evolving targets and threats, have to rely on third party services for transcripts while companies providing ASR services don’t have access to the up-to-date daily information that each different client might want them to focus on transcribing correctly. Kinzen’s approach of combining both expert editorial with technology solutions puts us in a powerful position to tackle this problem.
The solution Kinzen developed to this is two-fold. We built our own ASR system and developed a language model that we could update with new terms quickly and efficiently, allowing us to prioritise the key targets that are vital to get correct in our transcriptions. The second stage was to utilise our powerful knowledge graph, populated with the terms and phrases that our editors consider vital to track, and to build systems to allow them to quickly keep it up to date and current. Building our language model around this knowledge graph ensures that our system is not biased against the terms most vital for us to track.
In building tools for moderators and analysts, it is vital to know the unknowns that the editorial experts are searching for and ensure that these tools are built with this goal in mind. Transcription seems like a general task but as we have seen, approaching it without specialisation can lead to the trap of producing beautiful haystacks that further hide the needles within.