In days gone by, understanding a leak was a matter of a brave journalist sitting down at a desk with a pile of paper, plenty of strong coffee and a willingness to keep meticulous notes. By the time digital files were the norm, this manual approach looked daunting. Suddenly, there were thousands or millions of files to look at in a range of formats. Cross-checking millions of files to understand the relationship between individuals, companies and events over time was simply a non-starter.
The technology field that rode to the rescue around 15 years ago was called eDiscovery, these days more often termed 'big data analytics'. Complex journalistic investigation offers an insight into the what such systems are capable of but they are already routinely used by organisations that need to make sense of large amounts of unstructured and structured data to high degrees of certainty - law firms, regulators, governments and police forces are notable users.
In the case of the Panama Papers, the newspaper fed the Nuix platform several chunks of files at a time made up of 2.6 terabytes of Word and PowerPoint files, spreadsheets, emails and PDFs, a process that took around two weeks to turn this into something that could be queried to spot deeper connections, patterns and relationships between people, events, locations through time.
It sounds like a huge amount of data to plough through but according to Nuix's senior consultant Carl Barron, part of the team that helped the German newspaper get to grips with the data, 2.6 Terabytes is actually pretty typical for eDiscovery analytics.
"It wouldn't be thinkable to do a manual investigation over this amount of information. You'd miss critical information," Barron told Techworld.
Nuix did not see the data but helped the firm configure the server on which the analysis was carried out, isolated from Internet access.
The system churns through the files at high speed, extracting text as well as the metadata that indicates who created each file, when, and notes subsequent modifications. Sometimes the location is available. The language doesn't matter to Nuix, which can process characters and words in any. Documents in a closed state such as PDFs are identified and fed into an optical character recognition system for text extraction, something that accounts for most of the work according to Barron.
Critically, Nuix de-duplicates the data - the same file in a different place - which in the case of the Panama Papers quickly removed about a third of the data.
"It indexes data and gives you a quick and very transparent view."
The effect of this sort of analytic power is that a journalist or investigator can search using any criterion they want, for instance quickly being served an index of documents that mention a person of interest between certain date ranges. The system also allows larger teams of people to access, each following their own leads or interests.
The ICIJ previously worked with Nuix to analyse a cache of 2.5 million files the results of which were published in 2013.
Longer term, the ability of the technology to unlock hidden patterns and relationships within documents spells trouble for a world in which sometimes politically and financially-charged secrets need to be kept, although the spread of encryption will surely blunt this to some extent. However, encryption isn't a perfect solution to this issue, not least because using the technology introduces complex key management. More likely, some forms of unstructured data such as emails will be sent to self-destruct after a pre-defined time period although regulators could also outlaw such behaviour to preserve compliance.
What does Nuix's Barron think?
"There is no place to hide."