Is the PDF the cul-de-sac of data?
By Martin Pickrodt, and Chief Information Officer,
The information age has given a new meaning to Francis Bacon’s “Knowledge is Power”. Information and data are becoming ever cheaper to create, store and transmit; data is everywhere and a fundament for business decisions, loan approvals and marketing strategies. For us at Canopy, we have a need to analyse financial data and statements for aggregation purposes which is what we will focus on here.
While most companies have embraced this digital age, old habits die hard and we can still find a lot of paper trails: orders, invoices, bank statements, investment analysis etcetera. As a user of the data this can be very frustrating; we need to process our data for purposes of analysis, validation, accounting or even regulatory requirements.
Therefore, our dream is open standards and the willingness of counterparties to provide us with ‘our data’ in a proper format. Ravi Menon, Managing Director at the Monetary Authority of Singapore (MAS) emphasized in a recent speech the importance of open data standards: Common standards help against fragmentation, inefficiency and inconvenience; seamless data sharing will enable higher quality of data which is free of error and commonly understood. It will also allow the aggregation of data and make it intelligible, meaning: machine readable and machine useable.
Yet companies still work with paper or its modern day derivative: the PDF file.
PDF documents are great: every computer can open them and the reader can see exactly what the sender of information intended to show. A major advantage over any text document format which may or may not take the liberty of changing the formatting upon opening and wreak havoc to any nicely crafted design. PDFs are therefore stable display platforms avoiding any issues relating to PEBKAC –
And yet, PDF documents are terrible: they do not fulfil the requirement of being machine readable. While you can open the document for visual inspection or printing, you are not able to further process that information. The millstone around the neck of your data; you may have it but using it in any substantial way is a chore. That is because it is a document format and not a data exchange standard. The recipient is therefore stuck with data that is not machine useable.
Why are PDF documents so popular then? Apart from the stability of the viewing experience, senders of information see the non-machine-readability as an advantage. As the data cannot easily be used any further a perception of safety is created. Comparisons are harder to draw, insights are harder to gleam and you will find it difficult to bring your dataset to a competitor. It’s the placeholder for analog technology in a digital world, a neat replacement for paper.
What can be done with the quantities of data that we would like to use? We need to bring it back into a real electronic format, that much is clear. Many companies choose the hard way: manually re-enter the relevant items. Users of large data amounts, especially when faced with repeat processes like monthly statements, revert to outsourcing: a back office in a remote part of the world that will do the grunt work. A solution that will fail the test of true scalability as well as accuracy, not to mention security concerns.
PDF extraction is only a temporary solution on the path of making data feeds ubiquitous. We hope that companies and especially financial institutions will listen to the gentle nudge from regulators and the call from clients in this matter. Until then, automated extraction is a powerful application for bank customers and beyond. Not only is it hugely accurate and fast, it also saves cost.