28 May 2024

A map showing cool places to sit during hot weather near your apartment or place of work. Real-time data showing you where you can find an unoccupied station to charge an electric car. Budget reports detailing how a city or national government is spending tax revenue. Government open data portals contain an overwhelming variety of information. Some of it is practical, aimed at improving the day-to-day lives of residents. Yet the reason why no democracy should be without an open data portal is that they enhance government transparency.

A democratic government is one which shares the information it collects with the public. This allows journalists, researchers and the general population to monitor how government money is being spent, or whether climate targets are being met. And with the right expertise, they can even use it to assess whether policies are having the intended impact.  

In this article, we look at the fundamentals of open data portals, exploring why they are so important, and examining best practices. Given the vast amount of data being collected in the 21st century, sharing data is a huge task for administrations – but also a huge opportunity to make government more democratic.

Why open government data?

Since the 1970s, a large number of democratic governments have introduced freedom of information legislation. This gives all citizens the right to request access to government records. It’s a fundamentally democratic idea: the information a government collects should be available to citizens, unless there are good reasons to withhold it.

Open data takes this a step further, making a wide range of data open by default. This means that there is no longer a need to make individual requests. Not only can citizens access data with a minimum of effort. They can also get an overview of the kinds of data being collected by their government. 

What makes data “open”?

The EU’s data portal cites the “Open Definition”, a definition of open data developed by the Open Knowledge Foundation. In its simplest form, they define it as follows: “Open data and content can be freely used, modified, and shared by anyone for any purpose”.

To be more specific, open data should be accessible to anyone without having to pay, or even sign up for an account. Once downloaded it can be used for any purpose, including commercial, and can be edited and reproduced without the need to request permission.  

There are also some technical criteria that open data needs to fulfil. It needs to be easy to access – ideally via an online data portal. And it is vitally important that it should be machine readable. In other words, a computer should be able to interpret its contents. A PDF containing a scan of a graph contains a lot of information – but it can’t be read directly by a computer, or easily copy and pasted from one piece of software to another.

It should also be possible to open the data with at least one free piece of software. Best practice here is to use a file format that a wide range of different software can open. A common choice is a CSV, or Comma Separated Value file. They can be read by a text editor or spreadsheet software, but also can easily be worked with using specialist data analysis programming languages.

Who is it for?

Open data is about enhancing democracy – but it also has a lot of practical uses. On the one hand, private companies can use it. A taxi firm can use it to monitor traffic levels in a city, for example. Or an insurance company might use it to assess the risks involved with insuring properties in certain neighbourhoods.

It is also used by researchers at universities, think tanks, and NGOs to better understand developments in our societies. In some cases, they use the data to develop policy proposals that might improve the quality of life for citizens. And it can be used by journalists and anyone who wishes to check on claims made by governments. Politicians often cite statistics to demonstrate the success or failure of a particular policy. Publishing data in its raw forms makes it easier for others to check up on these claims. And it allows them to identify areas of concern, where governments may be underperforming.

To qualify as open, the data has to be accessible to everyone. Yet in practice, most people aren’t equipped to interpret a large part of it. This will depend largely on the kind of data in question. And as we will see, there are ways of making even the most complex data easier for a broad public to get to grips with.

Kinds of data

The word data might conjure up the image of giant spreadsheets containing millions of numbers. And in some cases, that’s exactly what it looks like. But data actually takes a lot of different forms. 

At the time of writing, the most popular data sets in Paris’ open data portal are about the availability of bikes and scooters, and the availability of charging points for cars. In New York, it’s a list of licensed taxis (while the city payroll statistics and data on road traffic accidents also rank in the top ten). 

The documents collected on open data portals can include anything from a PDF containing lists of names of local representatives and addresses of government buildings, to a government report on climate change, to interactive maps displaying cycle lanes or new building projects. 

They also cover things like government spending, or statistics relating to employment or healthcare. Typically, these take the form of spreadsheets, or CSV files – a format that is very easy for different pieces of software to read. 

Data visualisation

Data visualisation is the easiest way of making complex data accessible and understandable to a non-specialist audience. 

One of the most popular formats on city data portals is an interactive map. They can be used to show things like density of bike lanes, or to illustrate how different areas have benefited from public spending. 

The most common way of visualising numerical data is using graphs. Many government data portals include pre-generated graphs of the most important data sets. Some also include built-in apps that allow users to create their own graphs based on data sets in just a few clicks. 

This histogram showing the change in population in the boroughs of New York over a 50 year period was created on the New York Open Data Portal with just a few clicks. 

Not all open data portals have this option. Furthermore, preparing every single dataset to make it immediately readable for the graph app would be an endless task. Nonetheless, as long as data is available in the right format, a data analyst can work with datasets to quickly produce their own visualisations. And AI tools are making this process even more straightforward (see below).

APIs

We already saw that it’s best practice to share data in full, so that anyone who wishes to can check it in full. But in practice, huge data sets can be difficult to work with, requiring a lot of computer memory. And ensuring you are working with the latest data set is very tiresome if you have to download a new file each time it is updated.  

This is where Application Programming Interfaces, or APIs, come in. Rather than having to download a raw data set in full, data analysts can send an API request specifying the subset of data they are interested in. What’s more, an API can be set to automatically refresh data.

A lot of apps use APIs to ensure that they are staying up-to-date with the latest data. This is how things like weather apps work. Typically, APIs are only used by data specialists. But it’s considered best practice to have APIs on open data portals, as it makes working with the data far simpler for a whole range of researchers and other data professionals. 

Structure is key

Estimates of global annual data production rates are measured in zettabytes. In other words, very big. Public data is only a tiny proportion of this. But it still adds up to a dizzying amount. The EU’s data portal currently contains over 1.7 million sets, while the UN has over 60 million.

With this much data available, there is a good chance that some of it will never be accessed. But that doesn’t mean it shouldn’t be there to find if someone needs it. That’s how transparency works. 

The downside is that it can make finding relevant data difficult. There’s a common legal trick called a document dump. When a prosecution team requests access to documents, the defence provides thousands of them, boxes upon boxes of sheets of paper with no clear structure. So much information that finding what you are looking for is like looking for a needle in a haystack. 

So how can governments avoid their portals turning into data dumps? The key is not to reduce the volume, but to ensure that it is all correctly labelled and maintained. 

This is one of the reasons why good metadata is so important. The metadata of a data set is simply a description of the contents of the document. The metadata should be written in such a way that it shows up in relevant searches. It should also make it clear where the data comes from, so that you know that it is reliable. And it should clearly distinguish it from other, similar data sets. 

Taking this a step further, linked open data practices allow for different data sets with overlapping content to be linked together. This allows users to find a variety of different data on the same topic without searching through multiple data sets one at a time.

Keep it clean

As well as being clearly labelled and easy to find, it’s important that data should be “clean”. This means it is formatted in such a way that it is easy for different software to open. Rather than having to spend time cleaning it themselves, analysts can immediately begin working with it. 

In practice, data tends to be messy. Sometimes columns or rows contain different data types (words and numbers), dates in the wrong format, or even spelling mistakes. Data analysis programmes can be very powerful, but they don’t know how to deal with issues like this, and fixing them is time-intensive. Missing values are also a common problem, as they can make calculating things like averages difficult. 

At the same time, it is possible to overclean data. If I remove entire rows of data because a single value in that row is missing, it can distort the overall results obtained from data. Whenever it comes to making decisions that could affect how the data is interpreted, this should be left to the users.

Open data and AI

In their Open, Useful and Re-usable data (OURdata) Index, the OECD emphasises that high-quality open data is especially important in the age of AI. As LLMs depend on online sources for their answers, they’re only going to be reliable if they can draw on accurate, up-to-date data. 

At the same time, AI tools are already revolutionising how we work with data. Handling large data sets typically involves learning a specialist programming language such as R, Python, or SQL. Yet various AI chatbots have shown excellent results at writing code in these languages. In practice, this greatly simplifies the work of data scientists. But it doesn’t mean that someone with no programming knowledge can work with data this way. It is still vital that the user understands how the code works to avoid misinterpretation of the results.

There are also AI tools which can take data and automatically produce visualisations for you. Again, this requires some caution. You need to understand the nature of the data, and the statistics involved (does the data represent spending in actual or nominal terms? How can we distinguish genuine correlations from coincidences?)

In other words: While AI can speed up the process of working with data, it increases the risk of inaccuracies and misinterpretation. There should always be someone with a good grounding in data analysis to monitor accuracy and to interpret the results.

Can governments share all their data?

Making data open sounds great. In practice, no government in the world is yet sharing all of the data that they could be. There are various reasons for this, perhaps most importantly the cost of maintaining an open data portal. However, as data portals continue to evolve, we can expect to see governments sharing more and more data, at lower cost. 

Yet there are also a lot of cases where sharing data is not appropriate, and is even prohibited by law. This applies to data related to national security and other classified documents. But it also includes so-called microdata  – records that contain information directly pertaining to individuals, households, or businesses. In Europe, the GDPR restricts the sharing of personal data, and many governments outside of Europe provide similar protections.

It’s clear that our personal data shouldn’t be made available online. Governments routinely collect this kind of data, but they cannot share it without violating the rights of individuals. Does this mean that a significant amount of government data is completely inaccessible? 

In fact, there are cases where this kind of data will be made available to researchers, though not on an open data portal. For example, Eurostat, run by the European Commission, makes sensitive microdata collected in its surveys available to researchers on a case-by-case basis. Researchers must work for an approved institution (typically a university) and have a specific research proposal. Only the relevant subsets of the data are provided. They also have to store the data securely, and commit to not sharing it further.

This can be seen as a compromise – a way of ensuring that independent researchers have access to relevant data, thereby ensuring transparency and accountability, without making our personal information widely available to people who might exploit it.  

How useful was this post?

Click on a star to rate it!