How many organizations are mentioned in the IATI data? Part 1/3: The challenge

On my first day as a Data Scientist at D4D I was asked if I could count the number of organizations that are mentioned in the IATI data. It sounds like a simple question, but it turned out to be quite a challenge.

Although it is clear that there are 1085 publishers (november 2019), nobody really knows how many organisations these publishers work with. I was told that we suspect it to be somewhere between 20k and 50k. And in theory all these organizations should publish their IATI data as well if we want to get a complete picture of what happens with development money. That sounds like something worth looking into!

So to start I downloaded a snapshot of all the IATI data. From here, when I extract the content of all the fields like ‘organisation’, ‘participating-org’, ‘reporting-org’, ‘name’, ‘provider-org’, ‘receiver-org’, etc. I find over 160k (narrative) values. That seems like a lot, but of course many of these fields contain the same data. 

Making all the organization names lowercase; removing signs like brackets, percentage signs and exclamation marks; and removing extra spaces at the beginning and at the end of the names, makes them more easily comparable. Then, when I simply count the number of unique values, I find about 92k “organisations” that are mentioned in the IATI data. 

The reason I put quotation marks around “organisations” is that many of these values are not really names of organisations. Some publishers put the entire history of an organisation in a field. This happens most often in the participating-org field of an iati-activity. In that field over 1.5k “organisation names” have a length of more than 200 characters. The longest was about 4 A4 pages of text. Other ‘names’ include single characters, quarterly figures, yearly figures, (secret) codes, unknown acronyms and mysterious number combinations. Combined these account for at least 3000 “organisation names”.

Since these strange names are usually quite unique to a specific publisher, we could just look at organizations that are mentioned by more than one publisher. However, it turns out that only 6k names were mentioned by more than one publisher. That leaves us with two problems: 1) not all 86k remaining “organisation” names are bogus and 2) even the 6k that were mentioned by multiple publishers contain duplicates. 

This second problem becomes clear when we rank the organizations by the number of publishers that mention them. In the top eleven we find: 1. dfid (mentioned by 210 publishers), 2. uk department for international development dfid (by 172 publishers), 5. department for international development (by 54 publishers), 10. department for international development dfid (by 32 publishers) and 11. uk department for international development (by 30 publishers), see the Figure at the top. Also 2. netherlands ministry of forein affairs and 6. dutch ministry of forein affairs are the same organization.

This problem, of different ways of writing the same organization name, is not new. This is exactly the reason the IATI identifier was introduced. Unfortunately, just for 1. dfid, I found over 60 reference codes, including the right one (gb-gov-1), but also gb-1, dfid, gb-01, (uk), dfid gb-1,  and gb-!. 

So to get back to the question, by now the only thing we know for sure is that the number of organizations that are mentioned in the IATI data will be less than 87.5k. 

To be continued…

Part 2 can be found here.