How many organisations are mentioned in the IATI data? Part 2/3: deduplicating organisation names

In Part 1 of this mini series we tried to get an idea of the number of organizations that are mentioned in the IATI data, by simply counting the number of unique names, mentioned in the organization fields in IATI. From a preliminary analysis of the 92k unique “organizations” we could identify several obstacles, of which the two main problems appear to be 1) the same organization name can be written in many different ways; 2) many of these values are not really names of organisations. Let’s start with the first problem, the deduplication of the organization names. We will deal with the second problem in Part 3.

Python has a library called difflib, which we can use to find names that are similar. In the Table at the top of this post you can find some examples of correct and incorrect matches. For example: we still cannot distinguish between dfid and uk department for international development dfid, because these are too different. However, we can now see that many other ways of writing department for international development dfid are indeed classified as the same organization. Unfortunately, we can also no longer distinguish between the asian development bank and the african development bank.

From the 6k organizations that are mentioned by more than one publisher, using this difflib library, we find 3.4k unique organizations. When we also look at the other 86k ‘organizations’, we find a total of 48k unique organizations. Considering that this contains some organizations that are categorized as different when they are the same, as well as some organizations that are categorized as the same when they are different it is possible that this number is somewhere in the vicinity of the real number. It remains hard to say.

In the information about current IATI publishers, the official name is mentioned, as well as several other pieces of information, including the correct IATI identifier. One thing that is particularly interesting in our case are the slugs: a short version of the name. As we can see in the bottom right corner of the Table, the hardest things for difflib to match correctly are the full name with the abbreviation of the name. Luckily for us, we now have a table that contains both the full name and the abbreviation. We can use this to replace the slug with the official name in our data. 

When we do this we see some changes in the top eleven of organizations mentioned by other publishers. We see dfid in first place, replaced by the official name uk department for international development dfid. The combination of the abbreviated name and the full name also leads to several new organizations showing up in the top eleven, like unicef in the 6th place.

Though we could count again right away, it is good to realize that we have only replaced about 400 slugs with their full name. Though it drastically changes the most mentioned organizations, the more unique “organization names” still need to be cleaned. For example, in the 48k unique organizations we found earlier using difflib, all 1.5k “organisation” names with a length of more than 200 characters are still present. 

We have to look at the second problem we identified, which is that many of “organisation” names are not really names of organisations. This problem needs to be addressed first, before we count again.

To be continued…