In Part 1 of this mini series we tried to get an idea of the number of organizations by simply counting the number of unique names, mentioned in organization fields in the IATI data. From a preliminary analysis of the 92k unique “organizations” we could identify several obstacles, of which the two main problems appear to be 1) the same organization name can be written in many different ways; 2) many of these values are not really names of organisations. In Part 2 we looked at the deduplication of the organization names. In this post we will deal with the cleaning of the organization names.
To get a better feel for the data, we extracted the first word of all the organization names and also extract the first + second word. We could then count how often different words occurred (and rank them). The most occurring first words were the things you would expect like the, la, l, el, de, les, al, but also ngo, ong, stichting, oo and ooo. If we leave these most common ones out, by far the most occurring first word was minisitr* so ministry, ministere, ministerio etc. This was 2845 times both the first and the first and second word. We can debate whether all these ministries should count as separate organizations. But we leave them in for now. The same holds for the 308 names that started with organisatie/organisation/organismo/organizacion.
If you are curious, here is the list of top 19 first word / the first + second word of all the organization names after leaving out the most common ones as discussed in the paragraph above:
|First word of organization name||Org name count||first + second word of organization name||Org name count|
Another thing we did was to split names with a length of over 150 characters. Many of these longer names looked something like: “Organization_name was founded” or “Organization_name established in 1958”. We therefore split at the first “ is | was | has | established | founded | est ”. This way, we only keep the Organization_name. All the organization names that, after this initial cutting, had a length of less than 3 (so basically 1 or 2 characters long) or a length of over 200 characters are filtered out.
The list also contained 194 country names, and although these are in a way organizations, we leave them out. The same holds for all names that are just numbers, these too are filtered out.
We also removed all financial names, which clearly did not seem like organization names. These were names that included words like funds, expenditure, transfer of/to/from, payment, private donor or partner+number . Also names including 1st, 2nd, any_other_number+th, any_number+e, q1, q2, q3 or q4.
Just by cleaning we lowered the number of organization names from 92k to 89k, so that is 3k less!
In the previous part we used the Python library called difflib to find names that are similar. Thanks to the cleaning up front we can increase the threshold for difflib. This way we get less matches that should not be a match (like asian development bank and african development bank), but possibly more non matches that should be a match.
With the 400 slugs replaced with their full name (as we did in the previous part) and with the increased threshold for difflib (0.7 instead of 0.6) we find 46109 unique organization names. But as stated before this still includes matches that should not be a match as well as non matches that should be a match.
We started this series with an estimate of somewhere between 20k and 50k organizations. By applying the six dimensions data quality (accuracy, completeness, consistency, timeliness, validity, and uniqueness) we tried to arrive at a count. Unfortunately the current data quality does not allow us to find an exact number. So after this exercise we can conclude, with reservations, that there are around 40-50K organizations mentioned in the IATI data.
If, in the future, we want to be able to do a more accurate count, it would be great if publishing organizations would pay more attention to organization names/ codes and IATI identifiers. With improved data quality the value of IATI data increases, for example by enabling us to answer interesting questions like how many organizations are mentioned in the IATI data.