Summer Data Lovin’

Summer is the perfect time to do some data cleanup.  Get yourself prepped for fall solicitation season by cleaning and de-duplicating your records now.  Let’s look at five DIY solutions that can be implemented at no cost to merge duplicates, identify other data anomalies, and keep your data clean.

Logical audits

Use database logic (in a database query or within Excel after exporting your records) to identify potentially incorrect or anomalous records.  This is known as performing an audit.  Develop your list of audits based on the most useful data for you, e.g. if you use direct mail to raise money, then salutation and address data will be useful.

Think in terms of AND, NOT and OR, e.g. to find constituents who meet your criteria.  For example, you may look for people who are NOT deceased AND have data in Address Line 1 AND do not have any data in the zip code field OR do not have any data in the city field OR do not have any data in the state field.  This will yield you a list of folks with partial addresses.

Review constituents with a state equal to your organization’s home state and a zip code that does not start with your state’s 2-digit zip prefix(es), e.g. state = Oregon and zip code does not start with “97”.

Review constituents with a phone number that is less than 10 digits long, or a zip code that is anything other than 5 or 9 digits long.

Review constituents who are missing critical pieces of data that can probably be filled in with a small amount of research or deduction, e.g. title (Mr./Ms.) or a piece of address data (e.g. city is filled in but state is not).

Review constituents who are over 100 years old.  They are probably deceased, though this is certainly not a guarantee, and every constituent should be hand-researched prior to being marked deceased.

Review constituents with a gender of F and a title of “Mr.,” or a gender of M and a title of “Ms.” Note that for the most part, this type of match is probably a database error, but caution should be exercised to ensure that this is not a legitimate constituent preference prior to updating.

To maintain excellent data hygiene, develop a list of standard audits, including some that are custom to the types of data your organization stores, and run them on a routine basis, ideally quarterly.

Merge duplicates

Duplicate records are a common bugaboo of database systems, and are particularly common for organizations that pull data into their system from multiple sources, e.g. online giving or purchase transactions feeding directly into the database.  Your database system may have a tool to use to identify and merge potential duplicate records.  That’s your first step.

If your database is lacking a duplicate identification tool, or the tool is not very robust, your next step is to export a report of all of your constituents to Excel (assuming you have a small enough number of records to easily manipulate in Excel, say less than 100,000 or so).  From there, use formulas to identify duplicate records by comparing specific pieces of data between records.  I like to use the VLOOKUP formula to do this.

I start by trying to match full first and last name, using the CONCATENATE formula to create a combined name, e.g. AmandaJarman.  However, this will only get you so far, as some of your duplicates may have alternate name spellings or typos.  And, sharing the same first and last name is not a guarantee that two people are the same person.

My favorite trick is to use CONCATENATE, LEFT, and RIGHT to create new “fairly-unique” identifiers, such as the first four letters of first name and the last four numbers of phone number (jarm1630), or the first four letters of the first name and the first four characters of address (aman4111).  Use UPPER or LOWER to standardize case.

Email address is another great identifier to use, as it is generally a unique identifier.  However, some partners may share an email address, so it’s not always reliable.

Once I have my identifiers for each record, then I use VLOOKUP to cross-reference between records.  So, if two records share the same fairly unique identifier, e.g. jarm1630, then VLOOKUP will return the ID number of the second record.

This requires a little bit of fancy footwork.  I make a copy of my spreadsheet so that I have two copies to compare to one another using VLOOKUP.  You’ll want to sort one list ascending by ID and the other descending by ID, which will help your VLOOKUP catch as many matches as it can.

If you want to get really fancy, you can count the number of data points you are able to match to calculate a “strength of match” rating, e.g. if two records share the same concatenated first/last name and email address, it’s pretty likely that they are duplicate records.

Once you’ve identified the duplicates, merge them in your system, using the merge tool if your system has one.  Make sure that you migrate all gift, event attendance, volunteer and other participation data into one combined record, as you don’t want to lose any of this precious info – particularly the giving information.

Contact information is another story – you can only have one current home address, so in this case, only one data point can survive the merge.  You’ll need to determine which record has the most current contact information.  Hopefully your database system timestamps addresses and other contact data with a “last updated date”, and even better, the source of the update.  Pay attention to this when choosing which data to retain, and which to archive or delete.

Review your custom fields

Most databases come with fields that you can customize to hold data that is specific to your organization or which your database designer did not accommodate in their design.  Sometimes these fields might be called “attributes”.

These fields tend to become a dumping ground for all kinds of data, and over time can become a nightmare of inconsistency and irrelevance.  Take a few hours to review the data stored in these fields, and to ask a few questions:

How are we using this data?

Is this data relevant or useful?

Is this data easy to report?

Is each field used for one purpose only?  (That is, are the codes stored in that field part of a logical “family”?)

Are there any codes in here that have a purpose nobody can remember?

Is there any data in here that has been stored in multiple locations in the database?

Is there any information in here that is incomplete or inconsistent, e.g. spotty records of board service?  And if so, is it worth trying to remedy – based on how important the information is for institutional history and future decision-making, the difficulty of obtaining the missing data, and the time effort involved in updating the database?

Are all of these fields and their coding structures well-documented?

Then answers to these questions may inspire a cleanup/documentation plan.  Archive data that is no longer useful.  If multiple types of data have been stored in a field meant for one type of data, find new homes for your wayward codes, and migrate them there.  If one type of data has been stored in multiple places, determine the best single home and migrate the data there.  If data is missing (and is valuable enough to update), research it and fill in the gaps.

That funky module

A lot of us have one – that funky module of your database that was never consistently or completely implemented, or you don’t quite trust the information in it.  Maybe it’s the events module – a common place for funky data to live?   Or your appeal tracking?  Maybe (let’s hope not, but it happens), it’s your giving data.  Do a quick review of your database modules, and rank them by vitality and degree of messed-up-ness.  If your giving history or address/phone information is faulty, start there, as your management of those modules translates most directly into donor stewardship.  Ask yourself some of the same questions you asked when looking at custom fields, and create a step-by-step cleanup plan.

Clean the database, not the report

How many times have you pulled a mailing list, only to make a million corrections, like de-duplicating, combining partners into one line, or correcting the capitalization on address data?  And then the next time you pull the list, you have to do it all again!  It is very tempting to “save time” by cleaning up bad data in a report rather than in the system.  The best practice is to make corrections within your database, and then re-pull your report.  Ultimately, this won’t take much longer than updating information by hand in your spreadsheet, and it will save a lot of time (and mailing dread) in the long run.  Make this a consistent practice, and be amazed at how much cleaner your data stays.