Lecture: Intro to Cleaning Data

Messy or dirty data is inaccurate, incomplete or erroneous information in a data set.

Common examples include:

  • data entry errors
  • spelling or punctuation errors
  • information in the wrong location
  • multiple pieces of information in one location
  • incomplete or outdated data
  • duplicate information

Messy data presents ethical issues for journalists. Here are some things to keep in mind and to consider:

  1. Follow the Journalists’ Code of Ethics. Data reporters must recognize that the work of “gathering and reporting information may cause harm or discomfort” to others and must do all they can to “minimize harm” and ensure their work is “accurate, fair and thorough.”
  2. Be skeptical of data as you would any source.
  3. Be careful not to change or omit facts in the process of cleaning.
  4. Whenever possible, double-check questionable information with source.
  5. Consider the potential implications of publishing data with errors (i.e., what if wrong name is associated with a crime?)
  6. Take dirty data into account in your analysis and conclusions.
  7. Explain dirty data issues in story or methodology.
  8. Invite audience to correct or missing data.
  9. Be ethical in your choice of visualizations. Avoid confusing or false interpretations.
  10. Be transparent.

Messy data can also be one of the most time consuming aspects of working with a data set.

Here are tips and tools to simplify the process:

Top ten ways to clean your data in Excel
Intro to cleaning data (UC Berkeley School of Journalism)
Open Refine
Open Refine tutorials

Here is a basic example of how to use three common functions – CONCATENATE, TRIM and PASTE SPECIAL (via Intro to cleaning data – UC Berkeley School of Journalism)

CONCATENATE

This Excel function joins up to 30 text items together and returns the result as text.

Concatenate has a lot of uses, including combining columns of data into a single field. For example, addresses separated into columns for street, city and state can be concatenated into a single column for easier geocoding with Google Maps.

To try…

Download and open the file data-cleaning (1)

You can see that the header categories are split up into separate rows.

Concatenate will put them together.

Put your cursor in cell D4
Type the formula =CONCATENATE(D1,D2,D3)
Hit Return.

excel7.jpg

You can see that it takes all of the set and puts into one cell and reads Violentcrimerate.

So you can add spaces to your formula by using & “ “ (with a space between quotes).

Put your cursor in cell D4
Type the formula =CONCATENATE(D1&” “, D2&” “,D3& ” “)
Hit Return.

excel8.jpg

You can now copy that formula to the other cells in a row.

The first is to click once in the cell with the formula and hover your mouse over the bottom left corner until the cursor changes to a black cross.

excel9.jpg

Next, click and drag to the left (or right).

excel10.jpg

TRIM

So doing this created another error – the empty cells add spaces. For example, Year has two spaces in front of it.

The Trim function removes extra spaces from text.

Put your cursor in cell A5
Type =TRIM(A4)
Hit Return.

excel12.jpg

You’ll see the space is gone.

Now copy that to the entire row.

Put your cursor in cell A5
Press control+shift+right arrow to select the entire row
Type control+r

PASTE SPECIAL

Now let’s get rid of all of the extra header rows. But if you delete them you will get an error message because it is using formulas from those rows.

So you want to create a row of text without the formula.

Click on Row 5 label and select the row.
Select Edit menu
>Copy
Then click on Row 6 label
Select Edit menu
>Paste Special

A window will open with options. Select Values. (In Google Sheets, it’s called Paste Values)

excell14.jpg

Now you can delete rows 1-5.
Select the first five rows
Select Edit menu
>Delete

 

 

This entry was posted in Lecture. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s