Open Refine (previously Google Refine) is a data cleaning software that uses your web browser as an interface. This means it will look like it runs on the internet but all your data remains on your machine and you do not need internet connection to work with it.
The main aim of Refine is to help you exploring and cleaning your data before you use it further. It is built for large datasets – so don’t worry as long as your spreadsheets can hold the information, refine can as well.
To work with your data in Refine you need to start a new project.
Walkthrough: Creating a Refine project
- Start Refine – this will open a browser window pointing to https://127.0.0.1:3333 if this doesn’t happen open the link with your browser directly
- Create a new project: On the left tab select the “Create Project” tab:
- Click on “Choose Files” to choose your downloaded file and click on “next” – you can also use the URL to the CSV directly if your data is hosted on the web.
- You will get a preview on how refine will interpret your data – if you have selected a well formatted CSV or other file: this should be pretty automatic.
- Review the preview carefully to make sure the data looks right. Double check character encoding. Much, but not all data uses UTF-8 these days, but make sure you don’t see any funny characters in preview.
- In our Case all the Column have no column headings – since the first row of the file is actually empty.
- If you check further down, you’ll notice that the column headings for the data are in rows 5 and 6.
- Let’s tell refine where to find column information: On the bottom you can find settings on how refine should look for column information.
- We’ll tell it to ignore the first 5 lines, take the next 2 lines as column labels and ignore one further row:
- Great – now the preview looks much nicer.
- Name your project in the box on the top right side and click on “Create Project”.
- The project will open in the project view, this is the basic interface you are going to work with: by default refine shows only 10 rows of data, you can change this on the bar above the data rows. Also you can use the navigation on the right to see the next or previous rows.
You now have successfully created your first refine project. Remember: although it runs in a web-browser, the Refine server is still on your machine – all the data is there (so no worries if you handle sensitive information)
Once we created our project, let’s go and explore the data and the Refine interface a bit. Using Refine might be intimidating at first, since it seems so different from spreadsheets, once you get used to it you will notice how easily you can do things with it.
One of the commonly used functions in spreadsheets is sorting and filtering data – to figure out minima, maxima or things about certain categories. Refine can do the same thing.
Walkthrough: Sorting rows
- Refines handles data similar to a spreadsheet: you have rows, columns and cells – a cell is a field defined by a row and a column.
- To sort your rows based on a specific column click on the small downward triangle next to the column.
- Select “Sort…” to open the sorting dialog.
- You can select what to sort the values as and then what order to sort in. (We’ll sort in text, since for now we only have text columns).
- Click “OK” and your rows will be sorted based on the column.
- To undo the sort, click on the column options again, select “sort” then “remove sort”
The other frequently used function in Spreadsheets is filtering – in Refine this is called facetting. Facetting in Refine is really powerful – you will see in most of the rest of the Recipe we’ll use facets.
Walkthrough: Facetting rows based on a column
- Select the column options for the column you want to facet with
- Select “Facet”
- You can facet differently for text, numbers or dates – let’s facet as text – click on “Text facet”
- This will open a facet in the left bar
- Now select one or more of the choices and you’ll see how your data rows are limited to just those selected.
- Of course you can add more than one facet and thus filter more than once.
- Let’s add another facet: Note that one of the columns contain dates: we can add a timeline facet to let us select a date range
- This will open a new facet looking like:
- You can select a time frame by moving the indicators left and right.
Sometimes humans make mistakes when they enter data – they mistype city names or put in characters they can not see but the computer can. (For example, you can add a simple space at the end of a name and the computer will think they are different). For this let’s create a text facet for the cities:
Walkthrough: Reconciling Columns
- Create a text facet for the City column
- Scroll down where it says La Paz: see how many different ways there are to write la paz?
- Let’s fix this! This is so common that refine has a built in mechanism for detecting these kind of differences.
- Click on the “Cluster” button on the top right in the facet.
- Now check all the things Refine correctly noticed as similar
- Click on the “Merge Selected & Re-Cluster” button. Play with the keying function and methods on top to identify further possible merges.
Did you notice how most of the Cities are all uppercase? It’s rare to read them like this. And maybe you want to have nicer looking names: No problem. Refine supports this.
Walkthrough: Changing Case in Refine
- Let’s change the case in our city column from all uppercase to titlecase
- To do this, open the column options, go to edit cells -> common transforms -> to titlecase
- Tada – magically your names have been converted.
Congratulations! You successfully cleaned up a dataset using Refine!