What are the real challenges visualisation developers face, and what don't they want you to know about their work? Nate Agrin and Nick Rabinowitz lead you through some of the dirty secrets of the infovis trade.
Visualisation is the new 'must-have' element in project proposals and personal portfolios, and startups like Platfora, Datameer, and our own employers ClearStory Data and Chartio are raising millions for analytics platforms with browser-based visualisation interfaces.
To some extent, the buzz is justified. Data visualisation is a wonderful way of exploring data, finding new insights, and telling a compelling story. But what are the real challenges visualisation developers face - and what don't they want you to know about their work?
We'll lead you through some of the dirty secrets of the information visualisation (infovis) profession, offering an inside look at the process of visualisation development, along with practical tools and approaches for dealing with its inevitable challenges and frustrations.
Secret #01: Real data is ugly
Most data visualisation tutorials start with a pleasant fantasy: a pristine data set. Whether you’re learning to build a basic bar chart or a force-directed network graph, you’re presented with clean, normalised, well-formatted base data. This perfect JSON or CSV file is the digital analog of the neatly prepped mise en place in a televised cooking show: the refined result of tedious, painstaking work presented as raw ingredients. In practice, when dealing with most real-world data sets, expect to spend up to 80 per cent of your time finding, acquiring, loading, cleaning and transforming your data.
Some of this process can be done with automated tools, but almost any data cleaning involving two or more data sets will require some level of manual work. A wide variety of tools can convert XLS to XML or timestamps to other date formats, but nothing can automagically map one company’s internal sales categories to those of its competitors, or deal reliably with data entry typos, incompatible character encodings, or (shudder) poor OCR.
Tools and strategies
- Budget significant time in any visualisation project for data cleanup. Increase your estimate (in some cases exponentially) for multiple data sources, manually entered or OCR data, divergent categorisation schemes, and non-standard formats
- Google Refine is a great data cleanup workhorse, though it has limitations, particularly for non-tabular data. Other cleanup-specific tools include Data Wrangler and Mr. Data Converter. However, many tasks still require basic proficiency in a scripting language like Python or manual work in Excel. Save your scripts - you’ll use them again
- Eat your own dog food if you can: visualisation is a great tool for identifying data problems. Use scatter plots and histograms to find and fix suspicious outliers
Secret #02: A bar chart is usually better
One of the first questions to ask when considering a potential visualisation design is “Why is this better than a bar chart?” If you’re visualising a single quantitative measure over a single categorical dimension, there is rarely a better option. Likewise, time-based data is usually best displayed on a line chart, and scatterplots are often best for exploring correlations between two linear measures. At the risk of sounding regressive, there are good reasons these charts have been in continuous use since the 18th century. Bar charts are one of the best tools available for facilitating visual comparisons, leveraging our innate ability to precisely compare side-by-side lengths.
The corollary to bar chart superiority, and perhaps the dirtiest secret in this article, is that the coolest-looking visualisations are often the least useful. The novelty and aesthetic appeal of custom visualisations comes at a cost: the clarity of the data. Most bar chart alternatives ask the viewer to compare differences we have a harder time discerning: areas, angles, hues, or opacities. At best, such visualisations make comparison difficult; at worst, they distort the data entirely, leading viewers to false conclusions.
Tools and strategies
- Don’t dismiss traditional visualisation choices if they represent the best option for your data. Start with bar and line charts, and look further only when the data requires it
- Have a good rationale for choosing other options. Compared to bar charts, bubble charts support more data points with a wider range of values; pies and doughnuts clearly indicate part-whole relationships; treemaps support hierarchical categories
Secret #03: There’s no substitute for real data
Cleaning and formatting a single data set is hard enough, but what if you’re building a live visualisation that will run with many different datasets? Maybe you have to build a visualisation for use in multiple departments within one company, where every department has its own database, and you don’t have time to manually clean each dataset. Your first instinct may be to grab some demo data and use that to build your visualisation; your visualisation library may even come with standard sample data.
Unfortunately, there is no substitute for real data. Demo data tends to have a normal distribution and a manageable number of records; it’s designed to show visualisations in their best light. A bar chart doesn’t just have the prerequisite bars, it looks like an ideal bar chart. It doesn’t help you plan for data discrepancies, null values, outliers, or other real-world problems. If you rely too much on demo data, when you plug in real data you may find that your visualisation isn’t the best one suited for your data to begin with.
Tools and strategies
- Ideally use several random samples of real data if you cannot access an entire dataset
- Invalid and missing data is a guarantee. If your data won’t be cleaned before being graphed, do not clean your sample data
- Real data may be so large as to overwhelm your visualisation or the system generating it. Be sure that if you use a sample of data you correctly scale up the sample size (or reduce it appropriately) before creating a final visualisation
Secret #04: The devil is in the details
Designing the labels, legends and axes for your visualisation is often an afterthought to the initial visualisation. But these elements are crucially important to the visualisation, and can be difficult and time-consuming to get right, especially when you can’t predict the data ahead of time.
When laying out your visualisation, leave significant rendering space for any additional marks you may need, often including relatively wide margins around the graphical part of your visualisation. Axis labels should be spaced such that they do not occlude each other and are easily readable. Use rotate or reposition labels if necessary for legibility. If a particular area is overcrowded with labels, but you need them for clarity, consider moving the labels farther from the elements they reference and connect them with an indicating line. Another technique is to group crowded labels together in a single tooltip-like group. Consider the space you’ve allowed and the length of the longer labels. If the labels won’t fit you might need to shorten them with ellipses, or simply truncate the text at a fixed length.
Similarly, legends require advance planning to render well. One easy option is to reserve some space for the legend to one side of the graphic. Unfortunately, this means that you’ll need to reduce the size of the graphical portion of your visualisation. In order to preserve some space you may be able to place the legend in an empty part of the graphic, or make the legend draggable so the viewer can access any graphics underneath.
Tools and strategies
- Plan space around your graphic for labels, axes and legends
- Designate a maximum character length for labels, truncating if needed to prevent crowding. Group nearby labels together, revealing them in response to user actions
- Consider scrolling or accordion-style expansion for long legends
- Whatever you do, don’t leave these elements out. Labels may seem like a secondary concern when you’re focused on the graphic elements, but they are incredibly important to your viewers
Secret #05: Animate only when appropriate
As a visualisation author, it’s often tempting to add animations into your final product. Animations are a powerful way of connecting data to changes in state and trends. However, animations can also lead to confusing or misleading interpretations of your data. You should carefully plan for how it will affect your entire output and not simply add it at the end of your work. Animations work best when they can reveal data relationships showing how data groups together between different states, how the data changes over time, or how data points are directly related.
In general, make your animations simple, predictable and re-playable. Allow users to view the animation multiple times so they can track where objects start and end. Avoid occluding objects in a transition with other objects, which makes tracking more difficult and do not transition objects along unpredictable paths. With complex animations, research suggests that viewers’ comprehension improves when the animation is broken into simple 'staged' transitions. A stage pauses the animation with the objects in a transitioning state and provides the viewer a moment to reflect on the state of each object.
Tools and strategies
- Strive to make your animations as simple as possible
- Consider staged animations when an animation is either complex or has many transitioning objects
- Flashy animations are often entertaining at first, but quickly become frustrating to the viewer. Do not add animation just because you can
Secret #06: Visualisation is not analysis
It's a central tenet of the field that data visualisation can yield meaningful insight. While there’s a great deal of truth to this, it’s important to remember that visualisation is a tool to aid analysis, not a substitute for analytical skill. It’s also not a substitute for statistics: your chart may highlight differences or correlations between data points, but to reliably draw conclusions from these insights often requires a more rigorous statistical approach. (The reverse can also be true - as Anscombe’s Quartet demonstrates, visualisations can reveal differences statistics hide.) Really understanding your data generally requires a combination of analytical skills, domain expertise, and effort. Don’t expect your visualisations to do this work for you, and make sure you manage the expectations of your clients and your CEO when creating or commissioning visualisations.
Tools and strategies
- Unless you’re a data analyst, be very careful about promising real insight. Consider working with a statistician or a domain expert if you need to offer reliable conclusions
- Small design decisions - the colour palette you use, or how you represent a particular variable - can skew the conclusions a visualisation suggests. If you’re using visualisations for analysis, try a variety of options, rather than relying on a single view
- Stephen Few’s Now You See It offers a good practical introduction to using visualisation for business analysis, including suggestions for developers on how to design analytically-valid visualisation tools
Secret #07: Data visualisation takes more than code
The range of libraries and tutorials now available make it easier than ever to produce production-quality web-based visualisations without specialised expertise. But creating visualisations that offer real insight or tell a compelling story still requires a particularly wide range of real skills in addition to coding, including graphic design, data analysis, and an understanding of interaction design and human perception. No library or technology can substitute for knowing what you’re doing.
But the flip side of this secret is that you don’t need to know that much - especially if you use well-established visualisations and interaction principles. Learn enough about the field to avoid newbie mistakes (always zero-base your bar charts and never set a circle radius with a linear scale), keep things simple (no 3D, limited animation, no drop shadows), base your work on solid examples and you can create great visualisations.
Words: Nate Agrin and Nick Rabinowitz
Nick Rabinowitz is the Senior Data Visualisation Developer at ClearStory Data. He has over 15 years experience working on web and visualisation projects, primarily for non-profit, academic, and public sector clients.
Nate Agrin is the Director of Visualization at Looker. He studied information and visualisation theory at UC Berkeley and has worked at companies like Splunk and Twitter, contributing to their web-based interfaces.