Wallpapering Fog

Monday, 9 February 2015

How to do football analysis in Tableau | Part 2

"Here's where the fun begins"

Han Solo

So you've read Part 1, you've downloaded and installed Tableau Public and you've got a small Liverpool and Everton dataset to play with. Or alternatively, you took the short cut, downloaded the dataset and came straight here.

Either way, it's time to hit Tableau.

Before you dive into the software, you'll need to create an online Tableau Public account. This will give you somewhere to publish your visualisations and also to save them while you work on them. One of the big restrictions of Tableau Public is that it's cloud-based and you can only save your stuff to Tableau Public's website

You've set up an account? Great, we'll make use of it later on. Now you can open up the Tableau Public software.

You should be looking at a screen like this...

Click "Open Data"and you get a long list of data sources that Tableau can access. The greyed out ones are for Professional users only - they aren't available in Tableau Public - but our dataset is in Excel, which is fine.

Click Excel, find your football data workbook and open it.

You should see Tableau's data loading screen.

You can do a lot of data manipulation here, including joining different datasets together, but we're just loading up a single Excel worksheet for now. If your Excel workbook has only got one worksheet in it, then Tableau will pick that up automatically. If you have more than one sheet in the workbook, then drag the one with the player data across from the left hand column to the empty top window.

Click "Go to Worksheet".

We're in! You should be looking at an empty Tableau worksheet.

If you can use Excel pivot tables then you're going to feel at home here quite quickly but if you can't then don't worry, it's all very straightforward.

Tableau's looked at our dataset and guessed which columns in our data are "Dimensions" and which are "Measures".

Dimensions are things you can split the data by. Player names, team names and positions go in here. "Apps" (number of appearances) shouldn't really be in here but we can deal with it later if we need that data. Tableau's guessed wrong because there are brackets for substitute appearances in the Apps column on our spreadsheet, rather than it just containing simple numbers.

Everything that isn't a Dimension, is a Measure. Measures are data columns that you want to add up, or average, or do whatever else mathematical with. Measures are your numbers. Tableau's put things like Age and Goals and Minuted Played in here, which is what we want.

So, what do you want to see first?

At last week's OptaPro Forum, Simon Gleave (@SimonGleave) showed some nice age distribution charts that plot the ages of players at a club. We could easily draw one of those.

Drag "Age" from the Measures area and drop it into the middle of the table, where it says "Drop field here". Take care to drop it in the middle, not onto the column or row headings areas.

If you're using the sample data, then it will now say 1,330 in the table. If you've put your own data together then you might get a different number because the WhoScored website is regularly updated with new player data.

1,330 is the total of all of the Everton and Liverpool players' ages in our dataset. Useful.

What about the average age of each team? That would be more useful.

Since you dragged in Age, there's now a green lozenge on the "Marks" area that says "SUM(Age)"

You can use this green lozenge to change from Sum to Average. Right click it, find "Measure (Sum)" in the popup and change it to Average,

In the sample dataset, the average age of all of Liverpool's and Everton's players is 26.6.

Let's split it by team. Drag "Team" from Dimensions onto the row shelf at the top of the screen and drop it.

Everton are older than Liverpool! We've just learned something.

Showing numbers to three decimal places is a bit much, so you can change the default formatting for Age if you want. Right click it in the Measures area, choose "Default Properties" and use "Number Format" - "Number (Custom)" to get rid of the decimals.

You could format numbers directly in the view but the nice thing about changing the default is that now whenever we use Age again, it will always appear without decimal places.

Stop reading for a minute and have a play with this table. It might seem odd to say it in a user guide, but the best way by miles to learn Tableau is to play with it. Drop more measures into the view and try splitting the rows and columns by different dimensions and see how Tableau reacts.

That's enough tables. Tableau's all about the graphics, no? Let's draw a chart.

Use the little tab button at the bottom of the screen - the one that looks like a little bar chart - to create a new empty worksheet. The other button - that looks like a little four pane window - is for creating dashboards. We'll get to that later.

Hopefully you're starting to see that in some ways, Tableau's a lot like Excel. It has worksheets and each worksheet is basically an Excel Pivot Table, with rows and columns and measures.

We're going to use this new sheet to draw an age distribution. That means we'll want to count how many players there are, split by age groups.

In our dataset, Age is a Measure to be summed or averaged, not a Dimension that you can split things by, but Tableau can sort that out for us. Drop age into the middle of the view, like you did before and then use "Show Me" to draw a histogram.

We've got a chart!

The Show Me button is the centre of Tableau and it's where you decide what kind of visualisation you want to draw. Think of it like the Excel Chart Gallery, but a lot more powerful.

Depending on what data you've dragged into the view, Tableau will offer you different types of charts in Show Me. This can sometimes get a bit confusing, e.g. you might decide to draw a scatter chart and Tableau says No and greys out the button. It looks at the data you've dragged into the main view and decides you can't draw a scatter with that.

When the type of chart you want is greyed out, look at the tip at the bottom of the "Show Me" box. Tableau will tell you exactly what it needs and when you drag those things in, the option you want will work. Once you start to get used to Tableau and the way that it works, you'll find this happens much less.

Charts in Tableau work exactly like tables and if you get confused, it can often help to think of them that way, or even switch back to a table, sort your data out and then switch back to a chart.

Charts working like tables, means that they have rows and columns and we can split our age chart by team if we want to.

Grab team from the Dimensions area and drop it just to the left of CNT(Age) on the Rows shelf.

Two charts! Now we can really see the differences in age that are driving Everton's older average.

Try dragging Team back off the Rows shelf and putting it in different places - on the Columns shelf, or into Colour or Label in the Marks area. There is loads of flexibility to create the view that you want.

Tableau has five basic ways of showing you differences in your data.

1. You can split it, using the Rows and Columns shelves.
When you do this, you'll get new rows and columns in a table, or new charts, one for each split.

2. You can vary colour

3. You can vary size
(That one doesn't really make sense as a team split - try it. Not every technique is good everywhere)

4. You can change the shape of datapoints

5. You can label different items

Tried all of those? Don't just skim through, I really meant it about playing being the best way to learn. Drag player names in. Swap Age for a different measure. Put team in Color and Label. Go nuts. Break stuff. Junk your worksheet and start again if you need to and remember that Tableau's Undo feature is pretty much bulletproof; Control-Z will always put things back the way they were!

Now that you know the basics of worksheets, you're ready to make your first dashboard...

We'll create some more charts and use them to build an interactive dashboard in Part 3.

How to do football analysis in Tableau | Part 1

I'm back from a hugely enjoyable OptaPro Forum last week and buzzing with new ideas. It was a fantastic day with some great presentations and it was brilliant to finally put faces to a few analysts' names (and Twitter handles!)

One question came up often enough at the forum to make me think that there might be an audience for this post. How do you use Tableau for football analysis...?

Want to know a secret? Tableau's easy. And it's free. There's really no need to restrict yourself to the purgatory of Excel's chart gallery. This post will cover a few basics and then if there's interest, I might do a follow up that shows how to do some more advanced visuals with X,Y pitch coordinate data.

I'm going to keep this post (actually, since the length blew up as it was being written, these three posts) at a pretty high level: Where to get data, how to connect to it, how to draw some basic views and then make them interactive.

The guide will assume that you can handle yourself with the basics of copying and pasting and inserting columns in Excel and that "install this software; here's a link" isn't too taxing! In general, I'll try to take things step-by-step, with screenshots so you can see where we're going.

First things first, we're going to need a copy of Tableau. It comes in professional and free versions and unless you've got a grand to spare, you're going to want the free "Public" version. Tableau Public is almost fully featured, with just a few restrictions on the types of data you can load and how you save and export your dashboards (more on this later).

If you haven't already, download and install the latest version of Tableau Public from here.

Next, we'll need some data. Very often acquiring data and cleaning it up is the biggest barrier to football analysis and this post isn't going to cover the various sources you could use. Once you get into very large volumes of data, you start to need programming skills in a tool like Python or R and this is only supposed to be a basic intro! We need a small, simple dataset and WhoScored can help us out.

Let's build some visuals to compare Liverpool and Everton this season.

TL:DR. If you don't care about this bit, you can download the mini dataset that we'll use in Tableau and go straight to Part 2.

Go to the Liverpool page on WhoScored and you'll find a table that looks like this.

It's a nice little dataset with player statistics for this season, but it's not doing us any good stuck on the website - we need to extract that data.

There are a few automated tools that are worth trying for data extraction. Import.io is new and is shaping up to be a really great tool, but unfortunately it doesn't pick up the data that we're after when it scans WhoScored's web page. Excel's load data from web feature doesn't either.

We're going to have to do this the old fashioned way.

Click and drag to select everything from the little 'R' in the black title bar at the top of the table, to the bottom right hand corner of the data. Make sure you've got exactly that selected and then Control-C to copy.

(There are probably programmers yelling at their monitors right now. I'm not covering web scraping in this post!)

In Excel, hit paste (Control-V) in cell A1 and you get...

Now go and get the Everton data and do exactly the same thing, pasting it onto a different sheet.

Finally, copy the Everton data from Excel excluding the black titles row and paste it just below the Liverpool data. If you don't skip out the titles row, you'll get column titles mixed up with your player data in Tableau - we only need titles at the top of our table!

As a general rule, Tableau likes data to be in lists. You want your data to be listed downwards, not across the page as loads of columns. WhoScored's data is a list of players with facts about them and that's ideal - a separate set of columns for every player, or data scattered across different sheets won't work.

You can get rid of the Everton data sheet now, you don't need it any more. Give the worksheet with both teams on it a sensible name and save your workbook.

Now we're going to clean the data up a bit. We don't need all of the formatting, so highlight everything, find 'Clear' on the Home Menu (it's got a little pink eraser icon) and choose Clear Formats.

Simple black and white text. Proper analytics.

You could load your list of data into Tableau straight away, but it's still a little messy. Column B has got nothing in it, so highlight that whole column and delete it (Delete icon on the Home menu). The remaining data will move in from the right to fill the gap.

Those "-" signs indicating no data also aren't great for Tableau. Text saying "-" isn't no data, it's a cell with "-" in it. Find and replace "-" with nothing (Control-H) to get rid of them.

The last job in cleaning up is to sort out our player names. If you look in cell B2, you'll see that player names are currently mixed up with their ages and positions. The technical term for this is, "a pain in the arse".

Insert some columns to create a bit of space, starting from column C. Five columns should do it.

Now your data looks like this...

Highlight everything in column B and choose Text to Columns from the Data menu.

Choose "Delimited", tick "Comma" and click finish.

The positions data gets split into column C and because some players can play in a few positions, you'll get data in column D and maybe column E too. For this little project, we don't care about the extra positions, so delete columns D and E. Don't just clear the data out of them, delete them so that everything moves left and your data looks like this.

The playing positions are sorted, but our player's names are still mixed up with their ages. We can split these with a quick text formula.

Age is always a two digit number at the end of the name. Assuming no players are over 99 or under 10 years old, putting this formula in cell D2 will grab the player's age.

"Right" gets the two characters at the end of the text string and "Value" tells Excel to see the result as a number, not as text (e.g. 20, not "20").

The player's name is everything except the last two characters. Put this in cell E2 to grab that bit.

Copy and paste those two new formulas downwards, to fill up the table with names and ages.

You could have waited until Tableau to make these splits as it has exactly the same LEFT, RIGHT and LEN formulas as Excel, but I want you to be able to see the raw data table before we load it. Once you're in Tableau, the data table sits in the background and as a beginner it can sometimes be a little tricky to visualise what your formulas have actually done.

Nearly finished, I promise. If you look at your table of data, you'll notice that there's not a way to know which team each player is on. We've got a record at the top for Raheem Sterling, but it doesn't say "Liverpool" anywhere. That will limit us in Tableau because we won't be able to easily compare teams if that information isn't in our dataset.

The data in column A, labelled "R" isn't useful for anything, so let's put team names there instead. Select cells in column A down to the last Liverpool player, type "Liverpool" and hit Control-Enter to fill all of those cells. Then do the same for Everton and re-title column A as "Team".

The very last job is to add some column titles for the age, name and position columns you've created. Save your workbook and then if it looks like this, we're ready to hit Tableau in Part 2.

Thursday, 29 January 2015

Football analyst network vis: New and improved!

My original visualisation of the football analysis community on Twitter generated a fair bit of interest, so it's time for a new and improved version.

We had one recurring question the first time around: "Where am I on the graph?". It badly needed a search function and thanks to @balbezit, it now has that and whole lot more. Twitter really is a fabulous learning tool! I'd tweeted an early static image of the updated network and got this reply...

I didn't know how to do that, or even that you could. But I do now.

I've also extended the starting group of analysts to the 110 accounts on this list. If you're followed by at least two of those accounts, then you should appear in the network graph somewhere. Following loads of them doesn't count - they have to find you interesting too!

Here's step one. You get a bigger node if more people in this community are following you. It's not sized by follower count in general, otherwise Barack Obama and Cristiano Ronaldo would have the biggest circles, by miles. More closely linked people will cluster closer together.

And next (again via a tip from @balbezit), we can shade the communities in different colours. This is an automatic algorithm in Gephi and I think it's quite effectively separated everybody by their interests.

Broadly, we get...

Red: Core of (largely) Premier League statistical analysts
Light blue: Professional analysis, including OptaPro and Prozone
Green: Journalism
Yellow: The transatlantic connection
Dark blue: Wider interests, including marketing and data vis (I'm in this one)
Purple: FiveThirtyEight

For the interactive version, go here.

It's a big web page. You will have to give it a while to load.

If you use Internet Explorer then you deserve the issue that it will look blurry, rather than crisp and easy to read. Download a proper browser.

Try clicking nodes and zooming and panning around the view!

Help! I can't find my own account!

You need to use the search box on the left. Search for your display name, not your Twitter username (i.e. I'd search for "Neil Charles", not "@data_monkey")

The chart will filter to show your personal following network, but your own node still might not be obvious, if it's small. Near the top of the bar that's popped up on the right hand side of the page, hover over your name. Voila! There you are.

Thanks go to...

@balbezit for the tips, Gephi for a brilliant bit of visualisation software and to Scott Hale for his fantastic Sigma.js exporter that was used to build the interactive vis.

Wednesday, 3 December 2014

There's a storm coming to marketing

Rory Sutherland tweeted a fascinating link a few weeks ago. He does that a lot, but this one in particular has stuck with me.

The link points to this article on Wikipedia, about Moravec's Paradox. Essentially, Moravec's Paradox explains that it's easier to program computers to do stuff that we think is complicated, than to do stuff that we think is easy.

Teach a robot to play world class chess? Done. Deep Blue beat Kasparov in 1997 and it's all downhill from there.

Teach a robot to walk as well as a human toddler? Nope. Now we're stuck.

As artificial intelligence improves, Moravic's Paradox suggests that you should be fearful for your job if you work with data analysis and structured processes. On the other hand, there's no imminent danger of somebody building a robot that's adaptable enough to fix the central heating in every different home. The plumbers will be fine. The jobs that we think of as 'easy' - manual labouring and skills that require some physical coordination - are way beyond the capability of today's computing, but the jobs that we think of as 'hard', may not be.

Keep Moravic's Paradox in mind as we look at a couple of new tools.

First, CasualImpact by Google. (Yes, that name needs a space. No it hasn't got one.)

CausalImpact is a tool for estimating what advertising has done to web traffic. You feed it your traffic stats and your advertising stats and it estimates how hard the advertising is working to create more traffic.

In essence, this is how I've been earning a living for the past fifteen years. Google just automated it.

OK, that's over-dramatic, Google hasn't made me redundant, yet. CausalImpact is a very small stepping stone, which only works for website traffic, in many cases won't work at all and you need a fair bit of technical knowledge to be able to deploy it, because it comes as an R plugin.

R plugins are hard, because R is hard. But then people like @jjmulz do helpful things like this.

And suddenly the ground I'm standing on starts to look shakier again. All programming tools are hard until somebody sticks an easy front end on them. If CausalImpact doesn't do it for you, try another Google funded project - the Automated Statistician. The machines are definitely coming.

This is all early days, but you can see where it's headed. Marketing analysis is a process and it has to be a fairly repeatable one, or you'd never be able to sell it to clients as a product. Without a process, every single project would be its own piece of R&D that might or might not work.

Marketing return on investment analysis is difficult, but so is chess and computers are better at chess than we are. You just have to teach them a framework for understanding the game.

What about the other end of the media planning process? The planning bit, before you get to measure what you've achieved? Charging into audience discovery, comes Profiler, from YouGov.

You probably saw YouGov Profiler via social media in the past few weeks. It's great.

Type in almost any subject area and it will tell you about the people who are interested in that topic. The scope of what you can look up is seriously amazing - you have to give it a try.

If you work in marketing, you'll quickly recognise the screens that pop out of Profiler as 'pen portraits'. These portraits are front and centre in every agency's pitch documents and annual plans. First we tell you about the audience who we want to see your adverts and then we tell you how we're going to achieve that.

Click on the 'Media' tab that you get on the output screen, bearing in mind that this is a demo and the full product will have loads more detail.

Damn, somebody's just automated another part of what marketing agencies do.

It is true that few businesses - other than marketing agencies - will buy access to the whole of YouGov's tool, because it would be too expensive for a piece of kit you'll use once or twice per year. Marketing agencies could still act as an intermediary, holding data and tools and running them for clients. We do this a lot now.

Except that if we've learned one thing about the web, it's that the web disintermediates. If you're sat in the middle of a transaction, making money by being a gatekeeper who controls access to a resource, then you should be scared of the internet. High street shops, travel agents, music labels, publishers... sooner or later, intermediary businesses get slapped by the web, because it puts buyers directly in touch with sellers.

If I was YouGov, I'd sell the Big Expensive Tool version of Profiler, but I'd also make it available on a 'pay as you go' model and let individual companies buy data, one query at a time. At the point YouGov or one of their competitors does that, the insight that agencies can create by profiling an audience becomes quite seriously devalued.

Just like Google's CausalImpact, the profiles that a company runs for itself probably won't be as sophisticated as they'd get from a professional analyst working in a marketing agency, but in many cases that won't matter. Amazon can't recommend books like an independent book store can, but it still forced most of the independents out of business.

Marketing agencies have spent years refining their processes. We're proud of our processes and they're what we use to differentiate ourselves from other agencies. We talk constantly about how we have a process to discover things differently, or connect them differently, or to measure the results better.

Computers are good at processes and this is going to become a serious problem for marketing companies and the people who work in them. Pieces of what we do are going to get automated. Pieces of what we do are already being automated.

The real revolution is quite some way off and you probably don't need to worry about it too much yet, because sexy bits of technology that you only just heard about, are usually ten to twenty years away from actually working properly. In the meantime though, we're going to see many innovations that chip away at the agency model and marketing agencies are going to have to work out - again - what it is that they can actually charge clients for.

We only did planning, until our clients mostly evolved onto fairly similar, effective, best-practice media plans.

Then we did 'added value': Processes, discovery, insight and post-campaign analysis.

When the processes, insight and analysis start to be automated, what will we do then?

My strong suspicion is that a marketing agency's true value lies in human interactions and in explaining the world, person-to-person to our clients. Rather than selling 'things'; media plans, PowerPoint decks, research studies and analyses, we're going to have to become much better at charging for these human interactions. If we don't, we'll slowly be automated into irrelevance.

Monday, 10 November 2014

Visualising football analysts on Twitter

Building on my new-found love of network diagrams, I thought it would be fun to visualise a social graph of football analysts on Twitter.

Who should you follow? These guys. They're fascinating.

Click the image for a (much) bigger and zoomable version.

Small print:

Lots of users following each other moves those users' nodes closer together.

Following, replying to, or mentioning a user on Twitter gets you linked.

Nodes are sized by number of inbound links (i.e. shouting a lot and following lots of people doesn't get you a big circle, other people mentioning and following you does).

Twitter rate limits mean that once you hit a certain number of followers, you don't get any bigger. That's why all of the core people have nodes that are the same size.

This graph undoubtedly flatters my own profile because it's built from people I follow and talk to.

The starting point for the graph was Twitter users in this list. Who's missing? Let me know!

Thursday, 6 November 2014

Visualising 45,000 football transfers

Football's an international business and it's obvious to anybody watching a Premier League game, that players have been transferred in from all over the world.

But which countries' clubs are the most interconnected? Is the old cliché true, that British players don't travel as much as their foreign counterparts? And can we show the relationships between clubs in an interesting way?

I drew the following images with Gephi, using data on just under 45,000 player transfers, taken from SoccerWiki. Gephi clusters teams by the closeness of their transfer activity; a lot of players moving between teams means that they will group together, while teams that are far apart rarely acquire each other's players.

Some of these images benefit from clicking through to the larger version link and zooming in...

45,000 player transfers

Big version. Zoom in and scroll to see detail.

A rough guide to national connections
The UK and Italy stand apart from an interconnected Europe.

Big version

The British peninsular
Note the Scottish spur and island of Ireland.

A few technical notes:

Node sizing is by number of transfers in and out. A larger node indicates more transfer activity.

SoccerWiki isn't a perfect repository of transfer data, but it's more than good enough to draw this sort of network diagram and overall is a really fantastic resource. Although the way that SoccerWiki stores information makes it impossible to put an exact time-stamp on transfers, data covers a range from 2007 to 2014.

I've dropped any team with fewer than 20 player movements - in or out - in order to clean up the diagram. With everything switched on, it renders very slowly and you get a cloud of small, barely attached teams floating around the edges. They're distracting without adding any information to the visualisation.

Views were rendered using Gephi's 'Force Atlas 2' algorithm.

Tuesday, 15 July 2014

The quiet BI revolution (part one)

Three years ago on Wallpapering Fog, I wrote a post about why our company (or more precisely, since the company's huge, my department) had adopted Tableau software.

At the time, I said:

"I feel like I'm giving away a trade secret here, but what the hell, you're going to hear about it from somewhere soon anyway."

Having just attended the London Tableau Conference, I can confirm that the secret is well and truly out. It was a brilliant event, brimming with enthusiastic people and great ideas, that deserves its own write-up away from this post.

For this post, I'd like to indulge in one of my occasional crystal ball gazes and look at the future of Business Intelligence (BI). Not BI on the cutting edge - although that is an exciting topic - but BI in regular businesses. Businesses that have small analytics teams, no time and aren't PR'ing a project to the trade press, with all of the doubts and the dirty laundry Tippexed out.

So where is BI - and in particular, regular reporting - for a normal analytics team going to head over the next five to ten years?

1. Data Visualisation and Reporting

Data vis as it applies to most businesses, is now a solved problem (what to visualise isn't. That's part two of this post). You can have good looking reports, automatically refreshed and delivered onto any device you like and even on paper, if you must. They're quick to build, easy to adapt and easy to maintain - more so than Excel-based reports ever were and much more flexible.

The only things you can't do easily, are weird and wonderful innovative visuals that nobody's ever seen before and you can't have all of this functionality for free.

On the first of these problems, I'd argue that this isn't a business issue. Businesses need straightforward charts, tables and standard reports, not animated 3D network diagrams, so software like Tableau will do a great job. I'd also argue that if you're looking for real flexibility, Lyra is something that I'm quite excited about...

On the second problem - cost - you just have to bite the bullet. $20,000 spent on the right BI software will transform your analytics department.

(That's if you give the $20k to your analytics department. DO NOT give it to a centralised IT team. They'll very likely ask for another $230k to make a nice round number, disappear for six months and then reappear asking for more money.)

The real change in data reporting, investigation and visualisation over the next five years or so, is going to be from a situation where many businesses don't yet realise that it's a solved problem, to one where they do.

Tableau's solved this problem and in my opinion is by some distance the best of the new breed of reporting and investigation tools, but if it hadn't been Tableau it would have been Qlik View. And if not them, Spotfire. And... you get the point.

What's going to happen over the next few years is that Tableau knowledge will become more valuable - because more businesses will want to hire those skills - and also less valuable, because loads more people are going to know how to use the software. The end result is basic supply and demand. It might swing back and forth for a bit, but we'll settle onto a situation where many (most?) analysts know Tableau as a regular part of their job. There'll be specialists, just like there are specialist Excel consultants, but most businesses will sort themselves out and nobody will be paid a fortune just for knowing how to use Tableau.

So far, no real surprises and if you read Wallpapering Fog regularly then you've probably heard those ideas before. The next two points are where I see a quiet revolution happening.

2. (not) Data Warehousing

You probably already know how this works. Analysts with Tableau do the visuals, but there's a big SQL database in the back end, looked after by a centralised IT team, which contains exactly 73% of what you want to visualise. A big enough gap that you can't just ignore data that isn't in the data warehouse, but not so big that the data warehouse as it stands is useless.

What often happens in response to an incomplete data warehouse, is that analysts build a hack. The data that isn't centralised is pulled in from ad-hoc spreadsheets and mashed together in Excel or Tableau, which works OK until you need more than a couple of people to update those spreadsheets, or somebody's on holiday. This is the issue we often hit in media agencies; you can solve a problem once, but can't roll out the solution everywhere to all clients because some parts of your 'solution' are held together with gaffer tape and bits of string.

What's needed is some software that's built for analysts and allows them to merge different data sources and to schedule updates, without recourse to a database administrator.

If you were at the Tableau Conference last week, then you'll have seen Alteryx sat squarely in this area. Drag-and-drop, hugely flexible and very friendly, I played with the demo a few months ago and I loved it.

But, it is quite pricey. Especially if, like us, you wouldn't plan on using all of Alteryx's capabilities and are only really interested in blending data sources together.

Did somebody say what about Open Source? Here's my tip of the day. Go and download the Community Edition of Pentaho Kettle and persevere through the thirty minute skirmish it will take you to get it installed and working properly. Your reward will be drag and drop data acquisition, blending and output, all for free. This is how I process a lot of my football data and it's brilliant.

In terms of crystal ball gazing, the analytics department now starts to look quite different. It's running a lot of reports on schedules, freeing up time for investigation and innovation. Nobody does the whole "getting into work at 7am on Monday for a frantic three hours of board report running" any more, which retailers in particular are very fond of. And thank God for that.

In our new world, IT only handles data when it needs to flow in large volumes from a point-of-sale or distribution system. IT does the bit that it already does very well now, but everybody stops moaning that the data warehouse doesn't also contain lots of the smaller user-maintained pieces of information that make a business run properly.

If you're thinking that the new world sounds like the same old BI promises, then you're right, it does. We should have been able to do these things ages ago but it didn't work due to the disconnect between analysts and IT and the slow build time, inflexibility and high cost of software. Analysts received questions and understood what output was needed, but usually only IT had the (inflexible) technology to make that output happen automatically.

The big differences now are speed, cost, flexibility and the number of companies that will be working in this new way. It's no exaggeration to say that you're able to go from raw data, to first-version business reports in two days. You can pin those down to a format everybody's happy with in a couple of months (faster if you make decisions quickly) and then you can fully automate them. Reports are able to evolve because they can be rebuilt and republished very quickly, in hours rather than weeks.

Then what do you do next? It's a serious question with which some reporting teams are going to struggle. When nobody needs you to move data from Google Analytics to Excel and chart the same charts every week, what will you do? The time to start thinking about that is now.

3. Data acquisition

This one's not solved; it's currently being solved and we've got a little way to go yet. Data acquisition is the last barrier between analysts, managers and an automated dashboard containing absolutely everything on which they wish to report.

Alteryx and Pentaho Kettle are fantastic data assembly (ETL) tools, provided your data isn't stored somewhere really stupid. Unfortunately, I work in marketing and our industry specialises in making data as difficult as possible to access.

- It's in untidy, bespoke web interfaces, behind login screens.

- It's in the colour key that somebody has chosen to fill cells in Excel

- It's emailed across, with a friendly "Hello! Hope you had a good weekend. Today's spend number is £2,486."

Database that, smartarse.

What I see happening over the next few years is some new tools and some new ways of working. Provided data is delivered in a consistent format, then the likes of Alteryx or Kettle can make the data acquisition and blending problem go away.

Where data is in web interfaces, we can already scrape it using Python or R, but then you need an analyst who knows how to scrape and that's not such a common skill-set. (Top tip: look for a football analyst - by necessity, we're getting quite good at it.)

We're going to evolve towards XML and other data feeds in addition to the usual user facing tables that come from the majority of web data sources, which again brings the likes of Alteryx into play. The data providers who don't do this should gradually become extinct through a process of natural selection.

Eventually, these changes will form an almost universal API. Every provider's data is different, but you'll be able to get to the data in an automated way and that's 90% of the battle. When you've done that, you only need to solve the data transformation problem once.

We'll also see - as is happening already - advanced data providers like Datasift starting to deliver information into services such as Google's Cloud Platform. A few years ago this wouldn't have helped, because you're just swapping one API for another, but when a critical mass of services all use that same cloud, easy connectors start to appear.

So why do I say that data acquisition isn't a solved problem yet?

Well for one, too many sources are still silos, but a second issue is that user input is still much too difficult. There's no Tableau for manual data entry and we still have to call a developer to create web forms and database schemas and data validation and to link it all together for us. Either that, or we have a central spreadsheet for people to fill in and we pray that they don't break it, or all try to edit it simultaneously.

I'm sure this software will come, but I haven't yet seen it. Microsoft Access forms and VBA really isn't it and neither are Google Forms. Microsoft, for all that they had a massive head start and will claim to have solutions to all of these problems, are nowhere in the BI race and are falling further behind.

If you've seen another solution to the problem of regularly taking validated user input without embarking on a software build or trying to lock down a spreadsheet, I'd love to hear about it in the comments.

The future's bright

In our future analytics department a lot has changed, but it's been a quiet revolution. A lot of things that were difficult are now easy and the business analyst's scope has extended well into traditional IT territory. Or, more accurately, that territory is more clearly delineated between the two departments and issues which neither IT nor analysts could previously solve (for a sensible budget in a sensible time-frame), have been dealt with.

Reports have moved to web browser interfaces - except for those staff who absolutely insist that they need printed ones - and automation takes care of putting them together. Analysts can quickly and visually interrogate their data and as an aside, Excel has moved to being a secondary tool for serious analysts, behind Tableau (or a competitor of your choice).

We were promised all of this a long, long time ago. Most businesses might actually get there in the next five years or so. It's interesting that the process of assembling Business Intelligence is being solved backwards... Rather than from data collection, to merge, to visualise, solving the visualisation element has driven a requirement to be able to better blend data, which in turn drives changes in how we acquire it.

And you know what happens after that? Businesses will start to realise that a lot of the information they've spent years trying expensively to assemble, won't on its own work the miracles that they hoped it would. Not without some other major changes happening too.

My favourite quote from last week's conference came from Fawad Qureshi of Teradata.

"Old business process + expensive new technology = expensive old business process"

That will be part two of this post. When you've got to your ultimate suite of business reports and they're easy to maintain, what happens then? What changes? Does anything happen at all?

Pages