The rise of Data Science

13/07/2016 Eric Gazoni Company Unsplash/@SpaceX

The shape of things to come

In the past few months, I think we've witnessed the end of the "Big Data" hype, and the rise of the "Data Science".

To me, this is the natural transition from buzzword to useful technology.

I'm really thrilled to see enthusiastic people challenge themselves on Kaggle, organizations fund the Jupyter project, and platforms like plot.ly get a wider attention.

Today, I see more and more analysts turn their Excel workbooks into Jupyter notebooks, share data and insight within their company, for everyone to see and act upon.

To me, we're finally achieving the years-old dream of sharing data across organizations, even from teams that are seemingly unrelated. We're at the verge of synergy.

But this dream is still fragile and the biggest threat for all dreams is the same: disappointment.

A fragile dream

To be used, a data analysis service should of course provide insightful data, but also be dependable, and that's where the problems start.

Software design and deployment is a craft, as much as data science, and rare are the ones that can excel at both.

Most people in the field, having to deal with complicated software stacks, spending their day within the command line, will be happy to setup their own solution to expose their latest machine learning service.

And that's perfectly fine and normal, because that's what makes most sense in today's world of tight R&D budgets, and because data analysts don't usually come from a CS curriculum.

Self-inflicted wounds

But soon you end up with as many custom-built servers as you have teams within the organization, each implementing their own security model, authentication backends, mail notification, paging and load-balancing.

Add to the mix the high turnover in the industry, GitHub-stars-based platform selection, unpinned dependencies, ... good luck with keeping the services up and running in a few months.

Of course, in the corporate environment, IT services provide infrastructures and often help with deployment of business-produced applications. But the production of said applications is either:

  • fully delegated to the analysts (who usually lack experience in developing reliable applications)
  • responsibility to the IT department, who has to grasp the essence of the analyst work (an equally daunting task).

To me, having so many business people writing their own services, automating their work, producing sophisticated machine learning algorithms, is an exciting opportunity for organizations.

But if they fail to deliver what can be expected from a professional service, it might quickly go to waste, to everyone's loss.

Bridging the gap between data science and IT

Basically, if you produce code for a living, then you should (learn to) code properly.

If you are assembling a team of data scientists, you might consider this setup:

Infrastructure team

  • pure IT, you want sysadmins and networking people here
  • provide the infrastructure, connectivity, run centralized services such as:
    • LDAP
    • monitoring
    • continuous integration/deployment
    • backups
    • firewall
    • version control
  • provide a standardized way to deploy web services

Data Science team

  • this should be your main team in terms of headcount
  • play with models and data
  • write services
  • invest some time in basic computer science training:
    • version control
    • algorithms and programming best practices
    • following a style guide (to balance bus factor and high turnover)
    • test driven development and writing good APIs
    • networking notions
    • performance tuning rules of thumb

Facilitators

  • if your team reaches 5-8 people, you might want to add facilitators to solve day-to-day technical issues
  • preferably developers with good experience in data science
  • daily review the code produced by the data science team to ensure compliance with the deployment requirements
  • translate and follow-up data science team requests for infrastructure needs:
    • adding a website to the firewall whitelist so it can be scraped
    • providing servers with GPUs
    • restoring backups
    • ...

You should be able to compose such a team with people you already work with, and quickly start shipping valuable services for your organization.

Note: Adimian provides training for non-developers writing code. Check our training page.

Eric Gazoni

Eric is a software engineer, founder of Adimian, splitting his time between making his customers lives better, filling the paperworks and doing puzzles with his twin boys.