I moved this summary to a new location at http://georg.io/s2ds where I also keep a growing collection of additional summaries.

S2DS Day 1

Science 2 Data Science (S2DS) is an industry-sponsored course that aims at leading graduates with a quantitative background into the industry of data analytics and data science.

The S2DS program has at least two components: During the first six days of the program we hear about technologies and methodologies that are deemed essential for practicing data science, and most of the remaining time we work in small teams on specific projects with the industry partners of the program.

My intention with this series of blog posts is to keep a legible record of the topics we hear about for future reference.

You can read more about the course at s2ds.org or follow the Twitter hash tag for the 2014 intake here.

Kim Nilsson who is a co-organizer of S2DS and John Hall of KPMG welcomed the students and discussed a number of examples where data science creates value for companies.

Good Coding Practices

Ole Moeller-Nilsson talked to us about good coding practices.

Introduction

  • Ada Lovelace (1815-1852) is considered to be the first programmer due to her description of Babbage’s analytical machine
  • Code can be thought of as a means for communication: There is machine code understood by the computer, source code understood by humans, and a translation process (compilation) between the two.

Programming Languages

A number of properties that distinguish programming languages are:

  • types: untyped, statically/strongly typed, dynamically / weakly typed
  • compiled languages rely on an explicit compilation step
  • interpreted languages have no explicit compilation step
  • procedural: sequential procedures
  • object oriented
  • aspect oriented: e.g. event logging
  • functional: stateless, input / output pipelines constructed out of functions

Code Quality

  • Generally, good code is tested, well-written, etc.
  • We should care about code quality since we write code for someone else where that someone else may be your employer, your colleagues, or in many cases your future self.
  • A few examples of projects that have failed partly due to poor code quality: * Taurus, London Stock Exchange * LASCAD, London Ambulance Service
  • On the debate over scripting v. coding: Scripts are oftentimes prototypes for production code and we should therefore take the quality of scripts as seriously as that of code.

Function and Variable Naming

  • functions:
    • are actions in code
    • use verb and noun for naming (avoid ‘do’, ‘be’, ‘perform’ and other non-informative verbs)
    • be consistent about casing and your use of underscores
    • use one word per concept
    • do not abbreviate too much
  • variables:
    • are ‘things’ hence use nouns
    • use prefix for context
    • see function naming
  • classes:
    • are ‘types of things’
    • use nouns
    • have each class encapsulate one thing only (this is more a matter of code architecture than naming I guess)

Comments

  • R.C. Martin: “Comments are always failures.”
  • comment only if you cannot make your code clearer otherwise
  • comments should explain why not how
  • use TODOs and FIXMEs for future reference
  • highlight important places, comment unusual code
  • do not comment past / removed code, nor conditions which really must be checked in your code

Functions

  • one function should do exactly one thing (keep them small)
  • avoid side effects
  • use few parameters as every parameter introduces questions (Why is the parameter needed? What are its effects? …)
  • avoid return arguments (TODO: why is this?)

Classes

  • are templates for objects
  • package data and functions, i.e. permitted behavior
  • have one concrete responsibility
  • encapsulation:
    • encapsulate as much of the internals of your classes as possible
    • this keeps code that uses your class and your own code independent

Code Layout

Testing

  • Always test and verify your code!
  • catching and fixing bugs earlier is far cheaper than doing so later
  • test-driven development:
    • write a test that fails, implement code to make the test pass
    • rinse and repeat

Test Types

  • unit tests:
    • the person coding writes these
    • need to be small and execute fast
    • run these tests every time your code changes
  • acceptance tests:
    • test your code against predetermined specifications (these may be your customer’s specifications or needs)
    • may constitute a demo in front of an audience or something programmatic
  • regression tests:
    • similar to unit tests
    • involve greater chunks of your code at a time
    • run these nightly or on some other schedule
  • user tests:
    • the intended user testing your code

How To Test

  • should be automatic and easy
  • test code needs to be of high quality and clear as well
  • every piece of test code should test one thing
  • test often
  • do happy path testing
  • do failure testing as your code needs to fail in an expected manner
  • do boundary testing
  • unit tests should not show random behavior but higher-level testing may do so
  • minimize time spent debugging by writing lots of tests

Improving Code / Refactoring

  • pre-requisite for refactoring is automated tests (do not start refactoring without these)
  • tip: when you want to use or modify someone else’s code, write tests for it

Methodologies

  • waterfall:
    • step-by-step implementation of the whole
    • the whole only works once all steps have been implemented
    • finish each step in sequence
  • agile:
    • finish a simple version of the whole in a short period of time
    • lean software (few features) initially and add features across the entire system so as to keep the whole usable
    • agile methodologies: scrum, kanban, etc.

Human Factor

  • writing software is always a team effort
  • write your code with the intention of your code becoming common property (avoid the ‘my code’ mentality and do not shy away from modifying someone else’s code)
  • avoid scenarios where just one person understands part of the code

Other Important Topics Touched Upon

  • version control
  • code review
  • pair programming

References

comments powered by Disqus