Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Red and Black
Sep 5, 2011

I’m a “data scientist” who writes scripts in Python and R. I am the second most junior member of a 5 member data science team. I am not a clever man. However, my boss has asked me to design a workflow for our team to develop production systems. Please help me.

Some background: Without getting too deep into the history of our team, we don’t have much experience developing production systems. We’ve mostly done BI work and ad-hoc analysis requests. Our first ever production system, a script that tries to predict which of our customers will leave the company soon (so that the sales team can intervene), only launched 3 months ago. We will probably be launching half a dozen or so new production systems within the next year, some of which will be interacted with directly by customers. That means it’s a big deal if these systems fail. So the “workflow” I’m designing needs to keep the risk of a crash to the bare minimum.

In short, this system should ensure:
A) that crashes in production never happen
B) that when crashes do happen they can be fixed quickly by any member of the team

With regards to A:
  • A stack of a linux docker container, and a virtual environment via pipenv (Python) or packrat (R), to ensure the environment code is developed in is always identical to the production environment. (This is not the case right now and causes problems)
  • Unit tests for all the functions, methods and classes of any production script. I’m thinking we use py.test for Python and testthat for R. Python’s hypothesis package applied whenever possible. (This is not what we’re doing now. For example our sole production system has no unit tests.)
  • Separation of development version of code and production version of code via git. (We are now doing this, but weren’t before. Also nobody on our team is used to working with git beyond the basic commands)

With regards to B:
  • Clean code and lots of documentation. Our team follows no style guide. Maybe I should suggest we follow PEP 8 and PEP 257 for Python? Or possibly the Google style guides for R and Python would be a better basis. We’ll also need some sort of standard for writing Oracle SQL code.
  • It’s not going to be enough to say “write clean code”, we need to have a system in place where everyone reviews everyones production code. Maybe we can have code reviews where the entire team systematically goes through the program, rewriting bad variable names and non-idiomatic code? Surely someone has developed a system to deal with this kind of problem.
  • Half of our team programs in Python 2.x and the other half programs in 3.x. We should probably only program in Python 3 since there are no libraries we use that are exclusive to Python 2

How you can help me:
  • Tell me how the team you work on ensures things A and B posted above
  • Tell me what you think I’ve listed here that’s a good idea, and what’s a bad idea
  • Post resources or book recommendations

Adbot
ADBOT LOVES YOU

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Chomskyan posted:

I’m a “data scientist” who writes scripts in Python and R. I am the second most junior member of a 5 member data science team. I am not a clever man. However, my boss has asked me to design a workflow for our team to develop production systems. Please help me.

Some background: Without getting too deep into the history of our team, we don’t have much experience developing production systems. We’ve mostly done BI work and ad-hoc analysis requests. Our first ever production system, a script that tries to predict which of our customers will leave the company soon (so that the sales team can intervene), only launched 3 months ago. We will probably be launching half a dozen or so new production systems within the next year, some of which will be interacted with directly by customers. That means it’s a big deal if these systems fail. So the “workflow” I’m designing needs to keep the risk of a crash to the bare minimum.

In short, this system should ensure:
A) that crashes in production never happen
B) that when crashes do happen they can be fixed quickly by any member of the team

With regards to A:
  • A stack of a linux docker container, and a virtual environment via pipenv (Python) or packrat (R), to ensure the environment code is developed in is always identical to the production environment. (This is not the case right now and causes problems)
  • Unit tests for all the functions, methods and classes of any production script. I’m thinking we use py.test for Python and testthat for R. Python’s hypothesis package applied whenever possible. (This is not what we’re doing now. For example our sole production system has no unit tests.)
  • Separation of development version of code and production version of code via git. (We are now doing this, but weren’t before. Also nobody on our team is used to working with git beyond the basic commands)

With regards to B:
  • Clean code and lots of documentation. Our team follows no style guide. Maybe I should suggest we follow PEP 8 and PEP 257 for Python? Or possibly the Google style guides for R and Python would be a better basis. We’ll also need some sort of standard for writing Oracle SQL code.
  • It’s not going to be enough to say “write clean code”, we need to have a system in place where everyone reviews everyones production code. Maybe we can have code reviews where the entire team systematically goes through the program, rewriting bad variable names and non-idiomatic code? Surely someone has developed a system to deal with this kind of problem.
  • Half of our team programs in Python 2.x and the other half programs in 3.x. We should probably only program in Python 3 since there are no libraries we use that are exclusive to Python 2

How you can help me:
  • Tell me how the team you work on ensures things A and B posted above
  • Tell me what you think I’ve listed here that’s a good idea, and what’s a bad idea
  • Post resources or book recommendations

a) At least for the style, having people manually review code for style violations is not a great idea - it'll lead to a lot of wasted time and arguments. I would suggest introducing a linter into your build process - a linter offers the following advantages:
- 1. Allows for enforcing style at compile time.
- 2. Avoids rule arguments - the rules will have to be specific enough to be enforced by the linter, and the rules are enforced by the linter.

Whatever style you pick should be enforceable by your linter.

b) For unit tests, YMMV, but getting 100% code coverage is an extremely optimistic goal and a waste of time imo. I would say a good rule of thumb is to get your codebase into a condition where you could potentially write unit tests that represent any problem, unit test public interfaces of components only, and use the unit tests as examples of a) how to use the component, b) a way of automating the regression testing (as if all bugs can be represented as unit tests, then if you can write unit tests to represent the bugs, you can stop regressions.) It might be better to see if you can mandate having unit tests for every bug fix instead of an arbitrary code coverage requirement.

Khorne
May 1, 2002

Bruegels Fuckbooks posted:

b) For unit tests, YMMV, but getting 100% code coverage is an extremely optimistic goal and a waste of time imo. I would say a good rule of thumb is to get your codebase into a condition where you could potentially write unit tests that represent any problem, unit test public interfaces of components only, and use the unit tests as examples of a) how to use the component, b) a way of automating the regression testing (as if all bugs can be represented as unit tests, then if you can write unit tests to represent the bugs, you can stop regressions.) It might be better to see if you can mandate having unit tests for every bug fix instead of an arbitrary code coverage requirement.
I'd suggest also explicitly writing unit tests for things that are subject to edge cases that only "you" are going to know about. It's like a comment in the code that forces you to read and understand it.

Lots of bugs get introduced when someone who doesn't have insight into the original purpose of code changes that code to fix another bug and doesn't consider its effects in an edge case. Occasionally, that "someone" is you: the person who wrote the code a while ago.

Khorne fucked around with this message at 20:11 on Jul 11, 2018

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


If you want to be able to launch a system with high confidence that it'll never crash in production, Python is a bad choice. You really want a language like Java, C# or even C++ with better static analysis tools.

Other than that, I don't really see anything data science-specific here. This seems to be a bog standard software problem.

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

ultrafilter posted:

If you want to be able to launch a system with high confidence that it'll never crash in production, Python is a bad choice. You really want a language like Java, C# or even C++ with better static analysis tools.

Other than that, I don't really see anything data science-specific here. This seems to be a bog standard software problem.

not really wanting to start a war here, but C++ is generally the wrong choice for a new project unless
a) the ecosystem/libraries you want to use are written in C++ already and it would be more effort to wrap those than to just write some C++
b) you already have 5-10 years+ experience with C++ or the intent of the project is to learn C++
c) GC pauses are a dealbreaker in said app.

there are plenty of static analysis tools for C++, but the problem is that the tools are necessary because it's very easy to gently caress up in C++.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


C++11 and later are a lot better than earlier versions, but yeah, that wouldn't be my first choice for a new project. I still think it's a better choice than Python, though.

LuckySevens
Feb 16, 2004

fear not failure, fear only the limitations of our dreams

ultrafilter posted:

C++11 and later are a lot better than earlier versions, but yeah, that wouldn't be my first choice for a new project. I still think it's a better choice than Python, though.

Can you expand on this? I might be starting to work for a python data analysis shop. What's the drawback of python?

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

LuckySevens posted:

Can you expand on this? I might be starting to work for a python data analysis shop. What's the drawback of python?

it depends on project size.

python has two problems that make it unsuited to large projects:

a) Python is dynamically typed, interpreted language. many sorts of errors that would be detected by a C/C++/Java compiler are turned into run-time errors in Python. tooling is also generally less effective on dynamic languages than on statically typed (e.g. intellisense works better with static typing)

b) the performance of python is generally going to be lower than that of statically typed equivalents, since it is an interpreted language.

Problem a) is not insurmountable with good engineering discipline - keeping modules small, having good unit test coverage, using type annotations (which is a a language feature of python 3.5, and there are tools that can provide these for earlier versions) are all ways to mitigate this problem.

Problem b) is also not insurmountable - it's possible to use numba to use LLVM to compile the python code and improve the performance, and there are other ways of improving python through compilation.

There's also nothing wrong with using python for quick engineer-type scripts or dealing with data quickly.

However if you're starting from scratch and plan on having a big rear end project with code from dozens of people, default choice should be C#/Java imo, but anyone can get anything to work with enough effort.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


That's basically everything I would say. The only thing I want to add is that while it's true that

Bruegels Fuckbooks posted:

anyone can get anything to work with enough effort

you have to keep in mind that not all of that effort is going to happen before launch.

Adbot
ADBOT LOVES YOU

LuckySevens
Feb 16, 2004

fear not failure, fear only the limitations of our dreams


Thanks!

  • Locked thread