|
I’m a “data scientist” who writes scripts in Python and R. I am the second most junior member of a 5 member data science team. I am not a clever man. However, my boss has asked me to design a workflow for our team to develop production systems. Please help me. Some background: Without getting too deep into the history of our team, we don’t have much experience developing production systems. We’ve mostly done BI work and ad-hoc analysis requests. Our first ever production system, a script that tries to predict which of our customers will leave the company soon (so that the sales team can intervene), only launched 3 months ago. We will probably be launching half a dozen or so new production systems within the next year, some of which will be interacted with directly by customers. That means it’s a big deal if these systems fail. So the “workflow” I’m designing needs to keep the risk of a crash to the bare minimum. In short, this system should ensure: A) that crashes in production never happen B) that when crashes do happen they can be fixed quickly by any member of the team With regards to A:
With regards to B:
How you can help me:
|
# ? Jul 10, 2018 19:55 |
|
|
# ? Apr 25, 2024 02:21 |
|
Chomskyan posted:I’m a “data scientist” who writes scripts in Python and R. I am the second most junior member of a 5 member data science team. I am not a clever man. However, my boss has asked me to design a workflow for our team to develop production systems. Please help me. a) At least for the style, having people manually review code for style violations is not a great idea - it'll lead to a lot of wasted time and arguments. I would suggest introducing a linter into your build process - a linter offers the following advantages: - 1. Allows for enforcing style at compile time. - 2. Avoids rule arguments - the rules will have to be specific enough to be enforced by the linter, and the rules are enforced by the linter. Whatever style you pick should be enforceable by your linter. b) For unit tests, YMMV, but getting 100% code coverage is an extremely optimistic goal and a waste of time imo. I would say a good rule of thumb is to get your codebase into a condition where you could potentially write unit tests that represent any problem, unit test public interfaces of components only, and use the unit tests as examples of a) how to use the component, b) a way of automating the regression testing (as if all bugs can be represented as unit tests, then if you can write unit tests to represent the bugs, you can stop regressions.) It might be better to see if you can mandate having unit tests for every bug fix instead of an arbitrary code coverage requirement.
|
# ? Jul 11, 2018 15:20 |
|
Bruegels Fuckbooks posted:b) For unit tests, YMMV, but getting 100% code coverage is an extremely optimistic goal and a waste of time imo. I would say a good rule of thumb is to get your codebase into a condition where you could potentially write unit tests that represent any problem, unit test public interfaces of components only, and use the unit tests as examples of a) how to use the component, b) a way of automating the regression testing (as if all bugs can be represented as unit tests, then if you can write unit tests to represent the bugs, you can stop regressions.) It might be better to see if you can mandate having unit tests for every bug fix instead of an arbitrary code coverage requirement. Lots of bugs get introduced when someone who doesn't have insight into the original purpose of code changes that code to fix another bug and doesn't consider its effects in an edge case. Occasionally, that "someone" is you: the person who wrote the code a while ago. Khorne fucked around with this message at 20:11 on Jul 11, 2018 |
# ? Jul 11, 2018 20:02 |
|
If you want to be able to launch a system with high confidence that it'll never crash in production, Python is a bad choice. You really want a language like Java, C# or even C++ with better static analysis tools. Other than that, I don't really see anything data science-specific here. This seems to be a bog standard software problem.
|
# ? Jul 22, 2018 17:14 |
|
ultrafilter posted:If you want to be able to launch a system with high confidence that it'll never crash in production, Python is a bad choice. You really want a language like Java, C# or even C++ with better static analysis tools. not really wanting to start a war here, but C++ is generally the wrong choice for a new project unless a) the ecosystem/libraries you want to use are written in C++ already and it would be more effort to wrap those than to just write some C++ b) you already have 5-10 years+ experience with C++ or the intent of the project is to learn C++ c) GC pauses are a dealbreaker in said app. there are plenty of static analysis tools for C++, but the problem is that the tools are necessary because it's very easy to gently caress up in C++.
|
# ? Jul 22, 2018 21:31 |
|
C++11 and later are a lot better than earlier versions, but yeah, that wouldn't be my first choice for a new project. I still think it's a better choice than Python, though.
|
# ? Jul 22, 2018 23:30 |
|
ultrafilter posted:C++11 and later are a lot better than earlier versions, but yeah, that wouldn't be my first choice for a new project. I still think it's a better choice than Python, though. Can you expand on this? I might be starting to work for a python data analysis shop. What's the drawback of python?
|
# ? Sep 12, 2018 10:53 |
|
LuckySevens posted:Can you expand on this? I might be starting to work for a python data analysis shop. What's the drawback of python? it depends on project size. python has two problems that make it unsuited to large projects: a) Python is dynamically typed, interpreted language. many sorts of errors that would be detected by a C/C++/Java compiler are turned into run-time errors in Python. tooling is also generally less effective on dynamic languages than on statically typed (e.g. intellisense works better with static typing) b) the performance of python is generally going to be lower than that of statically typed equivalents, since it is an interpreted language. Problem a) is not insurmountable with good engineering discipline - keeping modules small, having good unit test coverage, using type annotations (which is a a language feature of python 3.5, and there are tools that can provide these for earlier versions) are all ways to mitigate this problem. Problem b) is also not insurmountable - it's possible to use numba to use LLVM to compile the python code and improve the performance, and there are other ways of improving python through compilation. There's also nothing wrong with using python for quick engineer-type scripts or dealing with data quickly. However if you're starting from scratch and plan on having a big rear end project with code from dozens of people, default choice should be C#/Java imo, but anyone can get anything to work with enough effort.
|
# ? Sep 12, 2018 22:04 |
|
That's basically everything I would say. The only thing I want to add is that while it's true thatBruegels Fuckbooks posted:anyone can get anything to work with enough effort you have to keep in mind that not all of that effort is going to happen before launch.
|
# ? Sep 13, 2018 00:37 |
|
|
# ? Apr 25, 2024 02:21 |
|
Bruegels Fuckbooks posted:very good reply Thanks!
|
# ? Sep 13, 2018 09:18 |