Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Cingulate
Oct 23, 2012

by Fluffdaddy
Or a pandas DataFrame.

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy

Jose Cuervo posted:

I want to take the information present in a database table (.accdb) and read it into a Pandas dataframe. Is this possible? From Googling and looking on Stackoverflow I cannot find a way but I have never used a database before so maybe I am missing something.
Export to a csv and pandas.read_csv?

Cingulate
Oct 23, 2012

by Fluffdaddy
Since I'm a bad programmer, I very often do something that's basically

code:
for item_1, item_2 in zip(list_a, list_b):
    something(item_1) = some_function(item_2)
Now if I was for example constructing a dict, I could do dict comprehension:

code:
my_dict = {item_1: some_function(item_2) for item_1, item_2 in zip(list_a, list_b)}
And that seems very nice and sensible, and also it's fast.

But imagine I'm not construction a dict, but adding to a Pandas data frame or such things. Eg.,

code:
df = pd.DataFrame()

for item_1, item_2 in zip(list_a, list_b):
    df.loc[item_1,"some_field"] = some_function(item_2)
What's the pythonic way? Map? Lambda? Goto?

E: fixed the dict comprehension

Cingulate fucked around with this message at 15:08 on Nov 27, 2014

Cingulate
Oct 23, 2012

by Fluffdaddy
Ah yes, that's a much smarter implementation for the specific Pandas case that'll actually improve about half of my scripts.

But is there a general answer? What if I want to assign bar(y) to some foo specified by x without a loop:
code:
for x, y in zip(list_1, list_2):
    foo(x) = bar(y)
Or is the question itself somehow badly asked?

Cingulate
Oct 23, 2012

by Fluffdaddy
Ah, knew there'd be some map-like thing in there. Thanks.

Cingulate
Oct 23, 2012

by Fluffdaddy
This is basically the only thing I'm saying in this thread, but TMH, have you considered using Pandas?

Cingulate
Oct 23, 2012

by Fluffdaddy

Thermopyle posted:

Also Guido wants to bring something like mypy into core python. In fact, you can use mypy with type annotations right now!
Python code:
def fib(n: int) -> Iterator[int]:
    a, b = 0, 1
    while a < n:
        yield a
        a, b = b, a+b
http://www.mypy-lang.org/tutorial.html
Not that Python is too slow for me, but is it realistic that building on this, performance benefits will be achieved?

I understand that that's not the focus and the expected main benefit though.

Cingulate
Oct 23, 2012

by Fluffdaddy
Is there a simple explanation (rule) for why (or when) in such contexts, dicts are preferable over objects (ie., dict[key] vs object.attribute)?

Cingulate
Oct 23, 2012

by Fluffdaddy
My main python installation is Python3 from anaconda, and python calls the anaconda Py3k. I want to install a package that is Python 2.x only, and asks for python to call a Python 2.x. What's the best way for going about this?

Also, it seems readthedocs.org is down right now?

Edit: I'm on a Mac, if this matters.

Cingulate fucked around with this message at 15:31 on Dec 22, 2014

Cingulate
Oct 23, 2012

by Fluffdaddy
I'm using Anaconda for my Python distribution. To use the Intel MKL on Python, I either have to by Anaconda's optimizer package thing, or compile Numpy manually, correct? And the latter option is probably highly dispreferred as I'll have a good chance messing up Anaconda's infrastructure?

Cingulate
Oct 23, 2012

by Fluffdaddy
Yes, I already have the MKL and actually just compiled R to make use of it (... instead of going the comfortable route and downloading Revolution Analytics' R distribution).
I've also set up a few conda envs (thanks to this thread) - though may I ask in this context how I can remove an entire environment at once over the CLI?

I'm just wondering if a mostly computer-illiterate person such as I should even bother trying to get MKL and an Anaconda Python to play along nicely by hand (e.g. by manually compiling Numpy), or if I should just go for my credit card.

I understand correctly you're working for continuum.io?

Cingulate fucked around with this message at 19:51 on Jan 27, 2015

Cingulate
Oct 23, 2012

by Fluffdaddy
dear bigreddot please get the conda statsmodels package to the recent (0.6.1) version I need the ordinal GEE api

Also I'll ask my dadboss to shell over some money for MKL Optimizations.

Edit:

BigRedDot posted:

I work for Continuum, and I wrote the original version of conda
:colbert:

I've started using Anaconda for all my Python needs and have just recommended our newest PhD students to set up their systems using Anaconda

Cingulate fucked around with this message at 10:23 on Jan 28, 2015

Cingulate
Oct 23, 2012

by Fluffdaddy

BigRedDot posted:

It looks like it already is? :)
I SWEAR it was still at 0.5 when I checked.

Cingulate
Oct 23, 2012

by Fluffdaddy
I don't get parallel for loop syntax. At all. Like, I keep starring at the documentation to multiprocessing or whatever, and it's all Greek.

I just want to do something like
parfor x in range(0,10): my_list[x] = (foo(x))

FWIW, this is on Python 2.7. IPython, that is.

Cingulate
Oct 23, 2012

by Fluffdaddy

vikingstrike posted:

This article may be helpful: http://chriskiehl.com/article/parallelism-in-one-line/ In particular, using the map() function on a multiprocessing Pool() object at the very end.
Thank you, it actually did.


salisbury shake posted:

What are you trying to accomplish?
Trying to do this without having to use map().

:(

Cingulate
Oct 23, 2012

by Fluffdaddy
Okay, I think I basically got the multiprocessing thing now - my problem was that I was trying to avoid a functional style, but once I stopped trying to make everything be a for loop, it started making sense.

However - the thing I want to parallelise is already parallelised. What I mean is, I have a function that inherently utilises 10 or so of our (>100) cores. I want to run multiple instances of that function in parallel, to get closer to utilising 50 or so of our cores (and no, I can't really make the functions themselves able to parallelise more efficiently).
Basically, I want to apply a large decomposition to large datasets. I have 20 independent large datasets and want to process them in parallel. But the decomposition function is already mildly parallelised.

When I simply do what's explained in vikingstrike's link (multiprocessing.dummy.pool), I actually make everything much slower because the individual sessions only utilise 1 core.
Can I somehow parallelise parallelised functions (execute multiple instances of a parallelised function in parallel)?

Am I making sense?

Cingulate
Oct 23, 2012

by Fluffdaddy
I did conda update --all on my iMac and it's been doing this

quote:

Fetching package metadata: ..
Solving package specifications: ..
Error: Unsatisfiable package specifications.
Generating hint:
WARNING: This could take a while. Type Ctrl-C to exit.
[47796/341376 ] |############ | 14%
for a few days now.

Could take a while indeed.

Cingulate
Oct 23, 2012

by Fluffdaddy
What would the original, list-comprehension one be called? What ”style" is that?

(My Python is like 75% list comprehensions these days.)

Cingulate
Oct 23, 2012

by Fluffdaddy

SurgicalOntologist posted:

The only thing "wrong" with a functional style is it's not mainstream Python, so your typical Python programmer is not likely to have encountered it and will be confused.

Cingulate, I don't know about identifying a style (it's not really clear cut, even my version isn't functional in the more academic sense), but it's worth pointing out that there's no list comprehension, but a generator expression. You shouldn't have 75% list comprehensions, but a lot of generator expressions/comprehensions of all kinds is usually considered the best way to write Python code.
Ah thanks.

BigRedDot posted:

I haven't used map or filter in years. List comprehensions and generator expressions
  • are declarative (almost always the best option when it is available)
  • are non-trivially more efficient (dispatch immediately to Python C API)
As for the style, it's sometimes referred to as "declarative". In the manner of Prolog, the idea is to tell the computer, directly, what you want, rather than a bunch of steps for how to do it. The implication being that you always have to know what you want, but that giving the steps requires converting what you want into the sequence of steps, and that eliminating this conversion process leaves less room for errors.
Multiprocessing only does .map though, there is no "parallel list comprehension" right?

Cingulate
Oct 23, 2012

by Fluffdaddy

Bob Morales posted:

We use 1 tech per state right now.

I thought about that, but then if we fire a tech and hire a new one I have to change the tech assigned to like, 15 states in some cases.
You could do that in a single line though.

[state_to_tech.__setitem__(state, new_technician) for state, technician in zip(state_to_tech.keys(), state_to_tech.values()) if state_to_tech[state] == old_technician]

Cingulate
Oct 23, 2012

by Fluffdaddy
Dominoes, I don't think one could do something similar for e.g. the linear solvers in Numpy though? (Compared to the Intel MKL build of Numpy.) As, probably, they're already ran as compiled stuff in C or Fortran anyways.

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Faster GLM/RLM would be nice though; coincidentally it's the limfac in one of my projects.
Have you looked at MATLAB's mldivide?

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

This is the first time that I've ever heard someone suggest MATLAB in response to "I need this to run faster"
I do that occasionally - of course, many of MATLAB's basic capabilities are state of the art, top of the line.

Though if anybody knows of a linear system solver faster than MATLABs mldivide, please please do tell me.

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Prepend return to the function's last line.

Is there a reason why python doesn't implement numpy/matlab-style indexing with multiple keys or indices?

Ie for a list of dicts:
Python code:
adict[3, 'akey']
As a cleaner alternative to
Python code:
adict[3]['akey']
The answer is Pandas.

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

This is a dirty lie (unless you're choosing a very narrow definition of "basic") and you shouldn't spread it
mldivide is a very basic capacity of MATLAB - it's a single-character operator, more basic than element-wise matrix operations! - and for all I know, state of the art, top of the line. I don't know what you're talking about.

Cingulate
Oct 23, 2012

by Fluffdaddy
Yes, basically mldivide is seemingly very good at picking what specific package to call (e.g., suitesparse/UMFPACK), and of course Mathworks pays the cash for licensing the MKL and so on.

So, it's state of the art.

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

I'm talking about all of the other "basic features" of MATLAB that are notoriously slow and archaic as gently caress. The features of Matlab that were inherited from older projects work great. Basically, any feature that can't be vectorized with a function written in Fortran before 1990 is complete garbage. Mathworks can't even provide basic list functionality without making it an O(n) operation, for gently caress's sake

By your logic, Fortran77 is a state of the art programming language
I think you're trying to disprove something quite different from what I stated, such as "MATLAB is a good general programming language" or something like that.

mldivide is a basic MATLAB functionality, and it is state of the art, top of the line. I assume the same goes for e.g. dot products or matrix inversions. Thus, many of MATLAB's basic capabilities are state of the art, top of the line.

I've never actually compared the numpy/scipy versions to MATLAB; I'd assume with default installations, they're somewhat to noticeably slower.

Cingulate
Oct 23, 2012

by Fluffdaddy

QuarkJets posted:

And I agreed with you, MATLAB's features that were built pre-1990 in Fortran by someone other than Mathworks run really well, so long as you can completely vectorize the operation
UMFPACK was released in 1994 and is in C though, and I assume the parts of mldivide that check if UMFPACK is appropriate are not written in pre-1990 Fortran either.
And I guess this must be my final contribution to this slightly silly derail.

Cingulate
Oct 23, 2012

by Fluffdaddy
Hm. In that case, I'm going to make my suggestion to Dominoes more clear: if solving linear systems is a limiting factor in your stuff, maybe take a look at MATLAB's mldivide, which
1. is somewhat well documented (and its parameters for every call can be laid bare) - e.g. SO's link
2. makes calls to the best actual number crunchers, so you can learn what the best actual number crunchers are
3. is pretty good at deciding what number crunchers to call based on the properties of the input (sparse, square etc.)

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Specifically, I'm running a few operations on millions of dataset combinations. I'm able to optimize most of them well with Numba, or they're quick enough not to be an issue. The holdup is using a linear regression (specifically statsmodels' GLM at the moment) to find residuals. It runs slow compared to the rest of my calculations.
I assume the actual linear regression is the problem here, as computing residuals is fairly trivial? In which case, what I did when I had that specific problem, and I'm not at all an expert or even any good at this, was:
- check if there is a special property of the matrices you can exploit - are they e.g. sparse, or square?
- are you using the best algorithm for solving that kind of problem? I assume statsmodels calls scipy or numpy for its linear systems, and IIRC neither ships UMFPACK
- do you repeatedly solve y = B*x for the same x? In that case, you can store and recycle the factorisation for massive speed boosts

For me, the optimal solution turned out to actually be making everything sparse and calling MATLAB's "\" once for all my observation matrices sharing the same predictor matrix, which then made solving the linear system actually the fastest part, much faster than building the predictor matrices in the first place - to some extent because I was using, and too lazy to avoid using, MATLAB For loops. I eventually switched over to building the predictors in Python.
So this is my story, hope you liked it.

Cingulate
Oct 23, 2012

by Fluffdaddy

Hughmoris posted:

Python code:
movies = rt.lists('dvds', 'new_releases')
for each_movie in movies['movies']:
	print each_movie['title']
If I had written that, I'd have done

Python code:
print([movie['title'] for movie in rt.lists('dvds', 'new_releases')['movies']])
Assuming performance not being critical and no side effects, is something like this considered disfavoured due to readability?

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Different, probably less-readable approach.

Python code:
movies = rt.lists('dvds', 'new_releases')
list(map(print, (movie['title'] for movie in movies)))
Or swap the second line with

Python code:
list(map(lambda movie: print(movie['title']), movies)
I've started using map a bit, but I still find list comprehension both more readable and easier to write.
Once you introduce lambda, I'm usually out though. Though maybe that's practice, I've used it before.

If I wanted exactly the same output, I'd probably also have done that as a list comp, like [print(movie['title']) for ... ] or something like that.

Cingulate
Oct 23, 2012

by Fluffdaddy

Hammerite posted:

While we're bikeshedding (yes I know I started it), this list(map(print, ...)) stuff is bananas IMO. Argument-unpacking is your friend!

Python code:
movie_titles = (m['title'] for m in rt.lists('dvds', 'new_releases')['movies'])
print(*movie_titles, sep = '\n')
* and ** is right up there with lambda and map though :colbert:

Also you could do a list comprehension instead and avoid the sep = thing, [print(movie) for movie in movie_titles].

Edison was a dick posted:

That's because map with a lambda is inferior when compared to list comprehensions or generator expressions.
The only time I can think of where a call to map may be theoretically better than a list comprehension, would be if the body of the lambda were sufficiently complicated that there's a form that is written in C which you can pass in place of the lambda, and the operation itself is sufficiently slow that it's worth the overhead in marshalling the data between C and python.
I use map for multiprocessing (as I learned in this thread).
I wish there was a parallel list comprehension, then I'd never ever optimise code ever again and instead spend half my time apologising for crashing the server by filling up all the memory.

Cingulate
Oct 23, 2012

by Fluffdaddy

Edison was a dick posted:

FFS! Let's just
Python code:
print '\n'.join(m['title'] for m in rt.lists('dvds', 'new_releases')['movies'])
and get on with our lives!

Python code:
[print(m['title']) for m in rt.lists('dvds', 'new_releases')['movies'])
is still shorter and holds the smallest number of distinct elements out of all the ways to reproduce the original example :colbert:

Dominoes posted:

I prefer map if I don't need to use lambda or a a list comp with it. Ie the function already exists. In this example, I might prefer it if the input list was already set up; it it didn't need the ['title'] lookup.
Why? I also use list comprehensions with such functions, what's the advantages with the current stage of Python?
One I found googling just now was you can maybe more easily switch between parallel and serial implementations by doing an optional map = multiprocessing.pool.map ...

Cingulate
Oct 23, 2012

by Fluffdaddy

Hammerite posted:

Generating a list as a side effect is ugly and makes it harder to appreciate at a glance what's happening. Shortness is secondary to clarity.
Okay, I get that first point.
Although I think that in this case, shortness is clarity as shortness comes from not introducing additional words (functions), like join or map.

Cingulate
Oct 23, 2012

by Fluffdaddy
Yeah I've done some reading and I'm getting the point. Great, now I'm gonna rewrite like half my code.

Cingulate
Oct 23, 2012

by Fluffdaddy
In case either of you feels this is all semantics: I'm definitely learning things.

Cingulate
Oct 23, 2012

by Fluffdaddy
I run iPython notebooks remotely that often take days to run. So far I've been occasionally logging into the remote server to check top and see if it's still crunching, but that's a bit daft isn't it? I'd like to be informed when the process finishes. I tried sending myself an email, but the default example requires storing passwords in plaintext. That's not really what I want for notebooks I'll share with other people often. I found a recommendation for the getpass module, but that's not really what I'm looking for either. Any ideas? Mustn't be email, I'd just like to get a notification somehow.

Cingulate
Oct 23, 2012

by Fluffdaddy

Edison was a dick posted:

Being set in an environment variable is only barely an improvement. A rogue program that gains sufficient permissions can still read it out of your process, but at least it's not passed on the command line, where any process can see it.

tbh, I'd be happier if people just used getpass.getpass(), even if python doesn't have an interface to locked memory, to prevent the password being accidentally written to disk when the process is swapped out.
Thanks - while this doesn't really make me happy, it seems to be the preferred option.

My coworkers married to MATLAB actually use a script with hard-coded plaintext passwords, so ...

Adbot
ADBOT LOVES YOU

Cingulate
Oct 23, 2012

by Fluffdaddy

Dominoes posted:

Resolved. It looks like the name 'quick' is protected on PyPi (Although not fully on its test site), despite being unused.

Renamed to 'brisk' and uploaded. Speaking of which, if anyone's interested in faster drop-in replacements for basic numerical functions, check it out here. Suggestions / addition requests encouraged.
Extending OLS to multiple regression! Though as noted, it's not clear there is much to gain.

  • Locked thread