Python information and short questions megathread.

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›6 »

Cingulate: Oct 23, 2012; by Fluffdaddy

I'm not sure I correctly understand the question, but have you tried help(matplotlib.pyplot.axes)?

If the first arg to plt.axes is a tuple of 4 numbers, it's setting the axis size (as left, bottom, width, height).

# ¿ Apr 16, 2015 15:50

Adbot: ADBOT LOVES YOU

# ¿ May 22, 2024 08:16

Cingulate: Oct 23, 2012; by Fluffdaddy

I'm using joblib. I have a list of 47 entities. When I call

code:

results = Parallel(n_jobs=32, verbose=5)(delayed(f)(item) for item in list_of_items)

the first 32 entities of the list are properly processed. An exception is triggered by the others, which in this case was caught by my try clause (resulting in an empty list return). When I call

code:

results2 = Parallel(n_jobs=18, verbose=5)(delayed(f)(item) for item in list_of_items)

only the first 18 are processed.
So results has 15 empty fields, results2 has 29. So the output list has the correct length, and the filled fields have the correct content.

f is a fairly long, complicated and badly written function so I don't want to bother anybody with it.
When I run it outside of the parallel loop, it works.

list_of_items has not been changed. joblib prints:

code:

[Parallel(n_jobs=18)]: Done  19 out of  47 | elapsed: 35.9min remaining: 52.8min
[Parallel(n_jobs=18)]: Done  29 out of  47 | elapsed: 36.7min remaining: 22.8min
[Parallel(n_jobs=18)]: Done  39 out of  47 | elapsed: 37.7min remaining:  7.7min
[Parallel(n_jobs=18)]: Done   9 out of  47 | elapsed: 53.0min remaining: 223.6min
[Parallel(n_jobs=18)]: Done   1 out of  47 | elapsed: 54.1min remaining: 2488.1min
[Parallel(n_jobs=18)]: Done  47 out of  47 | elapsed: 57.3min finished

Any ideas?

# ¿ Apr 26, 2015 17:52

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

Have you tried reading the documentation?

https://pythonhosted.org/joblib/parallel.html#common-usage

According to that, the function that you want to parallelize needs to be a generator. It sounds like you need to modify your code

In what I did, I tried to copy the docs. From what I understand, we write the code as a generator - i.e., my (delayed(f)(item) for item in list_of_items), where f is a "regular" function that returns something.

FWIW it works if I set n_jobs to -2 ...

Actually, you pointing out that (delayed(f)(item) for item in list_of_items) is a generator finally cleared up the joblib syntax for me.

# ¿ Apr 26, 2015 19:16

Cingulate: Oct 23, 2012; by Fluffdaddy

Nippashish posted:

It looks to me like you're using joblib correctly. Can you trigger the same weirdness with a different f? Does f interact with the outside world at all?

I don't see how f would have any side effects.
I do do random.seed(an_int) in there. This is pointless, as I'm just realizing, as an_int is set to the same value for all iterations by my usage of partial() to wrap f, but that's the only thing I can see loving with globals.
I'm accessing a bunch of globals however.

But the only stuff that's changed is created inside the function, and the function returns to a new list.

I'm possibly missing something very obvious, I'm a terrible programmer.

I haven't found another f it does this for, though I haven't tried much.

# ¿ Apr 26, 2015 19:20

Cingulate: Oct 23, 2012; by Fluffdaddy

Is there any reason for not using scikit-learn instead?

# ¿ Apr 29, 2015 23:11

Cingulate: Oct 23, 2012; by Fluffdaddy

This works for the line you've posted:

code:

import pandas as pd
from collections import Counter

# load data into data frame
df = pd.DataFrame.from_csv("sa.csv", sep="\t")

# drop slow trials
df = df[df["RT-P"] > 100]

# count Trial Type for trials where Accuracy equals 2
c = Counter(df[df["Accuracy-T"] == 2]["Trial Type"])
print(c)

# drop Acc. 2 and 3
df = df[df["Accuracy-T"] != 2]
df = df[df["Accuracy-T"] != 3]
print(df)

(Don't do this by hand, use Pandas.)

# ¿ May 8, 2015 08:43

Cingulate: Oct 23, 2012; by Fluffdaddy

Surgical Ontologist and I are both psychologists. Try installing pandas. Not just for this thing, but, as SO said, because it'll be near-essential for almost everything else you're gonna do once the data is in.
Doing science, especially psychology, in Python without pandas is like tying your good hand to your foot and programming with your left exclusively; it can be done, but WHY?

You can install continuum.io's Anaconda, which is the best way to get a scientific python distribution, without administrative rights, and you can trivially install pandas via Anaconda. (However, psychopy doesn't work with all Anaconda distributions, sadly, so if you have to use the same Python for psychopy and the analysis, you may have to try installing pandas manually.)

# ¿ May 8, 2015 16:41

Cingulate: Oct 23, 2012; by Fluffdaddy

SurgicalOntologist posted:

Well, even with the effort you put in, it would probably be quicker to start over in pandas than learn to do it correctly in raw Python.

This.
Doing it in pandas is the lazy way.

I've so far not even bothered to spend any time explaining what pandas is because even with you having to google it, install it manually, learn to use it etc., you'd save massive amounts of time compared to doing it manually.

e:f,b

Even pd.from_csv(fname) is so much easier than 'with open(fname) as f: ... '

# ¿ May 8, 2015 17:30

Cingulate: Oct 23, 2012; by Fluffdaddy

Poizen Jam posted:

Seems that way, I'll use Pandas then. Both Anaconda and Psychopy are based on Python 2.7 however, so I have at least one more question: Is there a decent statistics package compatible with either? 'Statistics' was introduced in Python 3.4 so it doesn't work and I need Standard Deviation calculations for the Van Selst procedure. Hilariously, it's currently calculated manually. But it also iterates over a currently opened list so now I have to double check that one too.

Scipy and Numpy, which you'll get with Anaconda, have basic stats.
Statsmodels has more complex models, such as LMMs, GEEs, ...
Scikit-learn has ML and multivariate stuff.

Seaborn is for nice, basic visualisation and also has inference tools (ANOVAs etc).

You can also use iPython, especially the iPython notebook, and call R from Python.

You probably want to learn list comprehensions by the way. 90% of my Python work as a psychologist is list comprehensions and Pandas.

# ¿ May 8, 2015 17:51

Cingulate: Oct 23, 2012; by Fluffdaddy

Yes, pandas is for psychology stuff, not for "real" data science; I don't use pandas for ML or most neuroscience I do.
But if you want to analyse response times, pandas will usually suffice.

# ¿ May 8, 2015 18:59

Cingulate: Oct 23, 2012; by Fluffdaddy

QuarkJets posted:

1) Adding strings together in Python is slow; every time that you add two strings, a new string gets created.

That's a bit outdated.
http://stackoverflow.com/questions/1349311/python-string-join-is-faster-than-but-whats-wrong-here

format is still superior for longer strings for reasons of clarity, being able to format, and scalability, but performance wise, it's pretty much a toss-up.

# ¿ May 9, 2015 00:37

Cingulate: Oct 23, 2012; by Fluffdaddy

Dominoes posted:

Unfortunately, I've no idea what that should be.

where your Script says Exec=ipython3 ..., put the absolute path, I assume

# ¿ May 10, 2015 20:13

Cingulate: Oct 23, 2012; by Fluffdaddy

outlier posted:

Any opinions on Python plotting libraries? Like most everyone else, I used Matplotlib for years, but I never got along with it: seemed too much like Matlab, non-Pythonic, not very orthogonal. And now we have a whole host of libraries - seaborn, plotly, veusz, chaco, etc. I'd like something that was powerful but logical, not one where every type of plot seemed to have it's own unique interface.

Really depends on what you want to be doing, doesn't it? Seaborn for example is great, but it's also really restricted (focused). Also, it's simply Matplotlib shortcuts. If what you want to do is covered by Seaborn, you're in luck, but if not, you're back to Matplotlib again.

# ¿ May 21, 2015 11:37

Cingulate: Oct 23, 2012; by Fluffdaddy

Also note that Pandas has some basic plotting capacities (MPL-based ofc).

# ¿ May 21, 2015 21:03

Cingulate: Oct 23, 2012; by Fluffdaddy

outlier posted:

Call that an implied requirement. My main beef with matplotlib (while admitting it's a powerful package lots of people love) is that it just doesn't stick to my brain and feels like I'm swapping to another language when I do plotting. And should you look at the source to see how things work, you're confronted with a whole lot of functions using kwargs.

Doing an R call may be the way to go.

I think that's being a bit too down on matplotlib - you can still do reasonably pythonic stuff that would be rather different in Matlab. Like, iterating over axes to set their properties etc.

# ¿ May 22, 2015 17:51

Cingulate: Oct 23, 2012; by Fluffdaddy

dantheman650 posted:

Some posters recently recommended I learn about testing, and coincidentally the next Coursera course in the series I'm taking is testing focused. So far so good but one question is bugging me. I know how to run a test suite to test for situations where I know the expected result, but how do I test with randomness? For example, while making a game I initialize a game board and can easily test that the initialization works as expected. Then I add a number to a random square. Using my testing suite, there isn't an expected value for the board to compare to the computed value. Obviously this case is trivial and can easily be tested by just printing the board, but what about for more complex functions? Again, this might be a tough question to tackle in a forums post.

Fix the random seed?

random.seed(0) etc

# ¿ Jun 2, 2015 09:45

Cingulate: Oct 23, 2012; by Fluffdaddy

salisbury shake posted:

Stick a @dump_args on a bunch of methods/functions and you'll get a map of how poo poo flows through the program

Okay now I get it.

# ¿ Jun 3, 2015 21:29

Cingulate: Oct 23, 2012; by Fluffdaddy

hooah posted:

As far as I'm aware, yes, it's better to separate your functions from your main program/function in all programming languages (although I only have experience with C/C++ and Python).

Are there any exceptions to this rule?

For example, one thing I sometimes do is

code:

def some_function(input):
    if has_some_characteristic(input):
        def do_something(x):
            return whatever(x)
    else:
        def do_something(x):
            return something_else(x)

    x = do_things_with(input)
    return do_something(x)

# ¿ Jun 11, 2015 19:44

Cingulate: Oct 23, 2012; by Fluffdaddy

On our (Linux) server, a bunch of people have their own Anaconda setups. I'm using Anaconda with Python 3.4, and it works flawlessly. A friend of mine, on his own account, also uses Anaconda and Python 3.4 (all up to date), however, he can't read files - he gets a decoding error (UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position ...: ordinal not in range(128)). All of the involved scripts, including decoding/ascii.py, which raises the error, are the same between my account and his. However, when we call sys.getfilesystemencoding() on his Python, we get 'ascii', and on mine, we get 'utf-8'.

I have no idea what's going on here. We've removed and reinstalled his Anaconda, to no avail. I don't get where any difference may be coming from. Any suggestions?

Edit: the problem seems to be independent of Anaconda. If I call the system's base python installation from my account, system file encoding is set to utf-8, but on his, it's ANSI-something.

Cingulate fucked around with this message at 10:06 on Jun 24, 2015

# ¿ Jun 24, 2015 09:58

Cingulate: Oct 23, 2012; by Fluffdaddy

outlier posted:

I dimly remember something like this from webdev days. Try "export PYTHONIOENCODING=utf8" in the shell and look to see if he's got some strange setting in a bash file.

No, that didn't change a thing

There is nothing in .bashrc or .bash_profile either.

# ¿ Jun 24, 2015 15:49

Cingulate: Oct 23, 2012; by Fluffdaddy

Things I do:
- few, usually medium-length, loops
- list and dict comprehensions - a lot, often nested
- functions small and large, even sometimes recursive
- a common pattern is to create a function or a dict for the purpose of using it for a list or dict comprehension

Things I know I don't do:
- classes and objects
- decorators

Things I don't know I don't do:
- ...

Most of what I do is scientific python, so usually numpy and scipy based, with a lot of pandas and matplotlib. I also operate with strings (e.g., cleaning up text input) a lot. So nothing exciting.

What's the next concept I should focus on learning and understanding so as to become slightly less awful at Python?

# ¿ Jul 19, 2015 00:11

Cingulate: Oct 23, 2012; by Fluffdaddy

I actually do that - not much, but sometimes.
Though one thing I do a lot is
var = (foo if condition else bar)

That counts right?

I do this whenever var is required and there is no clear default. If there is a clear default, or if var is usually set by an external input, I'll usually rather use an if clause for the rarer case.

# ¿ Jul 19, 2015 00:23

Cingulate: Oct 23, 2012; by Fluffdaddy

Dominoes posted:

Learn the functools, itertools, and collections modules.

Nippashish posted:

Learn how to make effective use of classes and objects. There's a reason they're so ubiquitous in modern computing.

Thanks!

# ¿ Jul 19, 2015 03:08

Cingulate: Oct 23, 2012; by Fluffdaddy

Badly Jester posted:

I'm just now getting into the whole programming game, but since you're also a linguist: did you learn Python using the NLTK book? If so, would you recommend it?

I'm more of a neuro guy than a linguist really (the idea that I might actually be a linguist to the degree that I'm a functionalist just shows how dumb whoever bought that thing is ...). I started out using Python to handle things like experimental log files, which is string parsing, but not especially linguist-y in nature. And for basically every task I've ever considered using NLTK for, there are vastly superior individual packages. So I don't know much about NLTK, and never read the book.

I know a bunch of other peopel in the linguistics thread are also pretty good at Python, I think at least foiled is much better than I am actually.

# ¿ Jul 19, 2015 10:28

Cingulate: Oct 23, 2012; by Fluffdaddy

You can version control ipynbs. And I use all these packages minus networkx, exclusively under 3.
You probably want to add seaborn btw.

# ¿ Jul 20, 2015 18:46

Cingulate: Oct 23, 2012; by Fluffdaddy

NumPy question. Can I somehow do something like this:

code:

x_loc = np.asarray([1,2,3])
y_loc = np.asarray([1,5,9])
values = np.asarray([99, 11, 00])
target = np.zeros((100,100))

target[x_loc, y_loc] = values

so that it's equal to

code:

x_loc = np.asarray([1,2,3])
y_loc = np.asarray([1,5,9])
values = np.asarray([99, 11, 00])
target = np.zeros((100,100))

for x, y, v in zip(x_loc, y_loc, values):
    target[x,y] = v

but without doing loops?

# ¿ Jul 21, 2015 21:17

Cingulate: Oct 23, 2012; by Fluffdaddy

Oh thanks.

# ¿ Jul 21, 2015 21:47

Cingulate: Oct 23, 2012; by Fluffdaddy

BigRedDot posted:

Welp. https://news.ycombinator.com/item?id=9936295

If there's things you want to see in Anaconda, let us know, we are moving ahead full bore

Does that mean Travis Oliphant is rich now

# ¿ Jul 23, 2015 23:27

Cingulate: Oct 23, 2012; by Fluffdaddy

I really tried to think of some way to improve, but I really couldn't think of anything but for "just do what you're doing now, but harder". Which is lame.
I'm really happy with conda.

I have something extra lame though. Make installing packages from binstar easier. Right now, you have to manually copy lines and poo poo, multiple lines. Just give me another flag for conda install to automatically check binstar if there is nothing in the main repo/my channels, and if something is found, prompt me asking if I want to install it.

# ¿ Jul 24, 2015 19:58

Cingulate: Oct 23, 2012; by Fluffdaddy

iPython question: I have a self-installed module available, but I have no idea how I am making it available. It's not in my PYTHONPATH, it's not in my anaconda directory, it's not in the directory I'm calling Python from. Can I somehow check what other paths Python checks?

# ¿ Jul 29, 2015 01:32

Cingulate: Oct 23, 2012; by Fluffdaddy

KICK BAMA KICK posted:

This might be way too simple but have you looked at sys.path?

It is in sys.path, thanks! So how did it get in there ..?

# ¿ Jul 29, 2015 02:03

Cingulate: Oct 23, 2012; by Fluffdaddy

Fluue posted:

It's been a bit since I've used Python for CSV parsing, so this might be a stupid design issue I'm overlooking:

I have a CSV file formatted like this:
pre:
time    value    par    frame
1       111                          
1.1     h        
1.12    e        
1.13    y
Meaning that each character is on its own row in the value column. I want to iterate over the CSV file and make a human-readable format of the CSV output. I really only want to deal with the value column, so what I think I need to do is:

1. Iterate over the rows and build a list of entries in a list
2. Somehow concatenate the entries in the list to make a human-readable string

Is there some idiomatic way I'm overlooking? I'm thinking it's a simple iteration problem, but it seems like there is a more efficient way I'm not thinking about.

I'm not sure I get the question, but: do you want to make a single string out of the whole value column?

import pandas as pd
df = pd.read_csv(csv_filename)
print("".join(df["value"]))

Sure, you could do it without pandas, but if you ever deal with csv, you're gonna use pandas eventually.

# ¿ Aug 15, 2015 11:38

Cingulate: Oct 23, 2012; by Fluffdaddy

The Fool posted:

What does pandas do that the built in csv module doesn't ?

Is that literally the question you want to ask? Because if yes: a billion things. Pandas is the primary Python way of dealing with (empirical) data below the actual big data scale. If the only thing you want to do with a csv is literally what Fluee asked about, Pandas would be overkill, but if you actually want to work with data (ie., clean it, reorganize it, operate on it, analyse it, plot it ...), Pandas is the standard module for that.

(Also, I don't understand how I managed to miss SO's post ...)

# ¿ Aug 15, 2015 19:34

Cingulate: Oct 23, 2012; by Fluffdaddy

I very often find myself doing something like this:

code:

def expensive_computation(input):
    more_stuff = do_something_simple(input)
    a_thing = do_something_complicated_with_stuff(input, more_stuff)
    return more_stuff, a_thing

list_of_tuples = [expensive_computation(thing) for thing in things]
things = [thing for _, thing in list_of_tuples]
stuff = [stuff for stuff, _ in list_of_tuples]

I don't assume I could somehow have something like

code:

things, stuff = [expensive_computation(thing) for thing in things]

# ¿ Aug 20, 2015 19:26

Cingulate: Oct 23, 2012; by Fluffdaddy

Yep. Thanks.

# ¿ Aug 20, 2015 20:22

Cingulate: Oct 23, 2012; by Fluffdaddy

... and if you're doing data science, you should proceed straight to the iPython Notebook.

# ¿ Aug 24, 2015 10:39

Cingulate: Oct 23, 2012; by Fluffdaddy

Given a pandas series of length 1, I can do this

code:

list(the_series)[0]

to get just the value. However, surely there is a more pythonic way?

For what it's worth, the series originated from a DataFrame from which I have selected one specific cell.

# ¿ Aug 25, 2015 09:01

Cingulate: Oct 23, 2012; by Fluffdaddy

Viking_Helmet posted:

How are you getting a series back from a single dataframe cell? Selecting a cell by index or iloc will return the value of that cell - are you doing something else?

In this case, I've first extracted a single row of the data frame to do various things with it. Then I select one column of that single-row frame. Though I guess you're pointing out the solution - using .loc or .iloc in the first place.

This has come up a few times now though.

The actual code in this case has however completely changed, looks like this now and is still really bad and slow:

code:

keys = ("date", "type", "thing")

outs = []
for title in df.title.unique():
    d_ = df[df["title"] == title]
    if len(d_.type.unique()) > 1:
        for name in d_["name"]:
            d__ = d_[d_.name == name]
            d = d__.loc[:,keys]
            other_date = (d_[d_["type"] == "foo"]["date"].mean()
                            if "bar" in d__["type"] else 
                            d_[d_["type"] == "bar"]["date"].mean())
            d["date_diff"] = list(d__["date"] - other_date)[0]
            outs.append(d)
df2 = pd.concat(outs)

# ¿ Aug 25, 2015 16:06

Cingulate: Oct 23, 2012; by Fluffdaddy

Zero Gravitas posted:

I've got an idea for a thing that involves a lot of things that I've never used before and thought I would ask for some help.

I'm working at a company that uses a CFD program that largely operates as a black box, not giving any information until the run is finished. After poking through the files that the program generates while it is running, it generates and stores a lot of data in a bunch of Notepad - readable files delimited by column, headers in text and values for various properties of the flow. I think this is ripe for a program/script that reads each file and plots each (or a selection of) parameters as they are generated using Matplotlib. This way we can see if the simulation is worth continuing early on instead of having to wait until the end.

1) The files are not csv files, but .in and .out . Will a module like csvwrite happily take it anyway if its in a clearly delimited format?

2) What else could I do to read this file?

3) I'm used to working with arrays of numbers in numpy but these files will have a couple of header rows (column number, Parameter name, units) and then a list of floats in scientific notation. Is it ok to use text and floats in an array or am I better off using a different way of storing these?

4) The CFD program updates these files continuously - how would I make my script/program check the files for updates? I think its a pretty simple thing to do to simply check the files every ten seconds or so and generate new images by wrapping the entire thing in a timed loop, but I think that computationally instead of reading and plotting values from (number of files) x 15 variables x 40000 iterations every ten seconds its probably easier to simply append the new values to the arrays (or other device) where they are stored.

Any pointers?

I'm really a one-trick pony, but: you probably want to use Pandas. It has a good read_csv method and comfortable plotting functions.

There's also an append method.

# ¿ Aug 28, 2015 11:48

Adbot: ADBOT LOVES YOU

# ¿ May 22, 2024 08:16

Cingulate: Oct 23, 2012; by Fluffdaddy

BigRedDot posted:

Edit: FWIW I think this is probably a common confusion, given that the spelling of default arguments in function definitions is basically the same as the spelling of passing keyword arguments in function invocations. But they are two separate, different things.

What the I never even knew this

# ¿ Aug 30, 2015 09:23

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python information and short questions megathread.

«‹›6 »