Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
NinpoEspiritoSanto
Oct 22, 2013




Work through Think Python or something

Adbot
ADBOT LOVES YOU

first move tengen
Dec 2, 2011
You're right this does seem dumb learning with something outdated lol. Glad I hadn't gotten too far into it.

salisbury shake
Dec 27, 2011
I wrote a bunch of async scrapers that produce results via async generators, and I want to consume the results of the generators as soon as they're produced. Basically, I want something like itertools.chain() or asyncio.as_completed() but for async generators.

Python code:
from typing import AsyncIterable
from asyncio import sleep
from random import random


async def gen() -> AsyncIterable[float]:
  while True:
    time = random()
    await sleep(time)
    yield time


async_gens = [gen() for _ in range(10)]

async for result in async_chain(async_gens):
  # do stuff with result
Are there any libraries or parts of the standard library that implement something like async_chain() in the code example above?

NinpoEspiritoSanto
Oct 22, 2013




Look at trio and the task nurseries. I've a scraper running at work that returns back results as completed.

mightygerm
Jun 29, 2002



What's the current standard test framework for open source projects? I'm starting to work on a project and I'm unsure whether to use tox, nose, pytest, etc.

NinpoEspiritoSanto
Oct 22, 2013




pytest/tox

Wallet
Jun 19, 2006

Bundy posted:

pytest/tox

I like pytest, personally, but not for any particularly good reason.

The March Hare
Oct 15, 2006

Je rêve d'un
Wayne's World 3
Buglord
echoing liking pytest, the assertions feel much more natural and I've had annoying experiences with tox but that could just be me

bigperm
Jul 10, 2001
some obscure reference
I'm working on the Flask Tutorial where you make a little blog.

I've got the part where you initialize the database and I am getting an error and I have no idea I'm missing. At the bottom of this page it says to just run the command flask init-db and move on the the next page. When I run it I get a syntax error.

code:
# __init__.py

import os

from flask import Flask

def create_app(test_config=None):
    # creates and configure the app
    app = Flask(__name__, instance_relative_config=True)
    app.config.from_mapping(
        SECRET_KEY='dev',
        DATABASE=os.path.join(app.instance_path, 'flaskr_sqlite'),
    )

    if test_config is None:
        # Load the instance config, if it exists, when not testing
        app.config.from_pyfile('config.py', silent=True)
    else:
        # Load the test config if passed in
        app.config.from_mapping(test_config)
    
    # ensure the instance folder exists
    try:
        os.makedirs(app.instance_path)
    except OSError:
        pass

    # a simple page that says hello
    @app.route('/hello')
    def hello():
        return 'Hello, World!'
    
    from . import db
    db.init_app(app)

    return app
code:
# db.py

import sqlite3

import click
from flask import current_app, g
from flask.cli import with_appcontext


def get_db():
    if 'db' not in g:
        g.db = sqlite3.connect(
            current_app.config['DATABASE'],
            detect_types=sqlite3.PARSE_DECLTYPES
        )
        g.db.row_factory = sqlite3.Row

    return g.db


def close_db(e=None):
    db = g.pop('db', None)

    if db is not None:
        db.close()

def init_db():
    db = get_db()

    with current_app.open_resource('schema.sql') as f:
        db.executescript(f.read().decode('utf8'))


@click.command('init-db')
@with_appcontext
def init_db_command():
    """Clear the existing data and create new tables."""
    init_db()
    click.echo('Initialized the database.')

def init_app(app):
    app.teardown_appcontext(close_db)
    app.cli.add_command(init_db_command)
code:
-- schema.sql

DROP TABLE IF EXISTS user;
DROP TABLE IF EXISTS post;

CREATE TABLE user (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    username TEXT UNIQUE NOT NULL,
    password TEXT NOT NULL
);

CREATE TABLE post (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    author_id INTEGER NOT NULL,
    created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    title TEXT NOT NULL.
    body TEXT NOT NULL,
    FOREIGN KEY (author_id) REFERENCES user (id)
);
code:
(venv) adam@DESKTOP-89P9OD6:/mnt/c/Users/ringe/Documents/coding/flask_tutorial$ flask init-db
Traceback (most recent call last):
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/bin/flask", line 8, in <module>
    sys.exit(main())
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/flask/cli.py", line 967, in main
    cli.main(args=sys.argv[1:], prog_name="python -m flask" if as_module else None)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/flask/cli.py", line 586, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/flask/cli.py", line 426, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/flaskr/db.py", line 40, in init_db_command
    init_db()
  File "/mnt/c/Users/ringe/Documents/coding/flask_tutorial/flaskr/db.py", line 33, in init_db
    db.executescript(f.read().decode('utf8'))
sqlite3.OperationalError: near ".": syntax error
I know this might be a big ask but this is where I would traditionally give up in frustration and would appreciate if anyone could point out the tiny error or misconfiguration that might let me push through here.

KICK BAMA KICK
Mar 2, 2009

bigperm posted:

I know this might be a big ask but this is where I would traditionally give up in frustration and would appreciate if anyone could point out the tiny error or misconfiguration that might let me push through here.
From schema.sql:
code:
title TEXT NOT NULL.
period instead of a comma?

Hollow Talk
Feb 2, 2014
pytest fixtures are good and my friend.

bigperm
Jul 10, 2001
some obscure reference

KICK BAMA KICK posted:

From schema.sql:
code:
title TEXT NOT NULL.
period instead of a comma?

code:
(venv) adam@DESKTOP-89P9OD6:/mnt/c/Users/ringe/Documents/coding/flask_tutorial$ flask init-db
Initialized the database.
That was it!

Thank you so much.

Wallet
Jun 19, 2006

Hollow Talk posted:

pytest fixtures are good and my friend.

This is a good reason. The most recent thing I've been working on is an API and pytest has made it straightforward to set up fixtures to handle all of the fiddly bits so I can e.g. test refreshing an auth token like this:
Python code:
def test_refresh_token(test_broker):
    user = test_broker.add_user()
    resp = user.verify()
    first_token = resp.json['session_token']

    # Refreshing returns a new token
    resp = user.refresh_token(log_level='info')
    assert type(resp.json['session_token']) is str
    assert not resp.json['session_token'] == first_token
    assert resp.status_code == 200

    # Old token doesn't work
    resp = user.refresh_token(session_token=first_token, log_level='info')
    assert resp.status_code == 403

    # New token does work
    resp = user.refresh_token(log_level='info')
    assert resp.status_code == 200
Instead of writing hundreds of requests manually.

salisbury shake
Dec 27, 2011

Bundy posted:

Look at trio and the task nurseries. I've a scraper running at work that returns back results as completed.

Thanks. Turns out that aiostream.stream.merge() does exactly what I wanted.

duck monster
Dec 15, 2004

I kinda wish DocTests where more common. They really do solve two problems at once (In depth code documentation and unit-ish testing)

duck monster fucked around with this message at 02:28 on Mar 19, 2021

Loezi
Dec 18, 2012

Never buy the cheap stuff
I have a list of triples, each containing a GroupID (str), a payload (str, can be ignored for the purposes of this discussion) and a priotity (float, higher = more important). The size of the list is unknown and the order of the triples in the list is effectively random.

I need to produce a new list, at most MAX_LEN in size, containing the top-MAX_LEN most important triples. They need to be ordered s.t. the triples are grouped together by GroupID, with more important items first within groups, and the groups further ordered by the maximal within-group priority. So if we asume MAX_LEN = 4, for the input
code:
data: List[Tuple[str, Any, float]] = [
    ("a", ..., 1.5),
    ("b", ..., 0.9),
    ("c", ..., 0.6),
    ("c", ..., 1.2),
    ("b", ..., 10.2),
    ("a", ..., 0.01),
]
I'd expect the output to be
code:
[
    ("b", ..., 10.2)
    ("b", ..., 0.9),
    ("a", ...., 1.5),
    ("c", ...., 1.2),
]
I came up with the following solution
code:
def order(data: List[Tuple[str, Any, float]]) -> List[Tuple[str, Any, float]]:
    # Limit data to top MAX_LEN data points
    data = sorted(data, key=lambda datapoint: datapoint[2], reverse=True)[:MAX_LEN]
    
    # Group data by groupID (groupby expects sorted input)
    data = sorted(data, key=lambda output: output[0])
    data = (group for _, group in itertools.groupby(data, key=lambda x: x[0]))
    
    # Sort each group by score
    data = (sorted(group, key=lambda datapoint: datapoint[2], reverse=True) for group in data)
    
    # Sort groups by max group score
    data = sorted(data, key=lambda group: max(datapoint[2] for datapoint in group), reverse=True)
    
    # Flatten groups to list
    data = list(itertools.chain.from_iterable(data))
    
    return data
but the code seems fairly long for the thing I'm trying to do. Am I missing some obvious stable sorting trick or piece of the standard library?

QuarkJets
Sep 8, 2008

I think that first line is out of place; if you immediately truncate then you've chosen a maximum number of entries in arbitrary order and then you sorted them, but the problem statement makes it sound like you want to choose a maximum number of entries from the fully sorted list. So truncating to MAX_LEN should be the last step, not the first

And are you sure that example output is right? As-written, I thought that you wanted to sort by GroupID first, and then sort entries with the same GroupID by priority; but your output takes the row with the highest priority and then appends all of the rows with the same GroupID before moving on to the next highest priority, and so on. Two very different interpretations! In other words this is what I would have thought the output should be, based on how you wrote the problem:

code:
[
    ("a", ..., 1.5),
    ("a", ..., 0.01),
    ("b", ..., 10.2),
    ("b", ..., 0.9),
]
Are you allowed to use pandas? You can do multi-column sorting on a dataframe if you just want to sort by (Group ID, priority).

Loezi
Dec 18, 2012

Never buy the cheap stuff

QuarkJets posted:

I think that first line is out of place; if you immediately truncate then you've chosen a maximum number of entries in arbitrary order and then you sorted them, but the problem statement makes it sound like you want to choose a maximum number of entries from the fully sorted list. So truncating to MAX_LEN should be the last step, not the first

And are you sure that example output is right? As-written, I thought that you wanted to sort by GroupID first, and then sort entries with the same GroupID by priority; but your output takes the row with the highest priority and then appends all of the rows with the same GroupID before moving on to the next highest priority, and so on. Two very different interpretations! In other words this is what I would have thought the output should be, based on how you wrote the problem:

code:
[
    ("a", ..., 1.5),
    ("a", ..., 0.01),
    ("b", ..., 10.2),
    ("b", ..., 0.9),
]
Are you allowed to use pandas? You can do multi-column sorting on a dataframe if you just want to sort by (Group ID, priority).


Truncating first is the correct behaviour: the description was written by me for that post, trying to verbalize the behaviour I was showing with the examples. As another example, this
code:
[
    ("a", ..., 1.5),
    ("b", ..., 0.9),
    ("c", ..., 4.6),
    ("c", ..., 1.0),
    ("b", ..., 10.2),
    ("a", ..., 0.01),
]
should become this
code:
[
    ("b", ..., 10.2), 
    ("c", ..., 4.6), 
    ("c", ..., 1.0), 
    ("a", ..., 1.5)
]
Pandas is fine, but the payloads can be fairly large text strings, in case that matters.

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!
Has anybody here gotten on to the Python machine learning train? I started diving in and was surprised. For what you hear about it, I kind of figured I'd see some more magic, but it's been kind of disappointed.

I've been particularly looking at neural networks (multi-layer perceptrons) to do the classic XOR problem. With scikit-learn, the API is kind of what I would have expected at this point for consolidation. It was pretty simple. The problem was the lack of convergence. I had to find a very specific recipe on Stack Overflow to get a success rate around 96% (I'd have the whole experiment rerun 100 times). When I was hand-writing out this crap, I could count on nailing it every time. I think I should only need two neurons in the hidden layer, but the recipe that worked needed three. Both two or four+ in the hidden layer made things worse.

I then looked at what this was like with pytorch and ran into 100+ LOC stuff that had to do a lot of handcrafting to even set it up. I was surprised by the amount of coding. Also, none of it actually would run with whatever version I got, and this code was all of 2 years old. So that API looks to be a huge pile of flux. Even here, I was reading posts about inconsistencies in solving for XOR.

Right now I'm trying to just get a pulse on the nature of all this. It looks like there's a lot of options and magic sauce and people just kind of poke it with a stick until it appears to work--if it ever even does. I'd expect some failure in more complicated scenarios but I'm even talking about the XOR problem as understood from the 1950s and solved in the 1990s. I'm used to scientific application culture due to working with tons of electrical engineers, and I get real strong vibes of that in all of this. It smells like a lot of this code doesn't do what the people who wrote it think it does. I guess you get away with that in AI sometimes when you can converge on a solution; your own bugs just cause it to converge slower, and you wouldn't necessarily know there's an actual problem.

Edit: Right now I'm backing up and sketching an old-school multi-layer perceptron to see if I can consistently solve XOR and verify if I'm not going crazy insisting on a 100% convergence rate. I had moved on to neuroevolution when I was last looking at this stuff, and that definitely had a 100% success rate since one network in the "herd" would eventually come up with the right thing and the wrong ones would die.

QuarkJets
Sep 8, 2008

Most machine learning experts are treating it like an art more than a science but there are also like a billion research papers published every year actually trying to explain why some recipes work better than others, like for instance a few years ago there were several articles explaining why dropout is a stupid technique for lazy idiots who don't know any better but it's not like people stopped using dropout

I serve as a reviewer for journal articles and as far as I can tell the most common way to develop a successful new network is to start from one that already works well

OnceIWasAnOstrich
Jul 22, 2006

I've recently done quite a bit with ML at various levels in Python. The Scikit-Learn version is very simple, easy to use, but there is a lot buried in the many, many arguments to some methods and the defaults are very frequently going to be unhelpful to you. You might not have the same choice in optimization methods you might use otherwise and the API isn't really designed for internal-to-the-model modularity like that so it is up to whoever wrote that model to give you all the options or you need to customize the model yourself (lots of code).

PyTorch is very much for creating models and methods, not really for use as a simple API for using established/implemented methods like scikit. That said, the type/size of models you want PyTorch for tend to have a lot of extra complication in setting up massive parallel training that isn't really conducive to something super-simple like the Scikit API although stuff like pytorch-lightning get close. I've done a lot with sequence learning using models that use the HuggingFace-style API for Transformers models implemented in either PyTorch or Tensorflow. That API is a lot closer to what you might expect for direct use although it is clearly evolving rapidly as more and more methods get churned out gradually expanding the range of stuff the coordinated API might need to do.

salisbury shake
Dec 27, 2011
Anyone know if weakref.finalizer objects will be called when the interpreter receives signals to shutdown like SIGTERM and others?

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
I don't have much to say but I wanted to pop in and say I had to strip a 3k line excel sheet of addresses of duplicates with some weird rear end parameters and I ended up using pandas to do it. I was amazed at the power of that.

Dominoes
Sep 20, 2007

Rocko Bonaparte posted:

Has anybody here gotten on to the Python machine learning train?
It's a hype trap. Like data science a few years back. Might be a good resume pad if you're looking for a job or VC funds.

If you're trying to solve a practical problem, consider a decision tree.

QuarkJets
Sep 8, 2008

"Trap" maybe isn't the right word when machine learning experts are in massive demand and are actually solving unique problems that haven't been easily solvable by classical methods. There's definitely a lot of hype present, but it's also just a really useful tool, like learning how to use Docker or CUDA

e: And I want to clarify that there are absolutely a ton of grifters that are taking advantage of the hype and trying to apply ML to everything while pretending like it's magic, but that advice is more relevant to project managers than developers

QuarkJets fucked around with this message at 03:53 on Mar 10, 2021

salisbury shake
Dec 27, 2011

salisbury shake posted:

Anyone know if weakref.finalizer objects will be called when the interpreter receives signals to shutdown like SIGTERM and others?

Turns out they're only called upon SIGINT or during an otherwise clean shutdown.

Zoracle Zed
Jul 10, 2001
A couple of my favorite libraries implement nice _repr_html_ methods so objects print rich formatted representations in Jupyter Notebook. Whenever I've looked into it their source code, there's a ton of artisanal hand-crafted html & css formatting. Anyone ever seen a library for doing that kind of thing automatically that handles composition? Like if Fart and Poop both implement _repr_html_, a Butt object that has fart and poop variables should just shove their _html_repr_s into a table or tree or something.

Wallet
Jun 19, 2006

Zoracle Zed posted:

A couple of my favorite libraries implement nice _repr_html_ methods so objects print rich formatted representations in Jupyter Notebook. Whenever I've looked into it their source code, there's a ton of artisanal hand-crafted html & css formatting. Anyone ever seen a library for doing that kind of thing automatically that handles composition? Like if Fart and Poop both implement _repr_html_, a Butt object that has fart and poop variables should just shove their _html_repr_s into a table or tree or something.

I've used Pygments for highlighting but I don't remember how much it can handle as far as layout goes.

Wallet fucked around with this message at 14:13 on Mar 11, 2021

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!

QuarkJets posted:

Most machine learning experts are treating it like an art more than a science but there are also like a billion research papers published every year actually trying to explain why some recipes work better than others, like for instance a few years ago there were several articles explaining why dropout is a stupid technique for lazy idiots who don't know any better but it's not like people stopped using dropout

I can appreciate this in any rapidly expanding discipline where the frontier is inevitably going to be beyond what you're currently looking at from general discussions or books. I'm still perplexed by the goofiness doing basic stuff. I'm posting some screwing around I did for giggles at the end here.

QuarkJets posted:

I serve as a reviewer for journal articles and as far as I can tell the most common way to develop a successful new network is to start from one that already works well
Heh so it's kind of like Makefiles.

OnceIWasAnOstrich posted:

I've recently done quite a bit with ML at various levels in Python. The Scikit-Learn version is very simple, easy to use, but there is a lot buried in the many, many arguments to some methods and the defaults are very frequently going to be unhelpful to you.

...

PyTorch is very much for creating models and methods, not really for use as a simple API for using established/implemented methods like scikit. That said, the type/size of models you want PyTorch for tend to have a lot of extra complication in setting up massive parallel training that isn't really conducive to something super-simple like the Scikit API although stuff like pytorch-lightning get close.

I should have considered that when I was searching for the package to install and found a whole bunch of "pytorch-*" stuff. Is the pytorch basically a base framework at this point? Do I work with it through a different library?

Dominoes posted:

It's a hype trap. Like data science a few years back. Might be a good resume pad if you're looking for a job or VC funds.

If you're trying to solve a practical problem, consider a decision tree.

Somebody called you out on this but it doesn't mean you're wrong either. I'm doing some expeditionary stuff at work because it's politically vogue to try to hit some of our problems with some machine learning. When I heard about it, I first thought about what exactly they're trying to accomplish in even the most basic terms of inputs and outputs. Nobody really knows and that's a warning sign. But since I was the idiot that tried to use neuroevolution for stock market stuff a decade ago, I'm wading through it myself. I suspect there are machine learning solutions to these particular problems; my general take on if its possible is if I can model a situation and "see" a solution but it's particularly difficult to outright code the solution in a contemporary way. However, if I can code an assessment of success then I have a fitness function and "I'm halfway there" (fighting non-linearity in the model sounds like what will take the remaining 50% of effort until it expands to 99% of the effort...).

Going in another direction: are you implying a decision tree would not be machine learning? Or was that just a "X instead of Y" kind of answer? I agree with the idea but I'm testing the whole ecosystem using a domain I've learned in the past. So since I'm cozy with perceptrons, I figured I'd assess the different libraries based on my previous experience with them. This is showing me how goofy this stuff is and implies that if I naively go off and use something like the decision trees that I'm going to be wading into some muck that doesn't mesh up evenly with how it's taught as theory.



So I did more screwing around with scikit-learn's MLPClassifier to try to manually parameterize a perceptron to solve XOR. I did a little side project where I wrote my own network using the given weights and got a successful XOR prediction. When I extend that to the MLPClassifier, I get a different prediction. I'm not doing any kind of training here; I'm manually setting the weights and thresholds/biases/intercepts.

I tend to look at neuron activation as a threshold instead of an intercept or bias so I wonder if I'm interpreting the intercept_ attribute incorrectly. The coefs_ fields really do look like regular old weights across different layers; it adjusts based on hidden_layer_sizes and the data I fit. I run the fit() method to get the initial topology and then blow it over. No luck:

code:
from sklearn.neural_network import MLPClassifier

X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]

# relu activation assuming: output = 1.0 if output >= self.threshold else 0.0
classifier = MLPClassifier(hidden_layer_sizes=(2), activation='relu')

# Weights courtesy of
# [url]http://toritris.weebly.com/perceptron-5-xor-how--why-neurons-work-together.html[/url]
# [url]http://toritris.weebly.com/uploads/1/4/1/3/14134854/4959601_orig.jpg[/url]
# Works fine for a hand-coded network
model = classifier.fit(X, y)
model.coefs_[0][0][0] = 0.6
model.coefs_[0][0][1] = 0.6
model.coefs_[0][1][0] = 1.1
model.coefs_[0][1][1] = 1.1

model.coefs_[1][0] = -2.0
model.coefs_[1][1] = 1.1

# Assuming to use threshold 1.0 for relu. I tried 1.0, -1.0, 0.0, 0.5, and 0.9 with varying results but no actual success.
model.intercepts_[0][0] = 1.0
model.intercepts_[0][1] = 1.0
model.intercepts_[1][0] = 1.0

print(model.predict(X))
print(model.score(X, y))

# Output
# [0 0 0 0]
# 0.5
By the way, is there a particularly good Discord or something to grit through stuff like this? The scikit-learn project didn't really endorse anything other than dropping a line on StackOverflow with a specific tag. I might do that because this isn't urgent or anything.

Dominoes
Sep 20, 2007

QuarkJets posted:

"Trap" maybe isn't the right word when machine learning experts are in massive demand and are actually solving unique problems that haven't been easily solvable by classical methods. There's definitely a lot of hype present, but it's also just a really useful tool, like learning how to use Docker or CUDA

e: And I want to clarify that there are absolutely a ton of grifters that are taking advantage of the hype and trying to apply ML to everything while pretending like it's magic, but that advice is more relevant to project managers than developers

Rocko Bonaparte posted:

Somebody called you out on this but it doesn't mean you're wrong either. I'm doing some expeditionary stuff at work because it's politically vogue to try to hit some of our problems with some machine learning. When I heard about it, I first thought about what exactly they're trying to accomplish in even the most basic terms of inputs and outputs. Nobody really knows and that's a warning sign. But since I was the idiot that tried to use neuroevolution for stock market stuff a decade ago, I'm wading through it myself. I suspect there are machine learning solutions to these particular problems; my general take on if its possible is if I can model a situation and "see" a solution but it's particularly difficult to outright code the solution in a contemporary way. However, if I can code an assessment of success then I have a fitness function and "I'm halfway there" (fighting non-linearity in the model sounds like what will take the remaining 50% of effort until it expands to 99% of the effort...).

Going in another direction: are you implying a decision tree would not be machine learning? Or was that just a "X instead of Y" kind of answer? I agree with the idea but I'm testing the whole ecosystem using a domain I've learned in the past. So since I'm cozy with perceptrons, I figured I'd assess the different libraries based on my previous experience with them. This is showing me how goofy this stuff is and implies that if I naively go off and use something like the decision trees that I'm going to be wading into some muck that doesn't mesh up evenly with how it's taught as theory.

We're on the same page. I'm opportunistic ML and AI techniques will transform our world, and we're gradually getting there. In their current form, I'm not convinced ANNs, SVMs etc are good fits for many problems. There are exceptions, like image recognition. There's enough noise today that I'd guess a random mention of ML or AI (especially a press release, job description, business plan, resume etc) is full of it; I've updated my priors.

You can classify a decision tree as ML if you want, or not. It's an easy-to-grasp, but powerful tool for creating complex behavior.

If you'd like to get into ML, carefully consider why first. Would this have been appealing 5 years ago?

Dominoes fucked around with this message at 00:00 on Mar 12, 2021

OnceIWasAnOstrich
Jul 22, 2006

Rocko Bonaparte posted:

I tend to look at neuron activation as a threshold instead of an intercept or bias so I wonder if I'm interpreting the intercept_ attribute incorrectly. The coefs_ fields really do look like regular old weights across different layers; it adjusts based on hidden_layer_sizes and the data I fit. I run the fit() method to get the initial topology and then blow it over.

A multilayer perceptron works differently than the example you have. In your example, neurons output a binary 0/1 value directly. In the MLP scheme that Scikit-Learn uses, you have a non-linear activation function that maps to a specific range for each neuron. To implement the classifier, the MLPClassifer takes the output of the final layer which I believe will be of shape [num_samples, num_classes], and uses the softmax function to normalize that output to a probability distribution over your classes for each sample. The classifier will then output the class identify with the highest probability.

Dominoes posted:

You can classify a decision tree as ML if you want, or not. It's an easy-to-grasp, but powerful tool for creating complex behavior.

I'm curious, would you maybe draw a line at a non-ML decision tree being human-interpretable? I definitely agree that there is obviously a ton of hype and my personal hand-wavy boundary is that ML models are models where there is no attempt to make the model structure reflect how the modeled phenomenon actually works, and the point isn't to understand the phenomenon through the model, just to make a good [insert goal here]. Clearly decision trees in certain incarnations are ML, especially ensemble tree methods. One of the more powerful and "successful" big machine learning models isn't an ANN but is instead a complicated method of creating ensembles of decision trees.

OnceIWasAnOstrich fucked around with this message at 00:09 on Mar 12, 2021

Dominoes
Sep 20, 2007

I don't have a reason to draw a line; categorization is a tool you can apply to a problem. Maybe you have a reason to draw a line for DTs as ML or not.

In the same sense, choose a tool suitable for the problem you're working with. Maybe it's something categorized as ML. I reject xhoosing ML when it's the wrong tool.

OnceIWasAnOstrich
Jul 22, 2006

Dominoes posted:

I don't have a reason to draw a line; categorization is a tool you can apply to a problem. Maybe you have a reason to draw a line for DTs as ML or not.

In the same sense, choose a tool suitable for the problem you're working with. Maybe it's something categorized as ML. I reject xhoosing ML when it's the wrong tool.

Sorry, I didn't mean to make you draw a line. My point was that, personally, I would never say a type of model is or isn't ML. From my perspective ML is more of the approach or philosophy to problem solving. To me, a linear regression could be (and is) used as machine learning, but can just as easily not be.

Also, Rocko, your issue with applying the MLP classifier to the XOR problem can be illustrated this way. All of the methods used for optimization of model weights are based on gradients (SGD/Adam more so than LBFGS). You can easily end up in a situation where your optimizer gets stuck in a particular region of parameter space and needs to propose much larger changes to the parameter to improve the gradient than it is capable of making.

If you use SGD or Adam as your optimizer, you can visualize this:

Python code:
correct = 0
for x in range(100):
    classifier = MLPClassifier(hidden_layer_sizes=(2,),
                           learning_rate_init=0.1,
                           random_state=x)
    model = classifier.fit(X, y)
    if all(model.predict(X) == y):
        correct += 1
    seaborn.lineplot(x=range(len(model.loss_curve_)), y=model.loss_curve_)
    plt.xlabel('iterations')
    plt.ylabel('loss')
print(f'Correct: {correct}%')


You can see that about a quarter of these manage to converge on the correct solution, a loss value near zero. MLPClassifier uses log-loss/cross-entropy. If you leave the learning_rate_init default at 0.001, you should get a lot of warnings telling you it didn't converge with 200 iterations and a plot like this:



With the default learning rate, it can't make changes to the parameters fast enough to reach the correct solution in the default maximum number of iterations. I am not sure why L-BFGS does as poorly as it does here, I get 28% properly fit, most of the models converge on similar incorrect solutions. There aren't quite enough knobs to tweak with that particular optimizer here.

This is one of those issues I was referring to when I say that the defaults for the Scikit models are not always sane and definitely not always suitable. Making these stochastic optimizers converge properly and quickly by tweaking both model and optimizer hyperparameters is where much of the "art" comes into it.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Hey I have a “how to do this faster” Python/pandas question.

I have an 8Mx50 col data set as a DF and I want to do some multi column partial string matching/filtering.

For example I need to check whether any of 20 partial strings exist in col Address1 when col City contains New York.

vikingstrike
Sep 23, 2007

whats happening, captain
Something like this might work ok:

frame[‘has_match’] = frame.loc[frame.City.str.contains(‘New York’), ‘Address’].str.contains(r’your regex string’, regex=True)

Phayray
Feb 16, 2004
My take on ML is that it's just loose fitting. Classically, we use physics or basic math principles or a bunch of learned experience to create the model, but for some problems we just don't have enough information, or the interactions are too complicated to model using classical techniques. ML is great at addressing these because you don't need to fully understand all of your system fundamentals. It's a hand-wavy approach for a hand-wavy problem that steers you in the right direction.

Where things will go is that we'll use our classical understanding to solve the parts we understand really well, and to limit or inform the ML, and then you let it do its thing to clean up the edges for you, producing a really exceptional result. You might even gain a performance boost because you're not necessarily solving the whole classical problem anymore, which is great.

Right now one of the big challenges is how you get the "classical" information back out, with error bars. One of the problems with the ML black box is that the answers are often relative or categorized such that it's very difficult to understand the statistical confidence in that value. Yeah, it got the right answer, but what does that value mean statistically? Is that a 1 sigma or a 5 sigma result? How sure are you that the blob is a tumor? There's a big difference!

To contribute: I'm a long time C++ programmer who has recently jumped into python again after a ~10 year hiatus for work reasons and I have to say...f-strings are the poo poo. I've been bitten by type shenanigans a few times, and it's going to continue to haunt me forever, but f-strings, man...drat. I didn't realize the future was already here. I hope to never touch any python2 code ever again.

Phayray fucked around with this message at 06:07 on Mar 13, 2021

mr_package
Jun 13, 2000
fstrings support in current PyCharm is great, it prepends the f and adds the closing } for you if you type { into a string.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

vikingstrike posted:

Something like this might work ok:

frame[‘has_match’] = frame.loc[frame.City.str.contains(‘New York’), ‘Address’].str.contains(r’your regex string’, regex=True)

I was kinda drunk in the prev posting and should clarify that .str.contains is what I’m using now and is way too slow. isin would be sufficiently fast but I don’t think can find substrings. It’d work for city but not for the 20 address partial strings that I’d use in .str.contains

For example if wanting to find thing on John rd or Jane road I might use “John|Jane” in str.contains

This works obvs but is massively too slow as I wanna do this sort of operation potentially hundreds of times. Considering a SQL DB.

NinpoEspiritoSanto
Oct 22, 2013




Use sqlite or something god drat pandas has become the new excel

Adbot
ADBOT LOVES YOU

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Wouldn't it be best to just sort by new York and then do the search on address?

I would ask on steak overflow tbh, alot of pandas obsessed weirdos on there who answer my pandas questions within ten minutes

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply