Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
nullfunction
Jan 24, 2005

Nap Ghost

Csixtyfour posted:

Would this work or do I need to keep working on it?

Perfectly acceptable for a student IMO.

If you wanted to continue poking at it, you could move your prompt into the input() call, and do input('Enter an integer: ') so you don't have to call print() as a separate statement. I would probably replace the "x" in your loop with an underscore, since you're not actually using it anywhere and some IDEs will complain that you didn't use x anywhere.

Adbot
ADBOT LOVES YOU

nullfunction
Jan 24, 2005

Nap Ghost
Or with pytest, use the capsys fixture

nullfunction
Jan 24, 2005

Nap Ghost

QuarkJets posted:

I thought sorted already did that

sorted() gives you the sorted keys, not key-value pairs.

You can do something like this but I'm not aware of a native method on dict that would do this for you.

Python code:
>>> d = {"a": 1, "c": 3, "b": 2}
>>> s = {k: d[k] for k in sorted(d)}
>>> s
{'a': 1, 'b': 2, 'c': 3}

nullfunction
Jan 24, 2005

Nap Ghost
Neat, that's definitely nicer to look at than a dict comprehension. I guess I've never really thought about lists of tuples being transformed back to a dict like that, but it makes a lot of sense.

nullfunction
Jan 24, 2005

Nap Ghost
I've never worked with Qt so I can't speak to the offload of work in QThreads or whatever, hopefully this gets the caching idea across, and may get you thinking about how you can modify your program's structure to make your life a little easier in the future:

Python code:
# pokemon.py
import requests

# The dict where we'll store the results locally from your API calls
local_pokedex = {}

class PokemonApi():
    @staticmethod
    def get_pokemon(name: str) -> dict:
        # Probably not a bad idea to add some error handling around this! APIs fail sometimes!
        contact = {'User-Agent': '("Accept: application/json", "Cache-Control: max-age=360",)'}
        response = requests.get(f'https://pokeapi.co/api/v2/pokemon/{name}', headers=contact)
        return response.json()

class Pokemon():
    def __init__(self, data: dict):
        self.name = data['name']
        self.height = data['height']
        self.height_cm = self.height * 10
        self.height_ft = self.height_cm * 0.0328084
        # Other interesting attributes could be processed here,  or if you're not sure what 
        # you'll use, you might store the entire JSON
        # As the final step of initializing, update the cache
        local_pokedex[self.name] = self
    
    @classmethod
    def from_api(cls, name: str):
        return cls(data=PokemonApi.get_pokemon(name))
    
    @classmethod
    def from_cache(cls, name: str):
        if name not in local_pokedex.keys():
            # We don't have this pokemon in our local cache, so we have to go to the API
            return cls.from_api(name)
        # If we're here, we know we have a cached version of the pokemon and don't need to call the API
        return local_pokedex[name]
Normally you'd use the from_cache classmethod when you need to get a new Pokemon due to user input, since it will handle cache hit/miss, but you could always call the from_api classmethod if you wanted to force a cache refresh, or you could call the bare constructor if you loaded the data dictionary from a file or something.

The first time I call for Pikachu using from_cache, it's 100-200ms to retrieve from the API. The next time I ask for a Pikachu, the data is already local so I'm just spending the time to access the dictionary of Pokemon objects (absurdly fast in comparison, at the cost of a little bit of RAM).

Python code:
# poketest.py
from datetime import datetime
from pokemon import Pokemon

first_call_start = datetime.now()
first_pika = Pokemon.from_cache('pikachu')
first_call_finish = datetime.now()

second_call_start = datetime.now()
second_pika = Pokemon.from_cache('pikachu')
second_call_finish = datetime.now()

print(f'{(first_call_finish - first_call_start).microseconds}usec to get my first Pikachu')
print(f'{(second_call_finish - second_call_start).microseconds}usec to get my next Pikachu')
code:
$ python3 poketest.py
112527usec to get my first Pikachu
3usec to get my next Pikachu

nullfunction
Jan 24, 2005

Nap Ghost

D34THROW posted:

Pokemon.from_cache returns a new instance of Pokemon (i.e. a constructor) if the Pokemon is not in the API, or the copy of the locally cached data. In other words, from_cache is a Pokemon factory?

I had a pretty decent grasp of factory patterns and interfaces in VBA but for some reason, in Python they break my brain.

A classmethod is effectively a factory, really all it's doing is receiving the class as an argument, along with potentially other arguments, and then calling the constructor (or another classmethod that eventually calls the constructor). It's worth noting that it has access to class attributes, but not instance attributes:

Python code:
class Butt():
    crack = True
    def __init__(self):
        self.fart_noise = "toot"
    
    @classmethod
    def alternate_butt(cls):
        print(cls.crack) # True
        print(cls.fart_noise) # AttributeError: type object 'Butt' has no attribute 'fart_noise'
Sort of a contrived example, but let's say you want to create a consistent object from one of two inconsistent sources. Maybe the XML input expresses a timespan as a start and end date, and the JSON equivalent stores a start date and a duration, but in your object, all you need is the duration. Your from_xml classmethod could handle subtracting the two dates and substituting a datetime.timedelta while the from_json classmethod might convert the duration over to a datetime.timedelta and ignore the start date completely. It's a great way to hook in some format- or source-specific logic without polluting the other methods in the class, making the whole of the class a lot easier to read, reason about, and perhaps most importantly, test.


E: since I was mocking this up pretty late at night, I didn't go into error handling, but in the case of a Pokemon not existing in the API, the code I posted above blows up. Were this code I was actually using, I'd be checking the status code of the response at a minimum, and raising a better exception (along the lines of a PokemonNotFoundError) and handling that error with something in the UI. In fact, it looks like that API only supports lowercase names, so doing a .lower() and .strip() on the name in the classmethod is probably a good idea too!

nullfunction fucked around with this message at 19:13 on Mar 25, 2022

nullfunction
Jan 24, 2005

Nap Ghost
Sure, that's a very common use case for classmethods, though without seeing the rest of your code, I don't see a reason you need two separate classes for that example.

Python code:
class PGT:
    def __init__(self, **kwargs):
        # Your init stuff
    
    @classmethod 
    def from_csv(cls, path):
        # CSV stuff here

    @classmethod 
    def from_xml(cls, path):
        # XML stuff here
APIs with questionable choices are everywhere. It could be worse: returning a 200 OK and a stack trace in the body is definitely something I've run into, all so the person maintaining the API can say "Oh, that API never errors, must be your code :smuggo:"

nullfunction
Jan 24, 2005

Nap Ghost
I don't think the fact that it's community-run or beta really answers the "what happens if the API is no longer reachable?" question, if anything, it underscores it. It's a question you should be asking yourself each time you interface with something outside your immediate control, though, even as a hobbyist. Answering those questions will give you natural boundaries in the code you write, hints on where to break things apart into more manageable chunks. Code that works and does what you intend it to do is an achievement at any level, and if you have no ambitions past hobbyist that's fair game too.

Rather than shelving it, try adding a local file cache! Even if you're just saving the JSON you got from the webserver, it means you can still use it when your internet connection is down, and a refactor would do the code you posted some good, especially if you ever want to change anything about it in the future.

nullfunction
Jan 24, 2005

Nap Ghost
Nobody likes startswith()?

Python code:
[name for name in names if name.lower().startswith("m")]

nullfunction
Jan 24, 2005

Nap Ghost

KICK BAMA KICK posted:

Too lazy and I forget how to use timeit up correctly Phoneposting or I'd test but: sure it's nbd so just hypothetically, is s.lower() O(n)? and worth avoiding if the strings could be long in favor of like s[0] in "Mm"?

lower() is O(n), so yes, for large numbers of very long strings you may see a large benefit with this approach.

I just learned that startswith can accept a tuple of strings, so you can skip the lower() and just

Python code:
[name for name in names if name.startswith(("M", "m"))]
and it's nearly as performant as grabbing the 0th index, while not throwing exceptions if it hits an empty string. Pretty readable too.

nullfunction
Jan 24, 2005

Nap Ghost

QuarkJets posted:

What's the actual performance difference between value = a_dict.get(key, default) and value = default if key not in a_dict else a_dict[key]? I'm guessing it's going to be negligible enough that I wouldn't let it effect my code at all. I'm an optimization blow-hard but even for me that's beyond what I'd consider doing. There is surely a more effective way to optimize the performance of the software than doing this kind of replacement.

I think this one comes down to being a style choice, you should be doing whatever is conventional for your group (or if you get to set that standard, then whatever you like)

I was curious about this too.

code:
$ python3.10 -m timeit -r 10 -s "d = {'foo': 'bar'}" "value = d.get('foo', 'bar')"
10000000 loops, best of 10: 31.5 nsec per loop

$ python3.10 -m timeit -r 10 -s "d = {}" "value = d.get('foo', 'bar')"
10000000 loops, best of 10: 30.9 nsec per loop

$ python3.10 -m timeit -r 10 -s "d = {'foo': 'bar'}" "value = 'bar' if not 'foo' in d else d['foo']"
10000000 loops, best of 10: 27.1 nsec per loop

$ python3.10 -m timeit -r 10 -s "d = {}" "value = 'bar' if not 'foo' in d else d['foo']"
20000000 loops, best of 10: 18.8 nsec per loop

nullfunction
Jan 24, 2005

Nap Ghost

QuarkJets posted:

And this gives performance that's a little worse, but still better than calling d.get directly???

Ultimately both are pulling work out of the loop, so yeah, I'd expect them to be faster than a call to d.get(). I do see that it's about 3ns faster (regardless of whether the dict is populated) than the operator approach's worst-case on my machine, which surprised me a bit, still the operator's best-case wins handily, but

QuarkJets posted:

Just goes to show how important it is to profile before trying to optimize

is the real takeaway here.

nullfunction
Jan 24, 2005

Nap Ghost

Dawncloack posted:

I have a regex problem.

You're missing the re.MULTILINE flag:

Python code:
import_regexer = re.compile('^.*import .+$|^from .+ import .+$|^.*include .+$', flags=re.MULTILINE) 
That's what tells the regex module that you are feeding it multiple lines of text inside a single string. Otherwise, the caret would only match the start of the entire file, and the dollar sign the end of the file.

Obligatory "now you have two problems" and all that aside, I would rethink the way you've chosen to write your regex. Look into \s at the very least if you want to capture whitespace, using a .* or .+ when you actually want whitespace is guaranteed to wreck your day at some point in the future.

nullfunction
Jan 24, 2005

Nap Ghost

nullfunction posted:

now you have two problems

As a general rule if you want to interrogate Python scripts in any meaningful way, you should look into the ast module. It's part of the standard library, though I'd wager that someone's done the hard work and written parsers for php and bash as well and you could take a similar approach for other files that you may need to parse with a quick pip install.

Python code:
import ast
import pathlib

contents = pathlib.Path("sample.py").read_text()
abstract_syntax_tree = ast.parse(contents)

def analyze_import(stmt: ast.Import):
    modules = [n.name for n in stmt.names]
    print(f"Found an import statement on line {stmt.lineno}: imported module{'s' if len(modules) > 1 else ''}: {', '.join(modules)}")

def analyze_import_from(stmt: ast.ImportFrom):
    aliases = [n.name for n in stmt.names]
    print(f"Found an import statement on line {stmt.lineno}: imported alias{'es' if len(stmt.names) > 1 else ''} {', '.join(aliases)} from module: {stmt.module}")

for item in abstract_syntax_tree.body:
    if isinstance(item, ast.Import):
        analyze_import(item)
    elif isinstance(item, ast.ImportFrom):
        analyze_import_from(item)
Python code:
# sample.py
import json
import os, sys
from datetime import timedelta, datetime
...
code:
Found an import statement on line 2: imported module: json
Found an import statement on line 3: imported modules: os, sys
Found an import statement on line 4: imported aliases timedelta, datetime from module: datetime

nullfunction
Jan 24, 2005

Nap Ghost

Seventh Arrow posted:

edit: oh wait, if I just convert the "=" to ":", will that get me a valid json? or is there a catch?

You should set this line of thinking aside and step back a little bit.

You've correctly identified that the data doesn't look like JSON. It's also not a format I recognize immediately other than to say it's structured and probably from the SOAP API as opposed to the newer REST API. Are you using the Python client for this API? If so you should have Python objects. How did you get to this format?

nullfunction
Jan 24, 2005

Nap Ghost

DoctorTristan posted:

I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger.

This is the key.

I had figured you were print()ing the result, and there's a pretty important thing to call out here that will probably help you in many ways going forward, especially working with unfamiliar APIs.

When you print() an object, all you're doing is asking that object for a string representation of itself. There's a default representation that gives you something like <__main__.Foo object at 0x0000deadbeef> if you choose not to implement your own __repr__() but it can be literally any string representation that the developer wants, including JSON or some weird string like you have here.

Here's a small contrived example to hopefully illustrate:

Python code:
from dataclasses import dataclass

@dataclass
class Rectangle:
    height: int
    width: int
    
    @property
    def area(self) -> int:
        return self.height * self.width
    
    def __repr__(self) -> str:
        return f"A rectangle (area {self.area})"

r = Rectangle(3,4)
print(r) # A rectangle (area 12)
print(r.height) # 3
print(r.width) # 4
What you probably intend to use is dir(), which tells you what your options are:

Python code:
['__annotations__', '__class__', '...snip...', 'area', 'height', 'width']
Using dir() like that can help you identify what properties or methods you might be able to use to get all of the column data out of the object, which I think is probably more aligned with what you were originally trying to do. Maybe you luck out and there's a to_json() method hiding among the entries, but more likely you'll get all of the various properties that you need to plug into the next step in the process.

Unless there's a particularly mature and well-maintained third party API that does exactly what you need it to (simple-salesforce is great but doesn't cover everything!), I'd go with the official API. You'd need something really compelling to convince me that adding a third party dependency over the official dependency is the right move for this, especially if the APIs they expose are similar.

Ultimately you're going to have to do the work of mapping the columns you want (object data) to the right format (pandas dataframe) so that you can use the to_sql call regardless, so you may as well do it against the official API.

Seventh Arrow posted:

As an aside, I've developed a severe aversion to saying, "this next task should be pretty easy."

Congrats, you can call yourself a programmer now.

nullfunction
Jan 24, 2005

Nap Ghost
I had a look at the SDK link you posted and it's the same one linked from the Salesforce page, so I guess you're using the only choice.

I can speak to what you need to do conceptually, but I haven't used pandas, all I've done is have a quick look at what classmethods exist to create a dataframe. I know it's a super popular library so I'm certain you can google "how do I load a pandas dataframe" and get to the same pages I would see. Just don't take this as an optimized solution, it's something to get you started. I gather that you're new to writing code in general, so I'll try to break it down as far as I can, but you should go through the exercise of actually implementing all of this if it's gonna be your job. You're gonna be doing stuff like this a lot, and frankly Python's a great language to do it in. Part of navigating a huge ecosystem like Python's is going to be learning to use some of the built-in tools to self-service when exploring unfamiliar objects, as documentation isn't always available or useful when it does exist.

I'm going to have to make some assumptions because I don't have access to Salesforce data, but let's take the repr from the BounceEvent you posted previously as a guide to what might exist on that object:

quote:

(BounceEvent){
Client =
(ClientID){
ID = client_id_here
}
PartnerKey = None
ObjectID = None
SendID = 4016
SubscriberKey = "subscriber_key_here"
EventDate = 2021-05-12 12:53:38.780000
EventType = "OtherBounce"
TriggeredSendDefinitionObjectID = None
BatchID = 4
}

The structure of the "Client" field suggests that the ClientID is a separate object or something more complex that you'll need to figure out and map accordingly, but overall this looks fairly straightforward in terms of mapping, provided you can locate the properties to get these values out of the object. I can see that this is a single bounced email, so I know that each BounceEvent maps to a "row" in pandas. I know that Pandas can accept a list of dicts (where each dict is a row, and each key in the dict is a column) into a dataframe, and you've indicated you want to use a dataframe to load the data into SQL so let's design with that goal in mind... we need a list of dicts to feed into pandas.

pre:
python pulls data          python somehow gets                  pandas pushes the data
from salesforce    ==>     that weirdly-formatted data  ==>     using to_sql to the
marketing cloud            into a dataframe                     client's Azure SQL DB
It seems like you've got the first part of this working from the examples you found, so let's just wrap it in a function, and annotate that it returns a list:

Python code:
def get_bounce_events() -> list:
    # snip -- example code you posted earlier goes here -- getResponse.results contains a list of BounceEvent objects
    getResponse = getBounceEvent.get()
    bounce_events = Response.results
    return bounce_events
Boom, first section done. Sort of. Maybe you need to pull out some of the authentication stuff (don't hard-code auth!), maybe it needs an argument to provide some further query info to narrow down the bounce_events you get, but that's the gist of it. We've put the logic relating to the retrieval of data in its own function away from everything else. This makes it easier to change in the future when someone inevitably asks you to make a tweak to how it works.

Now that you have all this code in a convenient function, it's a good time to fire up the REPL, paste the function in, and grab an object to play around with:

code:
>>> # paste your code, hopefully no errors
>>> bounce_events = get_bounce_events() # invoking the function you pasted
>>> bounce_event = bounce_events[0] # grab the first one from the list
>>> # now, you can easily do things like
>>> dir(bounce_event)
['__magic_methods__','UsefulProperty','SomeJunkTooProbably',...]
>>> type(bounce_event.UsefulProperty)
<class 'str'>
>>> print(bounce_event.UsefulProperty)
Some string data that interests you
Using dir() and type() you should be able to locate the properties that correspond to the values in the columns for this row. Once you understand how to get the data out of the object, you can write a function that accepts one of these objects as an argument, and outputs a dict:

Python code:
def transform_bounce_event(bounce_event: object) -> dict:
    return {
        "client_id": bounce_event.ClientID.ID, # or however this is obtained, maybe it's bounce_event['ClientID']['ID']? use the repl to figure out how to access the underlying data!
	"partner_key": bounce_event.PartnerKey,
	"send_id": bounce_event.SendKey,
        # etc
    }
Now, that's probably fine until you get to EventDate. If you're lucky, the BounceEvent object will give you a native Python datetime, and I'd be shocked if pandas didn't recognize and convert native datetimes to whatever it uses internally. The worst case scenario is that you can coax a string out of the object and use strptime to parse the date and time from a string. Familiarize yourself with that link, you'll use it again, even if you lucked out this time. Either way, if it needs to end up in SQL, you need to figure out how to get it into the dict in a format that pandas likes, so that it can go into the dataframe.

Noodling around with dir(), calling methods, reading properties, using type() to map out the object, taking notes of what calls you need to make to get all your data out... these are things you'll want to get good at, you'll probably do them a lot. They're also super useful in a pinch when debugging.

At this point, you should have a function that you can feed a BounceEvent into, and get a dict out of that has everything you need for pandas. If you can get something loaded into a dataframe for one example BounceEvent, you've got a pretty good chance that at least one other BounceEvent will work too, so when exploring this way, I always try to start getting a single example of something working, then feeding it more data to try to validate that it works with more than just the one event you're testing.

Python code:
# pull data from salesforce marketing cloud
bounce_events = get_bounce_events() # maybe there is an argument here for a campaign or date range or whatever
# get that weirdly-formatted data into something appropriate for a dataframe
transformed_events = [transform_bounce_event(e) for e in bounce_events]
# make the dataframe from the transformed data
df = pandas.DataFrame.from_records(data=transformed_events, ...)
# sql magic here, no idea how pandas connects to azure but it happens last
do_some_sql_magic_with_dataframe(df)
Thinking about separating your code like you did is a great start, it gives you clues as to how you can structure the code. Use functions, they're great for abstracting away complexity, providing convenient places to swap code in and out, and just generally keeping things tidy. I didn't go too wild with type annotations above but I highly recommend looking into them further and using them everywhere you can; your IDE can interpret them and shout at you when you try to do things you shouldn't. In Python they're suggestions, not enforced, but more and more libraries are using them and the tooling has improved drastically over the last few years, there's never been a better time to get in the habit.

A strikingly large portion of the programs you'll write follow the same basic pattern as what you're trying to do above, so much so that entire industries have grown up around it. Have a look at the wiki page for ETL as it may give you some starting points for stuff to research further, vocab, etc.

nullfunction
Jan 24, 2005

Nap Ghost

StumblyWumbly posted:

I have a Python memory management and speed question: I have a Python program where I'm going to be reading in a stream of data and displaying the data received over the last X seconds. Data outside that window can be junked, I'd like to optimize for speed first, and size second. I know how I'd do this in C, but I'm not sure about Python memory management.
The data are going to be in a live updating graph, so accessing it contiguously would be good.

I have my own ideas about circular buffers or offset pre-allocated buffers, but I have the feeling Python has something off the shelf that will handle this well. Does anything like that exist?

If you want to hold the last N items in a structure similar to a ring buffer without implementing your own, a deque is probably what you're after: https://docs.python.org/3/library/collections.html#collections.deque

Deques can be given an optional maxlen to limit their size, but if you're expecting the stream volume to ebb and flow, you're probably better off implementing the cleanup of the tail end yourself and tying it to the time of the streamed event. In terms of performance, it's O(1) to access either end of the deque, O(n) in the middle, so if you're not doing a ton of random reads, or have a relatively small collection, a deque makes for a really nice choice.

nullfunction
Jan 24, 2005

Nap Ghost

Jabor posted:

This doesn't look like an arraydeque to me. It should have O(1) indexed access (as long as you're just peeking at elements and not adding/removing).

They are indeed different. I'm not aware of an arraydeque equivalent with fast access in the middle in Python's stdlib though I'd be surprised if there wasn't an implementation somewhere out there in the broader ecosystem.

Cachetools is an option, provided an LRU cache with TTL on top of it is acceptable. If you have the necessary skills and need the raw performance, binding something faster to Python is always an option as well.

I would start by throwing together a proof of concept in Python and profiling it before adding dependencies or binding another language, though.

nullfunction
Jan 24, 2005

Nap Ghost

QuarkJets posted:

:shrug:

I just want to throw something together using whatever is currently considered best practice. Maybe I could set up something static that just displays the last 10 pictures I dropped into a directory (e.g. I would generate html locally and then upload everything to s3, I guess?)

Best practice is sort of difficult to gauge here, because there are a dozen different ways to do something like this depending on the finer points of your requirements and what you're trying to optimize. Especially given the number of ways you can run a container on AWS alone, to say nothing of other techniques within AWS or other cloud providers.

As mentioned by a few others, if you just want a static site to display some images, S3 web hosting makes for a fantastic choice. Add a second bucket in a different region, replicate files there, throw CloudFront in with origin failover and it's suddenly globally-distributed and tolerant to a region failure -- pretty powerful for not a lot of work! There really isn't much Python involved here on the hosting side, other than maybe pushing files to S3 using boto3, generating your HTML file from a template, etc. It's nicely optimized for simplicity without sacrificing availability, and should be extremely cost-effective to run as well. It's also completely manual, if you want to add a new image to the 10 most recent, you have to regenerate your source files and reupload everything, there's no compute associated with this method.

If you wanted a solution that leans on Python more heavily as the brains, I would use AWS Lambda for something like this. I don't think I would ship it as a container, I'd probably just live with using 3.9 until AWS finally gets their poo poo together with 3.10 and 3.11 runtimes since it's a very simple task and probably doesn't require a lot of maintenance. You could use a Lambda function to generate the HTML on the fly but I'd probably create the skeleton as a raw HTML file that lives in S3 and use htmx to call the function that asks S3 for files, determines the 10 most recent, and returns them as a chunk of HTML without having to write any JS on the frontend. If you had a suitable method of authenticating users, you could also use Lambda to return a signed URL that will let you upload files directly to the S3 bucket... no extra compute handling required to process a file upload, it's completely between the browser and S3 at that point!

To take this from a toy example into the real world, you'd probably want to think about :

- AuthN/AuthZ on your endpoint(s)
- Will end-users be rate-limited? Will results from the Lambda function need to be cached to keep from running into concurrency limitations if web traffic increases significantly? API Gateway can do rate limiting and caching for you but comes with a price and its own limitations around max request timeout. Adding HA means another API Gateway in another region with all the same backing Lambdas deployed there too.
- How often are we adding new images? Should we transition older images that don't need to be retrieved often to cheaper storage classes for archival, or should they be deleted after some time or falling out of the 10 most recent?
- Will image files stored in S3 be forced to conform to a certain naming spec, or will the original filename be used for storage? Is the S3 bucket allowed to grow unbounded? S3 doesn't have a way to query objects by their modified time, which means potentially lots of API calls to find the most recent objects if you don't prefix your objects with some specifier to help limit the number of queried objects. S3 bucket replication requires that object versioning be turned on, does your software correctly handle an object that has had several versions uploaded?
- If we need to maintain a record of historical images past the 10 most recent, should we store metadata about the objects in another place where we can do a faster lookup, then just reference the objects in S3? Is eventual consistency sufficient for this lookup? Lambda -> DynamoDB is a very common pattern, but is by no means the only option. If you want to use RDS, be sure you have a proxy between Lambda and your DB if you don't want to exhaust your connections when traffic spikes.
- As the number of moving parts involved grows, managing deployment complexity is going to be a bigger deal. Ideally you started building this out with your favorite infrastructure as code tool but if not, you'll definitely want to -- Terraform, CDK, CFN, doesn't really matter. I like CDK because I can stay in Python while defining my infrastructure, but it's not perfect (none of them are).

You could probably come up with a dozen more items to think about that aren't here, and to be clear, there are plenty of ways to accomplish the above in the hosting environment of your choice. You could opt to run the whole thing (object retrieval and storage too) in Django or Flask or FastAPI or whatever if you wanted the experience of building it yourself and don't mind a little undifferentiated heavy lifting.

nullfunction
Jan 24, 2005

Nap Ghost
Each one of your inputs should have a unique ID that you specified on the front end. When you POST data to an endpoint by submitting a form, your browser is grabbing all the stuff you put in the form and sending it to the server (starlette in this case) in a request. Each field in the form, visible or not, gets pushed up in that request as a key-value pair. The key will be the ID you gave your input, the value will be whatever's been entered on the page.

When you handle that request in starlette, it should give you all of the form data as a request parameter in your handler function. Each web framework is a little different but that request object should contain everything that was submitted in the form somewhere within it. Your inputs will likely be stored in a dict, with the key matching whatever the ID was on the front end.

nullfunction
Jan 24, 2005

Nap Ghost

Seventh Arrow posted:

I haven't been slamming the site with requests, so I don't think they're blocking me. Thoughts? Ideas?

The problem is how you've specified the classes on your find_all. What you've written asks for all <li> elements that have a class that matches that whole big long string of classes. The only place I see those classes is applied to the top-level HTML element, not to any of the <li> elements.

Digging around in the DOM, I was able to find the <li>s that correspond to the listings, they look like this:

code:
<li class="card-group__item item item-1 active">...</li>
<li class="card-group__item item item-2 active">...</li>
<li class="card-group__item item item-3 active">...</li>
...
Each of them have a few CSS classes in common: card-group__item, item, and active. "Active" and "item" are a bit generic and might apply to other situations on the page, so what I'd do is pick "card-group__item" which is the most specific shared class among them.

Python code:
results = soup.find_all('li', 'card-group__item')

nullfunction
Jan 24, 2005

Nap Ghost
No mention of zip() as a way to grab items in order from a collection of iterables?

Python code:
>>> student_grades = {
...     'Andrew': [56, 79, 90, 22, 50],
...     'Nisreen': [88, 62, 68, 75, 78],
...     'Alan': [95, 88, 92, 85, 85],
...     'Chang': [76, 88, 85, 82, 90],
...     'Tricia': [99, 92, 95, 89, 100]
... }
>>> 
>>> list(zip(*[semester_grades for student, semester_grades in student_grades.items()]))
[(56, 88, 95, 76, 99), (79, 62, 88, 88, 92), (90, 68, 92, 85, 95), (22, 75, 85, 82, 89), (50, 78, 85, 90, 100)]
From there it's easy enough to sum() and divide by the len() of each of those.

You can also use itertools.zip_longest() if you wanted to handle replacing missing grades with zeroes, for example:

Python code:
>>> import itertools
>>> student_grades = {
...     'Andrew': [56, 79, 90, 22], # length is only 4!
...     'Nisreen': [88, 62, 68, 75, 78],
...     'Alan': [95, 88, 92, 85, 85],
...     'Chang': [76, 88, 85, 82, 90],
...     'Tricia': [99, 92, 95, 89, 100]
... }
>>> 
>>> list(zip_longest(*[semester_grades for student, semester_grades in student_grades.items()], fillvalue=0))
[(56, 88, 95, 76, 99), (79, 62, 88, 88, 92), (90, 68, 92, 85, 95), (22, 75, 85, 82, 89), (0, 78, 85, 90, 100)]

nullfunction
Jan 24, 2005

Nap Ghost
Pydantic is more about data validation, not really applicable to the question as I understand it.

I'm unaware of any kind of generic declarative framework like that. The suggestion to use deepdiff is, in general, a good one -- I've used it in the past to root out differences in large dict structures and act upon those differences -- but the gap between "I want to declare how I want the world to look" and "now I have dictionaries that I can bounce against each other" is significant if you want to generalize things.

If you can't bend one of the existing tools like Ansible to your needs, you're probably going to have to build something. If you're going to build something, think really hard about whether or not you truly need it to be so flexible, especially if it

FISHMANPET posted:

only gets executed a few hundred times ever.

nullfunction
Jan 24, 2005

Nap Ghost

Falcon2001 posted:

I honestly wonder if just making a fake version of the client is the right answer, since this is starting to get a little insane.

Would something like responses help? Check into the registry / dynamic responses stuff.

I've found it to be extremely needs-suiting for mocking APIs in test clients.

nullfunction
Jan 24, 2005

Nap Ghost
Dataclasses are cool and good and if you're accustomed to just throwing everything in a dict it's easy to get started and see immediate benefits.

Pydantic is really helpful if you are getting your data from a less-than-trusted source and need to ensure that everything comes in exactly as you expect by using its validators. Naturally there's a cost to this extra processing, and whether or not this is acceptable will depend on your requirements and goals, but I can say that for all of my use cases the ergonomics of using Pydantic far outweighed the performance cost. I haven't had the opportunity to use the new 2.x series but I understand the core has been rewritten in Rust and is significantly faster than the 1.x series.

If your data is of a predictable format or you need to stick to the stdlib, dataclasses will suffice. If your data is uncertain and you don't mind consuming an extra dependency, it's hard not to recommend Pydantic.

nullfunction
Jan 24, 2005

Nap Ghost
Hopefully you've found the locale module in the stdlib to do your output formatting and just missed the fact that locale.atof() exists. The name isn't intuitive, but it parses strings according to your locale settings, which you can change using setlocale().

https://docs.python.org/3/library/locale.html

nullfunction
Jan 24, 2005

Nap Ghost
Pretty sure the v2 equivalent you're looking for is FieldValidationInfo, which is passed into each validated field like the values dict was in 1.x, it just has a slightly different shape.

Python code:

from pydantic import FieldValidationInfo
...

class PetOwner(BaseModel):
    person_id: Union[UUID4, UUID5] = Field(default_factory=uuid_factory)
    name: str
    animals: Dict[str, Pet]

    @field_validator("animals", mode="before")
    @classmethod
    def enforce_dog_properties(cls, v, info: FieldValidationInfo):
       print(info)
       # FieldValidationInfo(config={'title': 'PetOwner'}, context=None, data={'person_id': UUID('c60d44c7-ee94-4995-9a84-934102e63a77'), 'name': 'henry'}, field_name='animals')
       for _, animal in v.items():
          if "owners_id" not in animal:
              animal["owners_id"] = info.data["person_id"]
       return v

nullfunction
Jan 24, 2005

Nap Ghost
Yeah, fair. I was mucking about with the rest of the code to play with v2 a bit more and didn't include half of the things that would make this better.

So far v2 seems good, though I need to spend some time with their automated upgrade tool.

nullfunction
Jan 24, 2005

Nap Ghost

Armitag3 posted:

from them import frustration as usual

nullfunction
Jan 24, 2005

Nap Ghost
Generally I like the obnoxious itertools solution but as mentioned, yeah, doesn't quite meet the requirements laid out in the prompt:

quote:

Generate a Normally distributed random walk with a starting value of 0 as a Python list

I'd combine the techniques to just accumulate from a generator.

Python code:
from typing import Generator
from itertools import takewhile, accumulate
from random import gauss

def gauss_gen(mu: float = 0.0, sigma: float = 1.0) -> Generator[float, None, None]:
    while True:
        yield gauss(mu, sigma)

walk = list(takewhile(lambda x: abs(x) < 3, accumulate(gauss_gen(), initial=0.0)))
stopping_time = len(walk)

print(f"Stopped in {stopping_time} steps:")
print(f"{walk=}")
e: In no way am I saying the original prompt is well-written. I'd be pissed if I was taking the exam.

nullfunction fucked around with this message at 21:39 on Nov 11, 2023

nullfunction
Jan 24, 2005

Nap Ghost
Yeah, most of the typing stuff you'd likely reach for has been a part of the language for years now. Sometimes you'll have a dependency that requires a particular Python version, which informs the ceiling on features available to you. If what you produce is a module, you'll sometimes find that consumers of that module have some sibling dependency that is stuck on an older Python and limits your ability to use the features you want.

If you're looking for a baseline here's what I tend to do for all greenfield code and all code being refactored on an existing system.

1. Every argument should have a type annotation, every function should have a return type annotation.
2. Lists and dicts should always be annotated with what they are expected to contain in arguments. If you declare one in a function body (assuming Python >= 3.9) it should be annotated if declared empty or the type can't be inferred from the assignment expression. list isn't enough, list[str] is always better.
3. Complex or deeply-nested structures get turned into something more manageable either by aliasing or dataclasses. Type aliasing has existed since 3.5, though you'll have to pull in typing.Dict and typing.List rather than use the types themselves in annotations if you're on something older than 3.9. Nobody wants to see a pile of methods expecting a list[dict[Union[int, str], Union[int, str, bool, float, dict[str, Union[str, int, float]]]]] or whatever.
4. Any is a smell, but a useful one.

There's no one answer when asking "how far should I go with types?" because the factors to consider aren't uniformly weighted in every organization and environment. If what you produce is a module consumed by other teams, effort spent adding type hints has a base payoff to your immediate team and then is further increased by your making your consumers' lives easier. If the codebase in question is a very simple service that will be thrown away after a certain time or get touched once every never, investing hours carefully defining the intricacies of a deeply-nested dictionary in a request payload may never pay off and slapping a dict[str, Any] in may be the wiser choice because there are more impactful tasks at hand. If you expose a set of primitives for your consumers to use in complex ways, the benefits from typing are much greater than if your interface is a single function call in a single module.

If you're adding types to something that is established and has consumers that would be mad at you breaking changes, I would be very careful about the temptation to refactor as you add types, because the act of adding types to untyped code will almost certainly expose problems or failures of abstraction that you didn't realize were lurking all along. I find it better to use Any as an indication that we know something is not right and can't do any better without breaking compatibility. It differentiates from Unknown (a signal that this portion of the codebase has not had a typing pass) and serves as a target for a future refactor.


Falcon2001 posted:

anyone who uses another datetime format is a heretic and does not deserve to draw breath.

:hai:

Adbot
ADBOT LOVES YOU

nullfunction
Jan 24, 2005

Nap Ghost

Seventh Arrow posted:

At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

I'll preface this with an acknowledgement that I'm not a pyspark toucher so I'm not going to really focus on that. From the point of view of someone who is reviewing your submission, I'm very happy to see comments, error checking, and unit tests! They give me additional insight into how you communicate information about your code and how you go about validating your designs. However, if the assignment was supposed to take you 4 hours and you turn in something that looks like it's had 40 put into it, that isn't necessarily a plus.

Since you're offering it up for a roast, here are some things to consider:
  • Your error handling choices only look good on the surface. Yes, you've made an error message slightly more fancy by adding some text to it, yes, the functions will always raise errors of a consistent type. They also don't react in any meaningful way to handle the errors that might be raised or enrich any of the error messages with context that would be useful to an end user (or logging system). You could argue that they make the software worse because they swallow the stack trace that might contain something meaningful (because they don't raise from the base exception).
  • Docstrings for each function are a good practice. Some docstring formats have a description for each argument, raised exception, and a description of the return value, and I tend to like these because it's helpful to tie relevant info directly to an argument. The docstrings you wrote contain the function's name, just reworded, and are not useful. In fact, you could take the tiny bit of extra information from the docstring and put it back into the function name and have a function name that is even better than the one you started with. calculate_most_common_genre_by_user is better than calculate_most_common_genre (but I would probably go with top_genre_by_user personally).
  • Normalize your use of single or double quotes. Run it through a formatter like black or ruff. The best case is that the reviewer doesn't notice or thinks it's a bit sloppy, the worst case is that they assume you copied it from two different websites.
  • You don't have any type annotations in your arguments or on your return values. Help your IDE help you.
  • Your unit tests check that an answer was returned, but stop short of actually seeing if that answer is correct. Constructing a tiny fake dataset with some known answers for your unit tests is a great way to validate that, and it seems like that's what you did from the filenames, but it seems unfinished.

It's clear that you've seen good code before and have some idea of what it should look like when trying to write it for yourself, but you're missing fundamentals and experience that will allow you to actually write it. To be clear this is a fine place to be for a junior. If this is for a junior role, submitted as-is it's a hire from me but there's a lot of headroom for another junior to impress me over this submission.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply