Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Oysters Autobio posted:

Echoing this and expanding it even further: don't touch anything related to XML as a beginner.

While I'm absolutely sure it has some great features about it, I'm finding it's very much a "big boy" format and not beginner friendly. Maybe not the format necessarily but the ecosystem of tools in Python for it are really dense.

Hell, just look at the name "lxml" as a package. Gonna throw out a dumb hot take that I literally put no thought into: Acronyms should be banned from package naming.
Yeah, the big issue is that XML was built as a markup language, not a language for representing data structures or configuration, the two things developers between 1995 and 2005 really like to pretend it was ever good at

If you can make guarantees about the documents you're loading like "text will never contain other elements" then it gets a lot easier to work with and enables much more straightforward APIs like Pydantic

Adbot
ADBOT LOVES YOU

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum
Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables."

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Jose Cuervo posted:

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Jose Cuervo
Aug 25, 2004

Fender posted:

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Fender posted:

Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

I use lxml when needing to iterate over huge lists via xPaths from scraped data. Seems to be the fastest and it ain’t that hard. Selenium is slow at finding elements via xpath when you start needing to find hundreds of individual elements.

Also if you’re using selenium, lxml code can kinda look similar.

I spent multiple years writing and maintaining web scrapers and basically never used BS4.

CarForumPoster fucked around with this message at 10:13 on Apr 5, 2024

rich thick and creamy
May 23, 2005

To whip it, Whip it good
Pillbug
Has anyone played around with Rye yet? I just found it yesterday and am giving it a spin. So far it seems like a pretty nice Poetry alternative.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

Cyril Sneer fucked around with this message at 03:46 on Apr 17, 2024

PierreTheMime
Dec 9, 2004

Hero of hormagaunts everywhere!
Buglord

Cyril Sneer posted:

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

This sounds like a good use-case for a vectored database and retrieval-augmented generation (RAG) and/or semantic search. You can use your dialog text as the target material and the rest as metadata you can retrieve on match. Theres a number of free options for database, including local ChromaDB instances (which use SQLite) or free-tier Pinecone.io which has good library support and a decent web UI.

Hed
Mar 31, 2004

Fun Shoe
I just wanted to say I appreciated this error message when I forgot to put in the '-r'.

Bash code:
(venv) $ pip install requirements.txt
ERROR: Could not find a version that satisfies the requirement requirements.txt (from versions: none)
HINT: You are attempting to install a package literally named "requirements.txt" (which cannot exist). Consider using the '-r' flag to install the packages listed in requirements.txt
It sounds like something I would write.

boofhead
Feb 18, 2021

I love the "consider using". It feels very much like "you don't have to go home, but you can't stay here"

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
I have a case where I create two instances of an object via a big configuration dictionary. The difference between the two objects is a single, different value for one key. So, this works:

code:
big_config_dict = { .... }

B = dict(big_config_dict)
B['color'] = 'blue' #this one key is the only difference

thingA = Thing(big_config_dict) #default values
thingB = Thing(B) #single modified value
...but feels clunky. Am I missing some simpler way to do this?

Chin Strap
Nov 24, 2002

I failed my TFLC Toxx, but I no longer need a double chin strap :buddy:
Pillbug
You might want to do deepcopy instead of just dict() to copy but that's all I'd change

EDIT: If you wanted to be real safe you'd make them frozen too.

boofhead
Feb 18, 2021

If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simple

Python code:
config_1 = {'val1': 100, 'val2': 200}
# {'val1': 100, 'val2': 200}

config_2 = {**config_1, 'val2': 0}
# {'val1': 100, 'val2': 0}
you can also use it to combine/update multiple dicts, but the more complex the dictionaries or conflicts or changes are, the less appropriate it becomes, e.g.

Python code:
default_person = {"name": "john doe", "age": 50, "mother": {"name": "jane doe", "age": 80}}
default_hobbies = {"hobbies": ["going to church", "reading the bible"]}

sample_person = {**default_person, **default_hobbies}
# {'name': 'john doe', 'age': 50, 'mother': {'name': 'jane doe', 'age': 80}, 'hobbies': ['going to church', 'reading the bible']}

sample_stoner = {**default_person, **default_hobbies, "mother": {"name": "janet doe"}, "hobbies": ["smoking weed"]}
# {'name': 'john doe', 'age': 50, 'mother': {'name': 'janet doe'}, 'hobbies': ['smoking weed']}
# see how it replaces, it doesn't update partials/nested
it's not perfect but i quite like spread operators if the usage is clear and concise. works for lists too, just use one asterisk

boofhead fucked around with this message at 21:00 on Apr 24, 2024

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
[quote="boofhead" post="539140163"]
If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simple

Python code:
config_1 = {'val1': 100, 'val2': 200}
# {'val1': 100, 'val2': 200}

config_2 = {**config_1, 'val2': 0}
# {'val1': 100, 'val2': 0}
oooh, okay yeah this is perfect.

QuarkJets
Sep 8, 2008

If your things are dataclasses you can neatly dictionary-ize one while creating the other

Python code:
@dataclass
class Thing:
    a: int
    b: int
    c: int

thing1 = Thing(a=1, b=2, c=3)
thing2 = Thing(**{**asdict(thing1), 'c': 4})
Or if they're really default values you can skip using dicts entirely:
Python code:
@dataclass
class Thing:
    a: int = 1
    b: int = 2
    c: int = 3

thing1 = Thing()
thing2 = Thing(c=4)

Data Graham
Dec 28, 2009

📈📊🍪😋



Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is

Python code:
----> 1 thing1 = Thing()

TypeError: __init__() missing 3 required positional arguments: 'a', 'b', and 'c'

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Data classes rule. Use them everywhere.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Falcon2001 posted:

Data classes rule. Use them everywhere.

I’ve met like three functions that should be a class.

QuarkJets
Sep 8, 2008

Data Graham posted:

Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is

Python code:
----> 1 thing1 = Thing()

TypeError: __init__() missing 3 required positional arguments: 'a', 'b', and 'c'

You can assign defaults in the dataclass definition, see my example

Data Graham
Dec 28, 2009

📈📊🍪😋



Oh nice, I should have figured. Thanks!

e: wow I have no idea why it never even occurred to me to try that. Some days my stupidity leaves me breathless

Data Graham fucked around with this message at 03:06 on Apr 27, 2024

Seventh Arrow
Jan 26, 2005

I'm hoping that pyspark is python-adjacent enough to be appropriate for this thread (since the Data Engineering thread doesn't seem to be responding)

I'm applying for a job and Spark is one of the required skills, but I'm fairly rusty. They sprang an assignment on me where they wanted me to take the MovieLens dataset and calculate:

- The most common tag for a movie title and
- The most common genre rated by a user

After lots of time on Stack Overflow and Youtube, this is the script that I came up with. At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

code:
# Set up the Google Colab environment for pyspark, including integration with Google Drive

from google.colab import drive
drive.mount('/content/drive')

!pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, row_number, split, explode
from pyspark.sql.window import Window
import unittest

# Initialize the Spark Session for Google Colab
def initialize_spark_session():
    """Initialize and return a Spark session configured for Google Colab"""
    return SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

"""
This code will ingest data from the MovieLens 20M dataset. The goal will be to use Spark to find the following:

- The most common tag for each movie title:
  Utilizing Spark's DataFrame operations, the script joins the movies dataset with the tags dataset on the movie ID.
  After joining, it aggregates tag data by movie title to count occurrences of each tag, then employs a window function
  to rank tags for each movie. The most frequent tag for each movie is identified and selected for display.

- The most common genre rated by each user:
  Similarly, the script joins the movies dataset, which includes genre information, with the ratings dataset on the movie ID.
  It then groups the data by user ID and genre to count how many times each user has rated movies of each genre.
  A window function is again used to rank the genres for each user based on these counts. The top genre for each user
  is then extracted and presented.


"""

def load_dataset(spark, file_path, has_header=True, infer_schema=True):
    """Load the need files, with some error checking for good measure"""
    try:
        df = spark.read.csv(file_path, header=has_header, inferSchema=infer_schema)
        if df.head(1):
            return df
        else:
            raise ValueError("DataFrame is empty, check the dataset or path.")
    except Exception as e:
        raise IOError(f"Failed to load data: {e}")

def calculate_most_common_tag(movies_df, tags_df):
    """Calculates the most common tag for each movie title."""
    try:
        movies_tags_df = movies_df.join(tags_df, "movieId")
        tag_counts = movies_tags_df.groupBy("title", "tag").agg(count("tag").alias("tag_count"))
        window_spec = Window.partitionBy("title").orderBy(col("tag_count").desc())
        most_common_tags = tag_counts.withColumn("rank", row_number().over(window_spec)) \
                             .filter(col("rank") == 1) \
                             .drop("rank") \
                             .orderBy(col("tag_count").desc())
        return most_common_tags
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common tag: {e}")

def calculate_most_common_genre(ratings_df, movies_df):
    """Calculates the most common genre rated by each user."""
    try:
        movies_df = movies_df.withColumn("genre", explode(split(col("genres"), "[|]")))
        ratings_genres_df = ratings_df.join(movies_df, "movieId")
        genre_counts = ratings_genres_df.groupBy("userId", "genre").agg(count("genre").alias("genre_count"))
        window_spec = Window.partitionBy("userId").orderBy(col("genre_count").desc())
        most_common_genres = genre_counts.withColumn("rank", row_number().over(window_spec)) \
                                 .filter(col("rank") == 1) \
                                 .drop("rank") \
                                 .orderBy(col("genre_count").desc())
        return most_common_genres
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common genre: {e}")

def main():
    # Calling Spark session info
    spark = initialize_spark_session()
    # Load CSV files from MovieLens
    movies_df = load_dataset(spark, "/content/drive/My Drive/spark_project/movies.csv")
    tags_df = load_dataset(spark, "/content/drive/My Drive/spark_project/tags.csv")
    ratings_df = load_dataset(spark, "/content/drive/My Drive/spark_project/ratings.csv")

    # Perform analysis
    most_common_tag = calculate_most_common_tag(movies_df, tags_df)
    most_common_genre = calculate_most_common_genre(ratings_df, movies_df)

    # Displaying results
    print("Most Common Tag for Each Movie Title:")
    most_common_tag.show()
    print("Most Common Genre Rated by User:")
    most_common_genre.show()

class TestMovieLensAnalysis(unittest.TestCase):
    '''This will perform some unittests on a truncated set of files in a dedicated testing directory'''
    def setUp(self):
        self.spark = initialize_spark_session()
        self.movies_path = "/content/drive/My Drive/spark_testing/movies.csv"
        self.tags_path = "/content/drive/My Drive/spark_testing/tags.csv"
        self.ratings_path = "/content/drive/My Drive/spark_testing/ratings.csv"

    def test_load_dataset(self):
        # Assuming fake paths to simulate failure
        with self.assertRaises(IOError):
            load_dataset(self.spark, "fakepath.csv")

    def test_calculate_most_common_tag(self):
        movies_df = load_dataset(self.spark, self.movies_path)
        tags_df = load_dataset(self.spark, self.tags_path)
        result_df = calculate_most_common_tag(movies_df, tags_df)
        self.assertIsNotNone(result_df.head(1))

    def test_calculate_most_common_genre(self):
        ratings_df = load_dataset(self.spark, self.ratings_path)
        movies_df = load_dataset(self.spark, self.movies_path)
        result_df = calculate_most_common_genre(ratings_df, movies_df)
        self.assertIsNotNone(result_df.head(1))

if __name__ == "__main__":
    main()
    unittest.main(argv=['first-arg-is-ignored'], exit=False)
The line "unittest.main(argv=['first-arg-is-ignored'], exit=False)" is apparently a necessary quirk when working in a notebook format

nullfunction
Jan 24, 2005

Nap Ghost

Seventh Arrow posted:

At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

I'll preface this with an acknowledgement that I'm not a pyspark toucher so I'm not going to really focus on that. From the point of view of someone who is reviewing your submission, I'm very happy to see comments, error checking, and unit tests! They give me additional insight into how you communicate information about your code and how you go about validating your designs. However, if the assignment was supposed to take you 4 hours and you turn in something that looks like it's had 40 put into it, that isn't necessarily a plus.

Since you're offering it up for a roast, here are some things to consider:
  • Your error handling choices only look good on the surface. Yes, you've made an error message slightly more fancy by adding some text to it, yes, the functions will always raise errors of a consistent type. They also don't react in any meaningful way to handle the errors that might be raised or enrich any of the error messages with context that would be useful to an end user (or logging system). You could argue that they make the software worse because they swallow the stack trace that might contain something meaningful (because they don't raise from the base exception).
  • Docstrings for each function are a good practice. Some docstring formats have a description for each argument, raised exception, and a description of the return value, and I tend to like these because it's helpful to tie relevant info directly to an argument. The docstrings you wrote contain the function's name, just reworded, and are not useful. In fact, you could take the tiny bit of extra information from the docstring and put it back into the function name and have a function name that is even better than the one you started with. calculate_most_common_genre_by_user is better than calculate_most_common_genre (but I would probably go with top_genre_by_user personally).
  • Normalize your use of single or double quotes. Run it through a formatter like black or ruff. The best case is that the reviewer doesn't notice or thinks it's a bit sloppy, the worst case is that they assume you copied it from two different websites.
  • You don't have any type annotations in your arguments or on your return values. Help your IDE help you.
  • Your unit tests check that an answer was returned, but stop short of actually seeing if that answer is correct. Constructing a tiny fake dataset with some known answers for your unit tests is a great way to validate that, and it seems like that's what you did from the filenames, but it seems unfinished.

It's clear that you've seen good code before and have some idea of what it should look like when trying to write it for yourself, but you're missing fundamentals and experience that will allow you to actually write it. To be clear this is a fine place to be for a junior. If this is for a junior role, submitted as-is it's a hire from me but there's a lot of headroom for another junior to impress me over this submission.

QuarkJets
Sep 8, 2008

I don't like how the function code is all wrapped in large try/except blocks that raise new errors. That's not really error checking, it's error obfuscation; yes, you print the caught exception object, but it's hiding the stack trace for no good reason. If you absolutely felt like you had to add more context to exceptions, like if you wanted to use logging or send a notification to someone, then you could use exception chaining to raise the original exception
Python code:
def test():
    try:
        raise RuntimeError('foo')
    except RuntimeError as e:
        logging.warning(f'gently caress, I saw exeption {e}!!!')
        raise  # <-- This simple line raises the original exception as though it wasn't caught at all
But I think you should ditch the try/except blocks entirely, they're not helping you and bare Exception catching (where you catch the base Exception type instead of a more specific type) is a code smell.

Don't run pip install commands in your notebook code. A comment that describes what's needed is fine.

Seventh Arrow
Jan 26, 2005

Thanks guys, much appreciated! I will look into your suggestions.

Adbot
ADBOT LOVES YOU

StumblyWumbly
Sep 12, 2007

Batmanticore!
Any recommendations for places to start with interfacing dlls and python?
I'm interested in playing around with this and trying to make some file parsing stuff faster by reading binary data in c, then passing it to python for the higher level stuff.
I've used ctypes before, but I'm under the impression dll stuff has changed in the last 4 versions or so, and I'm worried searching will suggest bad habits.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply