Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

ultrafilter posted:

Replication crisis in AI/ML when?

It's a massive problem already because not only is there a huge problem with replicating/reproducing results in AI/ML (only about 30% of ML papers are reproducible), pretty much all research that uses ML techniques becomes more difficult to reproduce/replicate, so it's contributing to a larger replication crisis in all science.

Adbot
ADBOT LOVES YOU

dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
In addition to PCA, try T-SNE. It is loving amazing.


You might need to do some matrix simplification before T-SNE, but...


Seriously, try T-SNE.

T- Stochastic Neighbors Embedding

Oysters Autobio
Mar 13, 2017

some kinda jackal posted:

Ehh so I'm kind of backpedaling on my "no python" stance and I'm just hunting for a "just enough python to be useful in ML" type tutorial/resource that will let me competently follow/code along with random tutorials. It looks like Anaconda has a few good leads on practical tutorials or quick courses.

Ultimately I'm happy to eat a "told you so" because there's something about just speaking the native language of the space in an effort to learn the space.

But thanks to everyone and this thread for a bunch of high level links on concepts, etc. The more I poke at the concepts the deeper I want to go, and some of that is putting time into math, which I have the same amount of interest in as python, and then being able to put things into practice will be a nice way to validate my knowledge.

Anyway, a lot of words to say thanks.

Hey opposite-me. I absolutely love everything to do with data visualization, UI and UX but just have no interest in learning JavaScript.

Closest I was tempted to on the darkside was Svelte because drat I love how simple and straightforward it is. But I just have a weird visceral reaction to JavaScript once I spend much time on it. Having learned python as my first and only language it must be the curly braces.

Oysters Autobio
Mar 13, 2017
Sorry for the double post but it's different topic altogether.

What are people's experiences so far for using LLMs for things like named entity extraction and using it for wrangling unstructured data?

We have a poo poo ton of legacy Word docx "forms" (ie literally just tables on a doc. No XML fields or whatever) and such which I've tested a few times and was pretty impressed with the results converting it to JSON using Mixtral without any context priming or anything. As a data analyst I'm often asked "hey can you analyze this folder of completely unstructured word docs?". I usually manage to avoid it if it isn't something that's easily converted to CSV, but I could see this being useful for this legacy poo poo.

(Though I worry that it'll just perpetuate the business not giving a poo poo about APIs or even just spreadsheets for god's sake)

shrike82
Jun 11, 2005

yeah mixtral excels in a lot of traditional NLP tasks like NER, sentiment analysis etc. on unstructured data.
in terms of open source models, llama-70b wasn't good enough but mixtral is the first LLM that might be "good enough" for a lot of zero shot/few shot tasks where you previously had to finetune a BERT-type transformer model.

the big challenge right now for businesses trying to leverage LLMs is building a pipeline for their data (.ppt, .xlsx, .doc, .pdf, sharepoint connectors etc.) into RAG/LLM solutions as well as out of the model(s)

yigh
Jan 3, 2021

Bruegels Fuckbooks posted:

It's a massive problem already because not only is there a huge problem with replicating/reproducing results in AI/ML (only about 30% of ML papers are reproducible), pretty much all research that uses ML techniques becomes more difficult to reproduce/replicate, so it's contributing to a larger replication crisis in all science.

In fact it's problematic both in academia and industry (hey boss I added 10% more params to the model and shipped it... what's overfitting?)

Charles 2 of Spain
Nov 7, 2017

Charles 2 of Spain posted:

Not sure where to post this so I'll put it here.

I'm streaming audio data from a microphone and am continuously running a Pytorch model on it (speech recognition and other stuff). On a Linux laptop with Ubuntu and a GeForce GPU the inference time is around 8ms, which is nice and fast. When I run the exact same code and model on a Windows desktop also with a GeForce GPU the inference time is around 20ms, more than twice as slow.

What could be the reason for this? GPUs are the same on both systems and they are both being used as far as I can tell. I would understand a slight difference in performance depending on the operating system but this is quite large. Is it something to do with how the model was trained?
OK so on Windows the model slows down if the console window used to launch the program isn't in focus. For example if I minimize the window then it grinds to a halt but as soon as I bring it back up it starts running smoothly.

:wtc:

MrMoo
Sep 14, 2000

Windows prioritizes for the foreground window, there used to be options like the server OS to change that.

Charles 2 of Spain
Nov 7, 2017

Lol it appears that I fixed this by turning off Hardware-Accelerated GPU Scheduling.

pmchem
Jan 22, 2010


I'm pondering the following sort of ML problem: a game/simulation with many independent non-interacting agents each acting according to the same exact model, continuous input space, continuous output space, dynamic environment, continuous (real) reward function evaluated ONLY at the end of the game/simulation (not per step in the game/simulation). Reward function cannot be used to calculate gradient of model parameters (e.g., no backprop). Assume solution is, say, a pytorch implementation of whatever flavor NN you desire.

What training strategies might you consider other than neuroevolution?

mightygerm
Jun 29, 2002



pmchem posted:

I'm pondering the following sort of ML problem: a game/simulation with many independent non-interacting agents each acting according to the same exact model, continuous input space, continuous output space, dynamic environment, continuous (real) reward function evaluated ONLY at the end of the game/simulation (not per step in the game/simulation). Reward function cannot be used to calculate gradient of model parameters (e.g., no backprop). Assume solution is, say, a pytorch implementation of whatever flavor NN you desire.

What training strategies might you consider other than neuroevolution?

Sounds like a Q-learning or PPO problem to me. They should be able to learn a policy even when the reward function is null until the end of an episode.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


You might look into Bayesian optimization as an alternative to any RL-based approach.

pmchem
Jan 22, 2010


mightygerm posted:

Sounds like a Q-learning or PPO problem to me. They should be able to learn a policy even when the reward function is null until the end of an episode.

yeah, was considering NAF Q-learning for the continuous space but I thought that required loss/reward at each iteration, not just end of an episode (see algo 1 in Gu). guess I'll poke at other variants.

ultrafilter posted:

You might look into Bayesian optimization as an alternative to any RL-based approach.

already solved things that way

anyone have a favorite actor/critic approach for delayed rewards?

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Are there any LLM/NLP gurus in here? Asking before I make a big effort post.

mightygerm
Jun 29, 2002



Cyril Sneer posted:

Are there any LLM/NLP gurus in here? Asking before I make a big effort post.

I’m familiar with the use and deployment of LLMs, training one from scratch not so much.

Entropist
Dec 1, 2007
I'm very stupid.
I do NLP research including LLM evaluation but yeah, that doesn't mean I know everything about it and of course I don't have the resources to make my own.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
I'll just go ahead and post anyway. This is a cross post from the Python thread:

Cyril Sneer posted:

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

To elaborate, here is a slightly longer example of how the transcript data is returned (this being 6 dictionary entries in the/a single transcript's list of dicts):

code:
    [
    {'text': 'for all of you engineer types that are', 'start': 1503.84, 'duration': 2.4},
    {'text': 'watching here', 'start': 1505.36, 'duration': 2.799},
    {'text': "you know i'm going to tell you of course", 'start': 1506.24, 'duration': 4.319},
    {'text': 'the lead itself will act as an inductor', 'start': 1508.159, 'duration': 3.76},
    {'text': 'at a certain frequency', 'start': 1510.559, 'duration': 3.041},
    {'text': 'but for most of the frequencies that', 'start': 1511.919, 'duration': 4.0}
    ]
As you can see, YT breaks up the text every few words. I'm working with ~260 videos at about 40 mins in length each. Okay, with that in mind, I've been looking at two tutorials that set things up in different ways namely -


(1) https://mlops.community/how-to-build-your-first-semantic-search-system-my-step-by-step-guide-with-code/

Here, the case study is performing a semantic search on a corpus of paper abstracts. Each entry in the database is a (long) string of the abstract, along with metadata (author, title, etc). In my case, its easy enough to build a full transcript string from the sub-strings, but then I lose the timestamp info. I suppose if the search strategy can identify a location in the full string (rather than just identifying the transcript with the best match), I could a reverse look-up on those strings.



(2) https://medium.com/ai-science/build-semantic-search-applications-using-open-source-vector-database-chromadb-a15e9e7f14ce

In this case, the author shows how to add entries in the form of sentences (along with metadata), which is fairly close to what I have. From their example, I might do something like:

code:
    collection.add(
        documents=["for all of you engineer types that are", "watching here", "you know i'm going to tell you of course"],
        metadatas=[{"start": 1503.84}, {"start": 1506.24}, {"start": 1508.159}],
        ids=["aabbcc", "aabbcc","aabbcc"] #All from same video/transcript
    )  
What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding?


Next, if I'm understanding things correctly, neither of the above approaches are actually relying on LLMs but rather just similarity metrics within the embedding space?


Hopefully I've explained this well enough...would love some direction on how to make this all work! I'm fine with packaged solutions so long as they're free/open source.

Cyril Sneer fucked around with this message at 18:02 on Apr 17, 2024

mightygerm
Jun 29, 2002



Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence.

There’s a couple good open source implementations like weaviate and lancedb.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

mightygerm posted:

Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence.

There’s a couple good open source implementations like weaviate and lancedb.

What do you think of chromedb, used in the example I linked to?

Also still unclear how these vector databases address this:

Cyril Sneer posted:

What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding?

Finding a sentence is different than understanding a broader context within a paragraph, or the topical nature of a particular video at large.

Entropist
Dec 1, 2007
I'm very stupid.
I would say that using a LLM is overkill for this use case.

As for the context size, it really depends what kind of units of information you are interested in. If it's mainly words, you can use pretrained word embeddings and context doesn't matter as the word contextual meaning will come mainly from the pretraining data and not much from your context. Otherwise you can make document embeddings - either with the few word snippets, or by merging the snippets into full sentences or chapters of videos. It's really up to you at what level you want to have a searchable semantic representation. If more contextual info is needed you could also include for example the video title into each document embedding or whatever.

shrike82
Jun 11, 2005

Vector DBs (and probably LLMs too) are overkill for the first step of a hobbyist semantic search project.

Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query

You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

shrike82 posted:

Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query

I've implemented this now with chromadb, following that tutorial in link #2. It works, but its not quite giving me what I want. As I suspected, because my text samples are just these short word strings, it more or less just seems to be acting as a word search.

I found a nice demo here, also using youtube transcripts. They address the above concern by chunking together 20 entries to make longer text samples, and also use a rolling window approach:

https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/youtube_transcript_search.ipynb

However I think they're still just relying on sentence-transformers and similarity searches.


shrike82 posted:

You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.

I'd like to start exploring this so if you've got some suggestions for next steps that'd be great :)

shrike82
Jun 11, 2005

it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance.
think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work

1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds

2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers

anyway, if you really want to try alternative embedders:

a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5
b. use one of the cloud embedding APIs (e.g., openai's)

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

shrike82 posted:

it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance.
think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work

1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds

2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers


Right, that was my hypothesis above. That the transcript 'units' are too small to be of much contextual value. For (2), luckily my target channel is just one person.

I'm going to try the idea of chunking together 20 sentences with a rolling window of say 5, as was used done in that project I linked above.



shrike82 posted:

anyway, if you really want to try alternative embedders:

a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5
b. use one of the cloud embedding APIs (e.g., openai's)

That lancedb github link I posted actually does use an embedding from openAI, so I'll queue that up to try as well (they used it along with the chunking approach I described above).

Adbot
ADBOT LOVES YOU

Oysters Autobio
Mar 13, 2017
Any good examples of ML applications for data management / warehousing?

For example, helping with identifying and standardizing column headers across multiple datasets or spreadsheets?

I know all the rage right now is LLM and unstructured data but for BI analysts and data analysts who mainly work with messy but still structured data sources I'm trying to find applications in that space if such exist.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply