|
ultrafilter posted:Replication crisis in AI/ML when? It's a massive problem already because not only is there a huge problem with replicating/reproducing results in AI/ML (only about 30% of ML papers are reproducible), pretty much all research that uses ML techniques becomes more difficult to reproduce/replicate, so it's contributing to a larger replication crisis in all science.
|
# ? Mar 19, 2024 22:22 |
|
|
# ? Apr 25, 2024 15:21 |
|
In addition to PCA, try T-SNE. It is loving amazing. You might need to do some matrix simplification before T-SNE, but... Seriously, try T-SNE. T- Stochastic Neighbors Embedding
|
# ? Mar 20, 2024 14:45 |
|
some kinda jackal posted:Ehh so I'm kind of backpedaling on my "no python" stance and I'm just hunting for a "just enough python to be useful in ML" type tutorial/resource that will let me competently follow/code along with random tutorials. It looks like Anaconda has a few good leads on practical tutorials or quick courses. Hey opposite-me. I absolutely love everything to do with data visualization, UI and UX but just have no interest in learning JavaScript. Closest I was tempted to on the darkside was Svelte because drat I love how simple and straightforward it is. But I just have a weird visceral reaction to JavaScript once I spend much time on it. Having learned python as my first and only language it must be the curly braces.
|
# ? Mar 21, 2024 01:48 |
|
Sorry for the double post but it's different topic altogether. What are people's experiences so far for using LLMs for things like named entity extraction and using it for wrangling unstructured data? We have a poo poo ton of legacy Word docx "forms" (ie literally just tables on a doc. No XML fields or whatever) and such which I've tested a few times and was pretty impressed with the results converting it to JSON using Mixtral without any context priming or anything. As a data analyst I'm often asked "hey can you analyze this folder of completely unstructured word docs?". I usually manage to avoid it if it isn't something that's easily converted to CSV, but I could see this being useful for this legacy poo poo. (Though I worry that it'll just perpetuate the business not giving a poo poo about APIs or even just spreadsheets for god's sake)
|
# ? Mar 21, 2024 23:48 |
|
yeah mixtral excels in a lot of traditional NLP tasks like NER, sentiment analysis etc. on unstructured data. in terms of open source models, llama-70b wasn't good enough but mixtral is the first LLM that might be "good enough" for a lot of zero shot/few shot tasks where you previously had to finetune a BERT-type transformer model. the big challenge right now for businesses trying to leverage LLMs is building a pipeline for their data (.ppt, .xlsx, .doc, .pdf, sharepoint connectors etc.) into RAG/LLM solutions as well as out of the model(s)
|
# ? Mar 22, 2024 00:38 |
|
Bruegels Fuckbooks posted:It's a massive problem already because not only is there a huge problem with replicating/reproducing results in AI/ML (only about 30% of ML papers are reproducible), pretty much all research that uses ML techniques becomes more difficult to reproduce/replicate, so it's contributing to a larger replication crisis in all science. In fact it's problematic both in academia and industry (hey boss I added 10% more params to the model and shipped it... what's overfitting?)
|
# ? Mar 31, 2024 23:22 |
|
Charles 2 of Spain posted:Not sure where to post this so I'll put it here.
|
# ? Apr 2, 2024 05:23 |
|
Windows prioritizes for the foreground window, there used to be options like the server OS to change that.
|
# ? Apr 2, 2024 15:37 |
|
Lol it appears that I fixed this by turning off Hardware-Accelerated GPU Scheduling.
|
# ? Apr 4, 2024 07:58 |
|
I'm pondering the following sort of ML problem: a game/simulation with many independent non-interacting agents each acting according to the same exact model, continuous input space, continuous output space, dynamic environment, continuous (real) reward function evaluated ONLY at the end of the game/simulation (not per step in the game/simulation). Reward function cannot be used to calculate gradient of model parameters (e.g., no backprop). Assume solution is, say, a pytorch implementation of whatever flavor NN you desire. What training strategies might you consider other than neuroevolution?
|
# ? Apr 10, 2024 20:30 |
|
pmchem posted:I'm pondering the following sort of ML problem: a game/simulation with many independent non-interacting agents each acting according to the same exact model, continuous input space, continuous output space, dynamic environment, continuous (real) reward function evaluated ONLY at the end of the game/simulation (not per step in the game/simulation). Reward function cannot be used to calculate gradient of model parameters (e.g., no backprop). Assume solution is, say, a pytorch implementation of whatever flavor NN you desire. Sounds like a Q-learning or PPO problem to me. They should be able to learn a policy even when the reward function is null until the end of an episode.
|
# ? Apr 10, 2024 20:58 |
|
You might look into Bayesian optimization as an alternative to any RL-based approach.
|
# ? Apr 10, 2024 22:01 |
|
mightygerm posted:Sounds like a Q-learning or PPO problem to me. They should be able to learn a policy even when the reward function is null until the end of an episode. yeah, was considering NAF Q-learning for the continuous space but I thought that required loss/reward at each iteration, not just end of an episode (see algo 1 in Gu). guess I'll poke at other variants. ultrafilter posted:You might look into Bayesian optimization as an alternative to any RL-based approach. already solved things that way anyone have a favorite actor/critic approach for delayed rewards?
|
# ? Apr 10, 2024 22:18 |
|
Are there any LLM/NLP gurus in here? Asking before I make a big effort post.
|
# ? Apr 17, 2024 16:18 |
|
Cyril Sneer posted:Are there any LLM/NLP gurus in here? Asking before I make a big effort post. I’m familiar with the use and deployment of LLMs, training one from scratch not so much.
|
# ? Apr 17, 2024 16:38 |
|
I do NLP research including LLM evaluation but yeah, that doesn't mean I know everything about it and of course I don't have the resources to make my own.
|
# ? Apr 17, 2024 17:46 |
|
I'll just go ahead and post anyway. This is a cross post from the Python thread:Cyril Sneer posted:Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps. To elaborate, here is a slightly longer example of how the transcript data is returned (this being 6 dictionary entries in the/a single transcript's list of dicts): code:
(1) https://mlops.community/how-to-build-your-first-semantic-search-system-my-step-by-step-guide-with-code/ Here, the case study is performing a semantic search on a corpus of paper abstracts. Each entry in the database is a (long) string of the abstract, along with metadata (author, title, etc). In my case, its easy enough to build a full transcript string from the sub-strings, but then I lose the timestamp info. I suppose if the search strategy can identify a location in the full string (rather than just identifying the transcript with the best match), I could a reverse look-up on those strings. (2) https://medium.com/ai-science/build-semantic-search-applications-using-open-source-vector-database-chromadb-a15e9e7f14ce In this case, the author shows how to add entries in the form of sentences (along with metadata), which is fairly close to what I have. From their example, I might do something like: code:
Next, if I'm understanding things correctly, neither of the above approaches are actually relying on LLMs but rather just similarity metrics within the embedding space? Hopefully I've explained this well enough...would love some direction on how to make this all work! I'm fine with packaged solutions so long as they're free/open source. Cyril Sneer fucked around with this message at 18:02 on Apr 17, 2024 |
# ? Apr 17, 2024 18:00 |
|
Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence. There’s a couple good open source implementations like weaviate and lancedb.
|
# ? Apr 17, 2024 20:03 |
|
mightygerm posted:Yeah, you’re looking for a vector database. You can manually add the timestamp/video id as metadata to retrieve them alongside your sentence. What do you think of chromedb, used in the example I linked to? Also still unclear how these vector databases address this: Cyril Sneer posted:What's unclear to me here though is if I'll lose a lot of contextual information since the model/engine/tool doesn't 'see' the whole video transcript (or maybe it does via the IDs?). Or put another way, does breaking up the /a transcript in this way sacrifice contextual understanding? Finding a sentence is different than understanding a broader context within a paragraph, or the topical nature of a particular video at large.
|
# ? Apr 17, 2024 20:20 |
|
I would say that using a LLM is overkill for this use case. As for the context size, it really depends what kind of units of information you are interested in. If it's mainly words, you can use pretrained word embeddings and context doesn't matter as the word contextual meaning will come mainly from the pretraining data and not much from your context. Otherwise you can make document embeddings - either with the few word snippets, or by merging the snippets into full sentences or chapters of videos. It's really up to you at what level you want to have a searchable semantic representation. If more contextual info is needed you could also include for example the video title into each document embedding or whatever.
|
# ? Apr 17, 2024 20:26 |
|
Vector DBs (and probably LLMs too) are overkill for the first step of a hobbyist semantic search project. Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data.
|
# ? Apr 18, 2024 07:51 |
|
shrike82 posted:Look into embedding your logical units of text starting with small language models (look up sentence-transformers), saving those vectors as numpy arrays then doing a similarity search with your query I've implemented this now with chromadb, following that tutorial in link #2. It works, but its not quite giving me what I want. As I suspected, because my text samples are just these short word strings, it more or less just seems to be acting as a word search. I found a nice demo here, also using youtube transcripts. They address the above concern by chunking together 20 entries to make longer text samples, and also use a rolling window approach: https://github.com/lancedb/lancedb/blob/main/docs/src/notebooks/youtube_transcript_search.ipynb However I think they're still just relying on sentence-transformers and similarity searches. shrike82 posted:You can move onto Vector DBs and big-L LMs for embedding if you need more accuracy and are really dealing with big data. I'd like to start exploring this so if you've got some suggestions for next steps that'd be great
|
# ? Apr 19, 2024 05:54 |
|
it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance. think about how you could further preprocess the raw transcripts so that a chunk is meaty enough for search to work 1. youtube captioning, and video captioning more generally, stick together words so that they appear "in time" and neatly as one or two lines of text visually. you can apply some further heuristics around timestamps to gather captions together into larger logical units - e.g., logical chunks are separated by time gaps of > k seconds 2. do some speaker id (off-the-shelf models google-able) to separate text chunks by speakers anyway, if you really want to try alternative embedders: a. bge - considered SOTA open source/weights a couple months back: https://huggingface.co/BAAI/bge-large-en-v1.5 b. use one of the cloud embedding APIs (e.g., openai's)
|
# ? Apr 19, 2024 06:24 |
|
shrike82 posted:it sounds like you're being limited by the way you're chunking up the audio transcripts, not necessarily embedding performance. Right, that was my hypothesis above. That the transcript 'units' are too small to be of much contextual value. For (2), luckily my target channel is just one person. I'm going to try the idea of chunking together 20 sentences with a rolling window of say 5, as was used done in that project I linked above. shrike82 posted:anyway, if you really want to try alternative embedders: That lancedb github link I posted actually does use an embedding from openAI, so I'll queue that up to try as well (they used it along with the chunking approach I described above).
|
# ? Apr 19, 2024 16:53 |
|
|
# ? Apr 25, 2024 15:21 |
|
Any good examples of ML applications for data management / warehousing? For example, helping with identifying and standardizing column headers across multiple datasets or spreadsheets? I know all the rage right now is LLM and unstructured data but for BI analysts and data analysts who mainly work with messy but still structured data sources I'm trying to find applications in that space if such exist.
|
# ? Apr 19, 2024 19:36 |