Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
instead of continuing to fill up the idiot spare time projects thread with my literal nonsense, it was suggested i make a new thread about it so here it is

inspired by sarah palin's most recent word salad, i decided to make a markov text bot to generate virtual sarah palin quotes. turns out this is an idea a million other people have had, because anyone who knows both palin and markov text bots sees the rather obvious connection. however, this is only step 1 in our voyage.

oh before we get started here's what a markov text generator is

https://en.wikipedia.org/wiki/Markov_chain posted:

A Markov chain (discrete-time Markov chain or DTMC), named after Andrey Markov, is a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as "memorylessness": the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.

blah blah blah a bunch of poo poo about math and then something about baseball?

Markov processes can also be used to generate superficially real-looking text given a sample document: they are used in a variety of recreational "parody generator" software (see dissociated press, Jeff Harrison, Mark V Shaney).

These processes are also used by spammers to inject real-looking hidden paragraphs into unsolicited email and post comments in an attempt to get these messages past spam filters.

In the bioinformatics field, they can be used to simulate DNA sequences.

okay, so basically the procedure is this: you take a source text, you break it down into n-grams. an n-gram is a string of words of length n. so basically you just take every pairing of sequential words, stick them in a table, and count how often they happen. using this n-gram table, you just pick a random starting point. then you calculate the probability of the next n-gram based solely on the current n-gram, select the next n-gram based on an RNG and that table of probabilities, and just keep doing that until you get tired.

all very simple, but i'm still lazy as a motherfuck so i'm just using the ngram package in R. that looks like this:

code:
infile <- file("palin.txt")
diarrhea.in <- paste(readLines(infile), collapse = " ") 
palin.ngram <- ngram(diarrhea.in, 2)
palin.babble <- babble(palin.ngram)
where "palin.txt" is just a bunch of palin interviews, speeches, and debate performances stuffed into a text file. that gives you this:

quote:

"“pay-to-play.” Between bailouts for Wall Street cronies and stimulus projects or, as someone put it, this was all about Denali, mom, dad, ungulate eyeballs, slaying salmon on the floor of the world really works in order to accomplish after he's done turning back the waters and healing the planet? The answer is to challenge the status quo has got to call the devastation that a bill wouldn't be signed into law before we probably even got that first revolution.” We are the ones, right? You’re the ones who pay the bills in our enemies — proving peace through strength. In that respect, I applaud the president and his American dream endures. He knew the best of America are open, unfortunately though, some would want you to succeed too. And that we love. We’re here to stop that they inherited. Real reform never sits well with entrenched interests and power brokers. "

so it's basically perfect

NEXT UP: FROM SHITPOSTER TO TWITPOSTER

Adbot
ADBOT LOVES YOU

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
okay it's fun to generate nonsense for your friends and relatives, but what you really want to do is spread your garbage far and wide. of course that means you need twitter, the web's primary outlet for meaningless garbage. luckily R has a p great package for accessing the twits, called twitteR. first you need to register an app on twitter and get your keys though. that means:

1. register an account for your garbage bot
2. while logged into your garbage bot account, go to apps.twitter.com
3. click "create new app"
4. fill in some details here:



5. click "i agree" to the user agreement, which probably says that you're not going to do the things we're about to do. i don't know for sure because i never read it, i just like agreeing to stuff
6. go to the tab that says "keys and access tokens". you'll need to generate a token. then you need to copy down the gibberish after Consumer Key (API Key), Consumer Secret (API Secret), and then Access Token and Access Token Secret, lower down on the page. now twitteR can talk to twitter

code:
library(twitteR)


options(httr_oauth_cache=T)

apikey <- "apikey"
apisecret <- "apisecret"
token <- "token"
tokensecret <- "tokensecret"
setup_twitter_oauth(apikey, apisecret, token, tokensecret)
you now have the ability to do all sorts of stuff: post, search, get trending hashtags, etc. we're just going to set sarah up to post her garbage to the tweets, so all we need to do is take that babble, do some regex on it to get vaguely sentence-like objects of under 140 characters, and tweet them. thusly

code:
palin.babble <- babble(palin.ngram)
sentences <- c()
sentence.starts <- as.vector(gregexpr("[?.!] +[A-Z]", palin.babble)[[1]])
for(i in 1:(length(sentence.starts) - 1)){
   this.sentence <- substr(palin.babble, sentence.starts[i]+2, sentence.starts[i+1])
   if(nchar(this.sentence) <= 140){
      sentences <- c(sentences, this.sentence)
   }
}
tweet(sentences[1])
then hook that into a loop that runs every 30 minutes or something

result: mostly tedious gibberish but sometimes something entertaining comes out

https://twitter.com/markov_palin/status/690819879537131520

https://twitter.com/markov_palin/status/690818464269869056

https://twitter.com/markov_palin/status/690856828675104771

https://twitter.com/markov_palin/status/693176875078660096


and sometimes something chilling

https://twitter.com/markov_palin/status/690823441524604928




might as well make a markov trump while we're at it, it's basically just a matter of plugging in a new text file

result: markov trump is feeling romantic

https://twitter.com/markov_trump/status/693132349282787329



but not so romantic that he can't still be a brutal dictator

https://twitter.com/markov_trump/status/693011475791790081

https://twitter.com/markov_trump/status/692935871893471232

NEXT UP: ROOTING THROUGH THE TRASH

Trig Discipline fucked around with this message at 00:02 on Feb 14, 2016

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
okay how can we get that extra dose of twitter realness? answer: harvest twitter itself

turns out we can again do that super-easy with the twitteR package, just by using the userTimeline function

code:
trump.tweets <- twListToDF(userTimeline('realDonaldTrump', n=3200))$text
write(new.tweets, file="trumptweets.txt")
now we have a text file with all of donald's tweets, which we can load in and append to our speeches, debates, and interviews. this results in true social media engagement

https://twitter.com/markov_trump/status/691789588885544960



now let's get sarah in on the action

https://twitter.com/markov_palin/status/693288342524284928

https://twitter.com/markov_palin/status/693282630360432640

https://twitter.com/markov_palin/status/693244859478515712

https://twitter.com/markov_palin/status/693169316821233664

UP NEXT: KEEPING UP WITH CURRENT EVENTS

Trig Discipline fucked around with this message at 00:08 on Feb 14, 2016

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
okay, so we're generating meaningless chaos, but we still want to keep up with the hottest trends and news items. we can use the tm text mining packages and its various derivatives to see what's going on in the news

we're going to do this in the context of generating prophecies. we're going to mix Revelations, The Necronomicon, The Egyptian Book of the Dead, and Nostradamus with today's hot news and hashtags

code:
library(tm)
library(tm.plugin.webmining)

googlenews <- WebCorpus(GoogleNewsSource("Microsoft"))

googlenews.in <- paste(unlist(lapply(googlenews$content, function(x) x$content)), collapse = " ")
googlenews.in <- gsub("\\n", " ", googlenews.in)
googlenews.in <- gsub("([\\])", " ", googlenews.in)

yahoonews <- WebCorpus(YahooNewsSource("Microsoft"))

yahoonews.in <- paste(unlist(lapply(yahoonews$content, function(x) x$content)), collapse = " ")
yahoonews.in <- gsub("\\n", " ", yahoonews.in)
yahoonews.in <- gsub("([\\])", " ", yahoonews.in)
that gets us the text of the top stories on google and yahoo news. then we need our holy texts

code:
infile <- file("holy.txt")
holy.in <- readLines(infile)
holy.in <- paste(holy.in, collapse=" ")
and finally we need the hot hot tweets. since twitter gets really shirty when you get too much info at once, i'm just grabbing the top 100 tweets from the 20 most popular hashtags in the USA (that's the number in the getTrends arguments)

code:
tweets <- c()
trends <- getTrends(23424977)
for(i in 1:20){
   thistag <- trends[i, 1]
   print(paste("Harvesting tag", i, ":", thistag))
   these.tweets <- searchTwitter(thistag, 100)
   these.tweets <- paste(twListToDF(these.tweets)$text, collapse = " ")
   tweets <- paste(tweets, these.tweets, collapse = " ")
}
then you basically stuff all of that stuff into a single character vector, ngram it, and markov_thebeast is born

https://twitter.com/markov_thebeast/status/697993864607498241

https://twitter.com/markov_thebeast/status/698103558290280448

https://twitter.com/markov_thebeast/status/698043117576957953


NEXT UP: THE INTERNET BARFS UP ITS OWN rear end in a top hat

Trig Discipline fucked around with this message at 00:21 on Feb 14, 2016

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
for our (current) final iteration, we're going to turn twitter into a literal echo chamber. in response to the above, cheese-cube posted the following

cheese-cube posted:

has anyone made a twitter bot that makes markov chains using the tweets of users who follow it? idk could either be terrible or funny.

which is a frickin' genius idea. turns out to be pretty easy to do, too! this one is named markov_polov

polov runs as two separate scripts. one just grabs all of the followers of the twitter account, then scrapes their tweets and does some regex stuff to get rid of special characters. it also strips URLs so it doesn't end up reposting whatever weird porn you guys are passing around on twitter. it waits five minutes between searches so that twitter doesn't boot it off. then, after it's done one pass of all of the users, it writes their tweets to a text file. the second script just reads that text file every ten minutes, builds an ngram table, and spouts some bullshit

code:
while(1){
    mp <- getUser('markov_polov')
    followers <- mp$getFollowers()
    follower.tweets <- c()
    for(i in 1:length(followers)){
        print(paste("Grabbing tweets from", followers[[i]]$getScreenName()))
        follower.tweets <- paste(follower.tweets, paste(twListToDF(userTimeline(followers[[i]], n=3200))$text, collapse = " "), collapse = " ")
        print(paste("Got", nchar(follower.tweets), "characters so far..."))
        Sys.sleep(300)
    }


    tweets <- gsub("http[^[:space:]]*", "", follower.tweets)
    tweets <- gsub('\\\\n', "", tweets, perl=TRUE)
    tweets <- gsub('\\n', "", tweets, perl=TRUE)
    tweets <- gsub("([\\])", " ", tweets)
    tweets <- gsub("([\"])", " ", tweets)
    tweets <- gsub(" , ", " ", tweets)
    tweets <- iconv(tweets, "latin1", "ASCII", sub="")
    write(tweets, file="tweets.txt")
}
and the result is lovely

https://twitter.com/markov_polov/status/698366265946038272

https://twitter.com/markov_polov/status/698489628798488577

https://twitter.com/markov_polov/status/698604570893660160

https://twitter.com/markov_polov/status/698651563330383873


particularly when it catches someone who doesn't know wtf is going on

https://twitter.com/Pleasure__Kevin/status/698621578817343489

bonus: it has already passed the australian turing test by becoming self-aware enough to complain about telstra

https://twitter.com/Telstra/status/698479305781698560

Trig Discipline fucked around with this message at 00:45 on Feb 14, 2016

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
side note: markov trump's followers are mostly actual trump supporters now who seem to have no idea that it's a bot. loving amazing

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
oh yeah, a few notes:

* once it hits an ngram from a statistically unusual sentence, it has a tendency to repeat the rest of the sentence verbatim. the longer the corpus gets (i.e., the more users in markov polov's case), the less this happens

* the ngram package in R is buggy as gently caress, so every one of these bots just dies and hard-crashes R at random intervals. since all of the ngram processing is done via C calls and since i am both (1) lazy and (2) a poo poo C programmer, i am just restarting the bots when they die instead of fixing the issue. i suppose i could just write my own ngram package for R, but see point (1)

* because of the way the twitscraper script works for markov polov, the twitscraping gets five minutes slower for each new user. if you follow the bot, it may be a few hours or even days before your tweets get incorporated

Trig Discipline fucked around with this message at 00:59 on Feb 14, 2016

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
oh i've also been thinking that i might wait until palin and trump got a hundred followers or so and then gradually start feeding other texts into them. i'm thinking a handmaid's tale fed a chapter at a time into palinbot would be fun. not sure about trump, though. the wife suggested a combination of yosemite sam quotes and mein kampf, but i don't think there's that much text for the former

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
well i'll be damned

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

PCjr sidecar posted:

mein kampf translated through the Simple English vocuabulary

ooooh

definitely want to wait until he gets more followers though. i'm getting 2-5 new people a day

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
markov polov seems to have decided to just rip on Pleasure Kevin today

https://twitter.com/markov_polov/status/698669209794949121

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

big scary monsters posted:

have you had anyone @ed by them try to respond to the palin/trump bots? anything good?

a lot of retweets, but no actual engagement. if and when that happens, i'm just going to run the babbler locally and keep pasting replies until they realize what's up

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

craisins posted:

does it go through historical posts and add them to the markov bot? or only new posts from its followers?

all posts, up to 3200 posts for each user. it rescrapes on a regular basis, at intervals determined by how many friends it has

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
O_O

well it seems like they mainly just wanted him to turn it off, and i would definitely do that if it came to that

as it is, it's just endorsing alternative medicine

https://twitter.com/markov_polov/status/698681810432053248

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

cheese-cube posted:

https://www.youtube.com/watch?v=t-7mQhSZRgM&t=17s

markov polo is killing it nice work, thanks for the props re the idea trig!

it's a killer idea. should i credit your twitter handle instead of your forums handle?


also that video is magical

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
drat yospos y'all some nasty tweeters

https://twitter.com/markov_polov/status/698701970811392001

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
i'm thinkin yeah

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
yeah seriously how many of those are there ffs

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

more :huh:

https://twitter.com/markov_polov/status/698719606412738561

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
i was thinking about filtering those out but then

https://twitter.com/markov_polov/status/698727164904996864

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
by definition, yes

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
okay the next time the scraper rolls around i'm going to implement the fishmech filter. stripping out any post containing @waze, #runescape, and #SoundHound. i'm keeping the goodreads quotes tho

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
for posterity, here is the fishmech filter

code:
# Takes a df of tweets, returns same
fishmech.filter <- function(x){
  fishmech.terms <- c("@waze", "#runescape", "#SoundHound", "#Kindle", "Levelled", "@eBay", 
                      "Daemonheim", "@YouTube", "Challenges for Microsoft")
  for(i in fishmech.terms){
    x <- x[!grepl(i, x)]
  }
  return(x)
}
this eliminates over 75% of fishmech's total output, and the remainder is STILL mostly quotes posted by kindle. but that stuff is book quotes, so it's got enough variety for the corpus

fishmech congratulations on being a literal living edge case, at this point i'm starting to suspect that you yourself are an elaborate script running on a server farm somewhere

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
oh wow



poo poo just got real

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

seriously wtf

in lighter news

https://twitter.com/markov_palin/status/698915319788638208

https://twitter.com/markov_palin/status/698907767227027456

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
bump

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
so is markov polov

https://twitter.com/markov_polov/status/699030578201374721

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

Sniep posted:

Markov Polov send's his Valentine's day sentiments

https://twitter.com/markov_polov/status/699080537781108737

went a bit further than that

https://twitter.com/markov_polov/status/699055788690550784

https://twitter.com/markov_polov/status/699030578201374721

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

don't engage, i think he's an mra

https://twitter.com/markov_polov/status/699176095124295680

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
gonna be the first twitter bot to shoot up his high school

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
cat thread yeah but i don't post in cjs much anymore since i started actually working. i am very much a creature of megathreads tho

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

Necc0 posted:

oh you didn't write your own markov algo? shame on you, trig. all of my _ebooks bots are bespoke hand crafted garbage

oh hell no did you miss the several times where i mentioned how lazy i am

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

horse mans posted:

does this go back through follower's TLs for a more complete corpus? or just start including data from the time of following onward?

my markov bot doesn't seem very coherent most of the time and i think it's because it's only going off of ~3300 statuses

it scrapes everything they've tweeted, up to 3200 tweets per follower

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

horse mans posted:

i wish i had a better method for selecting more coherent sentence-like things. i was looking into maybe some kind of language processing on my tweet archive and then somehow try and integrate common sentence structures into what it finds, like, even going so far as to perform word classification on all of my tweets, but i don't know. seems like a lot of effort.

mine is literally "ends with a ?, ., or !, begins with a capital letter, is less than 140 characters". again, i am lazy

side note: if anyone else starts doing this in R, you should know that the version of the ngram package on CRAN is hella unstable. the version from github isn't exactly rock-solid but it crashes about 1/10 as often so far

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

Mandelbulb posted:

scraped SA, guess the subforum

idk which subforum but it's magical

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
https://twitter.com/markov_polov/status/699393314311639040

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
in other news

https://twitter.com/markov_trump/status/699399643155705856

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
markov polov spitting...something?

https://twitter.com/markov_polov/status/699700759537930240

https://twitter.com/markov_polov/status/699710848994938880



the beast is trying to decide whether life of pablo makes tidal worth it

https://twitter.com/markov_thebeast/status/699706640136671232

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer

infernal machines posted:

nice, your markov bot is cyber-bullying already

yeah it's basically a horrible rear end in a top hat at this point. thanks goons

e:

https://twitter.com/markov_palin/status/699766798783123456

Trig Discipline fucked around with this message at 02:31 on Feb 17, 2016

Adbot
ADBOT LOVES YOU

Trig Discipline
Jun 3, 2008

Please leave the room if you think this might offend you.
Grimey Drawer
cjs

https://twitter.com/markov_polov/status/699919091360870400

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply