Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Seventh Arrow
Jan 26, 2005

Although there's already a perfectly fine thread for Data Science, I thought it might be beneficial to have a thread for big data in general, especially for data analysts and engineers (and architects etc.) and people who are interested in these fields.

There's a lot of hype about big data at the moment and I hope this thread can be a place where bd concepts can be clarified - including this post...I'm currently a data engineering student and am still learning, so any corrections are appreciated. So with that in mind, here are the main disciplines as I understand them (keeping in mind that these aren't always neat compartments hermetically sealed off from each other):

Data Science:

I get the impression sometimes that some data scientists resent being lumped in with the whole big data trend. Because it's so entwined with the classic disciplines of math and statistics, I believe that "data science", in a sense, predates big data - I think it even predates computers. That said, I see data scientists as being the platonic ideal pure number crunchers. Today's data scientist will often have a big emphasis on machine learning and deep learning algorithms. When not doing that, they will be using various tools to squeeze useful information out of the mass of data that they're presented with

Data Analysis:

My understanding is that Data Analysts, like Data Scientists, also do a fair bit of number crunching but they often also have to make use of visualization tools like Tableau to present the number crunching results to stony-faced corporate oligarchs who may well have failed high school math. I've also seen Data Analysts associated with Excel expertise a lot, I'm still trying to figure that one out.

Data Engineering:

Data Engineers are most commonly associated with the process of preparing the data for use by our science and analysis overlords. When working with Data Engineering you'll frequently run into terms like ETL, batch processing, and data warehousing and/or data lakes. All of these have to do with taking in data from many sources (even streaming) and transmogrifying and loading it into a format and location that the number crunchers can use. Programming is often essential for this line of work, and knowledge of cloud stuff is becoming more and more of a necessity too.

https://www.datacamp.com/community/blog/data-engineering-vs-data-science-infographic#gs.j=HV5Mk

https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html

Again, the barriers between these disciplines aren't always so tightly controlled. Sometimes Data Scientists might need to do ETL stuff. Data Engineers might need to do analysis or cleaning.

Hadoop? Munging? Spark? What Is All This Garbage?

So here's some terms and concepts that you'll encounter if you want to embark on a career in big data:

Big Data: Big data is a career where you help Mark Zuckerberg find out what kind of beer America had for breakfast and who will pay the most $$$ for this information. Just kidding OR AM I. Anyways, businesses have always been keen on taking in and processing data but nowadays that amount of information has hit such a critical mass that traditional databases struggle to work through it all. As such, technologies like Hadoop and Spark have come about to address these challenges. To be sure, big data has been getting a bad rap, especially since the Cambridge fiasco, due to the way corporations have been handling people's personal data. I guess we'll have to see how that pans out, but for now if you have any conspiracy-minded friends, you can watch their eyeballs bulge as you tell them how much private data you stream into Hbase every day.

Hadoop: Hadoop isn't actually a single program or API, but a java-based framework that allows MapReduce processing across multiple clusters of computers. Hadoop includes a filesystem, HDFS, and a resource manager, YARN. Hadoop's processing can be used by applications like Hive, Pig, Hbase, Mahout, etc. Hadoop is, for all intents and purposes, where the whole big data hubbub began.

Spark: Spark is another cluster-computer processing engine, built on Scala, that is notable for using in-memory computations. It was created to address limitations in the MapReduce platform and I guess some people consider it a hadoop-killer, but it depends on the context. Spark can be used in Scala, Python, and R flavours and provides access to additional functions via SparkSQL, Spark Streaming, and MlLib (Machine Learning).

Hive: Apache Hive is a data warehousing solution that uses SQL-esque queries to analyze data in a MapReduce framework. I'm mainly bringing it up because if you want to do big data stuff, you will need to familiarize yourself with SQL queries and logic. Even though it uses SQL queries, though, Hive isn't technically a database so if someone refers to it as such, you can smugly look down your nose and say "um, excuse me good sir / m'lady but Hive isn't actually a database but a data warehousing application :smug: " (don't do this in job interviews)

Data Munging / Cleaning: Data Munging generally refers to the process of cleaning data, but I suppose it's not always an exact term. This means using Spark, Hive, code, or other tools to organize inconsistent and bad data. This can include NULL fields, inconsistent capitalization, bad data types, wrong column names, and even stuff like rows that should be columns. If you're looking to be a data scientist, you might be doing a lot of this stuff. Kaggle recently had a five-day tutorial on cleaning data in Python.

Programming: Most people have an idea of what programming is, but if you want to do stuff with big data, you're going to have to learn programming to some degree. Data Scientists and Analysts will probably need to have less programming mastery than Data Engineers and Architects. Currently favoured languages for data analysis are Scala, Python, and R. Engineers / Architects may also need to know Java.

I'd like to add a list of useful resources, but I think I'd like to see some input from posters to get an idea of what people want (also this thread could very well tank on day one).

Adbot
ADBOT LOVES YOU

  • Locked thread