A Big Data engineer is responsible for designing and building large-scale data systems that people usually refer to as “big data.” Does that sound interesting to you? If so, read on for a breakdown of key skills you'll need, including database and SQL knowledge.
Big Data is a term that followed the advent of companies such as Google when they started accumulating massive amounts of data, far more than previous generations of software could accommodate. The engineers at Google needed to pioneer new ways to manage such data. That was the beginning of a new profession known as Big Data engineering.
Becoming a Big Data engineer is a long process with great rewards. If you’re interested in this profession, you’ll need to learn basic database skills including relational database and SQL, as well as how to manage large amounts of data in distributed software systems such as Hadoop. Let’s look at these skills in particular.
Here are some technologies to start studying up on. Read the documentation on these; download software when you can; or use them in a cloud environment, such as in Amazon Web Services (AWS):
Hadoop: This is one of the original big data projects by the Apache Foundation. The technology inside it is pretty complex (it’s a file system that’s distributed across multiple systems), but Hadoop is pretty straightforward to use. As a Big Data engineer, you’ll likely be building systems on top of Hadoop.
Spark: This system works closely with Hadoop to accomplish data analytics on data that’s spread across multiple servers. Like Hadoop, it’s managed by the Apache Foundation, meaning the software is free.
Kafka: This software, again from Apache, works closely with both Hadoop and Spark and is for building pipelines of data. Data gets streamed from multiple sources and applications can receive this streamed data. A Big Data engineer might set up Kafka and then software developers will write applications that use the streaming data; or the big data engineer might also write such applications.
HBase and Cassandra: These are both types of “NoSQL” database systems that forgo traditional relational database architectures and allow for more flexible data storage. They also work with Hadoop.
Pro tip: The above software was created to manage large-scale systems, with data distributed among hundreds and even thousands of computers (as we’ll talk about shortly). Still, the software can be installed on your own computer. But should you? A better alternative is to use some of the “free tier” offerings of the different cloud platforms such as AWS. And even if you go beyond the free tier level, you can usually get by spending no more than $50 per month on a minimal setup sufficient for practicing. This is the route you’ll likely want to take, and consider the cost as part of your education.
General Technical Concepts Related to Big Data Engineering
In addition to the above, you’ll need to know some general technical concepts surrounding computing in general and data in particular. Following are a few of them.
Distributed Computing and data warehousing: There was a time when the databases in a company lived on a single computer, and retrieving the data was easy. Today, data is distributed across many computers; smaller organizations might have two or three; large organizations might have dozens; places like Google have tens of thousands of such servers. To complicate things further, such servers are typically running in data centers scattered across the planet. These servers operate together, communicating with each other, and sharing workloads. This concept of them running software all working together is known as distributed computing. Additionally, data is stored across these servers, a concept known as data warehousing. Big Data engineers must be experts in data warehousing, and familiar enough with distributed computing systems to make use of them. Here are some systems you’ll need to learn:
- Redshift is Amazon Web Services’ data warehousing service.
- BigQuery is Google’s data warehousing services, and one of the first in the business, and its place on the web is here.
- Snowflake: An independent platform for data warehousing.
ETL (Extract, Transform, Load): This is an industry buzzword that today permeates all of data analytics, data science, and data engineering. The idea itself is deceptively simple: Extract data from multiple sources; transform that data into something useful; load the transformed data into another database system.
Together, this is known as a pipeline. But like most things that sound simple, the details are actually quite complex. The Big Data engineer will typically be responsible for the technical aspects of retrieving and storing the data so that the data analysts and data scientists can use the data. This means understanding the software that performs the ETL tasks. Here are some you’ll need to know; start with one (such as AWS Glue ETL or Google Cloud Dataflow) and learn as much as you can; then study the others to become familiar with them:
- Apache NiFi and Apache Beam: These are open-source ETL systems managed by the Apache Foundation. NiFi is a GUI tool for building pipelines; Bean is a programming system for creating pipelines and it runs on distributed systems.
- Talend is a proprietary software system for data management.
- Informatica PowerCenter is a proprietary system for combining vast data sources.
- AWS Glue ETL: This is a subset of AWS’s Glue service. Glue itself is a system whereby developers can write code that runs in response to different events. Glue ETL handles the ETL aspects.
- Google Cloud Dataflow: This is Google’s cloud-hosted implementation of Apache Beam, mentioned earlier.
- Azure Data Factory is Microsoft’s cloud-based ETL offering.
Data Visualization: Although this is usually associated with data analysis skills, as the data engineer, you might work for a smaller organization that doesn’t have the budget to hire multiple data people. That means you’ll be doing data analytics and visualization in addition to your skills. Alternatively, you might work for a larger organization with data analysts, but as the data engineer, you might have to train the analysts. Therefore, it’s essential you learn data visualization skills.
Data Analytics: The same holds with data analytics; you might need to take on these tasks or train the analysts. Here are technologies you’ll need to master:
- Data modeling: This is how data people take real-world concepts (such as a customer order system) and create database tables that models the real-world data. Data modeling is perhaps the single most fundamental skill to any data profession.
- Relational database concepts: Data is typically modeled in a “relational” way, meaning one set of data, such as several customer orders for a single customer, relates to another set of data, such as details for that customer like company name, contact, and phone number. Databases such as MySQL and SQL Server are relational database systems. MySQL is a free and open source database system that’s easy to install and use. You can start with that, and then learn other relational database systems.
- SQL: This is the language used for modeling data in a relational database. All of today’s relational databases use this language.
- NoSQL database concepts: Around 2010, people started creating a new type of database that was decidedly non-relational. These databases stored data in different formats, usually as “objects” (which are basically just a collection of data items like name, address, phone number). But they weren’t strict on such details; if one item needed to be slightly different, then that was fine. Because of the non-relational aspect, these database systems became known as NoSQL. Such systems have grown in popularity, and right now the most popular is MongoDB. You can install it for free and practice using it here. (They also offer a free cloud-hosted version, called Atlas, with optional premium plans.)
- Data cleansing: This refers to finding and fixing incorrect data items.
- Statistics: Every data analyst needs to know statistics formulas and how to use them to run reports on the data. You’ll want to learn the basics and then go as far as you can after that.
Other Technical Skills
Aside from all-things-data, you’ll need to have other technical skills to complete your skillset, including:
- Python programming. Python is the language used by data professionals. You’ll want to take a course on it and master it. Within Python you’ll want to learn how to use its tools (called packages) that apply to data, including numpy and pandas.
- Containerization: This is a way of running very small virtual operating systems (usually Linux) within another computer server (again usually linux, but the “big three” all work: Linux, Windows, Mac). Containers are ideal for running in the cloud as they use very few resources.
- General cloud platforms: Here are the “big three” cloud platforms: Amazon AWS, Google, Microsoft Azure. You’ll want to have a good understanding about clouds in general. Pick one, it doesn’t matter which, and start learning as much as you can. Learn concepts such as provisioning servers, managing containers, running serverless code, storing data objects, managing users and permissions, networking and virtual private clouds, load balancing, and auto-scaling. Then start looking at the different hosted database options; for example, Amazon web Services has a service called RDS for managing hosted MySQL, SQL Server, and others.
Soft Skills That Support Your Data Engineering Career Goals
Any technical career isn’t just about the technology. All technical jobs mean working with other people and other such soft skills. Here are some you’ll need to have as a big data engineer:
Problem solving skills: Big Data engineers don’t just manage data. They solve problems. Business management will often have questions that require deep data dives, such as data about customers, but might not really understand the problem itself. They might ask a vague question. Data analysts will likely be the ones who are asked to find the answers to these vague questions; but the data analysts will turn to you, the big data engineer, to help piece together where all the data can be found, and help identify patterns and trends in the data itself.
Because big data engineers are involved also in the engineering aspect of the data, they’re the ones who must track down problems. For example, some data systems might be encountering bottlenecks in retrieving the data; this is a very common problem when working with large data systems. The Big Data engineer needs to track down what’s causing the bottleneck and fix it. Other problems might happen such as system failures, or disperse sets of data not integrating properly.
Communication: As database jargon becomes natural to you, you’ll find that coworkers who work in areas outside of tech might not think about data and its organization the way you do. That means you’ll have to take their requests and translate them into more technical aspects. For example, if two large corporations merge, their data will start out being stored in different systems, quite likely using different technologies, with very different data formats. Doing analysis on the combined data will likely be quite challenging. But the business-minded people might not understand why combining the data is so difficult. This requires explaining complex technical information in simple, non-technical ways.
Collaboration: Often you’ll be working with both technical and non-technical people, and you’ll need to be comfortable with collaboration and teamwork. Software developers will need your help retrieving the data to be used in their software. Business people and managers need information about the data. This requires being able to work well with different types of people.
Big Data Engineering is an advanced career that encompasses many different technologies and concepts are you see here. It can take a few years to learn and master.
As you progress in your training and education, you may find you enjoy one part in particular, such as cloud computing. If so, go for it! You don’t have to become a Big Data engineer. Instead you can become a cloud computing engineer. Or, you may find that Big Data engineering is exactly the right fit for you.
In any case, keep studying, keep learning. And plan to keep learning throughout your career. New technologies are appearing all the time, and you’ll want to make sure you stay on top of them.