A data architect oversees and designs the data systems in an organization. This includes selecting and implementing the database server software, building the data systems, and working with the software developers as they build software that accesses the data.
Additionally, they help build rules (called data governance and security) regarding who is allowed to have access to what data. With that in mind, what do data architects need to know to craft a great career? And what career paths are available?
What skills does a data architect need?
Because a data architect has such a wide array of responsibilities, there’s a pretty substantial list of skills you need to learn prior to becoming one. Let’s start with concepts:
- Data Modeling: This is where you take business items (such as customers, invoices, inventory and so on) and create a virtual version of each. Each of these becomes a table in a database.
- Data entity diagrams: These are diagrams that help you visualize your data models and how they relate to each other.
- Relational database concepts: This is how database tables connect to each other. For example, an invoice table would need a customer ID; that customer ID would also exist in a table listing individual customers.
- SQL (Structured Query Language, usually pronounced “sequel”): This is a language used to create database tables, populate the data, and retrieve and change it.
- Security: As with software, security is vital. The security with databases doesn’t just involve what people have access to the data, but also what software and even what servers are allowed to access it.
- ETL (Extract, Transform, Load): This is a sequence of steps often taken to take existing data and extract new data from it, ultimately loading such new data into a database.
- Big Data: Although this term is starting to become a bit outdated (it was old ten years ago!), it still has some importance. In the early days of Google, the data architects there discovered the existing software systems couldn’t handle the sheer volume of data Google was collecting. As such, their researchers started building new ways of managing such data. This was subsequently branded as Big Data. Today, many large organizations also have the data at such volumes, it requires big data methods for manipulating it.
- Governance: This is a broad area referring to many different aspects of data, including the establishment of rules and policies for accessing and managing data.
- Database consistency: Often, database systems are “replicated,” meaning the same data is copied across multiple computers. That way, if one system crashes and the data is corrupted, the data is not lost. However, this approach offers a lot of challenges regarding how soon an update is spread across the different computers. Managing such consistency requires understanding the problem, and how to use the tools effectively to prevent such problems.
- Database performance: When you’re reading data (called querying the data), there are ways to construct your queries that are faster and more efficient than others. Writing data, when done incorrectly, can cause a severe lag in reading the data system. As such, you need to learn about database performance and how to keep it optimal.
Next you’ll need to learn database server software. Here are the big three:
Each of these is a well-established relational database system that uses SQL as its programming language. However, each includes its own version of SQL with additional programming features:
- Microsoft SQL Server’s language is called T-SQL (which stands for Transact SQL).
- MySQL’s programming language doesn’t have an official name, but it allows you to create stored procedures using traditional programming constructs.
- Oracle Database’s language is called PL/SQL (which stands for Procedural Language for SQL).
In addition to these relational databases, you’ll also want to learn about non-relational databases. Here are the two biggest:
- MongoDB: A popular database system that stores data in a format referred to as documents.
- DynamoDB: This is an Amazon Web Services non-relational database that is adept at storing massive amounts of data for quick retrieval.
There are many different software and tools you’ll need to learn, as well:
- Data platforms such as Cloudera, Data Bricks, Snowflake: These are online tools that work with various cloud providers with different features such as data warehousing, Big Data, and data analytics. Learn as many of these tools as you can.
- Reporting and visualization tools: Reporting tools help create reports of data. Visualization tools help create graphs of data. There are too many tools to list here, but you should become somewhat familiar with as many as possible. PowerBI and Tableau are two of the biggest names in the business. Each tool has different strengths and features.
- Warehousing tools: Data warehousing refers to the management of massive amounts of data. The reason this came about is because many of the popular database systems (such as MySQL) were not capable of storing and managing the huge amounts of data today’s organizations acquire.
- Big Data Frameworks such as Apache Hive, Spark, Presto, Hadoop. These are all popular open source tools that help you manage huge amounts of data. They do have a high learning curve, but they’re important parts of big data architecture.
Next, you’ll need to learn various cloud platforms. There are three big ones, and each one has several different services that provide data capabilities:
- Microsoft Azure has several tools for managing big data such as Azure Synapse Analytics and Azure Databricks.
- Google created the big data concept, and their cloud platform includes several services, particularly their data warehousing service called BigQuery.
- AWS has a service called EMR (Elastic Map Reduce) which is essentially a cloud-based version of the Apache tools mentioned earlier.
Pro tip: Pick one cloud first and learn as much as you can about it. Then you’ll find that the other cloud systems have similar capabilities, and you’ll learn those pretty quickly. Amazon EMR is a good place to start, as it will also help you learn Apache Hadoop and Hive.
And finally: A.I. (artificial intelligence) and ML (machine learning). Although these two topics aren’t new, over the past several years there has been massive leaps in technology of both, and they’ll both be an important part of data architecture for decades to come. Learn about large language models (LLMs) and GPT (generative pre-trained models). And be ready for new technology that is certainly going to appear over the years.
What are the data architect career paths?
For starters, understand that there’s no entry-level position called “data architect.” Instead, you start in related areas, and after 5-10 years you’ll eventually work your way into a data architect position. It’s very much a mid-level or senior role.
So let’s look at the different ways to get to data architect, and where you can go after you get there.
In general, there are two ways you can start:
- Junior Software Developer (or Junior Programmer). In this case, you’ll want to make sure that the work you do has plenty of data involved. Make sure you learn as much as you can during this stage about data analytics and data modeling.
- Junior Data Analyst. If you start here, you’ll want to make sure you also pick up some coding experience. You may well be doing it here with some python programming. If not, then you’ll want to work some side gigs or help out on some open source python projects.
Pro tip: During this stage, it’s okay if you decide you don’t want to be a data architect! This is your chance to try out many different areas related to software development and architecture.
Next, you will move up or transition into a mid-level data-oriented job, such as mid-level data analyst or data scientist. Here you will start learning and perfect your craft.
After that, you could go in many different directions, such as senior data scientist. However, you might decide now is the time to move to data architect.
The data architect field essentially has two levels:
- Data Architect (where you’ll finally be using all the skills mentioned above)
- Senior Data Architect
The basic difference is that, when you reach senior data architect, you’ll likely be managing a team of data architects. This will likely involve less hands-on data architecture work, and more working with stakeholders in the organization to help them decide how they want to move forward with the different data options.
Do you need to know programming?
Yes, but only to a point. You’ll want to reach a mid-level of Python and R programming languages. You won’t need to reach a level where you’re building entire software systems, but you do need to know how to code in both languages.
As for other languages (such as C++, C#, Java, and JavaScript), you probably don’t need to spend time learning them. While it can’t hurt, spending such time would take away from mastering the other important skills needed to become a data architect.
Do You Need a Degree?
In theory, you could potentially work your way up to a data architect position without a degree. However, the reality is if you have your sights set on being a data architect, you will want to plan to get a bachelor’s degree at a minimum. Some options here are computer science, mathematics, or statistics. Other science and engineering degrees might work, too; but if you’re planning ahead, focus on a degree with plenty of computer programming, database concepts, data analytics, and statistics.
While a master’s degree might not be essential, it can definitely give you a boost over other candidates applying for the same position. However, such advanced degrees aren’t necessarily required. As recently as a decade ago, a master’s was a bare minimum, and some jobs required even a PhD. But now you can typically get by with a bachelor’s, thanks to employers’ increasing focus on skills-based hiring. (The PhD would only be necessary if you want to eventually go into research or to be a professor of data architecture.)
Conclusion
A data architect position takes a lot of work and years to achieve. You’ll start by completing a bachelor’s degree, and then landing a job as a programmer or data analyst. You’ll focus as much effort as you can on data modeling, data governance, and the necessary tools for such. Eventually, you’ll have enough experience to become a data architect. With the right skills, you can make a real difference in organizations’ use of data.