She learns Data, She leads
Data analysis is a comprehensive method of inspecting, cleansing, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is a multifaceted process involving various techniques and methodologies to interpret data from various sources in different formats, both structured and unstructured.
Data analysis is not just a mere process; it's a tool that empowers organizations to make informed decisions, predict trends, and improve operational efficiency. It's the backbone of strategic planning in businesses, governments, and other organizations.
Consider the example of a leading e-commerce company. Through data analysis, they can understand their customers' buying behavior, preferences, and patterns. They can then use this information to personalize customer experiences, forecast sales, and optimize marketing strategies, ultimately driving business growth and customer satisfaction.
Data analyst inspects, cleans, transforms, and models data to extract insights and support decision-making. As a data analyst, your role involves dissecting vast datasets, unearthing hidden patterns, and translating numbers into actionable information.
In a technical sense, data analytics can be described as the process of using data to answer questions, identify trends, and extract insights. There are multiple types of analytics that can generate information to drive innovation, improve efficiency, and mitigate risk.
There are four key types of data analytics, and each answers a different type of question:
Descriptive analytics asks, “What happened?”
Predictive analytics asks, “What might happen in the future?”
Prescriptive analytics asks, “What should be done next?”
Diagnostic analytics asks, “Why did this happen?”
Each of the above types has its own unique insights, advantages, and disadvantages. Used in combination they provide a more complete understanding of the business's needs and opportunities.
1. Descriptive Analytics
Descriptive analytics primarily uses observed data to identify key characteristics of a data set. It relies solely on historical data to provide reports on past events. This type of analysis is also used to generate ad hoc (as needed) reports that summarize large amounts of data to answer simple questions like “how much?” or “how many?” It can also be used to ask deeper questions about a specific problem. Descriptive analytics is not used to draw inferences or predictions from its findings; it is just a starting point used to inform decisions or to prepare data for further analysis.
The descriptive analytics process is as follows:
-Ask a historical question that needs an answer, such as “How much of product X did we sell last year?”
-Identify required data to answer the question
-Collect and prepare data
-Analyze data
-Present results
Examples of descriptive analytics include:
-Summarizing historical events such as sales, inventory, or operations data
-Understanding engagement data such as likes and dislikes or volume of page views over time
-Reporting general trends like revenue growth or employee injuries
-Collating survey results
2. Predictive Analytics
Predictive analytics utilizes real-time and/or past data to make predictions based on probabilities. It can also be used to infer missing data or establish a predicted future trend. Predictive analytics uses simulation models and forecasting to suggest what could happen going forward, which can guide realistic goal setting, effective planning, management of performance expectations, and avoiding risks. This information can empower executives and managers to take a proactive and fact-based approach to strategy and decision making.
The predictive analytics process is as follows:
-Ask a forward-thinking question, such as “Can we predict how much product X we will sell next year?”
-Collect and prepare data
-Develop predictive analytics models
-Apply models to the prepared data
-Review models and present results
Examples of predictive analytics include:
-Forecasting customer behavior, purchasing patterns, and identifying sales trends
-Predicting customer preferences and recommending products to customers based on past purchases and search history
-Predicting the likelihood that a given customer will purchase another product or leave the store
-Identifying possible security breaches that require further investigation
-Predicting staffing and resourcing needs
3. Prescriptive Analytics
Prescriptive analytics builds on descriptive and predictive analysis by recommending courses of action that will reap the greatest benefit for the organization. In short, prescriptive analytics tells you what should be done in a given situation. It helps executives, managers, and employees make the best decisions based on available data.
A good example of prescriptive analytics is the field of GPS-based map and direction applications. These applications provide route options to a destination based on traffic volume, road conditions, and maximum speed. It can then prescribe the best route based on user-defined objectives such as shortest distance or quickest time.
4. Diagnostic Analytics
Diagnostic analytics enhances the descriptive analytics process by digging in deeper and attempting to discover the cause(s).
The diagnostic analytics process is as follows:
-Identify anomalies (inconsistencies) in data sets
-Collect data related to the anomalies
-Use statistical techniques to uncover relationships and trends that could explain the anomalies
-Present possible causes
An example of diagnostic analytics includes
Using subscription cancellations, correlated with customer comments and ratings, to determine the most common reasons why users cancel subscriptions. Another example would be determining whether there is a correlation between the demographics of consumers and their purchasing patterns at specific times of year.
Data analysis plays a pivotal role in today's data-driven world. It helps organizations harness the power of data, enabling them to make decisions, optimize processes, and gain a competitive edge. By turning raw data into meaningful insights, data analysis empowers businesses to identify opportunities, mitigate risks, and enhance their overall performance.
1. Informed Decision-Making
Data analysis is the compass that guides decision-makers through a sea of information. It enables organizations to base their choices on concrete evidence rather than intuition or guesswork. In business, this means making decisions more likely to lead to success, whether choosing the right marketing strategy, optimizing supply chains, or launching new products. By analyzing data, decision-makers can assess various options' potential risks and rewards, leading to better choices.
2. Improved Understanding
Data analysis provides a deeper understanding of processes, behaviors, and trends. It allows organizations to gain insights into customer preferences, market dynamics, and operational efficiency.
3. Competitive Advantage
Organizations can identify opportunities and threats by analyzing market trends, consumer behavior, and competitor performance. They can pivot their strategies to respond effectively, staying one step ahead of the competition. This ability to adapt and innovate based on data insights can lead to a significant competitive advantage.
4. Risk Mitigation
Data analysis is a valuable tool for risk assessment and management. Organizations can assess potential issues and take preventive measures by analyzing historical data. For instance, data analysis detects fraudulent activities in the finance industry by identifying unusual transaction patterns. This not only helps minimize financial losses but also safeguards the reputation and trust of customers.
5. Efficient Resource Allocation
Data analysis helps organizations optimize resource allocation. Whether it's allocating budgets, human resources, or manufacturing capacities, data-driven insights can ensure that resources are utilized efficiently. For example, data analysis can help hospitals allocate staff and resources to the areas with the highest patient demand, ensuring that patient care remains efficient and effective.
6. Continuous Improvement
Data analysis is a catalyst for continuous improvement. It allows organizations to monitor performance metrics, track progress, and identify areas for enhancement. This iterative process of analyzing data, implementing changes, and analyzing again leads to ongoing refinement and excellence in processes and products.
As a data analyst, it’s your job to turn raw data into meaningful insights.
Any kind of data analysis usually starts with a specific problem you want to solve, or a question you need to answer—for example, “Why did we lose so many customers in the last quarter?” or “Why are patients dropping out of their therapy programs at the halfway mark?”
To find the insights and answers you need, you’ll generally go through the following steps:
1. Define your question or problem statement
2. Collect the necessary raw data
3. Clean the data so that it’s ready for analysis
4. Analyze the data
5. Create visualizations
6. Share your findings
1. Step one: Defining the question
The first step in any data analysis process is to define your objective. In data analytics jargon, this is sometimes called the ‘problem statement’.
Defining your objective means coming up with a hypothesis and figuring how to test it. Start by asking: What business problem am I trying to solve? While this might sound straightforward, it can be trickier than it seems. For instance, your organization’s senior management might pose an issue, such as: Why are we losing customers?, What is thecustomer's perception of our brand?, What type of packaging is more engaging to our potential customers?. It’s possible, though, that this doesn’t get to the core of the problem. A data analyst’s job is to understand the business and its goals in enough depth that they can frame the problem the right way.
The first step in the data analysis process is to define the objectives and formulate clear, specific questions that your analysis aims to answer. This step is crucial as it sets the direction for the entire process. It involves understanding the problem or situation at hand, identifying the data needed to address it, and defining the metrics or indicators to measure the outcomes.
2. Step two: Collecting the data
Once you’ve established your objective, you’ll need to create a strategy for collecting and aggregating the appropriate data. A key part of this is determining which data you need. This might be quantitative (numeric) data, e.g. sales figures, or qualitative (descriptive) data, such as customer reviews. All data fit into one of three categories: first-party, second-party, and third-party data. Let’s explore each one.
What is first-party data?
First-party data are data that you, or your company, have directly collected from customers. It might come in the form of transactional tracking data or information from your company’s customer relationship management (CRM) system. Whatever its source, first-party data is usually structured and organized in a clear, defined way. Other sources of first-party data might include customer satisfaction surveys, focus groups, interviews, or direct observation.
What is second-party data?
To enrich your analysis, you might want to secure a secondary data source. Second-party data is the first-party data of other organizations. This might be available directly from the company or through a private marketplace. The main benefit of second-party data is that they are usually structured, and although they will be less relevant than first-party data, they also tend to be quite reliable. Examples of second-party data include website, app or social media activity, like online purchase histories, or shipping data.
What is third-party data?
Third-party data is data that has been collected and aggregated from numerous sources by a third-party organization. Often (though not always) third-party data contains a vast amount of unstructured data points (big data). Many organizations collect big data to create industry reports or to conduct market research. The research and advisory firm Gartner is a good real-world example of an organization that collects big data and sells it on to other companies. Open data repositories and government portals are also sources of third-party data.
Once the objectives and questions are defined, the next step is to collect the relevant data. This can be done through various methods such as surveys, interviews, observations, or extracting from existing databases. The data collected can be quantitative (numerical) or qualitative (non-numerical), depending on the nature of the problem and the questions being asked.
Step 3: Data cleaning
Data cleaning, also known as data cleansing, is a critical step in the data analysis process. It involves checking the data for errors and inconsistencies, and correcting or removing them. This step ensures the quality and reliability of the data, which is crucial for obtaining accurate and meaningful results from the analysis.
Once you’ve collected your data, the next step is to get it ready for analysis. This means cleaning, or ‘scrubbing’ it, and is crucial in making sure that you’re working with high-quality data. Key data cleaning tasks include:
Removing major errors, duplicates, and outliers—all of which are inevitable problems when aggregating data from numerous sources.
Removing unwanted data points—extracting irrelevant observations that have no bearing on your intended analysis.
Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout issues, which will help you map and manipulate your data more easily.
Filling in major gaps—as you’re tidying up, you might notice that important data are missing. Once you’ve identified gaps, you can go about filling them.
A good data analyst will spend around 70-90% of their time cleaning their data. This might sound excessive. But focusing on the wrong data points (or analyzing erroneous data) will severely impact your results.In a field like marketing, bad insights can mean wasting money on poorly targeted campaigns. In a field like healthcare or the sciences, it can quite literally mean the difference between life and death.
Step 4: Data analysis
Finally, you’ve cleaned your data. Now comes the fun bit—analyzing it! The type of data analysis you carry out largely depends on what your goal is. But there are many techniques available. Univariate or bivariate analysis, time-series analysis, and regression analysis are just a few you might have heard of. More important than the different types, though, is how you apply them. This depends on what insights you’re hoping to gain.
Once the data is cleaned, it's time for the actual analysis. This involves applying statistical or mathematical techniques to the data to discover patterns, relationships, or trends. There are various tools and software available for this purpose, such as Python, R, Excel, and specialized software like SPSS and SAS.
Step 5: Data interpretation and visualization
You’ve finished carrying out your analyses. You have your insights. The next step of the data analytics process is to share these insights with the wider world (or at least with your organization’s stakeholders!) This is more complex than simply sharing the raw results of your work—it involves interpreting the outcomes, and presenting them in a manner that’s digestible for all types of audiences. Since you’ll often present information to decision-makers, it’s very important that the insights you present are 100% clear and unambiguous. For this reason, data analysts commonly use reports, dashboards, and interactive visualizations to support their findings.
After the data is analyzed, the next step is to interpret the results and visualize them in a way that is easy to understand. This could involve creating charts, graphs, or other visual representations of the data. Data visualization helps to make complex data more understandable and provides a clear picture of the findings.
Step 6: Data storytelling
The final step in the data analysis process is data storytelling. This involves presenting the findings of the analysis in a narrative form that is engaging and easy to understand. Data storytelling is crucial for communicating the results to non-technical audiences and for making data-driven decisions.
Business Overview
Businesses often have very different goals, but every type of business benefits from understanding their data to gain insights into their processes, procedures and customer behaviors. Let’s examine a couple of example projects and the questions that might be answered by an analysis.
Where should the marketing budget be concentrated to increase new business?
A bicycle manufacturer wants to identify areas where sales could be improved by targeted marketing efforts. Using sales records, customer demographic information and products, they can decide where advertising is most likely to increase new business.
How can a company make sure they are stocking the right inventory?
A medical supply company for hospitals knows that certain common conditions require high volumes of certain supplies, and that patient demographics are associated with certain diagnoses of these conditions. They want to use hospital data to gain insights into patient demographics and common diagnoses to identify patterns. Analysts can help the company find these patterns and ensure that they have the right supplies on hand at the right times to meet the needs of their customers.
What products or services should be recommended to a customer?
Recommendation systems are a popular way to introduce current consumers to new products and services. Think of the many recommendation systems that you encounter during your online experiences. A very common example of this is video subscription services, which often analyze customer purchasing habits, ratings and reviews to help determine which products and services to recommend to those customers.
But data analysis is not limited to just business problems.
Think about your personal interests and questions that you might want to answer in order to gain a better understanding of the field. As you work through the labs and activities in the course, you will be building valuable skills for each step of the Data Analysis Lifecycle. Potential employers expect to see evidence that demonstrates the ability to complete a project from start to finish. For data analyst positions, this is usually done through a project portfolio. Starting now and continuing through this course, you can be planning to create your own personal project portfolio to share with prospective employers.
As a data analyst, it’s your job to turn raw data into meaningful insights.
Any kind of data analysis usually starts with a specific problem you want to solve, or a question you need to answer—for example, “Why did we lose so many customers in the last quarter?” or “Why are patients dropping out of their therapy programs at the halfway mark?”
But the role isn’t just limited to a single process. As a data analyst, you occupy a very critical position; you’re the bridge between incomprehensible raw data and useful insights, empowering people in all areas of the organization to make smarter decisions and ultimately reach their goals. As such, you’ll work closely with managers, product owners, and department leads to identify goals, prioritize needs, and shape strategies.
And, in addition to actually analyzing data, you may also be responsible for building databases and dashboards, ensuring data quality and best practices, and maintaining relevant documentation.
Of course, the exact tasks and responsibilities will vary depending on where you work.
A Good Grasp Of Maths And Statistics
The amount of maths you will use as a data analyst will vary depending on the job. Some jobs may require working with maths more than others.
You don’t necessarily need to be a math wizard, but with that said, having at least a fundamental understanding of math basics can be of great help.
Data analysts need to have good knowledge of statistics and probability for gathering and analyzing data, figuring out patterns, and drawing conclusions from the data.
Knowledge of Excel
Excel is one of the most essential tools used in Data analysis.
It is used for storing, structuring, and formatting data, performing calculations, summarizing data and identifying trends, sorting data into categories, and creating reports.
You can also use Excel to create charts and graphs.
Knowledge of SQL and Relational Databases
Data analysts need to know how to interact with relational databases to extract data.
A database is an electronic storage localization for data. Data can be easily retrieved and searched through.
A relational database is structured in format and all data items stored have pre-defined relationships with each other.
SQL stands for Structured Query Language and is the language used for querying and interacting with relational databases.
By writing SQL queries you can perform CRUD (Create, Read, Update, and Delete) operations on data.
Knowledge Of A Programming Language.
To further organize and manipulate databases, data analysts benefit from knowing a programming language.
Two of the most popular ones used in the data analysis field are Python and R.
Python is a general-purpose programming language, and it is very beginner-friendly thanks to its syntax that resembles the English language. It is also one of the most used technical tools for data analysis.
Python offers a wealth of packages and libraries for data manipulation, such as Pandas and NumPy, as well as for data visualization, such as Matplotlib. Once you understand the fundamentals, you can move on to learning about Pandas, NumPy, and Matplotlib.
Knowledge of data visualization tools
Data visualization is the graphical interpretation and presentation of data.
This includes creating graphs, charts, interactive dashboards, or maps that can be easily shared with other team members and important stakeholders.
Data visualization tools are essentially used to tell a story with data and drive decision-making.
One of the most popular data visualization tools used is Tableau.
Data analysts work anywhere and everywhere. They are critical to almost any kind of organization and industry you can think of—from large corporations to fledgling startups, from financial institutions to government, healthcare, and non-profit organizations. Wherever data is being collected (and that’s pretty much everywhere these days!), there’s a need for data analysts.
If you’re considering a career in the field, you’ll find that your skills are needed everywhere. That’s one of the great things about the job: you’re not limited to a specific sector or type of company. Once you’re a qualified data analyst, the world really is your oyster π€©
Data analysis is a versatile and indispensable tool that finds applications across various industries and domains. Its ability to extract actionable insights from data has made it a fundamental component of decision-making and problem-solving. Let's explore some of the key applications of data analysis:
1. Business and Marketing
-Market Research: Data analysis helps businesses understand market trends, consumer preferences, and competitive landscapes. It aids in identifying opportunities for product development, pricing strategies, and market expansion.
-Sales Forecasting: Data analysis models can predict future sales based on historical data, seasonality, and external factors. This helps businesses optimize inventory management and resource allocation.
2. Healthcare and Life Sciences
Disease Diagnosis: Data analysis is vital in medical diagnostics, from interpreting medical images (e.g., MRI, X-rays) to analyzing patient records. Machine learning models can assist in early disease detection.
Drug Discovery: Pharmaceutical companies use data analysis to identify potential drug candidates, predict their efficacy, and optimize clinical trials.
Genomics and Personalized Medicine: Genomic data analysis enables personalized treatment plans by identifying genetic markers that influence disease susceptibility and response to therapies.
3. Finance
-Risk Management: Financial institutions use data analysis to assess credit risk, detect fraudulent activities, and model market risks.
-Algorithmic Trading: Data analysis is integral to developing trading algorithms that analyze market data and execute trades automatically based on predefined strategies.
-Fraud Detection: Credit card companies and banks employ data analysis to identify unusual transaction patterns and detect fraudulent activities in real time.
4. Manufacturing and Supply Chain
Quality Control: Data analysis monitors and controls product quality on manufacturing lines. It helps detect defects and ensure consistency in production processes.
Inventory Optimization: By analyzing demand patterns and supply chain data, businesses can optimize inventory levels, reduce carrying costs, and ensure timely deliveries.
5. Social Sciences and Academia.
Social Research: Researchers in social sciences analyze survey data, interviews, and textual data to study human behavior, attitudes, and trends. It helps in policy development and understanding societal issues.
Academic Research: Data analysis is crucial to scientific physics, biology, and environmental science research. It assists in interpreting experimental results and drawing conclusions.
6. Internet and Technology
Search Engines: Google uses complex data analysis algorithms to retrieve and rank search results based on user behavior and relevance.
Recommendation Systems: Services like Netflix and Amazon leverage data analysis to recommend content and products to users based on their past preferences and behaviors.
7. Environmental Science
Climate Modeling: Data analysis is essential in climate science. It analyzes temperature, precipitation, and other environmental data. It helps in understanding climate patterns and predicting future trends.
Environmental Monitoring: Remote sensing data analysis monitors ecological changes, including deforestation, water quality, and air pollution.
Data is now being collected and shared across many different organizations and in many different formats; a collection of data is referred to as a dataset.
Datasets may exist for the private use of an individual organization or shared across the internet to anyone who wants to reference them. An example of a private dataset is a physician’s patient dataset, which might include "patient demographics", "test results", "diagnosis", and "appointment schedules". Access to this dataset is limited to those with permission to use it. In contrast, an example of a publicly available dataset is the World Health Organization (WHO) open data repository, which contains health-related statistics for its 194 member countries and can be downloaded by anyone.
Datasets often contain multiple related files stored in different formats. Information about a dataset, including a description of what it contains and how it is formatted, is called metadata. Metadata files are valuable tools to provide analysts with an understanding of the data within the dataset.
One of the most common formats used to package and exchange data is the Comma Separated Values (CSV) format. Often, datasets that are publicly available may be made up of multiple CSV files that contain related data. These CSV files can be imported into tools such as Excel for further investigation and analysis. Later in this module, you will obtain and work with a CSV.
Key data analyst tools
As you are learning, the most common programs and solutions used by data analysts include spreadsheets, query languages, and visualization tools. In this reading, you will learn more about each one. You will cover when to use them, and why they are so important in data analytics.
Spreadsheets
Data analysts rely on spreadsheets to collect and organize data. Two popular spreadsheet applications you will probably use a lot in your future role as a data analyst are Microsoft Excel and Google Sheets.
Spreadsheets structure data in a meaningful way by letting you
Collect, store, organize, and sort information
Identify patterns and piece the data together in a way that works for each specific data project
Create excellent data visualizations, like graphs and charts.
Databases and query languages
A database is a collection of structured data stored in a computer system. Some popular Structured Query Language (SQL) programs include MySQL, Microsoft SQL Server, and BigQuery.
Query languages
-Allow analysts to isolate specific information from a database(s)
-Make it easier for you to learn and understand the requests made to databases
-Allow analysts to select, create, add, or download data from a database for analysis
Visualization tools
Data analysts use a number of visualization tools, like graphs, maps, tables, charts, and more. popular visualization tools are Tableau, PowerBI and Looker.
These tools
-Turn complex numbers into a story that people can understand
-Help stakeholders come up with conclusions that lead to informed decisions and effective business strategies
Have multiple features
- Tableau's simple drag-and-drop feature lets users create interactive graphs in dashboards and
A career as a data analyst also involves using programming languages, like R and Python, which are used a lot for statistical analysis, visualization, and other data analysis.
In this course, you will have the opportunity to use onee essential data analytics tools: Excel
What is Excel?
Excel is a powerful tool suitable for small datasets and quick data analysis. With Excel, you can manipulate data, summarize it with pivot tables, visualize it, and perform quick statistics to summarize it.
Why it’s important to know:
Excel is powerful and very popular for performing small-scale data analysis, calculations, data summaries, and data visualizations.
Excel skills you will learn in this course:
-Perform data cleaning by removing blank spaces as well as incorrect and outdated information
=Format and adjust data using conditional formatting
=Perform data calculations using formulas
-Organize data using sorting and filtering
-Create visualizations using graphing and charting
=Calculate, summarize, and analyze data using pivot tables
-Aggregate data for analysis.
Key takeaway
You have a lot of tools as a data analyst. This is a first glance at the possibilities, and you will explore many of these tools in-depth throughout this program.
What is Kaggle?
Kaggle is a platform for data science competitions, where data scientists and machine learning engineers can compete with each other to create the best models for solving specific problems or analyzing certain data sets. The platform also provides a community where users can collaborate on projects, share code and data sets, and learn from each other's work. Founded in 2010, Google acquired Kaggle in 2017, and the platform is now part of Google Cloud.
Kaggle hosts a variety of competitions sponsored by organizations, ranging from predicting medical outcomes to classifying images or identifying fraudulent transactions. Participants can submit their models and see how they perform on a public leaderboard, as well as receive feedback from other competitors and the community.
In addition to competitions, Kaggle also offers public data sets, machine learning notebooks, and tutorials to help users learn and practice their skills in data science and machine learning. It has become a popular platform for both novice and experienced data scientists to improve their skills, build their portfolios, and connect with others in the industry.
In this course we will learn EXCEL
Six problem types
Data analytics is so much more than just plugging information into a platform to find insights. It is about solving problems.
To get to the root of these problems and find practical solutions, there are lots of opportunities for creative thinking. No matter the problem, the first and most important step is understanding it. From there, it is good to take a problem-solver approach to your analysis to help you decide what information needs to be included, how you can transform the data, and how the data will be used.
Data analysts typically work with six problem types
1. Making predictions 2. Categorizing things 3. Spotting something unusual 4. Identifying themes 5. Discovering connections 6. Finding patterns
1. Making predictions
A company that wants to know the best advertising method to bring in new customers is an example of a problem requiring analysts to make predictions. Analysts with data on location, type of media, and number of new customers acquired as a result of past ads can't guarantee future results, but they can help predict the best placement of advertising to reach the target audience.
2. Categorizing things
An example of a problem requiring analysts to categorize things is a company's goal to improve customer satisfaction. Analysts might classify customer service calls based on certain keywords or scores. This could help identify top-performing customer service representatives or help correlate certain actions taken with higher customer satisfaction scores.
3. Spotting something unusual
A company that sells smart watches that help people monitor their health would be interested in designing their software to spot something unusual. Analysts who have analyzed aggregated health data can help product developers determine the right algorithms to spot and set off alarms when certain data doesn't trend normally.
4. Identifying themes
User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to problems that require analysts to categorize things, usability improvement projects might require analysts to identify themes to help prioritize the right product features for improvement. Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes.
6. Discovering connections
A third-party logistics company working with another company to get shipments delivered to customers on time is a problem requiring analysts to discover connections. By analyzing the wait times at shipping hubs, analysts can determine the appropriate schedule changes to increase the number of on-time deliveries.
7. Finding patterns
Minimizing downtime caused by machine failure is an example of a problem requiring analysts to find patterns in data. For example, by analyzing maintenance data, they might discover that most failures happen if regular maintenance is delayed by more than a 15-day window.
Key takeaway
As you move through this program, you will develop a sharper eye for problems and you will practice thinking through the problem types when you begin your analysis. This method of problem solving will help you figure out solutions that meet the needs of all stakeholders.
Companies in lots of industries today are dealing with rapid change and rising uncertainty. Even well-established businesses are under pressure to keep up with what is new and figure out what is next. To do that, they need to ask questions. Asking the right questions can help spark the innovative ideas that so many businesses are hungry for these days.
The same goes for data analytics. No matter how much information you have or how advanced your tools are, your data won’t tell you much if you don’t start with the right questions. Think of it like a detective with tons of evidence who doesn’t ask a key suspect about it. Coming up, you will learn more about how to ask highly effective questions, along with certain practices you want to avoid.
Highly effective questions are SMART questions:
Examples of SMART questions
Here's an example that breaks down the thought process of turning a problem question into one or more SMART questions using the SMART method: What features do people look for when buying a new car?
Specific: Does the question focus on a particular car feature?
Measurable: Does the question include a feature rating system?
Action-oriented: Does the question influence creation of different or new feature packages?
Relevant: Does the question identify which features make or break a potential car purchase?
Time-bound: Does the question validate data on the most popular features from the last three years?
Questions should be open-ended. This is the best way to get responses that will help you accurately qualify or disqualify potential solutions to your specific problem. So, based on the thought process, possible SMART questions might be:
S- On a scale of 1-10 (with 10 being the most important) how important is your car having four-wheel drive?
M-What are the top five features you would like to see in a car package?
A-What features, if included with four-wheel drive, would make you more inclined to buy the car?
R-How much more would you pay for a car with four-wheel drive?
T-Has four-wheel drive become more or less popular in the last three years?
Things to avoid when asking questions
1. Leading questions: questions that only have a particular response
Example: This product is too expensive, isn’t it?
This is a leading question because it suggests an answer as part of the question. A better question might be, “What is your opinion of this product?” There are tons of answers to that question, and they could include information about usability, features, accessories, color, reliability, and popularity, on top of price. Now, if your problem is actually focused on pricing, you could ask a question like “What price (or price range) would make you consider purchasing this product?” This question would provide a lot of different measurable responses.
2. Closed-ended questions: questions that ask for a one-word or brief response only
Example: Were you satisfied with the customer trial?
This is a closed-ended question because it doesn’t encourage people to expand on their answer. It is really easy for them to give one-word responses that aren’t very informative. A better question might be, “What did you learn about customer experience from the trial.” This encourages people to provide more detail besides “It went well.”
3. Vague questions: questions that aren’t specific or don’t provide context
Example: Does the tool work for you?
This question is too vague because there is no context. Is it about comparing the new tool to the one it replaces? You just don’t know. A better inquiry might be, “When it comes to data entry, is the new tool faster, slower, or about the same as the old tool? If faster, how much time is saved? If slower, how much time is lost?” These questions give context (data entry) and help frame responses that are measurable (time).
This reading illustrates the importance of data integrity using an example of a global company’s data. Definitions of terms that are relevant to data integrity will be provided at the end.
Scenario: calendar dates for a global company
Calendar dates are represented in a lot of different short forms. Depending on where you live, a different format might be used.
In some countries,12/10/20 (DD/MM/YY) stands for October 12, 2020.
In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-12.
In the United States, (MM/DD/YY) is the accepted format so October 12, 2020 is going to be 10/12/20.
Now, think about what would happen if you were working as a data analyst for a global company and didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of the data would be inaccurate. Imagine ordering extra inventory for December when it was actually needed in October!
A good analysis depends on the integrity of the data, and data integrity usually depends on using a common format. So it is important to double-check how dates are formatted to make sure what you think is December 10, 2020 isn’t really October 12, 2020, and vice versa.
Fortunately, with a standard date format and compliance by all people and systems that work with the data, data integrity can be maintained. But no matter where your data comes from, always be sure to check that it is valid, complete, and clean before you begin any analysis.
You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives. As a data analyst, alignment is something you will need to judge. Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective.
Key takeaways
- When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports.
- If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis.
-If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objec
As a data analyst, you’ll receive data from a variety of sources. This data will come in all different formats and, more often than not, it will comprise what’s known as “dirty” data. In other words, it won’t be ready for analysis straight off the bat—you’ll need to clean it first.
Data cleaning is an important early step in the data analytics process.
This crucial exercise, which involves preparing and validating data, usually takes place before your core analysis. Data cleaning is not just a case of removing erroneous data, although that’s often part of it. The majority of work goes into detecting rogue data and (wherever possible) correcting it.
What is rogue data?
‘Rogue data’ includes things like incomplete, inaccurate, irrelevant, corrupt or incorrectly formatted data. The process also involves deduplicating, or ‘deduping’. This effectively means merging or removing identical data points.
Garbage in, garbage out: The importance of data cleaning
Have you heard of the saying “Garbage in, garbage out”—otherwise known as GIGO? GIGO stems from the world of computer science, and simply means that if you put flawed data in, you’ll get flawed results out.
In data analytics, clean, quality data is essential to running meaningful and reliable analyses. Just as you wouldn’t build a house without first laying a good foundation, you can’t analyze your data without cleaning it first. Get the data cleaning stage right and you’ll create something strong, reliable, and long-lasting πͺ Do it wrong (or skip it altogether) and your analysis will soon crumble! π± That’s why data experts spend a good 60% of their time on data cleaning.
Working with dirty data is not only bad practice; it can be extremely costly in the long run. As a data analyst, you need to be confident in the conclusions you draw and the advice you give—and that’s really only possible if you’ve cleaned your data properly.
These days GIGO is very important as well in the context of AI and Large Language Models such as ChatGPT. The cleanliness and quality of the data which the LLMs are trained is huge topic of discussion, particularly considering how powerful these models are and how influential they're becoming. Because when it was launched first, ChatGPT-3 had no experience past the year 2021, as that was as far as its training data went up to.
What is dirty data?
Dirty data is essentially any data that needs to be manipulated or worked on in some way before it can be analyzed. Some types of dirty data include:
Incomplete data—for example, a spreadsheet with missing values that would be relevant for your analysis. If you’re looking at the relationship between customer age and number of monthly purchases, you’ll need data for both of these variables. If some customer ages are missing, you’re dealing with incomplete data.
Duplicate data—for example, records that appear twice (or multiple times) throughout the same dataset. This can occur if you’re combining data from multiple sources or databases.
Inconsistent or inaccurate data—data that is outdated or contains structural errors such as typos, inconsistent capitalization, and irregular naming conventions. Say you have a dataset containing student test scores, with some categorized as “Pass” or “Fail” and others categorized as “P” or “F.” Both labels mean the same thing, but the naming convention is inconsistent, leaving the data rather messy.
We’ve outlined just three types of dirty data here. For further examples, head to this round-up of the 7 most common types of dirty data and how to clean them. 7 Common Types of Dirty Data and How to Clean Them.
It’s important to know what kind of dirty data you’re dealing with, as this will inform how you go about cleaning it. So, whenever you receive or collect data, you’ll spend a good amount of time inspecting it in order to gauge where you need to focus your cleaning efforts π§
We’ve established how important the data cleaning stage is. Now let’s introduce some data cleaning techniques!
To clean your data, you might do some or all of the following:
Step 1: Get rid of unwanted observations
The first stage in any data cleaning process is to remove the observations (or data points) you don’t want. This includes irrelevant observations, i.e. those that don’t fit the problem you’re looking to solve.
For instance, if we were running an analysis on vegetarian eating habits, we could remove any meat-related observations from our data set. This step of the process also involves removing duplicate data. Duplicate data commonly occurs when you combine multiple datasets, scrape data online, or receive it from third-party sources.
Step 2: Fix structural errors
Structural errors usually emerge as a result of poor data housekeeping. They include things like typos and inconsistent capitalization, which often occur during manual data entry.
Let’s say you have a dataset covering the properties of different metals. ‘Iron’ (uppercase) and ‘iron’ (lowercase) may appear as separate classes (or categories). Ensuring that capitalization is consistent makes that data much cleaner and easier to use. You should also check for mislabeled categories.
For instance, ‘Iron’ and ‘Fe’ (iron’s chemical symbol) might be labeled as separate classes, even though they’re the same. Other things to look out for are the use of underscores, dashes, and other rogue punctuation!
Step 3: Standardize your data
Standardizing your data is closely related to fixing structural errors, but it takes it a step further. Correcting typos is important, but you also need to ensure that every cell type follows the same rules.
For instance, you should decide whether values should be all lowercase or all uppercase, and keep this consistent throughout your dataset. Standardizing also means ensuring that things like numerical data use the same unit of measurement.
As an example, combining miles and kilometers in the same dataset will cause problems. Even dates have different conventions, with the US putting the month before the day, and Europe putting the day before the month. Keep your eyes peeled; you’ll be surprised what slips through.
Remove unwanted outliers
Outliers are values that differ significantly from other values in your data. For example, if you see that most student test scores fall between 50 and 80, but that one student has scored a 2, this might be considered an outlier.
Outliers may be the result of an error, but that’s not always the case, so approach with caution when deciding whether or not to remove them.
What causes outliers in a dataset?
Outliers may be the result of an error, but that’s not always the case. It’s therefore crucial to understand not only how to detect outliers in a dataset, but also to determine the best way to handle them. Some common causes of outliers include:
Human error when entering the data (for example, a typo)
Intentional outliers, i.e. dummy values that test detection methods
Sampling errors as a result of extracting or combining data from multiple sources
Natural outliers—this is when outliers occur “naturally” in the data, as opposed to being the result of an error. Natural outliers are referred to as novelties.
Step 4: Fix contradictory data errors
Contradictory (or cross-set) data errors are another common problem to look out for. Contradictory errors are where you have a full record containing inconsistent or incompatible data.
An example could be a log of athlete racing times. If the column showing the total amount of time spent running isn’t equal to the sum of each racetime, you’ve got a cross-set error.
Another example might be a pupil’s grade score being associated with a field that only allows options for ‘pass’ and ‘fail’, or an employee’s taxes being greater than their total salary.
Step 5: Type conversion and syntax errors
Once you’ve tackled other inconsistencies, the content of your spreadsheet or dataset might look good to go.
However, you need to check that everything is in order behind the scenes, too. Type conversion refers to the categories of data that you have in your dataset. A simple example is that numbers are numerical data, whereas currency uses a currency value. You should ensure that numbers are appropriately stored as numerical data, text as text input, dates as objects, and so on.
In case you missed any part of Step 2, you should also remove syntax errors/white space (erroneous gaps before, in the middle of, or between words).
Step 6: Deal with missing data
When data is missing, what do you do? There are three common approaches to this problem.
The first is to remove the entries associated with the missing data. The second is to impute (or guess) the missing data, based on other, similar data. In most cases, however, both of these options negatively impact your dataset in other ways. Removing data often means losing other important information. Guessing data might reinforce existing patterns, which could be wrong.
The third option (and often the best one) is to flag the data as missing. To do this, ensure that empty fields have the same value, e.g. ‘missing’ or ‘0’ (if it’s a numerical field). Then, when you carry out your analysis, you’ll at least be taking into account that data is missing, which in itself can be informative.
Step 7: Validate your dataset
Once you’ve cleaned your dataset, the final step is to validate it. Validating data means checking that the process of making corrections, deduping, standardizing (and so on) is complete.
This often involves using scripts that check whether or not the dataset agrees with validation rules (or ‘check routines’) that you have predefined. You can also carry out validation against existing, ‘gold standard’ datasets.
This all sounds a bit technical, but all you really need to know at this stage is that validation means checking the data is ready for analysis. If there are still errors (which there usually will be) you’ll need to go back and fix them…there’s a reason why data analysts spend so much of their time cleaning data!
Now we’ve covered the steps of the data cleaning process, it’s clear that this is not a manual task. So, what tools might help? The answer depends on factors like the data you’re working with and the systems you’re using. But here are some baseline tools to get to grips with.
Microsoft Excel
MS Excel has been a staple of computing since its launch in 1985. Love it or loathe it, it remains a popular data-cleaning tool to this day. Excel comes with many inbuilt functions to automate the data cleaning process, from deduping to replacing numbers and text, shaping columns and rows, or combining data from multiple cells. It’s also relatively easy to learn, making it the first port of call for most new data analysts.
Programming languages
Often, data cleaning is carried out using scripts that automate the process. This is essentially what Excel can do, using pre-existing functions. However, carrying out specific batch processing (running tasks without end-user interaction) on large, complex datasets often means writing scripts yourself.
This is usually done with programming languages like Python, Ruby, SQL, or—if you’re a real coding whizz—R (which is more complex, but also more versatile). While more experienced data analysts may code these scripts from scratch, many ready-made libraries exist. Python, in particular, has a tonne of data cleaning libraries that can speed up the process for you, such as Pandas and NumPy.
Visualizations
Using data visualizations can be a great way of spotting errors in your dataset. For instance, a bar plot is excellent for visualizing unique values and might help you spot a category that has been labeled in multiple different ways (like our earlier example of ‘Iron’ and ‘Fe’). Likewise, scatter graphs can help spot outliers so that you can investigate them more closely (and remove them if needed).
As we’ve covered, data analysis requires effectively cleaned data to produce accurate and trustworthy insights. But clean data has a range of other benefits, too:
Staying organized:
Today’s businesses collect lots of information from clients, customers, product users, and so on. These details include everything from addresses and phone numbers to bank details and more. Cleaning this data regularly means keeping it tidy. It can then be stored more effectively and securely.
Avoiding mistakes:
Dirty data doesn’t just cause problems for data analytics. It also affects daily operations. For instance, marketing teams usually have a customer database. If that database is in good order, they’ll have access to helpful, accurate information. If it’s a mess, mistakes are bound to happen, such as using the wrong name in personalized mail outs.
Improving productivity:
Regularly cleaning and updating data means rogue information is quickly purged. This saves teams from having to wade through old databases or documents to find what they’re looking for.
Avoiding unnecessary costs:
Making business decisions with bad data can lead to expensive mistakes. But bad data can incur costs in other ways too. Simple things, like processing errors, can quickly snowball into bigger problems. Regularly checking data allows you to detect blips sooner. This gives you a chance to correct them before they require a more time-consuming (and costly) fix.
Now let’s take what you’ve learned about data cleaning and apply it to your dataset. We’ll focus on:
Identifying and removing duplicates
Identifying and handling missing data points
Are you ready? Then let’s begin! π
Before you start conducting in-depth analysis, it’s important to first get acquainted with your dataset—to get the “lay of the land,” if you will. This is where exploratory data analysis comes in.
You can think of exploratory data analysis as an initial investigation of your dataset where you seek to understand and summarize its main characteristics. EDA is useful because it helps you to understand how your data is structured, to spot potential patterns and trends, and to catch any anomalies. EDA is also important for determining if the methods of analysis you are planning to use later on are actually appropriate for your dataset.
In general, EDA focuses on understanding the characteristics of a dataset before deciding what we want to do with that dataset.
In a nutshell, descriptive statistics help you to summarize or describe the characteristics of your dataset in a meaningful way.
Imagine you have a dataset containing hundreds or even thousands of values. It would be impossible to look at that raw data with the naked eye and make any sense of it.
For example, if you collected data on the test scores of three hundred students, you might want to gauge their overall performance. You wouldn’t be able to do this simply by looking at a spreadsheet with all the raw data, but you could calculate the average (or mean) score. That’s an example of descriptive statistics!
Descriptive statistics are also useful for spotting potential errors or strange occurrences within your dataset. For example, in calculating the minimum and maximum values for a certain variable, you might notice that the maximum value falls outside of what could be considered a reasonable range. Imagine you’ve collected height data for a group of school children and calculated a minimum value of 20cm. That doesn’t seem like a realistic height for a child. Based on that, you’d investigate further to see what’s going on (and make sure that your dataset is in a fit state for analysis) π
Up next: What are the different types of descriptive statistics?
The three main types of descriptive statistics are:
1. Frequency distribution,
Which tells you how frequently (i.e. how many times) a certain value occurs within your dataset.
2. Measures of central tendency,
Which estimate the middle or average values within your dataset. Measures of central tendency are the mean, median, and mode.
3. Measures of variability
Help you gauge how much variability or “spread” there is within your dataset; in other words, how spread out the values are. Measures of variability are range, standard deviation, and variance.
You’ll learn how these values are calculated when we get to next sections, the practical component of our tutorial. For now, there’s one last piece of theory we need to complete the jigsaw. Proceed to section three! ππ»
A pivot table is a summary tool that wraps up or summarizes information sourced from bigger tables. These bigger tables could be a database, an Excel spreadsheet, or any data that is or could be converted in a table-like form. The data summarized in a pivot table might include sums, averages, or other statistics which the pivot table groups together in a meaningful way.
A pivot table summarizes large amounts of data in a more digestible, at-a-glance format. It does this by grouping the data in a meaningful way, for example by showing the sum or average values of certain variables.
Let’s imagine you have data for a chain of department stores. Your dataset includes data on the sales made each month in two different store locations. You want to be able to see, at a glance, how each store performed across the year, but it’s impossible to know when faced with thousands of rows of data in a spreadsheet.
And what a journey it’s been! We’ve cleaned our dataset, calculated descriptive statistics, and created pivot tables. Most importantly, we’ve started to uncover some pretty interesting insights about our data, allowing us to answer the following questions:
In this tutorial, we’ll proceed to the next step in the data analysis process:
Data visualization π This will help you to answer the remaining questions we set out at the beginning of the course, and to present your findings in a visual, easily digestible format.
By the end of the tutorial, you’ll be able to:
-Explain what data visualization is and why it’s important for the data analysis process
-Create your own data visualizations for your Cofee sales dataset (bar charts, column charts, and scatter plots)
-Answer questionsfor the key stakeholders at Cofee shop
As always, we’ll start with some theory before getting to the hands-on part π
They say a picture is worth a thousand words, and this is especially true for data analytics. Data visualization (or data viz, as it’s often called) is all about presenting data in a visual format—such as a graph, chart, or map.
This is useful as it helps to highlight the most important or relevant insights from a dataset, making it easier to spot patterns, trends, and relationships, as well as outliers (data points that differ significantly from other observations in your dataset—you may remember we mentioned these briefly in tutorial two).
Data visualization isn’t just about creating pretty graphics. It’s a crucial aspect of making data understandable, accessible, and meaningful. As a data analyst, it’s your job to find insights within the data and share them with others—others who can act on those insights without necessarily being data experts themselves. As such, data visualization is a storytelling tool; a way to communicate your findings to a wider audience.
As a data analyst, your goal is to make data meaningful. You take raw data, analyze it, and draw out insights that can have a real-world impact. When you uncover these insights, it’s your job to communicate them in a way that means something. You need others to not only understand your findings, but to care about and act upon them. You do this by building a narrative or a story around your data π
The power (and science) of storytelling
Since the dawn of time, storytelling has been one of the most fundamental—and powerful—methods of human communication. In fact, it’s in the way our brains are wired.
Studies have shown that if we’re presented purely with information or facts, only the language processing areas of the brain are activated.
If we’re being told a story, on the other hand, we engage lots of different areas of the brain—the language processing areas, but also any other part of the brain that would be activated if we were personally experiencing the events of the story π€―
Storytelling makes things more relatable, more vivid, and more memorable. So, as a data analyst looking to build a connection between your data and the people you’re sharing it with, storytelling is a critical skill.
There are three key components of data storytelling:
Data: This entails all the insights you’ve gathered from your analysis—for example, the Store locations, and the finding that the most frequent Transactions of the coffee product (to name just a few!) π
Visualizations: We learned all about data viz in tutorial four, so you’re already familiar with what a powerful storytelling tool it can be. Visualizations are key to making the findings of your analysis understandable for a broad audience π
Narrative: This is what puts your findings into context and makes them meaningful for your audience. With data storytelling, your narrative should introduce the topic (for example, the challenge or questions you set out to answer), present your findings and their relevance, and conclude with a specific call to action π¬
Data storytelling isn’t just about creating visualizations and sharing them. It requires a structured approach, and consideration of various factors. While there is no set formula for telling the story of your data, here are some steps you can follow:
A. Identify your audience π
To create an engaging story, you first need to understand exactly who you are trying to engage. Who are you presenting your insights to? Why should they care about the data and your findings? What problem or challenge will the data help them to address? What insights from your analysis will matter most to them?
B. Construct a compelling narrative π§π€
When sharing your insights, you don’t just want to explain them; you want to take your audience on a journey. To build a narrative:
Start by setting the scene: What’s the context behind your analysis? Why did you analyze this data in the first place? What was the problem or challenge you set out to solve, and why does it matter?
Present and discuss your findings: What did your analysis tell you? What are the main points you’ll share with your audience? What answers can you provide to the original questions or challenges you set out to investigate? Remember that not all the data used in your analysis will be relevant to the story you want to tell, so it’s important to pick out and highlight the key points. it’s tempting to share everything with your audience, but much more powerful to “concentrate on the pearls”? That’s what you’re doing in this step: picking out the pearls π
Provide action points and solutions: Based on your analysis, what actions can be taken moving forward? What advice can you give to your audience? How can they utilize the data you’re showing them, and what will be the impact?
C. Create and organize your data visualizations π
You may have already created data visualizations as part of your analysis (as we did in tutorial four). If these visualizations already illustrate the key points you want to convey, it’s a case of organizing them and deciding how you’ll present them—for example, figuring out the order in which they’ll be presented to your audience. Otherwise, you may need to create additional visualizations in order to convey your data “pearls.”
D. Share your data story π
With a compelling narrative in place, there’s only one thing left to do: Share it! We recommend building out a presentation deck, which brings us nicely to the practical part of the tutorial…
Data Science portfolios are as unique as the people who create them. No two portfolios will be the same in every respect, because they reflect the personalities, interests and experiences of a singular individual. Most portfolios, however, do contain similar types of information.
A portfolio differs from a social media page in that it should reflect your professional image or personal brand that you want to be seen by current and future employers, other experts in your field, and recruiters who are searching for talent. An internet search for “example data science portfolios” will find numerous sites, articles, and YouTube videos showing portfolios created by data professionals. To get an idea of the different ways people approach their Data Science portfolios, spend some time exploring examples that interest you.
Having a portfolio to showcase who you are and demonstrate your skills will help you stand out to potential employers. The case study that you will complete in this course can be one of the examples that you add to your portfolio.
Ins and outs of building your portfolio.
First and foremost, your portfolio should represent your own work. While getting ideas from other portfolios is inspiring, directly copying (or only slightly modifying) others’ work and sharing it in your own portfolio is never acceptable.
Additionally, if you work on a project as a data analyst, keep in mind that the work you do for an employer or client belongs to their business. In many cases, you can’t share that work publicly in your personal portfolio without direct and explicit permission from them beforehand.
Finally, be cautious even with open or public datasets. Unless you are using data that you personally collected, ask the owner of the data for permission before you post anything related to the data in your portfolio. You should always take full responsibility for what you publish by getting the right permissions as needed.
Now, let’s review four platforms you can use to host your portfolio.
1. Personal websites
Creating a personal website to host your portfolio is a great option because you can also use it to showcase aspects of your personality or background that contribute to your professional brand. For example, you might share a compelling experience that reflects your ability to collaborate, be resilient, or not give up. Whatever you choose to share, make sure that it is something you wouldn’t mind other people knowing about you. Having a blog or a personal website is also an excellent way to centralize your projects, especially since it’s relatively straightforward to set up a website without spending a huge budget. If you decide to go this route, WordPress is a great place to start, though another CMS like Strikingly or Wix will do the job just fine. hosting your site allows for more control and customization. Plus, if you work hard to optimize your SEO, you can appear quite high in Google searches.
2. Medium (and social networks)
It is important to communicate about your projects as much as possible. For content-based portfolio projects, there are blogging platforms you can use in addition to your own personal website. Medium is one of the best platforms to reach a wider audience with your projects. Moreover, posting on social networks such as Quora, LinkedIn, Twitter, and Reddit, can help solidify your legitimacy as a data scientist and enable your projects to gain more visibility.
3. GitHub
At a high level, GitHub is a website and cloud service that enables developers to store and manage their code repositories and track and monitor changes to them. To understand what GitHub is, you need to know two related principles: version control and Git, which help you record changes to your projects over time to recall specific versions later. You can check out this guide to learn more about Git. The platform allows users to collaborate on or publish open-source projects, fork and share code, and track issues. Setting up a GitHub account and hosting your portfolio using GitHub pages is easy and free.
4. Kaggle
If you have an account on Kaggle, you can also use it as a platform to host your portfolio and personal background. Kaggle is an online community platform for data scientists and machine learning enthusiasts. It allows you to collaborate with other data scientists, find and publish datasets, publish notebooks, and compete with other data scientists to solve data science challenges. There are many datasets available for those who want to implement their algorithms. The advantage of this platform is that the data is relatively well structured and cleaned. It is, therefore, a great place to start to get a feel for working on data science projects.
Overall, Having a solid data science portfolio can be a game-changer. It's a chance to acquire and learn new capabilities and leverage and improve existing ones. Pursuing portfolio projects can enable you to build out new skills, gain recruiters’ attention, and possibly generate potential sources of income by helping you start your freelance journey. Showcasing projects you've worked on to recruiters will differentiate you from other data scientists, so spend some time honing your portfolio, as the return on investment is definitely worth the effort.
The main purpose of your portfolio is to showcase your work. But how will your portfolio be evaluated by an audience? There are a number of things that most recruiters, employers, and others in the field look for when they evaluate your portfolio.
1. Is the portfolio organized and easy to navigate?
Unless you are proficient in web design or markup, it is a good idea to use a template to design a portfolio built on a personal website. There are many web page design templates available. A quick internet search can identify many sites that provide templates and design ideas.
2. Is there a concise introduction or About Me section that gives a short summary of who you are and the purpose of the portfolio?
Make sure that you include links to your resume and appropriate social media accounts, if applicable, such as LinkedIn or to your GitHub profile.
3. Are your data science projects accessible to viewers?
Many web page templates provide the ability to use an image to link to your project page.
4. Is a non-technical person (such as a recruiter) or someone without specialized software able to understand your work?
You can provide links to the actual projects that require software to run, however recruiters and HR personnel may not be familiar with the tools that you used in your project (such as SQL). Provide a simple description of each step of the project in non-technical language.
5. Do your project pages explain why and how you did your project(s)?
The person viewing your profile is interested in why you took on the various projects, what goal you had and how you approached the tasks.
6. Is there a clear outcome or conclusion reached at the conclusion of the project(s)?
Projects that you highlight in your portfolio should be ones that have a clear purpose, even if it is just a personal interest that you are researching. Include an explanation of what you learned at the end of the project and whether the result was in line with your expectations.