Udacity file download segmentation and clustering instructor notes






















Lidar Obstacle Detection project 7 stars 1 fork. Branches Tags. Could not load branches. Could not load tags. Latest commit. Git stats 2 commits. Failed to load latest commit information. Lidar Obstacle Detection. View code. Lidar Obstacle Detection The main goal of the project is to filter, segment, and cluster real point cloud data to detect obstacles in a driving environment. This project implements pipeline for converting the raw LIDAR sensor measurements into trackable objects.

Udacity: Machine Learning Nanodegree, Project 3 2 stars 2 forks. Branches Tags. Could not load branches. Could not load tags. Latest commit. Git stats 3 commits. Failed to load latest commit information. Consider what each category represents in terms of products you could purchase. To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail.

In the code block below, add three indices of your choice to the indices list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another. Let's draw a bar plot to visualise the amount of each product purchased for each sample, together with the dataset mean.

This will make comparing the three different sample points with each other much easier. Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers. What kind of establishment customer could each of the three samples you've chosen represent?

Hint: Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying "McDonalds" when describing a sample customer as a restaurant. One interesting thought to consider is if one or more of the six product categories is actually relevant for understanding customer purchasing.

That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

Which feature did you attempt to predict? What was the reported prediction score? Is this feature is necessary for identifying customers' spending habits? I attempted to predict feature Delicatessen, because I thought it was very dependant on the other features. Based on the output score, that means Delicatessen can't be predicted based on the other features at least for the DecisionTreeRegressor , so this feature is necessary for identifying customers' spending habits. To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.

If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others.

Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Run the code block below to produce a scatter matrix. Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict?

How is the data for those features distributed? Hint: Is the data normally distributed? Where do most of the data points lie? It confirms my suspicions about Delicatessen, because it seems clear in graphics that has no correlations with any of the other features.

Is easy to detect correlated features, because the data points are distributed along a diagonal line. It is worth notice that the scatter matrix is plotting each of the columns specified against each other column. However, in this format, when you got to a diagonal, you would see a plot of a column against itself. Since this would always be a straight line, Pandas decides it can give you more useful information, and plots the density plot of just that column of data.

And what is clear noticeable is that the data for all of these distributions are positively right swewed, with most of data points distributed between Reference: Understanding the diagonal in Pandas' scatter matrix plot. In this section, you will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting and optionally removing outliers.

Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful. If data is not normally distributed, especially if the mean and median vary significantly indicating a large skew , it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test , which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present and whether it is now stronger or weaker than before. Run the code below to see how the sample data has changed after having the natural logarithm applied to it. Detecting outliers in the data is extremely important in the data preprocessing step of any analysis.

The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers : An outlier step is calculated as 1. A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal. NOTE: If you choose to remove any outliers, ensure that the sample data does not contain any of these points!

Are there any data points considered outliers for more than one feature based on the definition above? Should these data points be removed from the dataset? If any data points were added to the outliers list to be removed, explain why.

I suspect that these outliers should not be removed, because their distance from the normal values can reveal a particular kind of customer. In this section you will use principal component analysis PCA to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component dimension from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data. In this course, you'll learn how to use an advanced analytical method called clustering to create useful segments for business contexts, whether its stores, customers, geographies, etc.

You'll learn this through improving your fluency in Alteryx, a data analytics tool that enables you prepare, blend, and analyze data quickly. This course is ideal for anyone who is interested in pursuing a career in business analysis, but lacks programming experience.

Related Nanodegree Program Introduction to Programming. About this Course The Segmentation and Clustering course provides students with the foundational knowledge to build and apply clustering models to develop more sophisticated segmentation in business contexts.

You will learn: The key concepts of segmentation and clustering, such as standardization vs. This course is part of the Business Analyst Nanodegree Program. Course Cost Free. Skill Level intermediate. Included in Product Rich Learning Content. Free Course Segmentation and Clustering Enhance your skill set and boost your hirability through innovative, independent learning.



0コメント

  • 1000 / 1000