What I learned from the 10-hour data science course - Data Analysis with Python Course - NumPy, Pandas, Data Visualization

What I learned from the 10-hour data science course - Data Analysis with Python Course - NumPy, Pandas, Data Visualization

·

6 min read

I will be watching the 10-hour course from FreeCodeCamp, youtube.com/watch?v=GPVsHOlRBBI, so I want to record what I learned from it.

Conda

img1.png Conda is an open-source package manager made for Python. If you have two python projects, each requiring a different version of Python and Python packages, Conda is for you. Conda can create, save, load and switch between environments on your local computer, so you can work on your two projects without having to reinstall Python every time you switch projects.

Jupyter Notebook

img2.png Jupyter notebook is a notebook where you can write notes and run code. It is useful for data science because codes are organized in cells, so you can run individual cells one at a time. Contents inside variables are also kept after you ran a cell so that you can use them repeatedly without having to restart the entire program. Without a Jupyter notebook, when I made a change, I have to restart the entire program because variables are not stored, which is very inefficient. But from now on I will use a Jupyter notebook and be more efficient.

It also helps me with writing my blog, since its structure is very similar.

Common keyboard shortcuts in Jupyter Notebook

Keyboards shortcuts are very useful to speed up things, so I went ahead to find some useful keyboard shortcuts from this article before starting:

  • Shift + Enter run the current cell, select below
  • Alt + Enter run the current cell, insert below
  • Ctrl + S save and checkpoint

There are two modes, one is the default mode, which is when you just loaded in the document and are not editing anything, in there you can:

  • A insert cell above
  • B insert cell below
  • D, D (press the key twice) delete selected cells
  • Z undo cell deletion
  • Y change the cell type to Code
  • M change the cell type to Markdown
  • Enter take you into edit mode

Inside edit mode, where you edit code cells:

  • Esc take you back to default mode
  • Tab code completion or indent
  • Shift + Tab tooltip

Numpy

img3.png

  • np.genfromtxt() can read csv file and return a numpy array
  • np.savetxt() can store a numpy array from a csv file
  • commonly used functions include:
  • Mathematics: np.sum, np.exp, np.round, arithemtic operators
  • Array manipulation: np.reshape, np.stack, np.concatenate, np.split
  • Linear Algebra: np.matmul, np.dot, np.transpose, np.eigvals
  • Statistics: np.mean, np.median, np.std, np.max
  • numpy supports array broadcasting, which allows arithmetic operations between two arrays with different numbers of dimensions but compatible shapes

Pandas

img4.png

  • pandas' main data type is DataFrame, which is like an Excel spreadsheet
  • we can use the .info() method to view the basic information about the dataframe
  • we can use the .describe() method to see statistical information about the numeric data within the dataframe
  • another important data type is Series, which is like an array. You can use the .index method to get the indexes as a list, which is very useful when you want to plot graphs of a series
  • we can pass in a list of columns to create a view of a data frame, like reduced_df = df[['column1', 'column3']], then the resulting dataframe will only have the two columns, but note that this is simply a view, and modifying values here will change the original dataframe as well, you need to use .copy() to create a new dataframe
  • to view data in a dataframe, we can use .head() to show first few items, .tail() to show last few items, and .sample() to show a random item
  • we can sort the dataframe by value by using covid_df.sort_values() where we pass in the column name
  • we can convert data to date using the pd.to_datetime() function
  • we can use the .groupby() function passing in a column name to group the data, then we can select some columns, and use.sum() or .mean() to calculate the value for the different groupings
  • we can also merge dataframes, by calling .merge() on a dataframe, passing in the dataframe, and on which column
  • to write back to csv, we use .to_csv() passing in the file name

Matplotlib and Seaborn

img5.png

img6.png

  • in a Jupyter notebook, we can use %matplotlib inline after importing matplotlib to ensure that our plots are shown and embedded within the Jupyter notebook itself
  • the most basic chart is a line chart, which can be plotted with plt.plot()
  • we can add labels to the chart with plt.xlabel() and plt.ylabel(), and legend with plt.legend()
  • it seems plots only show at the end of code execution, that's why we can do plt.plot(), then change certain stuff like labels afterwards, then at the end of the execution, the plot will be shown with the correct data and labels
  • we can even plot multiple lines on the same graph
  • seaborn is a statistical graphics library built on matplotlib, and is commonly imported as sns, because Samuel Norman "Sam" Seaborn is a fictional character portrayed by Rob Lowe on the television serial drama 'The West Wing' according to this StackOverflow post, my guess is the creator loved this TV show.
  • we can plot a scatterplot using sns.scatterplot() passing in the x and y, and optionally a hue
  • we can plot a histogram with plt.hist(), and we can customize the bins to even uneven bins if we want
  • we can stack histogram by passing in the stacked argument
  • we can plot bar chart by plt.bar()
  • to stop labels from overlapping, we can tilt them by plt.xticks(rotation=75)
  • we can plot a heat map using sns.heatmap(), a heat map is a good way to visualize 2D data

Basic principle

  • When starting data analysis, it is important to understand the data collected, how is it collected, and how accurate is it.
  • then we want to do some data preparation and cleaning, to prepare the data for analysis, and clean unwanted data.
  • then we can do some exploratory analysis and visualization, this is when we don't have any specific topic in mind, but just poke around and see what the data is like
  • then at last we try to answer questions with the data

Conclusion

I would like to thank the platform jovian.ai for partnering with freecodecamp.org and giving us this free course, and a platform to learn while doing, where we can use the same Jupyter notebook the instructor used. I want to also thank the instructor Aakash N S for the teaching and the material, he included practical examples for the content so we can better understand.

After this course, I feel confident continuing my data analysis project, which can be found on my Hashnode blog.