from <module name> import *
It’s finally time for us to revisit our notions of descriptive statistics (from Week 1 of the course), now in the context of Python!
Modules, Revisited
Before we talk about plotting, we will need to quickly talk about modules again. Recall from Lab01 that modules are Python files containing definitions for functions and classes. Up until now, we’ve been importing all functions and classes from a module using the command
There is another way to import modules, which is the following:
import <module name> as <abbreviation>
For example,
import numpy np
not only imports the numpy
module but imports it with the abbreviation (i.e. nickname) np
so that we can simply write np
in place of numpy
.
The reason this is particularly useful is because module names can sometimes be quite long, so being able to refer to the module with a shortened nickname will save a lot of time!
In general, if we import a module using
import <module name> as <abbreviation>
we reference functions from <module name>
using the syntax
<abbreviation>.<function name>()
For example, after having imported the numpy
module with the nickname np
, we access the sin()
function contained in the numpy
module by calling
np.sin()
Import the
numpy
module asnp
, and check thatnp.sin(0)
returns a value of0
.Import the
datascience
module asds
, and check that
ds.Table().with_columns("Col1", [1, 2, 3],
"Col2", [2, 3, 4]
)
correctly displays as
Col1 | Col2 |
---|---|
1 | 2 |
2 | 3 |
3 | 4 |
If you import a module with an abbreviation <abbreviation>
, you must always use the abbreviation when referencing the module; not the original module name.
For example, after importing numpy
as np
, running numpy.sin()
would return an error.
Numerical Summaries
Measures of Central Tendency
Recall that for a list of numbers \(X = \{x_i\}_{i=1}^{n}\), the mean is defined to be \[ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{1}{n} (x_1 + \cdots + x_n) \] Computing the mean of a list or array of numbers in Python is relatively simple, using the np.mean()
function [recall that we imported the numpy
module with the abbreviation np
, meaning np.mean()
is a shorthand for numpy.mean()
]. Similarly, to compute the median of a list or array we can use np.median()
.
Let x_list
be a list containing the elements 1
, 2
, and 3
, and let x_array
be an array containing the elements 1
, 2
, and 3
. Compute the mean and median of x_list
and x_array
using the appropriate functions from the numpy
module.
Measures of Spread
Recall that we also discussed several measures of spread:
- Standard deviation
- IQR (Interquartile Range)
- Range
Sure enough, the numpy
module contains several functions which help us compute these measures. Let’s examine each separately.
- Look up the help file on the function
np.ptp()
, and describe what it does. Also, answer the question: what doesptp
actually stand for? - Now, apply the
np.ptp()
function on yourx_list
andx_array
variables from Task 1 above and check that it functions like you expect.
Next, we tackle a slightly peculiar function: np.std()
. We expect this to compute the standard deviation of a list/array, but…
- Compute the standard deviation of the
x_list
variable from Task 1 by hand, and write down the answer using a comment or Markdown cell. - Now, run
np.std(x_list)
. Does this answer agree with what you found in part (a) above? - Now, recompute the standard deviation of
x_list
by hand but this time use \((1/n)\) instead of \((1 / n - 1)\) in the formula. How does this answer compare with the result ofnp.std(x_list)
?
The result of the previous Task is the following: given a list x = [x1, x2, ..., xn]
, running np.std(x)
actually computes \[ \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})^2 } \] as opposed to our usual definition of standard deviation \[ s_X = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \overline{x})^2} \] We can actually fix this issue by passing in an additional argument to the np.std()
function:
- Run
np.std(x_list, ddof = 1)
and check whether this matches the result of part (a) above.
To compute the standard deviation of a list x
, we run np.std(x, ddof = 1)
.
Finally, we turn to the IQR: to compute the IQR of a list/array x
, we use (after importing numpy
as np
)
25,75]))[0] np.diff(np.percentile(x, [
Visualizations
It’s finally time to make pretty pictures! The module we will use to generate visualizations in this class is the matplotlib
module (though there are quite a few other modules that work for visualizations as well). The official website for matplotlib
can be found at https://matplotlib.org/.
Before we generate any plots, we will need to run the following code once:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
'seaborn-v0_8-whitegrid') plt.style.use(
Here’s what these lines of code are doing:
%matplotlib inline
tells Jupyter to actually display our plots in our notebook (if we didn’t include this line, our plots wouldn’t display)import matplotlib
imports thematplotlib
moduleimport matplotlib.pyplot as plt
imports thepyplot
submodule (a submodule is just a module contained within another larger module) with the abbreviationplt
.plt.style.use('seaborn-v0_8-whitegrid')
tells Jupyter to use a specific theme (calledseaborn-v0_8-whitegrid
) when generating plots.
Again, notice the beauty of the import <module> as <abbreviation>
syntax- after running the third line above, we no longer need to write matplotlib.pyplot
, just plt
! Also, there are lots of other themes you can use when generating your plots: after completing this lab, I encourage you to consult this reference guide for a list of a few other pyplot
themes.
Boxplots and Histograms
Now, let’s proceed on to make some plots. The first two types of plots we will look at are the two we used to describe numerical data: namely, boxplots and histograms. The functions we will use are the plt.boxplot()
and plt.his()
functions, respectively.
Make a list called
y
that contains the following elements:[1, 2, 3, 4, 5, 4, 3, 5, 4, 1, 2]
.Run
plt.boxplot(y);
(be sure to include the semicolon!). With any luck, your plot should look like:
- Let’s make our boxplot horizontal, as opposed to vertical. Consult the help file on the
matplotlib.pyplot.boxplot()
function here and figure out how to position your boxplot horizontally. Your new plot should look like:
- Next, let’s add some color to our plot. Within your call to
plt.boxplot()
, add the following:patch_artist=True, boxprops = dict(facecolor = "aquamarine")
(don’t worry too much about what exactly this code is doing). Your boxplot should now look like this:
- Finally, let’s add a Title! Right below your call to
plt.boxplot()
, add the following:plt.title("My First Python Boxplot");
(again, note the semicolons). Your final plot should look like this:
- Time for a review: based on the boxplot we just generated, what is the IQR of
y
? Write your answer in a Markdown cell. Then, use the syntax discussed in the previous section of this Lab to use Python to compute the IQR ofy
, and comment on the result.
Of course, boxplots are not the only way to summarize numerical variables: we also have histograms!
Call the plt.hist()
function on the y
list defined in Task 3, and use the help file to add arguments to your call to plt.hist()
function to generate the following plot:
Pay attention to the number of bins!
Scatterplots
We should also quickly discuss how to generate scatterplots in Python.
- Copy-paste the following code into a code cell, and then run that cell (don’t worry about what this code is doing- we’ll discuss that in a future lab).
5)
np.random.seed(
= np.random.normal(0, 1, 100)
x1 = x1 + np.random.normal(0, 1, 100)
x2
; plt.scatter(x1, x2)
Your plot should look like this:
- Add an x-axis label that says
"x1"
and a y-axis label that says"x2"
, along with the title “My First Python Scatterplot”. Your final plot should look like:
- Does there appear to be an association between the variables
x1
andx2
? If so, is the association positive or negative? Linear or nonlinear? Answer using a comment or a Markdown Cell.
Plotting a Function
Finally, I’d like to take a quick detour from descriptive statistics and talk about how to plot a function using Python. As a concrete example, let’s try and plot a sine curve from \(0\) to \(2\pi\).
If you recall, on Lab01 we used the sin()
function from the math
module- it turns out that the numpy
module (which, recall, we have imported as np
) also has a sin()
function, so let’s use that one today:
np.sin()
Next, we create a set of finely-spaced points between our two desired endpoints (in this case, \(0\) and \(2\pi\), respectively). We will do so using the np.linspace()
function, which works as follows:
np.linspace(start, stop, num)
creates a set of num
evenly-spaced values between start
and stop
, respectively. For instance:
0, 1, 10) np.linspace(
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
In the context of plotting, the more points we generate the smoother our plot will seem (you will see what this means in a minute). As such, let’s start with 150
points between 0
and 2 * pi
:
= np.linspace(0, 2 * np.pi, 150) x
Finally, we call the plt.plot()
function on x
and np.sin(x)
to generate our plot:
=(4.5, 2.25))
plt.figure(figsize plt.plot(x, np.sin(x))
Let’s see what would have happened if we used fewer values in our np.linspace()
call:
= np.linspace(0, 2 * np.pi, 10)
xnew plt.plot(xnew, np.sin(xnew))
So, the more points we include in our call to np.linspace()
, the smoother our final function will look!
So, to summarize, here is the general “recipe” to plot a function f()
between two values a
and b
in Python:
- Let
x = np.linspace(a, b, <some large value>)
- Call
plt.plot(x, f(x))
- Add labels/titles as necessary
Generate a plot of the function \(f(x) = x - x^2 \sin(x)\) between \(x = -10\) and \(x = 10\). Experiment around with the number of values generated by np.linspace()
to ensure your plot is relatively smooth. Be sure to include axis labels; also, change the color of the graph to red. Your final plot should look something like this:
You only need to complete up to here during Lab, but you should complete the following tasks on your own as they are fair game for quizzes and exams (and are also very useful for your own edification!)
Overlaying Plots
Sometimes it will be useful to overlay two plots on top of each other. Recall that, for a function f()
and a variable x
that has been assigned a value resulting from a call to numpy.linspace()
, we generate a graph of f()
using (assuming matplotlib.pyplot
has been imported as plt
)
; plt.plot(x, f(x))
It stands to reason, then that given another function g()
we should be able to superimpose the graph of g()
onto the graph of f()
by simply adding another call to plt.plot()
:
;
plt.plot(x, f(x)); plt.plot(x, g(x))
Generate a graph of sin()
; on top of this graph, superimpose the graph of cos()
. Restrict the x
values on your graph to be between \(-4\pi\) and \(4\pi\). Your final graph should look like the following (pay attention to the axis labels and title!):
Now, as it stands, it’s a bit difficult to determine which curve corresponds to the sine curve and which corresponds to the cosine curve. As such, we should add some labels!
Copy your code from Task 9 above into a new code cell, and
- add
label = "sine"
to your call toplt.plot()
containing the sine curve - add
label = "cosine"
to your call toplt.plot()
containing the cosine curve.
Does this new plot look any different than the plot you generated in Task 3?
Hm, doesn’t look like anything changed… That’s because we didn’t add a legend to our plot! To add a legend, we simply tack on a call to plt.legend()
after our code from above.
Copy your code from Task 10 above into a new code cell, and add a line underneath it containing a call to plt.legend()
. Look up the help file to figure out what arguments you need to pass in to obtain the following graph (note the position of the legend):
Okay, we’re almost there! The only issue is that now the legend is covered up by the actual graphs. One way we can fix this is by extending the \(y-\)axis further, using the function plt.ylim()
:
Copy your code from Task 11 above into a new code cell, and add a line underneath it containing a call to plt.ylim()
. Look up the help file to figure out what arguments you need to pass in to obtain a lower \(y-\)limit of \(-1.5\) and an upper \(y-\)limit of \(2.0\). Your final graph should look like this:
Finally, it is sometimes considered bad form to rely too heavily on colors in plots. This is because doing so alienates readers who are colorblind. One way around this is to rely on different line types; e.g. used dashed lines for one graph and dotted lines for another.
Copy your code from Task 12 above into a new code cell. Read the following help file and figure out how to pass in a value to the linestyle
argument to your two calls to plt.plot()
to generate the following plot:
Note that the sine curve is now dotted, and the cosine curve is now dashed.