from datascience import *
Section 1: Markdown Syntax
Subsection 1.1
The quick brown fox jumps over the lazy
dog.
That was a really cool sentence!
Subsection 1.2: Itemized and Enumerated Lists
- It is very important to check the Binomial Conditions before using the Binomial Distribution!
- Failure to check the necessary conditions can lead to incorrect results.
- Incorrect results are not good!
Subsection 1.3: Typesetting Equations
The Pythagorean Theorem states that \(a^2 + b^2 = c^2\)
\[ f_X(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2} \]
- Pythagorean Theorem: \(a^2 + b^2 = c^2\)
- Euler’s Identity: \(e^{i \pi} + 1 = 0\)
\[ f_X(x) = \begin{cases} \frac{1}{b - a} & \text{if } a \leq x \leq b\\ 0 & \text{otherwise} \end{cases} \]
\[\begin{align*} \overline{x} & = \frac{1}{n} \sum_{i=1}^{n} x_i \\ n \overline{x} & = \sum_{i=1}^{n} x_i \end{align*}\]
Section 1.4: Hyperlinks
Section 2: Importing and Manipulating Data
= Table.read_table("https://pstat5a.github.io/Files/Datasets/air22.csv") air
To find the number of observational units, we simply count the rows in the table using the .num_rows
method:
air.num_rows
20345
To find the number of variables (i.e. the number of columns in the data matrix), we use the .num_columns
method:
air.num_columns
21
Following the hint, we can use the .labels
method to list the column labels (i.e. variable names) of the data matrix:
air.labels
('year',
'month',
'carrier',
'carrier_name',
'airport',
'airport_name',
'arr_flights',
'arr_del15',
'carrier_ct',
'weather_ct',
'nas_ct',
'security_ct',
'late_aircraft_ct',
'arr_cancelled',
'arr_diverted',
'arr_delay',
'carrier_delay',
'weather_delay',
'nas_delay',
'security_delay',
'late_aircraft_delay')
To display only the arr_del15
column we can use the command:
"arr_del15") air.column(
array([ 7., 3., 14., ..., 3., 1., 17.])
To find which years were included in the dataset, we display the year
column of the data matrix:
"year") # seems like only 2022 is included in the dataset air.column(
array([2022, 2022, 2022, ..., 2022, 2022, 2022])
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
'seaborn-v0_8-whitegrid') plt.style.use(
"weather_delay"),
plt.hist(air.column(= 100,
bins = "white");
edgecolor "delay duration (in mins)");
plt.xlabel("frequency");
plt.ylabel("Histogram of Delays"); plt.title(
Here’s how we can easily find the answer to the question “how many observational units were recorded from Alaska Airlines?”
As mentioned in the lab handout, air.column("carrier") == "AS"
returns a boolean vector with True
elements corresponding to values in carrier
that have values AS
. Since True
is encoded as 1
and False
is encoded as 0
(as was discussed in a previous lab), summing up the elements in the array air.column("carrier") == "AS"
will result in the total number of True
elements; i.e. the number of flights that were maintained by Alaska Airlines.
sum(air.column("carrier") == "AS")
975
The code air.row(air.column(1) == 1)
is selecting the rows of the air
data matrix whose second column entry (i.e. month) entry is equal to 1; i.e. it returns the portion of the data matrix corresponding to flights taking place in January.
The code air.row(air.column(1) == 2)[6]
returns the durations of flgihts that took place in February (i.e. the second month).
Thus, putting these two facts together, we can create a for
-loop to give us the average duration of flights per month:
import numpy as np
= []
means for k in np.arange(1, 13):
1) == k)[6])) means.append(np.nanmean(air.row(air.column(
1, 13),
plt.plot(np.arange(;
means)"month");
plt.xlabel("avg. duration of flights ( in mins)");
plt.ylabel("Average Duration of Flights Per Month"); plt.title(