Download the file elements.csv and load it into a pandas DataFrame.
There are online test question corresponding to some of the subtasks below.
Work out how many columns there are in the dataframe.
Each element has a type: lithium is an alkali metal, chlorine is a halogen and so on.
Work out how to get a list of all the types, in alphabetical order, without repetitions.
(You could search for "pandas unique" if necessary.)
Use the
groupby()
method to count how many elements there are of each type.
Make a smaller dataframe which just lists the names and symbols of elements that
are gasses at standard temperature and pressure. (Use the Phase column
to determine this.) Read the documentation for
reset_index()
and use it to tidy up the index of the new dataframe without introducing a useless extra column.
Extract the row for the artificial element Darmstadtium. You should see that
many values are listed as NaN, because they have not been measured.
Use the
dropna()
method to remove these rows.
Use the
argmax()
method to extract the row corresponding to the element with the highest boiling point.
This will add a column to your dataframe which gives a colour for each type of element.
Use matplotlib to make a picture of the periodic table. For each element,
you should use ax.text() to put the symbol in the correct place, as determined
by the Display Row and Display Column columns in the dataframe,
with the colour determined by the Colour column. When I did this, I found it
necessary to draw a rectangular box around the outside, otherwise the positions got
messed up for some reason that I have not understood.
Task 2: Whitby tides
Download the file whitby_tides.txt, which contains a record
of the tide levels at Whitby (in North Yorkshire) every 15 minutes in 2023. We will want
to convert this dataset into a pandas DataFrame, but this will require some auxiliary steps.
If the above file has been saved in the same folder as this notebook, we can read
all the lines in the file as follows:
with open(f'whitby_tides.txt') as f:
lines = f.readlines()
The first few lines of the file are headers, which you should ignore. The remaining lines
are like this:
For each line, you should extract the time and the tide level. You can either use
line.split() or slices like line[10:25] to get parts
of the string. You should represent the time as the number of hours since the
start of the year. Given a suitable string s, the function
pd.Timestamp(s).timestamp()
will give you the number of seconds since
the start of 1970, and you can work from there. The tide level is the first
number after the time. Some of the tide levels are followed by an M,
which you should ignore. If the tide level is given as -99, that means that the
data is missing, so you should skip that line. You should also skip any lines
that are empty. It will probably be simplest to start by creating a list, but
when the list is complete you should convert it into a pandas dataframe with
columns time and level.
Now plot the tide levels against time. If you do this for the whole dataset
then the plot will be very crowded, so you should experiment with plotting
a few days or weeks at a time. The time will be given in hours, but you
could add vertical dotted lines every 24 hours to make the daily variation
easier to see.
Next, we will try to fit a curve to the data. Start by copying the following
code:
tau = [
12.4206012, # Principal lunar semidiurnal
12, # Principal solar semidiurnal
12.65834751, # Larger lunar elliptic semidiurnal
23.93447213, # Lunar diurnal
6.210300601, # Shallow water overtides of principal lunar
25.81933871, # Lunar diurnal
4.140200401, # Shallow water overtides of principal lunar
8.177140247, # Shallow water terdiurnal
6 # Shallow water overtides of principal solar
]
omega = 2*np.pi/np.array(tau)
This defines an array $(\omega_0,\dotsc,\omega_8)$. Theory
(which you can read about on
Wikipedia)
predicts that the tide level at time $t$ should be given by
$$ p(t,a0,\dotsc,a18) =
a0 + a1\cos(\omega_0 t) + a2\sin(\omega_0 t) + \dotsb +
a17\cos(\omega_8 t) + a18\sin(\omega_8 t).
$$
Enter this as a Python definition. One approach is to just write out the
twenty function arguments and nineteen terms in the sum explicitly.
Alternatively, you can start the definition with def p(t,*a):,
then a will be a tuple and you can write a loop or use
numpy array methods.
Now use scipy.optimize.curve_fit to fit the curve to the data.
You can see the Mauna Loa notebook (linked at the top of this page) for an
example of how to do this. Plot the actual tide level and the
modelled tide level on the same graph, for $t$ from 1000 to 2000 hours.
The second row is the number of bachelors degrees in psychology awarded in each year.
The third row is the number of people employed as groundkeepers in Utah.
The fourth row is the number of babies named Georgia born in each year.
The fifth row os the number of terawatt hours of fossil fuel energy used in Somalia in each year.
Create a pandas dataframe df from this data. The years should
give the index, and the column names should be 'Psychology', 'Groundkeepers',
'Georgia' and 'Somalia'. Then use df.corr() to calculate the
correlations between the columns. You should find that the correlations
are all quite high, suggesting that the time variation of the columns is similar.
To visualise this, make a scatter plot of the psychology column against the
groundkeepers column. You should see that the points lie very close to a
straight line. For other pairs of columns the relationship is still
strong but not as clear.
Now use df.plot() to plot all the columns against time. The
plot will not be very informative, because the scales are so different:
the number of bachelors degrees is in the hundreds of thousands, while the
energy usage numbers are all less than one. To fix this, we can normalise
each column by subtracting the mean and dividing by the standard deviation
(which can be calculated by the mean() and std()
methods). Create a new dataset dfn with the same headers
but with normalised columns. Now dfn.plot() should show
four graphs that are fairly similar.
Task 4: Fruit and vegetables
Download and unzip the file food.zip, which contains
48 images of fruit and vegetables. There are six different types
(apples, bananas, carrots, oranges, plums and green peppers) with
eight images of each type. We will try to classify the images by type,
just using the overall colour. We will need the following imports:
from PIL import Image, ImageDraw
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
Now enter files = list(Path('fruit').glob('*.jpg')) to get
a list of all the image files. (This assumes that your notebook is
in a certain folder, and that the images are in a subfolder of that
folder called fruit.) Now enter
img = Image.open(files[0]) and display(img)
to load and display the first image. Then enter
A = np.asarray(img) to get an array of shape
(500,500,3) in which A[i,j,0] is the
amount of red at position (i,j), and
A[i,j,1] is the amount of green, and
A[i,j,2] is the amount of blue. All of these numbers
are integers between 0 and 255.
It will be more convenient to work with a matrix a of
shape (250000,3); use the reshape() method
to convert A to this form. We then want to take the
average of each column in a, giving a vector of length
three containing the red, green and blue values for the average colour
of the image. If we just do a.mean(), we will get a
single number which is the average of all numbers in a,
which is not what we want. Work out how to modify this to give
the required result. Also, round the resulting numbers to the nearest
integer, and use the astype() method to ensure that the
result has integer data type.
Now make a loop to apply all the above steps to each of the 48 images,
giving an array average_colour of shape
(48, 3). Then enter
This will give an array of dots, showing the average colours of all the
images. You will see that the result is not very satisfactory: the
average includes all the white border around each image so the
resulting colour is very pale. To fix this, go back to your
loop and insert the line a = a[a.min(axis=1) < 240]
(which will discard the white pixels) before taking the average,
then regenerate the scatter plot.
Now enter the following lines to divide the images into six groups
with similar colours:
This will display an array kmeans.labels_ of 48 integers,
each in the set $\{0,\dotsc,5\}$. You can display all the images
with label zero as follows:
for i in range(48):
if kmeans.labels_[i] == 1:
display(Image.open(files[i]))
You can display the other groups in the same way, or (better)
work out how to make a single picture showing thumbnails of
all the images arranged in groups. You should find that the
method used here is moderately successful: the groups are
roughly right but the oranges are not separated from the carrots
and there are various other errors as well.