mattconners.github.io

Doing data science - some technical learning resources

Subsetting Data in Pandas

import pandas as pd
titanic = pd.read_csv('https://web.stanford.edu/class/cs102/datasets/Titanic.csv')
 titanic.head()
last first gender age class fare embarked survived
0 Braund Mr. Owen Harris M 22.0 3 7.2500 Southampton no
1 Cumings Mrs. John Bradley (Florence Briggs Thayer) F 38.0 1 71.2833 Cherbourg yes
2 Heikkinen Miss Laina F 26.0 3 7.9250 Southampton yes
3 Futrelle Mrs. Jacques Heath (Lily May Peel) F 35.0 1 53.1000 Southampton yes
4 Allen Mr. William Henry M 35.0 3 8.0500 Southampton no

Select Specific Columns

To select a single column, use square brackets [] with the column name of the column of interest.

ages = titanic["age"]
ages.head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64
age_sex = titanic[["age", "gender"]]
age_sex.head()
age gender
0 22.0 M
1 38.0 F
2 26.0 F
3 35.0 F
4 35.0 M

Understand the dataframe

type(titanic[["age", "gender"]])
pandas.core.frame.DataFrame
titanic[["age", "gender"]].shape
(891, 2)

Select Specific Rows

above_35 = titanic[titanic["age"] > 35]
above_35.head()
last first gender age class fare embarked survived
1 Cumings Mrs. John Bradley (Florence Briggs Thayer) F 38.0 1 71.2833 Cherbourg yes
6 McCarthy Mr. Timothy J M 54.0 1 51.8625 Southampton no
11 Bonnell Miss Elizabeth F 58.0 1 26.5500 Southampton yes
13 Andersson Mr. Anders Johan M 39.0 3 31.2750 Southampton no
15 Hewlett Mrs. (Mary D Kingcome) F 55.0 2 16.0000 Southampton yes
class_23 = titanic[titanic["class"].isin([2, 3])]
class_23.head()
last first gender age class fare embarked survived
0 Braund Mr. Owen Harris M 22.0 3 7.2500 Southampton no
2 Heikkinen Miss Laina F 26.0 3 7.9250 Southampton yes
4 Allen Mr. William Henry M 35.0 3 8.0500 Southampton no
5 Moran Mr. James M NaN 3 8.4583 Queenstown no
7 Palsson Master Gosta Leonard M 2.0 3 21.0750 Southampton no
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an (or) operator
class_23 = titanic[(titanic["class"] == 2) | (titanic["class"] == 3)]

I want to work with passenger data for which the age is known.

age_no_na = titanic[titanic["age"].notna()]
age_no_na.head()
last first gender age class fare embarked survived
0 Braund Mr. Owen Harris M 22.0 3 7.2500 Southampton no
1 Cumings Mrs. John Bradley (Florence Briggs Thayer) F 38.0 1 71.2833 Cherbourg yes
2 Heikkinen Miss Laina F 26.0 3 7.9250 Southampton yes
3 Futrelle Mrs. Jacques Heath (Lily May Peel) F 35.0 1 53.1000 Southampton yes
4 Allen Mr. William Henry M 35.0 3 8.0500 Southampton no

How do I select specific rows and columns from a DataFrame?

I’m interested in the names of the passengers older than 35 years.

adult_names = titanic.loc[titanic["age"] > 35, "last"]
adult_names.head()
1       Cumings
6      McCarthy
11      Bonnell
13    Andersson
15      Hewlett
Name: last, dtype: object

I’m interested in rows 10 till 25 and columns 3 to 5.

titanic.iloc[9:25, 2:5]
gender age class
9 F 14.0 2
10 F 4.0 3
11 F 58.0 1
12 M 20.0 3
13 M 39.0 3
14 F 14.0 3
15 F 55.0 2
16 M 2.0 3
17 M NaN 2
18 F 31.0 3
19 F NaN 3
20 M 35.0 2
21 M 34.0 2
22 F 15.0 3
23 M 28.0 1
24 F 8.0 3

When selecting subsets of data, square brackets [] are used.

Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

Select specific rows and/or columns using loc when using the row and column names

Select specific rows and/or columns using iloc when using the positions in the table

You can assign new values to a selection based on loc/iloc.

# comparson between two columns
#df.query('value_1 < value_2')