Subsetting Data in Pandas
import pandas as pd
titanic = pd.read_csv('https://web.stanford.edu/class/cs102/datasets/Titanic.csv')
titanic.head()
last | first | gender | age | class | fare | embarked | survived | |
---|---|---|---|---|---|---|---|---|
0 | Braund | Mr. Owen Harris | M | 22.0 | 3 | 7.2500 | Southampton | no |
1 | Cumings | Mrs. John Bradley (Florence Briggs Thayer) | F | 38.0 | 1 | 71.2833 | Cherbourg | yes |
2 | Heikkinen | Miss Laina | F | 26.0 | 3 | 7.9250 | Southampton | yes |
3 | Futrelle | Mrs. Jacques Heath (Lily May Peel) | F | 35.0 | 1 | 53.1000 | Southampton | yes |
4 | Allen | Mr. William Henry | M | 35.0 | 3 | 8.0500 | Southampton | no |
Select Specific Columns
To select a single column, use square brackets [] with the column name of the column of interest.
ages = titanic["age"]
ages.head()
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: age, dtype: float64
age_sex = titanic[["age", "gender"]]
age_sex.head()
age | gender | |
---|---|---|
0 | 22.0 | M |
1 | 38.0 | F |
2 | 26.0 | F |
3 | 35.0 | F |
4 | 35.0 | M |
Understand the dataframe
type(titanic[["age", "gender"]])
pandas.core.frame.DataFrame
titanic[["age", "gender"]].shape
(891, 2)
Select Specific Rows
above_35 = titanic[titanic["age"] > 35]
above_35.head()
last | first | gender | age | class | fare | embarked | survived | |
---|---|---|---|---|---|---|---|---|
1 | Cumings | Mrs. John Bradley (Florence Briggs Thayer) | F | 38.0 | 1 | 71.2833 | Cherbourg | yes |
6 | McCarthy | Mr. Timothy J | M | 54.0 | 1 | 51.8625 | Southampton | no |
11 | Bonnell | Miss Elizabeth | F | 58.0 | 1 | 26.5500 | Southampton | yes |
13 | Andersson | Mr. Anders Johan | M | 39.0 | 3 | 31.2750 | Southampton | no |
15 | Hewlett | Mrs. (Mary D Kingcome) | F | 55.0 | 2 | 16.0000 | Southampton | yes |
class_23 = titanic[titanic["class"].isin([2, 3])]
class_23.head()
last | first | gender | age | class | fare | embarked | survived | |
---|---|---|---|---|---|---|---|---|
0 | Braund | Mr. Owen Harris | M | 22.0 | 3 | 7.2500 | Southampton | no |
2 | Heikkinen | Miss Laina | F | 26.0 | 3 | 7.9250 | Southampton | yes |
4 | Allen | Mr. William Henry | M | 35.0 | 3 | 8.0500 | Southampton | no |
5 | Moran | Mr. James | M | NaN | 3 | 8.4583 | Queenstown | no |
7 | Palsson | Master Gosta Leonard | M | 2.0 | 3 | 21.0750 | Southampton | no |
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator |
class_23 = titanic[(titanic["class"] == 2) | (titanic["class"] == 3)]
I want to work with passenger data for which the age is known.
age_no_na = titanic[titanic["age"].notna()]
age_no_na.head()
last | first | gender | age | class | fare | embarked | survived | |
---|---|---|---|---|---|---|---|---|
0 | Braund | Mr. Owen Harris | M | 22.0 | 3 | 7.2500 | Southampton | no |
1 | Cumings | Mrs. John Bradley (Florence Briggs Thayer) | F | 38.0 | 1 | 71.2833 | Cherbourg | yes |
2 | Heikkinen | Miss Laina | F | 26.0 | 3 | 7.9250 | Southampton | yes |
3 | Futrelle | Mrs. Jacques Heath (Lily May Peel) | F | 35.0 | 1 | 53.1000 | Southampton | yes |
4 | Allen | Mr. William Henry | M | 35.0 | 3 | 8.0500 | Southampton | no |
How do I select specific rows and columns from a DataFrame?
I’m interested in the names of the passengers older than 35 years.
adult_names = titanic.loc[titanic["age"] > 35, "last"]
adult_names.head()
1 Cumings
6 McCarthy
11 Bonnell
13 Andersson
15 Hewlett
Name: last, dtype: object
I’m interested in rows 10 till 25 and columns 3 to 5.
titanic.iloc[9:25, 2:5]
gender | age | class | |
---|---|---|---|
9 | F | 14.0 | 2 |
10 | F | 4.0 | 3 |
11 | F | 58.0 | 1 |
12 | M | 20.0 | 3 |
13 | M | 39.0 | 3 |
14 | F | 14.0 | 3 |
15 | F | 55.0 | 2 |
16 | M | 2.0 | 3 |
17 | M | NaN | 2 |
18 | F | 31.0 | 3 |
19 | F | NaN | 3 |
20 | M | 35.0 | 2 |
21 | M | 34.0 | 2 |
22 | F | 15.0 | 3 |
23 | M | 28.0 | 1 |
24 | F | 8.0 | 3 |
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
Select specific rows and/or columns using loc when using the row and column names
Select specific rows and/or columns using iloc when using the positions in the table
You can assign new values to a selection based on loc/iloc.
# comparson between two columns
#df.query('value_1 < value_2')