mattconners.github.io

Doing data science - some technical learning resources

import pandas as pd

titanic = pd.read_csv('https://web.stanford.edu/class/cs102/datasets/Titanic.csv')

 titanic.head()

	last	first	gender	age	class	fare	embarked	survived
0	Braund	Mr. Owen Harris	M	22.0	3	7.2500	Southampton	no
1	Cumings	Mrs. John Bradley (Florence Briggs Thayer)	F	38.0	1	71.2833	Cherbourg	yes
2	Heikkinen	Miss Laina	F	26.0	3	7.9250	Southampton	yes
3	Futrelle	Mrs. Jacques Heath (Lily May Peel)	F	35.0	1	53.1000	Southampton	yes
4	Allen	Mr. William Henry	M	35.0	3	8.0500	Southampton	no

To select a single column, use square brackets [] with the column name of the column of interest.

ages = titanic["age"]

ages.head()

  22.0
  38.0
  26.0
  35.0
  35.0
Name: age, dtype: float64

age_sex = titanic[["age", "gender"]]

age_sex.head()

type(titanic[["age", "gender"]])

pandas.core.frame.DataFrame

titanic[["age", "gender"]].shape

(891, 2)

above_35 = titanic[titanic["age"] > 35]

above_35.head()

	last	first	gender	age	class	fare	embarked	survived
1	Cumings	Mrs. John Bradley (Florence Briggs Thayer)	F	38.0	1	71.2833	Cherbourg	yes
6	McCarthy	Mr. Timothy J	M	54.0	1	51.8625	Southampton	no
11	Bonnell	Miss Elizabeth	F	58.0	1	26.5500	Southampton	yes
13	Andersson	Mr. Anders Johan	M	39.0	3	31.2750	Southampton	no
15	Hewlett	Mrs. (Mary D Kingcome)	F	55.0	2	16.0000	Southampton	yes

class_23 = titanic[titanic["class"].isin([2, 3])]

class_23.head()

	last	first	gender	age	class	fare	embarked	survived
0	Braund	Mr. Owen Harris	M	22.0	3	7.2500	Southampton	no
2	Heikkinen	Miss Laina	F	26.0	3	7.9250	Southampton	yes
4	Allen	Mr. William Henry	M	35.0	3	8.0500	Southampton	no
5	Moran	Mr. James	M	NaN	3	8.4583	Queenstown	no
7	Palsson	Master Gosta Leonard	M	2.0	3	21.0750	Southampton	no

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an

(or) operator

class_23 = titanic[(titanic["class"] == 2) | (titanic["class"] == 3)]

I want to work with passenger data for which the age is known.

age_no_na = titanic[titanic["age"].notna()]

age_no_na.head()

	last	first	gender	age	class	fare	embarked	survived
0	Braund	Mr. Owen Harris	M	22.0	3	7.2500	Southampton	no
1	Cumings	Mrs. John Bradley (Florence Briggs Thayer)	F	38.0	1	71.2833	Cherbourg	yes
2	Heikkinen	Miss Laina	F	26.0	3	7.9250	Southampton	yes
3	Futrelle	Mrs. Jacques Heath (Lily May Peel)	F	35.0	1	53.1000	Southampton	yes
4	Allen	Mr. William Henry	M	35.0	3	8.0500	Southampton	no

I’m interested in the names of the passengers older than 35 years.

adult_names = titanic.loc[titanic["age"] > 35, "last"]

adult_names.head()

     Cumings
    McCarthy
    Bonnell
  Andersson
    Hewlett
Name: last, dtype: object

I’m interested in rows 10 till 25 and columns 3 to 5.

titanic.iloc[9:25, 2:5]

When selecting subsets of data, square brackets [] are used.

Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

Select specific rows and/or columns using loc when using the row and column names

Select specific rows and/or columns using iloc when using the positions in the table

You can assign new values to a selection based on loc/iloc.

# comparson between two columns
#df.query('value_1 < value_2')