This blog post contains a number of useful code snippets, functions, and classes that will help with machine learning in Jupyter Notebooks. Specific usage instructions as well as dependencies

Modifying and Selecting Information from a DataFrame

Random Sample of Rows

Choose a random sample of rows from a large dataframe. This is very useful when trying to reduce your training set for debugging purposes.

indices = np.random.randint(10, size=2)
smallerDF = df[indices,:]

Selecting and Removing Columns

It’s very easy to extract or remove columns from a pandas dataframe using their built in indexing actions.

# Extract these keys into their own data frame
preserveKeys = ['x', 'y']
smallDF = df[preserveKeys]

# Create a new dataframe without certain columns
newDF = df.drop(columns=['z'])

Converting All Non-Null Entries to 1

This is helpful when converting a dataframe into a boolean dataframe in which a 1 indicates the prescence of a value and a null indicates there was no value. It can turn any dataframe into a sort of “checkbox” which is helpful for certain types of data processing like collaborative filtering where the actual value doesn’t matter.

booleanDF = copy.deepcopy(df) # Deep copy so don't modify other DF

# Convert 'np.nan' to 0's and everything else to 1's
booleanDF = booleanDF.notnull().astype('int')

# Replace all 0's with 'np.nan'
booleanDF = booleanDF.replace(0, np.nan)

Merging DataFrames

# Merge 2 dataframes with the same rows (ie. add new columns)
finalDF = pd.concat([df1, df2], sort=True)

# Merge 2 dataframes with the same columns (ie. add more rows)
finalDF = pd.concat([df1, df2], sort=True, axis=0)

# Merge 2 dataframes by row and add new columns when appropriate
finalDF = pd.concat([df1, df2], sort=True, axis=0, ignore_index=True)

Getting an Overview of a DataFrame (Make this into another

It’s often difficult to deal with abstract, seemingly black-box machine learning algorithms. What can help alleviate some of this stress is knowing what your data really looks like. Here’s a few examples that will help you understand what’s going on in your dataset.

Printing Basic Excerpts


df.head(5) # First 5 rows

Running df.describe() will print a table of all columns and their respective counts (how many non-null values in the column), mean, std (standard deviation), min, 25%, 50%, 75%, and max. Sometimes the row name doens’t get included. This can be fixed by passing in the argument “include=’all’” as follows: df.describe(include='all').

Running df.head(n) will print the first n rows of your dataset and can give you a good understanding of the form of your data. While the describe() function is good at showing you basic distributions, the head() function will show you what your data actually looks like quickly and easily.

Posted in General Programming with Python, Source Code