split() to break the string using delimiter and options to manage number of breaks, expand , order in Pandas.str methods

import pandas as pd 
my_dict={'name':['Ravi King','Raju Queen','Alex Jack']}
df = pd.DataFrame(data=my_dict)
print(df.name.str.split()) # without delimiter

Output

0     [Ravi, King]
1    [Raju, Queen]
2     [Alex, Jack]

Returns Series, Index, DataFrame

Options

Break email address using @ to separate userid and domain part of the email address.

import pandas as pd 
my_dict={'email':['Ravi@example.com','Raju@example.com','Alex@example.com']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('@'))

Output

0    [Ravi, example.com]
1    [Raju, example.com]
2    [Alex, example.com]

By using the option expand=True, we can get data in columns ( DataFrame ). We can use columns to get our data.

print(df.email.str.split('@',expand=True))

Output

      0            1
0  Ravi  example.com
1  Raju  example.com
2  Alex  example.com

The userid part can be collected like this

print(df.email.str.split('@',expand=True)[0])

Change the column name to 1 ( [1] ) to get domain part.

Handling NaN

import numpy as np
import pandas as pd 
my_dict={'email':['Ravi@example.com','Raju@example.com',np.nan,'Alex@example.com']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('@'))

Output

0    [Ravi, example.com]
1    [Raju, example.com]
2                    NaN
3    [Alex, example.com]

Using get() to get the columns

print(df.email.str.split('@').str.get(0))

Output

0    Ravi
1    Raju
2     NaN
3    Alex

n= int ( default =-1) all

We can specify number of splits to apply, by default all matching occurrences are used ( n=-1 ). We have changed our sample data to include more number of delimiters.

import numpy as np
import pandas as pd 
my_dict={'email':['id.Ravi@example.co.in','id.Raju@example.co.in',np.nan,'id.Alex@example.co.in']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('.',expand=True,n=1))

Output

     0                   1
0   id  Ravi@example.co.in
1   id  Raju@example.co.in
2  NaN                 NaN
3   id  Alex@example.co.in

rsplit()

We can break or split the string starting from right side or from end by using rsplit()

import numpy as np
import pandas as pd 
my_dict={'email':['id.Ravi@example.co.in','id.Raju@example.co.in',np.nan,'id.Alex@example.co.in']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.rsplit('.',expand=True,n=1))

Output

                    0    1
0  id.Ravi@example.co   in
1  id.Raju@example.co   in
2                 NaN  NaN
3  id.Alex@example.co   in

Uses of split()

One of the common requirement is to separate directory and file from the path. Here are some sample data where some addresses ( URLs) are given. Let us try to collect directory name and file name from the data.

import pandas as pd 
my_dict={'Page':['https://www.plus2net.com/html_tutorial/button-linking.php',
                 'https://www.plus2net.com/c-tutorial/grade.php',
                 'https://www.plus2net.com/sql_tutorial/between-date.php',
                 'https://www.plus2net.com/php_tutorial/variables2.php',
                 'https://www.plus2net.com/sql_tutorial/sql_like.php',
                 'https://www.plus2net.com/sql_tutorial/sql_sum-multiple.php',
                 'https://www.plus2net.com/sql_tutorial/date-lastweek.php',
                 'https://www.plus2net.com/sql_tutorial/sql_max.php',
                 'https://www.plus2net.com/sql_tutorial/sql_count.php',
                 'https://www.plus2net.com/html_tutorial/html_marquee_behvr.php',
                 'https://www.plus2net.com/javascript_tutorial/clock.php',
                 'https://www.plus2net.com/php_tutorial/php_drop_down_list.php'
]}
df = pd.DataFrame(data=my_dict)
print(df.Page.str.split('/',expand=True)[3])

Output is here

0           html_tutorial
1              c-tutorial
2            sql_tutorial
3            php_tutorial
4            sql_tutorial
5            sql_tutorial
6            sql_tutorial
7            sql_tutorial
8            sql_tutorial
9           html_tutorial
10    javascript_tutorial
11           php_tutorial

To get the file name we can use like this

print(df.Page.str.split('/',expand=True)[4])

After split we will add to columns

df[['id','prototcal','url','dir','file']]=df.Page.str.split('/',expand=True)

For multi level matching of columns we can use like this. This will help when we are not sure about the number of columns we will get in return. Sometime 2 columns sometime more than 2 columns.

df3 = df['page'].str.split('/', expand=True)
df3.columns = ['page_id{}'.format(x+1) for x in df3.columns]
df = df.join(df3)

Pandas contains() Converting char case slice() cat()

Numpy arrays Python & MySQL Python- Tutorials

Subscribe to our YouTube Channel here

str.split(): Break the string using delimiters

Options

Handling NaN

n= int ( default =-1) all

rsplit()

Uses of split()

Subscribe