« Pandas
Break the string using delimiters. If no delimiter provided then break using whitespace.
Returns Series, Index, DataFrame
No delimiter
import pandas as pd
my_dict={'name':['Ravi King','Raju Queen','Alex Jack']}
df = pd.DataFrame(data=my_dict)
print(df.name.str.split())
Output
0 [Ravi, King]
1 [Raju, Queen]
2 [Alex, Jack]
Options
Break email address using @ to separate userid and domain part of the email address.
import pandas as pd
my_dict={'email':['Ravi@example.com','Raju@example.com','Alex@example.com']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('@'))
Output
0 [Ravi, example.com]
1 [Raju, example.com]
2 [Alex, example.com]
By using the option expand=True, we can get data in columns ( DataFrame ). We can use columns to get our data.
print(df.email.str.split('@',expand=True))
Output
0 1
0 Ravi example.com
1 Raju example.com
2 Alex example.com
The userid part can be collected like this
print(df.email.str.split('@',expand=True)[0])
Change the column name to 1 ( [1] ) to get domain part.
Handling NaN
import numpy as np
import pandas as pd
my_dict={'email':['Ravi@example.com','Raju@example.com',np.nan,'Alex@example.com']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('@'))
Output
0 [Ravi, example.com]
1 [Raju, example.com]
2 NaN
3 [Alex, example.com]
Using get() to get the columns
print(df.email.str.split('@').str.get(0))
Output
0 Ravi
1 Raju
2 NaN
3 Alex
n= int ( default =-1) all
We can specify number of splits to apply, by default all matching occurrences are used ( n=-1 ). We have changed our sample data to include more number of delimiters.
import numpy as np
import pandas as pd
my_dict={'email':['id.Ravi@example.co.in','id.Raju@example.co.in',np.nan,'id.Alex@example.co.in']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.split('.',expand=True,n=1))
Output
0 1
0 id Ravi@example.co.in
1 id Raju@example.co.in
2 NaN NaN
3 id Alex@example.co.in
rsplit()
We can break or split the string starting from right side or from end by using rsplit()
import numpy as np
import pandas as pd
my_dict={'email':['id.Ravi@example.co.in','id.Raju@example.co.in',np.nan,'id.Alex@example.co.in']}
df = pd.DataFrame(data=my_dict)
print(df.email.str.rsplit('.',expand=True,n=1))
Output
0 1
0 id.Ravi@example.co in
1 id.Raju@example.co in
2 NaN NaN
3 id.Alex@example.co in
Uses of split()
One of the common requirement is to separate directory and file from the path. Here are some sample data where some addresses ( URLs) are given. Let us try to collect directory name and file name from the data.
import pandas as pd
my_dict={'Page':['https://www.plus2net.com/html_tutorial/button-linking.php',
'https://www.plus2net.com/c-tutorial/grade.php',
'https://www.plus2net.com/sql_tutorial/between-date.php',
'https://www.plus2net.com/php_tutorial/variables2.php',
'https://www.plus2net.com/sql_tutorial/sql_like.php',
'https://www.plus2net.com/sql_tutorial/sql_sum-multiple.php',
'https://www.plus2net.com/sql_tutorial/date-lastweek.php',
'https://www.plus2net.com/sql_tutorial/sql_max.php',
'https://www.plus2net.com/sql_tutorial/sql_count.php',
'https://www.plus2net.com/html_tutorial/html_marquee_behvr.php',
'https://www.plus2net.com/javascript_tutorial/clock.php',
'https://www.plus2net.com/php_tutorial/php_drop_down_list.php'
]}
df = pd.DataFrame(data=my_dict)
print(df.Page.str.split('/',expand=True)[3])
Output is here
0 html_tutorial
1 c-tutorial
2 sql_tutorial
3 php_tutorial
4 sql_tutorial
5 sql_tutorial
6 sql_tutorial
7 sql_tutorial
8 sql_tutorial
9 html_tutorial
10 javascript_tutorial
11 php_tutorial
To get the file name we can use like this
print(df.Page.str.split('/',expand=True)[4])
« Pandas
contains() Converting char case slice()
cat()
← Subscribe to our YouTube Channel here