Fun with Python

AGENDA
▸ Using Python to Access Web Data
▸ Using Databases with Python
▸ Processing and Visualizing Data with Python

USING PYTHON TO ACCESS WEB DATA
Access Web Data

▸ Web Requests
▸ Web Parser
▸ Web Services

▸ Web Requests
Requests Library
import requests
requests.get(‘http://www.facebook.com’).text
pip install requests #install library

▸ Web Requests
Make a Request
#GET Request 
 
import requests
r = requests.get(‘http://www.facebook.com’)  
if r.status_code == 200: 
print(“Success”)
Success

▸ Web Requests
Make a Request
#POST Request 
 
import requests
r = requests.post('http://httpbin.org/post', data = {'key':'value'}) 
Success

▸ Web Requests
Make a Request
#Other Types of Request 
 
import requests
r = requests.put('http://httpbin.org/put', data = {'key':'value'}) 
r = requests.delete('http://httpbin.org/delete') 
r = requests.head('http://httpbin.org/get')  
r = requests.options('http://httpbin.org/get')

▸ Web Requests
Passing Parameters In URLs
#GET Request with parameter 
 
import requests
r = requests.get(‘https://www.google.co.th/?hl=th’)  
Success

▸ Web Requests
#GET Request with parameter
import requests
r = requests.get(‘https://www.google.co.th’,params={“hl”:”en”})  
Success

▸ Web Requests
#POST Request with parameter
import requests
r = requests.post("https://m.facebook.com",data={"key":"value"}) 
Success

▸ Web Requests
Response Content
#Text Response
import requests 
 
data = {“email” :“…..” , pass : “……”} 
r = requests.post(“https://m.facebook.com”,data=data) 
print(r.text)
'<?xml version="1.0" encoding="utf-8"?>n<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML
Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd"><html xmlns="http://
www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="referrer"
content="default" id="meta_referrer" /><style type=“text/css”>/*<!………………..

▸ Web Requests
Response Content
#Response encoding
import requests
r = requests.get('https://www.google.co.th/logos/doodles/2016/king-
bhumibol-adulyadej-1927-2016-5148101410029568.2-hp.png')  
r.encoding = ’tis-620' 
print(r.text)
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage"
lang="th"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta
content="/logos/doodles/2016/king-bhumibol-adulyadej-1927-2016-5148101410029568.2-
hp.png" itemprop="image"><meta content="ปวงข้าพระพุทธเจ้า ขอน้อมเกล้าน้อมกระหม่อมรำลึกใน...

▸ Web Requests
Response Content
#Binary Response
 
 
import requests
r = requests.get('https://www.google.co.th/logos/doodles/2016/king-
bhumibol-adulyadej-1927-2016-5148101410029568.2-hp.png')  
open(“img.png”,”wb”).write(r.content)

▸ Web Requests
Response Status Codes
#200 Response (OK) 
 
import requests
r = requests.get('https://api.github.com/events') 
if r.status_code == requests.codes.ok: 
print(data[0]['actor'])
 
{'url': 'https://api.github.com/users/ShaolinSarg', 'display_login': 'ShaolinSarg', 'avatar_url': 'https://
avatars.githubusercontent.com/u/6948796?', 'id': 6948796, 'login': 'ShaolinSarg', 'gravatar_id': ''}

▸ Web Requests
#200 Response (OK) 
 
import requests
r = requests.get('https://api.github.com/events') 
print(r.status_code)
200

▸ Web Requests
#404 
 
import requests
r = requests.get('https://api.github.com/events/404') 
print(r.status_code) 
404

▸ Web Requests
Response Headers
#404 
 
import requests
r = requests.get('http://www.sanook.com') 
print(r.headers) 
print(r.headers[‘Date’]) 
{'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 08 Nov 2016 14:38:41 GMT', 'Cache-
Control': 'private, max-age=0', 'Age': '16', 'Content-Encoding': 'gzip', 'Content-Length': '38089',
'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Accept-Ranges': 'bytes'} 
 
Tue, 08 Nov 2016 14:38:41 GMT

▸ Web Requests
Timeouts
#404 
 
import requests
r = requests.get(‘http://www.sanook.com',timeout=0.001) 
ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read
timeout=0.101)

▸ Web Requests
Authentication
#Basic Authentication 
 
import requests
r = requests.get('https://api.github.com/user', auth=('user', 'pass')) 
print(r.status_code) 
200

▸ Web Requests
read more : http://docs.python-requests.org/en/master/

▸ Web Requests
Quiz#1 : Tag Monitoring
1. Get webpage : http://pantip.com/tags
2. Save to file every 5 minutes (time.sleep(300))
3. Use current date time as filename
(How to get current date time using Python?, find it on Google)

▸ Web Parser
HTML Parser : beautifulsoup
from bs4 import BeautifulSoup 
 
soup = BeautifulSoup(open(“ﬁle.html”),"html.parser") #parse from ﬁle 
soup = BeautifulSoup(“<html>data</html>”,"html.parser") #parse from
text
pip install beautifulsoup4 #install library

▸ Web Parser
Parse a document
from bs4 import BeautifulSoup
soup = BeautifulSoup(“<html>data</html>”,"html.parser") 
print(soup)
<html>data</html>

▸ Web Parser
Parse a document
#Navigating using tag names
 
html_doc = """<html><head><title>The Dormouse's story</title></
head><body>The Dormouse's story</
body>”””
soup = BeautifulSoup(html_doc,"html.parser") 
soup.head  
soup.title 
soup.body.p

▸ Web Parser
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story

▸ Web Parser
Parse a document
#Access string
 
html_doc = “""<h1>hello</h1>”””
print(soup.h1.string)
hello

▸ Web Parser
Parse a document
#Access attribute
 
html_doc = “<a href="http://example.com/elsie" >Elsie</a>”
print(soup.a[‘href’])
http://example.com/elsie

▸ Web Parser
Parse a document
#Get all text in the page
 
html_doc = """<html><head><title>The Dormouse's story</title></
head><body>The Dormouse's story</
body>”””
print(soup.get_text)
<bound method Tag.get_text of <html><head><title>The Dormouse's story</title></
head><body>The Dormouse's story</body></html>>

▸ Web Parser
Parse a document
# ﬁnd_all()
 
html_doc = """<a href="http://example.com/elsie" class="sister"
id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister"
id="link2">Lacie</a> and <a href="http://example.com/tillie"
class="sister" id="link3">Tillie</a>;”””
for a in soup.ﬁnd_all(‘a’): 
print(a)

▸ Web Parser
Parse a document
<a class="sister" href="http://example.com/elsie"
id="link1">Elsie</a> 
<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>

▸ Web Parser
Parse a document
#find_all() 
soup.find_all(id='link2') 
 
soup.find_all(href=re.compile("elsie")) 
 
soup.find_all(id=True)  
 
data_soup.find_all(attrs={"data-foo": “value"}) 
 
soup.find_all("a", class_="sister") 
 
soup.find_all("a", recursive=False) 
soup.p.find_all(“a", recursive=False)

▸ Web Parser
Parse a document
re.compile(…..)
<a href=“http://192.x.x.x” class=“c1”>hello</a> 
<a href=“https://192.x.x.x” class=“c1”>hello</a> 
<a href=“https://www.com” class=“c1”>hello</a>
ﬁnd_all(href=re.compile(‘(https|http)://[0-9.]’))
https://docs.python.org/2/howto/regex.html

▸ Web Parser
Parse a document
read more : https://www.crummy.com/software/BeautifulSoup/
bs4/doc/

▸ Web Parser
Quiz#2 : Tag Extraction
1. Get webpage : http://pantip.com/tags
2. Extract tag name, tag link, number of topic in 
ﬁrst 10 pages
3. save to ﬁle as this format 
tag name, tag link, number of topic, current datetime
4. Run every 5 minutes

▸ Web Parser
JSON Parser : json
import json 
 
json_doc = json.loads(“{key : value}“)
built-in function

▸ Web Parser
JSON Parser : json
#JSON string 
 
json_doc = “””{“employees":[ 
{"firstName":"John", "lastName":"Doe"}, 
{"firstName":"Anna", "lastName":"Smith"}, 
{"firstName":"Peter", "lastName":"Jones"} 
]} “””

▸ Web Parser
JSON Parser : json
#Parse string to object 
 
import json
json_obj = json.loads(json_doc) 
print(json_obj)
{'employees': [{'firstName': 'John', 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'},
{'firstName': 'Peter', 'lastName': 'Jones'}]}

▸ Web Parser
JSON Parser : json
#Access json object 
 
import json
json_obj = json.loads(json_doc) 
print(json_obj[‘employees’][0][‘ﬁrstName’]) 
print(json_obj[‘employees’][0][‘lastName’])
John 
Doe

▸ Web Parser
JSON Parser : json
#Create json doc 
 
import json
json_obj = {“ﬁrstName” : “name”,”lastName” : “last”} #Dictionary 
print(json.dumps(json_obj,indent=1))
{ 
"ﬁrstName": "name", 
"lastName": “last" 
}

▸ Web Parser
Quiz#3 : Post Monitoring
1. Register as Facebook Developer on
developers.facebook.com
2. Get information of last 10 hours post on the page 
https://www.facebook.com/MorningNewsTV3 
3. save to ﬁle as this format 
post id, post datetime, #number like, current datetime

▸ Web Parser
Quiz#3 : Post Monitoring
URL 
 
 
https://graph.facebook.com/v2.8/<PageID>?
ﬁelds=posts.limit(100)%7Blikes.limit(1).summary(true)
%2Ccreated_time%7D&access_token=

▸ Web Service

▸ Web Service
Web Service Type

▸ Web Parser
SOAP Example

▸ Web Parser
SOAP Request

▸ Web Parser
REST

▸ Web Parser
REST Request

▸ Web Parser
JSON Web Service

▸ Web Parser
Application

▸ Web Parser
JSON
 
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
list 
dict 
key 
value
read more : http://www.json.org/

▸ Web Service
Create Simple Web Service
from ﬂask.ext.api import FlaskAPI
app = FlaskAPI(__name__) 
 
@app.route('/example/') 
def example(): 
return {'hello': 'world'} 
 
app.run(debug=False,port=5555)
pip install Flask-API

▸ Web Service
Create Simple Web Service
#receive input 
 
from ﬂask.ext.api import FlaskAPI
app = FlaskAPI(__name__) 
 
@app.route(‘/hello/<name>/<lastName>') 
def example(name,lastName): 
return {'hello':name} 
 
app.run(debug=False,port=5555)

▸ Web Parser
Quiz#4 : Tag Service
1. Build get TopTagInfo function using web service.
2. Input : Number of top topic
3. Output: tag name and number of top the topic in json 
format.

▸ Web Parser
Quiz#4 : Top Tag Service
1. Build getTopTagInfo web service.
2. Input : Number of top topic
3. Output: tag name and number of top the topic in json 
format.

USING DATABASES WITH PYTHON
Databases

……….

Zero configuration  
– SQLite does not need to be Installed as there is no setup procedure to use it.
Server less  
– SQLite is not implemented as a separate server process. With SQLite, the process that wants to access the
database reads and writes directly from the database files on disk as there is no intermediary server process.
Stable Cross-Platform Database File  
– The SQLite file format is cross-platform. A database file written on one machine can be copied to and used
on a different machine with a different architecture.
Single Database File  
– An SQLite database is a single ordinary disk file that can be located anywhere in the directory hierarchy.
Compact  
– When optimized for size, the whole SQLite library with everything enabled is less than 400KB in size

SQLite
 
import sqlite3
conn = sqlite3.connect('my.db')
built-in library : sqlite3

SQLite
1. Connect to db
2. Get cursor
3. Execute command
4. Commit (insert / update/delete) / Fetch result (select)
5. Close database
Workﬂow

SQLite
import sqlite3 
conn = sqlite3.connect(‘example.db') # connect db 
c = conn.cursor() # get cursor
# execute1 
c.execute('''CREATE TABLE stocks 
(date text, trans text, symbol text, qty real, price real)''')
# execute2 
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
conn.commit() # commit 
conn.close() # close
Workﬂow Example

SQLite
Data Type

Database Storage
import sqlite3
conn = sqlite3.connect(‘example.db') #store in disk
conn = sqlite3.connect(‘:memory:’) #store in memory

Execute
#execute 
 
import sqlite3
conn = sqlite3.connect(‘example.db')  
c = conn.cursor() 
t = ('RHAT',) 
c.execute('SELECT * FROM stocks WHERE symbol=?', t)

Execute
#executemany 
 
import sqlite3
purchases = [('2006-03-28', 'BUY', 'IBM', 1000, 45.00), 
('2006-04-05', 'BUY', 'MSFT', 1000, 72.00), 
('2006-04-06', 'SELL', 'IBM', 500, 53.00),]
c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases)

fetch
#fetchaone 
 
import sqlite3
c.execute('SELECT * FROM stocks')
c.fetchone()
('2006-01-05', 'BUY', 'RHAT', 100.0, 35.14)

fetch
#fetchall 
 
import sqlite3
c.execute('SELECT * FROM stocks')
for d in c.fetchall(): 
print(d)
[('2006-01-05', 'BUY', 'RHAT', 100.0, 35.14), 
('2006-03-28', 'BUY', 'IBM', 1000.0, 45.0), 
('2006-04-05', 'BUY', 'MSFT', 1000.0, 72.0),

Context manager
import sqlite3
con = sqlite3.connect(":memory:")
con.execute("create table person (id integer primary key, ﬁrstname
varchar unique)")
#con.commit() is called automatically afterwards 
with con: 
con.execute("insert into person(ﬁrstname) values (?)", ("Joe"))

Read more :  
https://docs.python.org/2/library/sqlite3.html 
https://www.tutorialspoint.com/python/python_database_access.htm

Quiz#5 : Post DB
1. Register as Facebook Developer on
developers.facebook.com
2. Get information of last 10 hours post on the page 
https://www.facebook.com/MorningNewsTV3 
(post id, post datetime, #number like, current datetime)
3. design and create table to store posts

PROCESSING AND VISUALIZING DATA WITH PYTHON
Processing and Visualizing

▸ Processing : pandas
pip install pandas
high-performance, easy-to-use data structures and
data analysis tools

Pandas : Series
#create series with Array-like 
 
import pandas as pd 
from numpy.random import rand
s = pd.Series(rand(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
a 0.690232 
b 0.738294 
c 0.153817 
d 0.619822 
e 0.4347

Pandas : Series
#create series with dictionary 
 
from numpy.random import rand 
 
d = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(d) #with dictionary
print(s)
a 0 
b 1 
c 2
dtype: ﬂoat64

Pandas : Series
#create series with Scalar 
 
s = pd.Series(5., index=['a', 'b', 'a', 'd', ‘a']) #index can duplicate
print(s[‘a’])
a 5 
a 5 
a 5
dtype: ﬂoat64

Pandas : Series
#access series data 
 
s = pd.Series(5., index=['a', 'b', 'a', 'd', ‘a']) #index can duplicate
print(s[0]) 
print(s[:3])
5.0 
a 5 
b 5 
a 5
dtype: ﬂoat64

Pandas : Series
#series operations 
 
import numpy as np
s = pd.Series(rand(10)) #index can duplicate
s = s + 2 
s = s * s 
s = np.exp(s) 
print(s) 
0 187.735606 
1 691.660752 
2 60.129741 
3 595.438606 
4 769.479456 
5 397.052123 
6 4691.926483 
7 1427.593520 
8 180.001824 
9 410.994395
dtype: ﬂoat64

Pandas : Series
#series ﬁltering 
 
import numpy as np
s = pd.Series(rand(10)) #index can duplicate
s = s[s > 0.1] 
print(s) 
1 0.708700 
2 0.910090 
3 0.380613 
6 0.692324 
7 0.508440 
8 0.763977 
9 0.470675
dtype: ﬂoat64

Pandas : Series
#series incomplete data 
 
import numpy as np
s1 = pd.Series(rand(10)) 
s2 = pd.Series(rand(8))
s = s1 + s2 
print(s) 
0 0.813747 
1 1.373839 
2 1.569716 
3 1.624887 
4 1.515665 
5 0.526779 
6 1.544327 
7 0.740962 
8 NaN 
9 NaN
dtype: ﬂoat64

Pandas : DataFrame
2-dimensional labeled data  
structure with columns  
of potentially different types

Pandas : DataFrame
#create dataframe with dict 
 
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} 
 
df = pd.DataFrame(d) 
print(df)
one two 
a 1 1 
b 2 2 
c 3 3 
d NaN 4

Pandas : DataFrame
#create dataframe with dict list 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
 
print(df)
one two 
0 1 4 
1 2 3 
2 3 2 
3 4 1

Pandas : DataFrame
#access dataframe column 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
 
print(df[‘one’])
0 1 
1 2 
2 3 
3 4
Name: one, dtype: ﬂoat64

Pandas : DataFrame
#access dataframe row 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
 
print(df.iloc[:3])
one two 
0 1 4 
1 2 3 
2 3 2

Pandas : DataFrame
#add new column 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
df['three'] = [1,2,3,2] 
print(df)
one two three 
0 1 4 1 
1 2 3 2 
2 3 2 3 
3 4 1 2

Pandas : DataFrame
#show data : head() and tail() 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
df['three'] = [1,2,3,2] 
print(df.head()) 
print(df.tail())
one two three 
0 1 4 1 
1 2 3 2 
2 3 2 3 
3 4 1 2

Pandas : DataFrame
#dataframe summary 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
print(df.describe())

Pandas : DataFrame
#dataframe function 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
df = pd.DataFrame(d)��
print(df.mean())
one 2.5 
two 2.5 
dtype: ﬂoat64

Pandas : DataFrame
#dataframe function 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
print(df.corr()) #calculate correlation
one two 
one 1 -1 
two -1 1

Pandas : DataFrame
#dataframe ﬁltering 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
print(df[(df[‘one’] > 1) & (df[‘one’] < 3)] )
one two
1 2 3

Pandas : DataFrame
#dataframe ﬁltering with isin 
 
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]} 
print(df[df[‘one’].isin([2,4])] )
one two
1 2 3 
3 4 1

Pandas : DataFrame
#dataframe with row data 
 
d = [ [1., 2., 3., 4.], [4., 3., 2., 1.]] 
df.columns = ["one","two","three","four"] 
print(df)
one two three four 
0 1 2 3 4 
1 4 3 2 1

Pandas : DataFrame
#dataframe sort values 
 
d = [ [2., 1., 3., 4.], [1., 3., 2., 4.]] 
df.columns = ["one","two","three","four"] 
df = df.sort_values([“one”,”two”],ascending=[1,0])  
print(df)
one two three four 
0 2 1 3 4 
1 1 3 2 4

Pandas : DataFrame
#dataframe from csv file 
 
df = pd.read_csv(‘file.csv’) 
print(df)
one two three
0 1 2 3 
1 1 2 3 
2 1 2 3
file.csv 
 
one,two,three 
1,2,3 
1,2,3 
1,2,3

Pandas : DataFrame
#dataframe from csv file, without header. 
 
df = pd.read_csv(‘file.csv’,header=-1) 
print(df)
0 1 2
0 1 2 3 
1 1 2 3 
2 1 2 3
file.csv 
 
1,2,3 
1,2,3 
1,2,3

Pandas : DataFrame

Pandas : DataFrame
#dataframe from html, need to install lxml ﬁrst (pip install lxml) 
 
df = pd.read_html(‘https://simple.wikipedia.org/wiki/
List_of_U.S._states’) 
 
print(df[0])
Abbreviation State Name Capital Became a State
1 AL Alabama Montgomery December 14, 1819
2 AK Alaska Juneau January 3, 1959
3 AZ Arizona Phoenix February 14, 1912

Quiz#6 : Data Exploration
1. Goto https://archive.ics.uci.edu/ml/datasets/Adult 
to read data description
2. Parse data into pandas using read_csv() and set
columns name
3. Explore data to answer following questions, 
- find number of person in each education level. 
- find correlation and covariance between continue
fields  
- Avg age of United-States population where income
>50K.

Quiz#6 : Data Exploration
df[3].value_counts()

▸ Visualizing : seaborn
pip install seaborn
visualization library based on matplotlib

seaborn : set inline plot for jupyter
%matplotlib inline 
import numpy as np 
import seaborn as sns 
 
# Generate some sequential data
x = np.array(list("ABCDEFGHI"))
y1 = np.arange(1, 10)
sns.barplot(x, y1)

seaborn : plot result

seaborn : set layout
import matplotlib.pyplot as plt
f,ax = plt.subplots(1,1,ﬁgsize=(10, 10)) 
sns.barplot(x=[1,2,3,4,5],y=[3,2,3,4,2])

f,ax = plt.subplots(2,2,ﬁgsize=(10, 10)) 
sns.barplot(x=[1,2,3,4,5],y=[3,2,3,4,2],ax=ax[0,0]) 
sns.distplot([3,2,3,4,2],ax=ax[0,1])

seaborn : axis setting
f,ax = plt.subplots(ﬁgsize=(10, 5)) 
sns.barplot(x=[1,2,3,4,5],y=[3,2,3,4,2])
ax.set_xlabel("number") 
ax.set_ylabel("value")

seaborn : axis setting

seaborn : with pandas dataframe
import matplotlib.pyplot as plt 
 
d = {'x' : [1., 2., 3., 4.], 'y' : [4., 3., 2., 1.]} 
df = pd.DataFrame(d)
f,ax = plt.subplots(ﬁgsize=(10, 5)) 
sns.barplot(x=‘x’,y=‘y’,data=df)

seaborn : with pandas dataframe

seaborn : plot types
http://seaborn.pydata.org/examples/index.html

Quiz#7 : Adult Plot
1. Goto https://archive.ics.uci.edu/ml/datasets/Adult 
to read data description
2. Parse data into pandas using read_csv() and set
columns name
3. Plot ﬁve charts.

Fun with Python

Related slideshows

More Related Content

Fun with Python