Date

pythonのグラフライブラリといえば、matplotlib, seabornが有名です.

私見では、seabornはmatplotlibの補完的な機能を中心に設計されていて、かつインターフェースが (ggplotなどに慣れてしまうと) 一貫性がなくて APIマニュアルとにらめっこが大変。

まあなれればたいしたことないのかもしれないけど、pythonでもRでも同じようにグラフをかきたいという人には、ggplotのpython移植であるplotnineがおすすめ。

In [1]:
from plotnine import *
import matplotlib as mpl

Diamonds 4C

Good <- -> Poor
Color : DEFGHIJ
Clarity: FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
Cut: Ideal, Excellent(Premium), Very good, good, fair, poor
Carat: weight

ダイアモンドの4Cと値段のサンプルデータ. plotnineにあるし、Rでも定番。

In [2]:
# Colorだけ、factorのorderが逆なので、ひっくり返しておく
from plotnine.data import diamonds as diamonds_
diamonds = diamonds_.assign(color = diamonds_.color.cat.reorder_categories(diamonds_.color.cat.categories[::-1]))

categorical data

In [3]:
ggplot(diamonds, aes(x='color',y='stat(count)',fill='cut')) + geom_bar(position=position_dodge()) \
 + xlab('Color of diamonds') + ylab('Number count') + ggtitle('Cut/Color of diamonds')

# same
# ggplot(diamonds, aes(x='color',fill='cut')) + geom_bar(stat='count', position=position_dodge())
Out[3]:
<ggplot: (8792673502717)>
In [4]:
# size
display(ggplot(diamonds, aes(x='color',y='clarity')) + geom_count(aes(size='stat(n)'))
       + ggtitle('n of diamonds')
       )
display(ggplot(diamonds, aes(x='color',y='clarity')) + geom_count(aes(size='stat(prop)',group='clarity'))
        + ggtitle('proportion of color/clarity'))
display(ggplot(diamonds, aes(x='color',y='clarity')) + geom_count(aes(size='stat(prop)',group='color'))
        + ggtitle('proportion of clarity/color'))
<ggplot: (8792673400342)>
<ggplot: (8792673400426)>
<ggplot: (-9223363244192160782)>
In [5]:
ct = pd.crosstab(diamonds.cut, diamonds.color)
# ct.reset_index() fails.
# https://github.com/pandas-dev/pandas/issues/19136
ct = ct.rename(columns=str).reset_index()
display(ct)
ggplot(ct, aes(x='cut',y='D'))+geom_bar(stat='identity')
color cut J I H G F E D
0 Fair 119 175 303 314 312 224 163
1 Good 307 522 702 871 909 933 662
2 Very Good 678 1204 1824 2299 2164 2400 1513
3 Premium 808 1428 2360 2924 2331 2337 1603
4 Ideal 896 2093 3115 4884 3826 3903 2834
Out[5]:
<ggplot: (-9223363244181349930)>

metric variable

In [6]:
(ggplot(diamonds, aes(x='carat',y='price',color='color')) + geom_point(size=0.1)
  + ggtitle('scatter plot'))
Out[6]:
<ggplot: (8792678876729)>
In [7]:
# plotnine does not have geom_contour, use geom_density_2d instead
# https://github.com/has2k1/plotnine/issues/110
display(ggplot(diamonds.sample(1000), aes(x='carat',y='price')) + geom_point(size=0.2) + geom_density_2d(color='red')
       + ggtitle('contour') )
<ggplot: (-9223363244192513038)>
In [8]:
display(ggplot(diamonds, aes(x='carat',y='price')) + geom_bin2d(binwidth=(.05,400))
       + ggtitle('heatmap-like') )
display(ggplot(diamonds[diamonds.carat<=.5], aes(x='carat',y='price')) + geom_bin2d(binwidth=(.01,100))
       + ggtitle('heatmap-like: carat<=0.5'))
<ggplot: (-9223363244192597465)>
<ggplot: (-9223363244192126236)>
In [9]:
max(diamonds.carat)//0.25
Out[9]:
20.0
In [10]:
#(ggplot(diamonds, aes('carat', 'price')) + geom_boxplot(aes(group = 'cut_width(carat, 0.25)')))
bins = [0.25*i for i in range(2+int(max(diamonds.carat)//0.25))]
dd = diamonds.assign(caratbin=pd.cut(diamonds.carat, bins))
dd.caratbin.value_counts(dropna=False)
display(ggplot(dd, aes(x='caratbin',y='price')) + geom_boxplot() )
#display(ggplot(dd, aes(x='caratbin',y='price')) + geom_violin() )
/opt/conda/lib/python3.6/site-packages/plotnine/stats/stat.py:315: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  stats = pd.concat(stats, axis=0, ignore_index=True)
<ggplot: (-9223363244192305271)>
In [11]:
# geom_hex is not implemented.
display(ggplot(diamonds, aes(x='carat',y='price',color='color')) + geom_point(size=.1)
       + facet_grid(('cut','color'))
       + geom_hline(yintercept=10000)
       + ggtitle('facet_grid')
       + theme(figure_size=(20,20)))
<ggplot: (8792662494058)>

categorical vs metric

In [12]:
my_candidates = diamonds[(diamonds.carat<1.1)&(diamonds.carat>0.9)]
display(ggplot(my_candidates, aes(x='cut',y='price',color='color')) + geom_boxplot() )
<ggplot: (-9223363244192507872)>
In [13]:
display(ggplot(my_candidates, aes(x='cut',y='price',color='color')) + geom_violin() )
<ggplot: (8792662090472)>

Comments

comments powered by Disqus