4.4 reshape / reshape2
4.4.1 melt.array
An expression using a 3 dimensional array called a
in R where you want to
melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))
In Python, since a
is a list, you can simply use list comprehension.
In [1]: a = np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)
In [2]: pd.DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
Out[2]:
0 1 2 3
0 0 0 0 1.0
1 0 0 1 2.0
2 0 0 2 3.0
3 0 0 3 4.0
.. .. .. .. ...
20 1 2 0 21.0
21 1 2 1 22.0
22 1 2 2 23.0
23 1 2 3 NaN
[24 rows x 4 columns]
4.4.2 melt.list
An expression using a list called a
in R where you want to melt it
into a data.frame:
a <- as.list(c(1:4, NA))
data.frame(melt(a))
In Python, this list would be a list of tuples, so
DataFrame()
method would convert it to a dataframe as required.
In [3]: a = list(enumerate(list(range(1,5))+[np.NAN]))
In [4]: pd.DataFrame(a)
Out[4]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
4.4.3 melt.data.frame
An expression using a data.frame called cheese
in R where you want to
reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))
In Python, the melt()
method is the R equivalent:
In [5]: cheese = pd.DataFrame({'first' : ['John', 'Mary'],
...: 'last' : ['Doe', 'Bo'],
...: 'height' : [5.5, 6.0],
...: 'weight' : [130, 150]})
...:
In [6]: pd.melt(cheese, id_vars=['first', 'last'])
Out[6]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [7]: cheese.set_index(['first', 'last']).stack() # alternative way
Out[7]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
For more details and examples see the reshaping documentation.
4.4.4 cast
In R acast
is an expression using a data.frame called df
in R to cast
into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
mdf <- melt(df, id=c("month", "week"))
acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of pivot_table()
:
In [8]: df = pd.DataFrame({
...: 'x': np.random.uniform(1., 168., 12),
...: 'y': np.random.uniform(7., 334., 12),
...: 'z': np.random.uniform(1.7, 20.7, 12),
...: 'month': [5,6,7]*4,
...: 'week': [1,2]*6
...: })
...:
In [9]: mdf = pd.melt(df, id_vars=['month', 'week'])
In [10]: pd.pivot_table(mdf, values='value', index=['variable','week'],
....: columns=['month'], aggfunc=np.mean)
....:
Out[10]:
month 5 6 7
variable week
x 1 93.888747 98.762034 55.219673
2 94.391427 38.112932 83.942781
y 1 94.306912 279.454811 227.840449
2 87.392662 193.028166 173.899260
z 1 11.016009 10.079307 16.170549
2 8.476111 17.638509 19.003494
Similarly for dcast
which uses a data.frame called df
in R to
aggregate information based on Animal
and FeedType
:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
dcast(df, Animal ~ FeedType, sum, fill=NaN)
# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))
Python can approach this in two different ways. Firstly, similar to above
using pivot_table()
:
In [11]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:
In [12]: df.pivot_table(values='Amount', index='Animal', columns='FeedType', aggfunc='sum')
Out[12]:
FeedType A B
Animal
Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN
The second approach is to use the groupby()
method:
In [13]: df.groupby(['Animal','FeedType'])['Amount'].sum()
Out[13]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64
For more details and examples see the reshaping documentation or the groupby documentation.
4.4.5 factor
New in version 0.15.
pandas has a data type for categorical data.
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
In pandas this is accomplished with pd.cut
and astype("category")
:
In [14]: pd.cut(pd.Series([1,2,3,4,5,6]), 3)
Out[14]:
0 (0.995, 2.667]
1 (0.995, 2.667]
2 (2.667, 4.333]
3 (2.667, 4.333]
4 (4.333, 6]
5 (4.333, 6]
dtype: category
Categories (3, object): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6]]
In [15]: pd.Series([1,2,3,2,2,3]).astype("category")
Out[15]:
0 1
1 2
2 3
3 2
4 2
5 3
dtype: category
Categories (3, int64): [1, 2, 3]
For more details and examples see categorical introduction and the API documentation. There is also a documentation regarding the differences to R’s factor.