1 Series
Warning
In 0.13.0 Series
has internally been refactored to no longer sub-class ndarray
but instead subclass NDFrame
, similarly to the rest of the pandas containers. This should be
a transparent change with only very limited API implications (See the Internal Refactoring)
Series
is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis
labels are collectively referred to as the index. The basic method to create a Series is to call:
>>> s = pd.Series(data, index=index)
Here, data
can be many different things:
- a Python dict
- an ndarray
- a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
From ndarray
If data
is an ndarray, index must be the same length as data. If no
index is passed, one will be created having values [0, ..., len(data) - 1]
.
In [1]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [2]: s
Out[2]:
a 0.4047
b 0.5770
c -1.7150
d -1.0393
e -0.3706
dtype: float64
In [3]: s.index
Out[3]: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
In [4]: pd.Series(np.random.randn(5))
Out[4]:
0 -1.1579
1 -1.3443
2 0.8449
3 1.0758
4 -0.1090
dtype: float64
Note
Starting in v0.8.0, pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).
From dict
If data
is a dict, if index is passed the values in data corresponding
to the labels in the index will be pulled out. Otherwise, an index will be
constructed from the sorted keys of the dict, if possible.
In [5]: d = {'a' : 0., 'b' : 1., 'c' : 2.}
In [6]: pd.Series(d)
Out[6]:
a 0.0
b 1.0
c 2.0
dtype: float64
In [7]: pd.Series(d, index=['b', 'c', 'd', 'a'])
Out[7]:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Note
NaN (not a number) is the standard missing data marker used in pandas
From scalar value If data
is a scalar value, an index must be
provided. The value will be repeated to match the length of index
In [8]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[8]:
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
1.1 Series is ndarray-like
Series
acts very similarly to a ndarray
, and is a valid argument to most NumPy functions.
However, things like slicing also slice the index.
In [9]: s[0]
Out[9]: 0.40470521868023651
In [10]: s[:3]
Out[10]:
a 0.4047
b 0.5770
c -1.7150
dtype: float64
In [11]: s[s > s.median()]
Out[11]:
a 0.4047
b 0.5770
dtype: float64
In [12]: s[[4, 3, 1]]
Out[12]:
e -0.3706
d -1.0393
b 0.5770
dtype: float64
In [13]: np.exp(s)
Out[13]:
a 1.4989
b 1.7808
c 0.1800
d 0.3537
e 0.6903
dtype: float64
We will address array-based indexing in a separate section.
1.2 Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:
In [14]: s['a']
Out[14]: 0.40470521868023651
In [15]: s['e'] = 12.
In [16]: s
Out[16]:
a 0.4047
b 0.5770
c -1.7150
d -1.0393
e 12.0000
dtype: float64
In [17]: 'e' in s
Out[17]: True
In [18]: 'f' in s
Out[18]: False
If a label is not contained, an exception is raised:
>>> s['f']
KeyError: 'f'
Using the get
method, a missing label will return None or specified default:
In [19]: s.get('f')
In [20]: s.get('f', np.nan)
Out[20]: nan
See also the section on attribute access.
1.3 Vectorized operations and label alignment with Series
When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can be also be passed into most NumPy methods expecting an ndarray.
In [21]: s + s
Out[21]:
a 0.8094
b 1.1541
c -3.4300
d -2.0785
e 24.0000
dtype: float64
In [22]: s * 2
Out[22]:
a 0.8094
b 1.1541
c -3.4300
d -2.0785
e 24.0000
dtype: float64
In [23]: np.exp(s)
Out[23]:
a 1.4989
b 1.7808
c 0.1800
d 0.3537
e 162754.7914
dtype: float64
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
In [24]: s[1:] + s[:-1]
Out[24]:
a NaN
b 1.1541
c -3.4300
d -2.0785
e NaN
dtype: float64
The result of an operation between unaligned Series will have the union of
the indexes involved. If a label is not found in one Series or the other, the
result will be marked as missing NaN
. Being able to write code without doing
any explicit data alignment grants immense freedom and flexibility in
interactive data analysis and research. The integrated data alignment features
of the pandas data structures set pandas apart from the majority of related
tools for working with labeled data.
Note
In general, we chose to make the default result of operations between differently indexed objects yield the union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function.
1.4 Name attribute
Series can also have a name
attribute:
In [25]: s = pd.Series(np.random.randn(5), name='something')
In [26]: s
Out[26]:
0 1.6436
1 -1.4694
2 0.3570
3 -0.6746
4 -1.7769
Name: something, dtype: float64
In [27]: s.name
Out[27]: 'something'
The Series name
will be assigned automatically in many cases, in particular
when taking 1D slices of DataFrame as you will see below.
New in version 0.18.0.
You can rename a Series with the pandas.Series.rename()
method.
In [28]: s2 = s.rename("different")
In [29]: s2.name
Out[29]: 'different'
Note that s
and s2
refer to different objects.