对Python的 pandas 库所有的内置元类、函数、子模块等全部浏览一遍,然后挑选一些重点学习一下。我安装的库版本号为1.3.5,如下:
>>> import pandas as pd >>> pd.__version__ '1.3.5' >>> print(pd.__doc__) pandas - a powerful data analysis and manipulation library for Python ===================================================================== **pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on its way toward this goal. Main Features ------------- Here are just a few of the things that pandas does well: - Easy handling of missing data in floating point as well as non-floating point data. - Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects - Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let `Series`, `DataFrame`, etc. automatically align the data for you in computations. - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data. - Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects. - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets. - Intuitive merging and joining data sets. - Flexible reshaping and pivoting of data sets. - Hierarchical labeling of axes (possible to have multiple labels per tick). - Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format. - Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.
百度翻译如下:
panda——一个强大的Python数据分析和操作库
=====================================================================
**pandas**是一个Python包,提供快速、灵活和富有表现力的数据设计用于同时处理“关系”或“标记”数据的结构简单直观。它旨在成为使用Python进行实际的**真实世界**数据分析。此外,它还有成为**最强大、最灵活的开源数据的更广泛目标任何语言的分析/操作工具**。它已经很好了朝着这个目标前进。
主要特点
-------------
下面是pandas做得很好的几件事:
-轻松处理浮点和非浮点丢失的数据点数据。
-大小可变性:可以从DataFrame和高维对象
-自动和显式数据对齐:对象可以显式对齐或者用户可以忽略这些标签,让`“系列”、“DataFrame”等自动为您对齐计算。
-功能强大、灵活的分组功能,可执行拆分应用组合对数据集的操作,用于聚合和转换数据。
-在其他Python中轻松转换不规则的、不同索引的数据将NumPy数据结构转换为DataFrame对象。
-基于标签的智能切片、花式索引和大型数据的子集数据集。
-直观地合并和连接数据集。
-数据集的灵活重塑和数据透视。
-轴的分层标签(每个刻度可能有多个标签)。
-用于从平面文件(CSV和分隔)加载数据的强大IO工具,Excel文件、数据库以及从超快HDF5保存/加载数据总体安排
-时间序列特定功能:日期范围生成和频率转换、移动窗口统计、日期偏移和滞后。
Python pandas库中包含有好几千的元类、库函数、子模块等等,真所谓“任凭弱水三千,我只取一瓢饮”,我先来全部罗列一遍,然后再挑几个重要的学习一番。
119个pandas库函数(包含元类、函数、子模块等):
>>> import pandas as pd >>> funcs = [_ for _ in dir(pd) if not _.startswith('_')] >>> len(funcs) 119 >>> for i,f in enumerate(funcs,1): print(f'{f:18}',end='' if i%5 else '\n') BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame DateOffset DatetimeIndex DatetimeTZDtype ExcelFile ExcelWriter Flags Float32Dtype Float64Dtype Float64Index Grouper HDFStore Index IndexSlice Int16Dtype Int32Dtype Int64Dtype Int64Index Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NA NaT NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt64Index UInt8Dtype api array arrays bdate_range compat concat core crosstab cut date_range describe_option errors eval factorize get_dummies get_option infer_freq interval_range io isna isnull json_normalize lreshape melt merge merge_asof merge_ordered notna notnull offsets option_context options pandas period_range pivot pivot_table plotting qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml reset_option set_eng_float_format set_option show_versions test testing timedelta_range to_datetime to_numeric to_pickle to_timedelta tseries unique util value_counts wide_to_long
按元类、函数、子模块等,先分个类:
import pandas as pd funcs = [_ for _ in dir(pd) if not _.startswith('_')] types = type(pd.DataFrame), type(pd.array), type(pd) Names = 'Type','Function','Module','Other' Types = {} for f in funcs: t = type(eval("pd."+f)) t = Names[-1 if t not in types else types.index(type(eval("pd."+f)))] Types[t] = Types.get(t,[])+[f] for n in Names: print(f"\n\n【{n}】:{len(Types[n])}") for i,f in enumerate(Types[n],1): print(f'{f:18} ',end='' if i%5 or i==len(Types[n]) else '\n')
分类结果:
【Type】:40 BooleanDtype CategoricalDtype CategoricalIndex DataFrame DatetimeIndex DatetimeTZDtype ExcelFile Flags Float32Dtype Float64Dtype Float64Index Grouper HDFStore Index Int16Dtype Int32Dtype Int64Dtype Int64Index Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt64Index UInt8Dtype option_context 【Function】:56 array bdate_range concat crosstab cut date_range eval factorize get_dummies infer_freq interval_range isna isnull json_normalize lreshape melt merge merge_asof merge_ordered notna notnull period_range pivot pivot_table qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml set_eng_float_format show_versions test timedelta_range to_datetime to_numeric to_pickle to_timedelta unique value_counts wide_to_long 【Module】:12 api arrays compat core errors io offsets pandas plotting testing tseries util 【Other】:11 Categorical DateOffset ExcelWriter IndexSlice NA NaT describe_option get_option options reset_option set_option
先给出56个库函数的原版帮助,有252K之多单篇博文放不下,只能以连载方式罗列:
A~G: Function01~09
Types['Function'][:9] ['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'eval', 'factorize', 'get_dummies']
Function01
array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
Help on function array in module pandas.core.construction:
array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
Create an array.
Parameters
----------
data : Sequence of objects The scalars inside `data` should be instances of the scalar type for `dtype`. It's expected that `data` represents a 1-dimensional array of data. When `data` is an Index or Series, the underlying array will be extracted from `data`. dtype : str, np.dtype, or ExtensionDtype, optional The dtype to use for the array. This may be a NumPy dtype or an extension type registered with pandas using :meth:`pandas.api.extensions.register_extension_dtype`. If not specified, there are two possibilities: 1. When `data` is a :class:`Series`, :class:`Index`, or :class:`ExtensionArray`, the `dtype` will be taken from the data. 2. Otherwise, pandas will attempt to infer the `dtype` from the data. Note that when `data` is a NumPy array, ``data.dtype`` is *not* used for inferring the array type. This is because NumPy cannot represent all the types of data that can be held in extension arrays. Currently, pandas will infer an extension dtype for sequences of ============================== Scalar Type Array Type ============================== ======================================= :class:`pandas.Interval` :class:`pandas.arrays.IntervalArray` :class:`pandas.Period` :class:`pandas.arrays.PeriodArray` :class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray` :class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray` :class:`int` :class:`pandas.arrays.IntegerArray` :class:`float` :class:`pandas.arrays.FloatingArray` :class:`str` :class:`pandas.arrays.StringArray` or :class:`pandas.arrays.ArrowStringArray` :class:`bool` :class:`pandas.arrays.BooleanArray` ======================================= The ExtensionArray created when the scalar type is :class:`str` is determined by ``pd.options.mode.string_storage`` if the dtype is not explicitly given. For all other cases, NumPy's usual inference rules will be used. .. versionchanged:: 1.0.0 Pandas infers nullable-integer dtype for integer data, string dtype for string data, and nullable-boolean dtype for boolean data. .. versionchanged:: 1.2.0 Pandas now also infers nullable-floating dtype for float-like input data copy : bool, default True Whether to copy the data, even if not necessary. Depending on the type of `data`, creating the new array may require copying data, even if ``copy=False``. Returns ------- ExtensionArray The newly created array. Raises ------ ValueError When `data` is not 1-dimensional. See Also -------- numpy.array : Construct a NumPy array. Series : Construct a pandas Series. Index : Construct a pandas Index. arrays.PandasArray : ExtensionArray wrapping a NumPy array. Series.array : Extract the array stored within a Series. Notes ----- Omitting the `dtype` argument means pandas will attempt to infer the best array type from the values in the data. As new array types are added by pandas and 3rd party libraries, the "best" array type may change. We recommend specifying `dtype` to ensure that 1. the correct array type for the data is returned 2. the returned array type doesn't change as new extension types are added by pandas and third-party libraries Additionally, if the underlying memory representation of the returned array matters, we recommend specifying the `dtype` as a concrete object rather than a string alias or allowing it to be inferred. For example, a future version of pandas or a 3rd-party library may include a dedicated ExtensionArray for string data. In this event, the following would no longer return a :class:`arrays.PandasArray` backed by a NumPy array. >>> pd.array(['a', 'b'], dtype=str) <PandasArray> ['a', 'b'] Length: 2, dtype: str32 This would instead return the new ExtensionArray dedicated for string data. If you really need the new array to be backed by a NumPy array, specify that in the dtype. >>> pd.array(['a', 'b'], dtype=np.dtype("<U1")) <PandasArray> ['a', 'b'] Length: 2, dtype: str32 Finally, Pandas has arrays that mostly overlap with NumPy * :class:`arrays.DatetimeArray` * :class:`arrays.TimedeltaArray` When data with a ``datetime64[ns]`` or ``timedelta64[ns]`` dtype is passed, pandas will always return a ``DatetimeArray`` or ``TimedeltaArray`` rather than a ``PandasArray``. This is for symmetry with the case of timezone-aware data, which NumPy does not natively support. >>> pd.array(['2015', '2016'], dtype='datetime64[ns]') <DatetimeArray> ['2015-01-01 00:00:00', '2016-01-01 00:00:00'] Length: 2, dtype: datetime64[ns] >>> pd.array(["1H", "2H"], dtype='timedelta64[ns]') <TimedeltaArray> ['0 days 01:00:00', '0 days 02:00:00'] Length: 2, dtype: timedelta64[ns] Examples -------- If a dtype is not specified, pandas will infer the best dtype from the values. See the description of `dtype` for the types pandas infers for. >>> pd.array([1, 2]) <IntegerArray> [1, 2] Length: 2, dtype: Int64 >>> pd.array([1, 2, np.nan]) <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64 >>> pd.array([1.1, 2.2]) <FloatingArray> [1.1, 2.2] Length: 2, dtype: Float64 >>> pd.array(["a", None, "c"]) <StringArray> ['a', <NA>, 'c'] Length: 3, dtype: string >>> with pd.option_context("string_storage", "pyarrow"): ... arr = pd.array(["a", None, "c"]) ... >>> arr <ArrowStringArray> ['a', <NA>, 'c'] Length: 3, dtype: string >>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) <PeriodArray> ['2000-01-01', '2000-01-01'] Length: 2, dtype: period[D] You can use the string alias for `dtype` >>> pd.array(['a', 'b', 'a'], dtype='category') ['a', 'b', 'a'] Categories (2, object): ['a', 'b'] Or specify the actual dtype >>> pd.array(['a', 'b', 'a'], ... dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)) ['a', 'b', 'a'] Categories (3, object): ['a' < 'b' < 'c'] If pandas does not infer a dedicated extension type a :class:`arrays.PandasArray` is returned. >>> pd.array([1 + 1j, 3 + 2j]) <PandasArray> [(1+1j), (3+2j)] Length: 2, dtype: complex128 As mentioned in the "Notes" section, new extension types may be added in the future (by pandas or 3rd party libraries), causing the return value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype` as a NumPy dtype if you need to ensure there's no future change in behavior. >>> pd.array([1, 2], dtype=np.dtype("int32")) <PandasArray> [1, 2] Length: 2, dtype: int32 `data` must be 1-dimensional. A ValueError is raised when the input has the wrong dimensionality. >>> pd.array(1) Traceback (most recent call last): ... ValueError: Cannot pass scalar '1' to 'pandas.array'.
Function02
bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'
Help on function bdate_range in module pandas.core.indexes.datetimes: bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex' Return a fixed frequency DatetimeIndex, with business day as the default frequency. Parameters ---------- start : str or datetime-like, default None Left bound for generating dates. end : str or datetime-like, default None Right bound for generating dates. periods : int, default None Number of periods to generate. freq : str or DateOffset, default 'B' (business daily) Frequency strings can have multiples, e.g. '5H'. tz : str or None Time zone name for returning localized DatetimeIndex, for example Asia/Beijing. normalize : bool, default False Normalize start/end dates to midnight before generating date range. name : str, default None Name of the resulting DatetimeIndex. weekmask : str or None, default None Weekmask of valid business days, passed to ``numpy.busdaycalendar``, only used when custom frequency strings are passed. The default value None is equivalent to 'Mon Tue Wed Thu Fri'. holidays : list-like or None, default None Dates to exclude from the set of valid business days, passed to ``numpy.busdaycalendar``, only used when custom frequency strings are passed. closed : str, default None Make the interval closed with respect to the given frequency to the 'left', 'right', or both sides (None). **kwargs For compatibility. Has no effect on the result. Returns ------- DatetimeIndex Notes ----- Of the four parameters: ``start``, ``end``, ``periods``, and ``freq``, exactly three must be specified. Specifying ``freq`` is a requirement for ``bdate_range``. Use ``date_range`` if specifying ``freq`` is not desired. To learn more about the frequency strings, please see `this link <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__. Examples -------- Note how the two weekend days are skipped in the result. >>> pd.bdate_range(start='1/1/2018', end='1/08/2018') DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08'], dtype='datetime64[ns]', freq='B')
Function03
concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'
Help on function concat in module pandas.core.reshape.concat: concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion' Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number. Parameters ---------- objs : a sequence or mapping of Series or DataFrame objects If a mapping is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised. axis : {0/'index', 1/'columns'}, default 0 The axis to concatenate along. join : {'inner', 'outer'}, default 'outer' How to handle indexes on other axis (or axes). ignore_index : bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level. levels : list of sequences, default None Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys. names : list, default None Names for the levels in the resulting hierarchical index. verify_integrity : bool, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation. sort : bool, default False Sort non-concatenation axis if it is not already aligned when `join` is 'outer'. This has no effect when ``join='inner'``, which already preserves the order of the non-concatenation axis. .. versionchanged:: 1.0.0 Changed to not sort by default. copy : bool, default True If False, do not copy data unnecessarily. Returns ------- object, type of objs When concatenating all ``Series`` along the index (axis=0), a ``Series`` is returned. When ``objs`` contains at least one ``DataFrame``, a ``DataFrame`` is returned. When concatenating along the columns (axis=1), a ``DataFrame`` is returned. See Also -------- Series.append : Concatenate Series. DataFrame.append : Concatenate DataFrames. DataFrame.join : Join DataFrames using indexes. DataFrame.merge : Merge DataFrames by indexes or columns. Notes ----- The keys, levels, and names arguments are all optional. A walkthrough of how this method fits in with other tools for combining pandas objects can be found `here <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__. Examples -------- Combine two ``Series``. >>> s1 = pd.Series(['a', 'b']) >>> s2 = pd.Series(['c', 'd']) >>> pd.concat([s1, s2]) 0 a 1 b 0 c 1 d dtype: object Clear the existing index and reset it in the result by setting the ``ignore_index`` option to ``True``. >>> pd.concat([s1, s2], ignore_index=True) 0 a 1 b 2 c 3 d dtype: object Add a hierarchical index at the outermost level of the data with the ``keys`` option. >>> pd.concat([s1, s2], keys=['s1', 's2']) s1 0 a 1 b s2 0 c 1 d dtype: object Label the index keys you create with the ``names`` option. >>> pd.concat([s1, s2], keys=['s1', 's2'], ... names=['Series name', 'Row ID']) Series name Row ID s1 0 a 1 b s2 0 c 1 d dtype: object Combine two ``DataFrame`` objects with identical columns. >>> df1 = pd.DataFrame([['a', 1], ['b', 2]], ... columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.DataFrame([['c', 3], ['d', 4]], ... columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1, df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects with overlapping columns and return everything. Columns outside the intersection will be filled with ``NaN`` values. >>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], ... columns=['letter', 'number', 'animal']) >>> df3 letter number animal 0 c 3 cat 1 d 4 dog >>> pd.concat([df1, df3], sort=False) letter number animal 0 a 1 NaN 1 b 2 NaN 0 c 3 cat 1 d 4 dog Combine ``DataFrame`` objects with overlapping columns and return only those that are shared by passing ``inner`` to the ``join`` keyword argument. >>> pd.concat([df1, df3], join="inner") letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects horizontally along the x axis by passing in ``axis=1``. >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], ... columns=['animal', 'name']) >>> pd.concat([df1, df4], axis=1) letter number animal name 0 a 1 bird polly 1 b 2 monkey george Prevent the result from including duplicate index values with the ``verify_integrity`` option. >>> df5 = pd.DataFrame([1], index=['a']) >>> df5 0 a 1 >>> df6 = pd.DataFrame([2], index=['a']) >>> df6 0 a 2 >>> pd.concat([df5, df6], verify_integrity=True) Traceback (most recent call last): ... ValueError: Indexes have overlapping values: ['a']
Function04
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'
Help on function crosstab in module pandas.core.reshape.pivot: crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame' Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed. Parameters ---------- index : array-like, Series, or list of arrays/Series Values to group by in the rows. columns : array-like, Series, or list of arrays/Series Values to group by in the columns. values : array-like, optional Array of values to aggregate according to the factors. Requires `aggfunc` be specified. rownames : sequence, default None If passed, must match number of row arrays passed. colnames : sequence, default None If passed, must match number of column arrays passed. aggfunc : function, optional If specified, requires `values` be specified as well. margins : bool, default False Add row/column margins (subtotals). margins_name : str, default 'All' Name of the row/column that will contain the totals when margins is True. dropna : bool, default True Do not include columns whose entries are all NaN. normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False Normalize by dividing all values by the sum of values. - If passed 'all' or `True`, will normalize over all values. - If passed 'index' will normalize over each row. - If passed 'columns' will normalize over each column. - If margins is `True`, will also normalize margin values. Returns ------- DataFrame Cross tabulation of the data. See Also -------- DataFrame.pivot : Reshape data based on column values. pivot_table : Create a pivot table as a DataFrame. Notes ----- Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified. Any input passed containing Categorical data will have **all** of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category. In the event that there aren't overlapping indexes an empty DataFrame will be returned. Examples -------- >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar", ... "bar", "bar", "foo", "foo", "foo"], dtype=object) >>> b = np.array(["one", "one", "one", "two", "one", "one", ... "one", "two", "two", "two", "one"], dtype=object) >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", ... "shiny", "dull", "shiny", "shiny", "shiny"], ... dtype=object) >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c']) b one two c dull shiny dull shiny a bar 1 2 1 0 foo 2 2 1 2 Here 'c' and 'f' are not represented in the data and will not be shown in the output because dropna is True by default. Set dropna=False to preserve categories with no data. >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c']) >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f']) >>> pd.crosstab(foo, bar) col_0 d e row_0 a 1 0 b 0 1 >>> pd.crosstab(foo, bar, dropna=False) col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
Function05
cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
Help on function cut in module pandas.core.reshape.tile: cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True) Bin values into discrete intervals. Use `cut` when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, `cut` could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins. Parameters ---------- x : array-like The input array to be binned. Must be 1-dimensional. bins : int, sequence of scalars, or IntervalIndex The criteria to bin by. * int : Defines the number of equal-width bins in the range of `x`. The range of `x` is extended by .1% on each side to include the minimum and maximum values of `x`. * sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of `x` is done. * IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for `bins` must be non-overlapping. right : bool, default True Indicates whether `bins` includes the rightmost edge or not. If ``right == True`` (the default), then the `bins` ``[1, 2, 3, 4]`` indicate (1,2], (2,3], (3,4]. This argument is ignored when `bins` is an IntervalIndex. labels : array or False, default None Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when `bins` is an IntervalIndex. If True, raises an error. When `ordered=False`, labels must be provided. retbins : bool, default False Whether to return the bins or not. Useful when bins is provided as a scalar. precision : int, default 3 The precision at which to store and display the bins labels. include_lowest : bool, default False Whether the first interval should be left-inclusive or not. duplicates : {default 'raise', 'drop'}, optional If bin edges are not unique, raise ValueError or drop non-uniques. ordered : bool, default True Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype). If True, the resulting categorical will be ordered. If False, the resulting categorical will be unordered (labels must be provided). .. versionadded:: 1.1.0 Returns ------- out : Categorical, Series, or ndarray An array-like object representing the respective bin for each value of `x`. The type depends on the value of `labels`. * True (default) : returns a Series for Series `x` or a Categorical for all other inputs. The values stored within are Interval dtype. * sequence of scalars : returns a Series for Series `x` or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is. * False : returns an ndarray of integers. bins : numpy.ndarray or IntervalIndex. The computed or specified bins. Only returned when `retbins=True`. For scalar or sequence `bins`, this is an ndarray with the computed bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For an IntervalIndex `bins`, this is equal to `bins`. See Also -------- qcut : Discretize variable into equal-sized buckets based on rank or based on sample quantiles. Categorical : Array type for storing data that come from a fixed set of values. Series : One-dimensional array with axis labels (including time series). IntervalIndex : Immutable Index implementing an ordered, sliceable set. Notes ----- Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or Categorical object. Examples -------- Discretize into three equal-sized bins. >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3) ... # doctest: +ELLIPSIS [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ... >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True) ... # doctest: +ELLIPSIS ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ... array([0.994, 3. , 5. , 7. ])) Discovers the same bins, but assign them specific labels. Notice that the returned Categorical's categories are `labels` and is ordered. >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), ... 3, labels=["bad", "medium", "good"]) ['bad', 'good', 'medium', 'medium', 'good', 'bad'] Categories (3, object): ['bad' < 'medium' < 'good'] ``ordered=False`` will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels: >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, ... labels=["B", "A", "B"], ordered=False) ['B', 'B', 'A', 'A', 'B', 'B'] Categories (2, object): ['A', 'B'] ``labels=False`` implies you just want the bins back. >>> pd.cut([0, 1, 1, 2], bins=4, labels=False) array([0, 1, 1, 3]) Passing a Series as an input returns a Series with categorical dtype: >>> s = pd.Series(np.array([2, 4, 6, 8, 10]), ... index=['a', 'b', 'c', 'd', 'e']) >>> pd.cut(s, 3) ... # doctest: +ELLIPSIS a (1.992, 4.667] b (1.992, 4.667] c (4.667, 7.333] d (7.333, 10.0] e (7.333, 10.0] dtype: category Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ... Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based on bins. >>> s = pd.Series(np.array([2, 4, 6, 8, 10]), ... index=['a', 'b', 'c', 'd', 'e']) >>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False) ... # doctest: +ELLIPSIS (a 1.0 b 2.0 c 3.0 d 4.0 e NaN dtype: float64, array([ 0, 2, 4, 6, 8, 10])) Use `drop` optional when bins is not unique >>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True, ... right=False, duplicates='drop') ... # doctest: +ELLIPSIS (a 1.0 b 2.0 c 3.0 d 3.0 e NaN dtype: float64, array([ 0, 2, 4, 6, 10])) Passing an IntervalIndex for `bins` results in those categories exactly. Notice that values not covered by the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between two bins. >>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)]) >>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins) [NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]] Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]
Function06
date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'
Help on function date_range in module pandas.core.indexes.datetimes: date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex' Return a fixed frequency DatetimeIndex. Returns the range of equally spaced time points (where the difference between any two adjacent points is specified by the given frequency) such that they all satisfy `start <[=] x <[=] end`, where the first one and the last one are, resp., the first and last time points in that range that fall on the boundary of ``freq`` (if given as a frequency string) or that are valid for ``freq`` (if given as a :class:`pandas.tseries.offsets.DateOffset`). (If exactly one of ``start``, ``end``, or ``freq`` is *not* specified, this missing parameter can be computed given ``periods``, the number of timesteps in the range. See the note below.) Parameters ---------- start : str or datetime-like, optional Left bound for generating dates. end : str or datetime-like, optional Right bound for generating dates. periods : int, optional Number of periods to generate. freq : str or DateOffset, default 'D' Frequency strings can have multiples, e.g. '5H'. See :ref:`here <timeseries.offset_aliases>` for a list of frequency aliases. tz : str or tzinfo, optional Time zone name for returning localized DatetimeIndex, for example 'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is timezone-naive. normalize : bool, default False Normalize start/end dates to midnight before generating date range. name : str, default None Name of the resulting DatetimeIndex. closed : {None, 'left', 'right'}, optional Make the interval closed with respect to the given frequency to the 'left', 'right', or both sides (None, the default). **kwargs For compatibility. Has no effect on the result. Returns ------- rng : DatetimeIndex See Also -------- DatetimeIndex : An immutable container for datetimes. timedelta_range : Return a fixed frequency TimedeltaIndex. period_range : Return a fixed frequency PeriodIndex. interval_range : Return a fixed frequency IntervalIndex. Notes ----- Of the four parameters ``start``, ``end``, ``periods``, and ``freq``, exactly three must be specified. If ``freq`` is omitted, the resulting ``DatetimeIndex`` will have ``periods`` linearly spaced elements between ``start`` and ``end`` (closed on both sides). To learn more about the frequency strings, please see `this link <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__. Examples -------- **Specifying the values** The next four examples generate the same `DatetimeIndex`, but vary the combination of `start`, `end` and `periods`. Specify `start` and `end`, with the default daily frequency. >>> pd.date_range(start='1/1/2018', end='1/08/2018') DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'], dtype='datetime64[ns]', freq='D') Specify `start` and `periods`, the number of periods (days). >>> pd.date_range(start='1/1/2018', periods=8) DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'], dtype='datetime64[ns]', freq='D') Specify `end` and `periods`, the number of periods (days). >>> pd.date_range(end='1/1/2018', periods=8) DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'], dtype='datetime64[ns]', freq='D') Specify `start`, `end`, and `periods`; the frequency is generated automatically (linearly spaced). >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3) DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00', '2018-04-27 00:00:00'], dtype='datetime64[ns]', freq=None) **Other Parameters** Changed the `freq` (frequency) to ``'M'`` (month end frequency). >>> pd.date_range(start='1/1/2018', periods=5, freq='M') DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31'], dtype='datetime64[ns]', freq='M') Multiples are allowed >>> pd.date_range(start='1/1/2018', periods=5, freq='3M') DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31', '2019-01-31'], dtype='datetime64[ns]', freq='3M') `freq` can also be specified as an Offset object. >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3)) DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31', '2019-01-31'], dtype='datetime64[ns]', freq='3M') Specify `tz` to set the timezone. >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo') DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00', '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00', '2018-01-05 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq='D') `closed` controls whether to include `start` and `end` that are on the boundary. The default includes boundary points on either end. >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None) DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D') Use ``closed='left'`` to exclude `end` if it falls on the boundary. >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left') DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq='D') Use ``closed='right'`` to exclude `start` if it falls on the boundary. >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right') DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'], dtype='datetime64[ns]', freq='D')
Function07
eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
Help on function eval in module pandas.core.computation.eval: eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False) Evaluate a Python expression as a string using various backends. The following arithmetic operations are supported: ``+``, ``-``, ``*``, ``/``, ``**``, ``%``, ``//`` (python engine only) along with the following boolean operations: ``|`` (or), ``&`` (and), and ``~`` (not). Additionally, the ``'pandas'`` parser allows the use of :keyword:`and`, :keyword:`or`, and :keyword:`not` with the same semantics as the corresponding bitwise operators. :class:`~pandas.Series` and :class:`~pandas.DataFrame` objects are supported and behave as they would with plain ol' Python evaluation. Parameters ---------- expr : str The expression to evaluate. This string cannot contain any Python `statements <https://docs.python.org/3/reference/simple_stmts.html#simple-statements>`__, only Python `expressions <https://docs.python.org/3/reference/simple_stmts.html#expression-statements>`__. parser : {'pandas', 'python'}, default 'pandas' The parser to use to construct the syntax tree from the expression. The default of ``'pandas'`` parses code slightly different than standard Python. Alternatively, you can parse an expression using the ``'python'`` parser to retain strict Python semantics. See the :ref:`enhancing performance <enhancingperf.eval>` documentation for more details. engine : {'python', 'numexpr'}, default 'numexpr' The engine used to evaluate the expression. Supported engines are - None : tries to use ``numexpr``, falls back to ``python`` - ``'numexpr'``: This default engine evaluates pandas objects using numexpr for large speed ups in complex expressions with large frames. - ``'python'``: Performs operations as if you had ``eval``'d in top level python. This engine is generally not that useful. More backends may be available in the future. truediv : bool, optional Whether to use true division, like in Python >= 3. .. deprecated:: 1.0.0 local_dict : dict or None, optional A dictionary of local variables, taken from locals() by default. global_dict : dict or None, optional A dictionary of global variables, taken from globals() by default. resolvers : list of dict-like or None, optional A list of objects implementing the ``__getitem__`` special method that you can use to inject an additional collection of namespaces to use for variable lookup. For example, this is used in the :meth:`~DataFrame.query` method to inject the ``DataFrame.index`` and ``DataFrame.columns`` variables that refer to their respective :class:`~pandas.DataFrame` instance attributes. level : int, optional The number of prior stack frames to traverse and add to the current scope. Most users will **not** need to change this parameter. target : object, optional, default None This is the target object for assignment. It is used when there is variable assignment in the expression. If so, then `target` must support item assignment with string keys, and if a copy is being returned, it must also support `.copy()`. inplace : bool, default False If `target` is provided, and the expression mutates `target`, whether to modify `target` inplace. Otherwise, return a copy of `target` with the mutation. Returns ------- ndarray, numeric scalar, DataFrame, Series, or None The completion value of evaluating the given code or None if ``inplace=True``. Raises ------ ValueError There are many instances where such an error can be raised: - `target=None`, but the expression is multiline. - The expression is multiline, but not all them have item assignment. An example of such an arrangement is this: a = b + 1 a + 2 Here, there are expressions on different lines, making it multiline, but the last line has no variable assigned to the output of `a + 2`. - `inplace=True`, but the expression is missing item assignment. - Item assignment is provided, but the `target` does not support string item assignment. - Item assignment is provided and `inplace=False`, but the `target` does not support the `.copy()` method See Also -------- DataFrame.query : Evaluates a boolean expression to query the columns of a frame. DataFrame.eval : Evaluate a string describing operations on DataFrame columns. Notes ----- The ``dtype`` of any objects involved in an arithmetic ``%`` operation are recursively cast to ``float64``. See the :ref:`enhancing performance <enhancingperf.eval>` documentation for more details. Examples -------- >>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]}) >>> df animal age 0 dog 10 1 pig 20 We can add a new column using ``pd.eval``: >>> pd.eval("double_age = df.age * 2", target=df) animal age double_age 0 dog 10 20 1 pig 20 40
Function08
factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'
Help on function factorize in module pandas.core.algorithms: factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]' Encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. `factorize` is available as both a top-level function :func:`pandas.factorize`, and as a method :meth:`Series.factorize` and :meth:`Index.factorize`. Parameters ---------- values : sequence A 1-D sequence. Sequences that aren't pandas objects are coerced to ndarrays before factorization. sort : bool, default False Sort `uniques` and shuffle `codes` to maintain the relationship. na_sentinel : int or None, default -1 Value to mark "not found". If None, will not drop the NaN from the uniques of the values. .. versionchanged:: 1.1.2 size_hint : int, optional Hint to the hashtable sizer. Returns ------- codes : ndarray An integer ndarray that's an indexer into `uniques`. ``uniques.take(codes)`` will have the same values as `values`. uniques : ndarray, Index, or Categorical The unique valid values. When `values` is Categorical, `uniques` is a Categorical. When `values` is some other pandas object, an `Index` is returned. Otherwise, a 1-D ndarray is returned. .. note :: Even if there's a missing value in `values`, `uniques` will *not* contain an entry for it. See Also -------- cut : Discretize continuous-valued array. unique : Find the unique value in an array. Examples -------- These examples all show factorize as a top-level method like ``pd.factorize(values)``. The results are identical for methods like :meth:`Series.factorize`. >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) >>> codes array([0, 0, 1, 2, 0]...) >>> uniques array(['b', 'a', 'c'], dtype=object) With ``sort=True``, the `uniques` will be sorted, and `codes` will be shuffled so that the relationship is the maintained. >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) >>> codes array([1, 1, 0, 2, 1]...) >>> uniques array(['a', 'b', 'c'], dtype=object) Missing values are indicated in `codes` with `na_sentinel` (``-1`` by default). Note that missing values are never included in `uniques`. >>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) >>> codes array([ 0, -1, 1, 2, 0]...) >>> uniques array(['b', 'a', 'c'], dtype=object) Thus far, we've only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of `uniques` will differ. For Categoricals, a `Categorical` is returned. >>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]...) >>> uniques ['a', 'c'] Categories (3, object): ['a', 'b', 'c'] Notice that ``'b'`` is in ``uniques.categories``, despite not being present in ``cat.values``. For all other pandas objects, an Index of the appropriate type is returned. >>> cat = pd.Series(['a', 'a', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]...) >>> uniques Index(['a', 'c'], dtype='object') If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting ``na_sentinel=None``. >>> values = np.array([1, 2, 1, np.nan]) >>> codes, uniques = pd.factorize(values) # default: na_sentinel=-1 >>> codes array([ 0, 1, 0, -1]) >>> uniques array([1., 2.]) >>> codes, uniques = pd.factorize(values, na_sentinel=None) >>> codes array([0, 1, 0, 2]) >>> uniques array([ 1., 2., nan])
Function09
get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'
Help on function get_dummies in module pandas.core.reshape.reshape: get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame' Convert categorical variable into dummy/indicator variables. Parameters ---------- data : array-like, Series, or DataFrame Data of which to get dummy indicators. prefix : str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, `prefix` can be a dictionary mapping column names to prefixes. prefix_sep : str, default '_' If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with `prefix`. dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If `columns` is None then all the columns with `object` or `category` dtype will be converted. sparse : bool, default False Whether the dummy-encoded columns should be backed by a :class:`SparseArray` (True) or a regular NumPy array (False). drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. Returns ------- DataFrame Dummy-coded data. See Also -------- Series.str.get_dummies : Convert Series to dummy codes. Examples -------- >>> s = pd.Series(list('abca')) >>> pd.get_dummies(s) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 >>> s1 = ['a', 'b', np.nan] >>> pd.get_dummies(s1) a b 0 1 0 1 0 1 2 0 0 >>> pd.get_dummies(s1, dummy_na=True) a b NaN 0 1 0 0 1 0 1 0 2 0 0 1 >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]}) >>> pd.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 1 0 0 1 0 1 2 0 1 1 0 0 2 3 1 0 0 0 1 >>> pd.get_dummies(pd.Series(list('abcaa'))) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 1 0 0 >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True) b c 0 0 0 1 1 0 2 0 1 3 0 0 4 0 0 >>> pd.get_dummies(pd.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0