Pandas is a library used for data structures and data analysis
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
Creates default index values
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
df.head(num)
- can optionally set number of rowsdf.tail(num)
- can optionally set number of rowsdf.index
- gets indexesdf.columns
- get column namesdf.to_numpy()
- return numpy representation (expensive for mixed types)df.describe()
- quick stats (count, mean, std, min, etc)df.T
- transposedf.sort_index(axis=1, ascending=False)
- sort by indexdf.sort_values(by="B")
- sort by value(column name)df["A"]
- Selects column “A”, same as df.Adf[0:3]
- Selects first 3 rowsdf.loc[dates[0]]
- Cross section with dates labeldf.loc[:, ["A", "B"]]
- multi-axis by labeldf.loc["20130102":"20130104", ["A", "B"]]
- label slicing, inclusivedf.at[dates[0], "A"]
- specific valuedf.iloc[3]
- 3rd row, all columns, transposed. Inclusive.df.iloc[3:5, 0:2]
- 3rd-5th row, 0-2 column. Like NumPy/Python.df.iloc[[1, 2, 4], [0, 2]]
- List explist row/col to includedf.iloc[1:3, :]
- Explicit rowsdf.iloc[:, 1:3]
- Explicit coldf.iat[1, 1]
- Explicit row/coldf[df["A"] > 0]
- Select rows matching column criteriadf[df > 0]
- Select all values meeting criteriadf[df["E"].isin(["two", "four"])]
- Finds rows meeting column criteriadf["F"] = s1
- add a new column (where s1 is a Series)df.at[dates[0], "A"]
- by labeldf.iat[0, 1]
- by positiondf.loc[:, "D"] = np.array([5] * len(df))
- with numpy arraydf[df > 0] = -df
- turn all values greater than 0 negatives1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 -5 NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
df1.dropna(how="any")
- drop downs w/ NaNdf1.fillna(value=5)
- fill missing datadf.mean()
- mean per columndf.mean(1)
- mean per rowdf.apply(np.cumsum)
- apply function to all columndf.apply(lambda x: x.max() - x.min())
- apply function to get 1 val per columns.value_counts()
- histogram for seriess.str.lower()
- lowercase for series (more string methods exist too)pd.concat(list of rows)
- Concat pandas objectspd.merge(left, right, on="key")
- sql style join (left and right are df’s)df.groupby("A").sum()
ts.resample("5Min").sum()
- Samplingts_utc.tz_convert("US/Eastern")
- Time zone conversionts.to_period()
- Convert to diff perioddf = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df["grade"] = df["raw_grade"].astype("category")
- Convert raw to categoydf["grade"].cat.categories = ["very good", "good", "very bad"]
renamedf.sort_values(by="grade")
- Sortdf.groupby("grade").size()
- Group byimport matplotlib.pyplot as plt
plt.close("all")
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot()
plt.figure()
df.plot()
plt.legend(loc='best')
df.to_csv("foo.csv")
- write to csvpd.read_csv("foo.csv")
- read from csvdf.to_excel("foo.xlsx", sheet_name="Sheet1")
- write to excelpd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
- read from excel