飞道的博客

Jupyter Notebook的小白入门课程总结(上)

926人阅读  评论(0)

好好学习,学以致用

死皮赖脸报了公司的课程,就我一个亚洲时区的,大晚上地跟着公司内部学习课程,开始Jupyter Notebook之旅。

Module 1 Markdown in Jupyter

第一节课是最基础的入门,两个半小时讲的内容其实我看Markdown cheatsheet的话半小时就能学会了。一直对老外讲课有个印象就是,老外讲课不太循序渐进,经常从小白就直接跳到高阶了。导致我以前上课很容易前期太轻松,后期跟不上。现在我跟的这门课我实时记录着,复习着,练习着,希望可以学好吧。

CSDN的好处就是它也是用Markdown来编辑文档的,在这里写博文对于知识的巩固和练习是十分有效的。
抓住重点,多去复盘,多体会!

  1. 要熟悉快捷键,忘掉的可以按小键盘查看;
  2. Cell可以从code转为Markdown,但一定要是在蓝色状态下摁M或Y;
  3. 建立像WikiPedia一样的超链接导向,用以文章中的定位或快速回到文章某个段落位置;
  4. 提到了LaTeX,国外真的超多人用;
  5. 插入图片,显示在文档中。需要弄清楚方式是直接援引网络图片还是从本地上传;
  6. 老师讲的monospace text我觉得他讲错了,比如这种似乎不叫monospace等宽字体吧?而且老师还把符号给搞错了键盘,是~下面的backtick`才对。

课后作业:建立TOC,将模块一的链接添加进目录中

可以参考的一些资源:

Assignment:

创建一个小型的Jupyter报告
Create a new Jupyter Notebook “report.ipynb” in your Azure Project

将Jupyter的特点,优缺点列成表格的形式写出来

  • 写数学公式
$y=2x^2+\sqrt3$

y = 2 x 2 + 3 y=2x^2+\sqrt3

  • 设置字体颜色
<font color=#FF0000 >红色</font>

红色

Jupyter Notebook的优缺点

Jupyter Notebook
Description
Features
A web-based interactive python computational environment (formerly IPython Notebooks) that can create notebooks composed of input and output cells. These cells can contain code, text (using Markdown), mathematics, plots and rich media. The notebook usually ends with the “.ipynb” extension.
Pros
- Users are able to write down code and note together in the browser as well as the result of each run. It will form a report directly.
- Support many different types of programming languages, such as python, R, MATLAB, C++, Ruby. You only need to install the corresponding kernels.
- Visulaized contents can be interactive through applying interactive widgets, where users can zoom the map or rotate the 3d models.
- Very convenient for small-scale data analysis and code validation.
- Every files uploaded on cloud, easy to access and share with collaborators.
Cons
1. Reply on the internet speed.
2. The format layout is not easily arranged as that in Microsoft word.

Jupyter Logo

Yammer上的讨论:
Hi all, now that Jupytor is becomming more popular in use, I wonder if anyone has exported a notebook as a pdf report. The question here is: is there a Latex class for reports?
Hi all,
Even though i haven’t gotten to take a close look at publishing scientific papers/reports and hence can’t really estimate the effort required to create a template I have seen some impressive work already. For now all I can do is share the resources I stumbled upon and found interesting 😃
1. UltimateIpythonNotebook
2. WritingAcademicPapers
3. Ipypublish

maybe it helps a bit.

还有一个我自己找到的不错的网站,名字很有趣,叫Brilliant Wrong。有时间可以看一下。

Module 2

Prereading for module 2:
快速回答下列几个问题

  • What is a kernel? And which kernels exist?
    To answer this question read the Notebook beginners guide
  • Read about Numpy. What is it and what is its purpose?
  • Read about Pandas (in the Python context, not the mammal). What is it and what is its purpose?

My answer:

  1. A kernel is the “computational engine” that executes the code contained in a Notebook document. There are many types of kernels exist, the common kernels are IPython, IRkernel and IJulia. But you can also see other kernels here(Community maintained kernels).

  2. NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

NumPy系统是Python的一种开源的数值计算扩展。这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多(该结构也可以用来表示矩阵(matrix))。

  1. Pandas: In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term “panel data”, an econometrics term for datasets that include observations over multiple time periods for the same individuals.

Pandas 是基于NumPy的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。

ArcGIS pro support python 3 but ArcGIS supports python 2.

这第二节课主要是code下面的表格数据处理与转换。一个简单的csv表格老师有讲了两小时,虽然各种手段的代码编辑等等实现的功能挺多,但是我比较疑惑,这不是在excel表格里面几分钟就能做好的事情么?通过代码也可以实现,但真的有必要么?

引用了pandas

Dataframe的各种pivot处理转换

  • 读取Excel文件:
import pandas as pd
df = pd.read_excel("../../data/drawdown_curves.xlsx")
df
  • describe method查看每一列的描述性统计量,求表格数值的简单计算(如平均值最大值方差等等)会统计出count, mean, std, min, 25%, 50%, 75%, max
df.describe()
  • info数据集信息:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4968 entries, 0 to 4967
Data columns (total 4 columns):
ID       4968 non-null int64
X        4968 non-null float64
Y        4968 non-null float64
Curve    4968 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 155.3+ KB
  • 保存到csv文件
df.to_csv("Dataframe_output.csv",sep=";")
  • 转换堆表,使表格行列归整,切换横纵坐标轴
df_pivot = df.pivot(index="X",values="Y",columns="Curve") 
  • 只展示表格前三行或末三行
df_pivot.head(3)
df_pivot.tail(3)
  • 时间转换
from datetime import datetime
from datetime import timedelta
datetime(2018,1,1)+timedelta(days=1.2)
>>>datetime.datetime(2018, 1, 2, 4, 48)
  • 在表格添加一列日期值"Calendar"并且起始值为2018年1月1日。
# specify a start date and create a new column of type DateTime 
# create a new column "Calendar"
df_pivot["Calendar"] = pd.to_datetime(df_pivot.index, unit="D", origin = datetime(2018,1,1)) 
  • 将表格的行名称重置默认0,1,2,3…
df_pivot.reset_index()
  • 将表格的index行名称替换成Calendar
df_pivot.set_index("Calendar",inplace=True)
df_pivot.head()
  • 小数点:
  • 去掉某种值:
# resample and remove gaps by dropping all rows that contain NaN values
df_pivot.resample("D").mean().dropna().head()

How to drop certain values of a dataframe instead of NaNs?

quite easy by using the method .drop (how intuitive 😉 ) on the dataframe if you want to drop rows and columns based on a certain condition.

As an alternative you can specify a condition for selecting data directly and thereby create a new slice.

Let’s say we want to select all data where the well1 values in our training Dataframe df_pivot are larger than 495.37.

Then we would use the code below

“# select all values from df_pivot where the well1 column values are larger than 495.37”
df_pivot[df_pivot[“well1”] > 495.37]

One of my favorite resources for looking stuff up also has some good practical examples

  • 插值
df_pivot.resample("D").mean().interpolate().head()
  • 只展示某列
df_pivot[["well2","well5"]].head()
  • 选择某行(注意行数是从0开始计数的)
df_pivot.iloc[3]
df_pivot.iloc[0:6]
  • 画图
df_pivot.plot()
df_pivot[["well1","well3"]].plot()
  • inplace=True

  • %matplotlib inline

  • Variable inspector

  • How to select a certain datetime range from a dataframe by solely giving one date and a time increment?

Well, pandas DataFrames can also handle the datetime format directly instead of using strings (e.g. “2018-01-01”).

The example below will solely select columns from the dataframe df_pivot of the guided coding session of module 2 which are in the range of 5. January 2018 to the start time + 50 days (24. of February 2018)

"# specify start in datetime format

start_datetime = datetime(2018, 1, 5)

"# specify timespan with timedelta

timespan = timedelta(days=50)

"# specify end in datetime format

end_datetime = start_datetime + timespan

df_pivot[start_datetime:end_datetime]

https://openpyxl.readthedocs.io/en/stable/
https://stackoverflow.com/questions/21892570/ipython-notebook-align-table-to-the-left-of-cell

注意点:
不同kernel有时候会报错,有时候就可以运行。

PythonForDataAnalysis.pdf


转载:https://blog.csdn.net/github_37280613/article/details/105050565
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场