Data Science

A Beginner's Guide to Python

January 4, 2020

Hello There! Today we will discuss Python as a programming language for data science-oriented applications. We will also contrast it with other available options to better understand the pros and cons of using Python for data science.  The article also includes a Python cheat sheet and a directional guide to building your first project of data science using Python. 

Content:

  • What is Python
  • Python Vs. Others
  • Python and You
  • Getting Started with Python
    • Setting up Your Computer
    • Installing Python
    • Libraries for Python
  • Building Your First Project with Python
  • Python Cheat Sheet
  • Sources for Python
  • Endnote

What is Python

Released in 1991, Python is a high level language, widely used in machine learning and many general purpose applications. Python is an interpreted and object oriented language with dynamic semantics. 

Where is it used?

Python is highly favored for rapid application development due to its high-level built in data structures in conjunction with dynamic typing and dynamic binding. 

It is also used as a glue or scripting language to combine pre existing code blocks together.  

Coding With Python:

  • Python has a simple, easy-to-learn programming syntax which is intuitive and emphasizes readability, hence reducing the program maintenance cost.
  • Python extensively supports modules and code blocks which lead to code reusability and program modularity. 
  • The freely distributable Python interpreter and broad standard libraries are available in both binary format and as source code.
  • The spike in productivity that Python provides has resulted in its widespread use in the programming community.. 
  • The edit-Test-debug cycle is pretty fast in Python. 

Debugging Python

It’s easy to debug Python Programs:

  • In case there is a bug or a bad input, it will never cause a segmentation fault in Python. Instead, when the interpreter will raise an exception when it notices an error. 
  • If the program doesn’t have an exception handler or doesn’t catch the error, the interpreter prints a stack trace. 
  • A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. 
  • The debugger is written in Python itself, testifying to Python's introspective power. 
  • On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.

Current Version: 

The latest version of Python is Python 3.6.4 and it is available for all platforms. However, many existing systems still use Python 2.7, the legacy version as it is already the system base. Hence, a programmer or developer may be needed to work on it. It’s always a good idea to know how to port the code from Python 2.7 to Python 3.x. Here you will find a comprehensive guide on how to port this code.

Python Vs. Others

FeaturePythonRSAS
AvailabilityFree DistributionFree DistributionExpensive and available in bigger organizations only. The University edition is free but  has limited facilities.
Ease of LearningKnown for its simplicity and ease of use, Python is quite efficient in data science applications. There is no GUI like SAS but it can be used on top of other tools.Low level programming language which means it has a steep learning curve and even simple programs may mean a lot of code.Easy to learn. PROC SQL is easy for people who know SQL, comprehensive documentation but costly certifications
Data Handling CapabilitiesGood data handling capacity + supports parallel computingGood data handling capacity + supports parallel computingGood data handling capacity + supports parallel computing
Graphical CapabilitiesHighly advanced graphical capabilities, supports Plotty and Seaborn Highly advanced graphical capabilities, supports Plotty.Difficult to alter graphics and only functional support. You need to know how to use the Graph package of SAS.
Advancements in ToolsAdvanced and updated but prone to bugs and instability as it is an open-source languageMost advanced, updated wrt latest technologies, higher margin of instability and bugs in new releases as it is open source. Updates roll out with official versions only, which means it is updated in a controlled environment and is stable and bug-free
Job ScenarioGaining traction in industry usage gradually, used for context-centric applicationsGaining traction gradually. Used more widely in statistics oriented applications. The most widely used tool so far.
Deep Learning SupportExtensive support and multiple libraries which support DL in Python. Eg. Keras, TensorFlow, Theano etcIn-built packages, support by KerasR packages equivalent to Keras for Python.DL support is till in its infantile stage, mostly under development
Customer Service Support and CommunicationHuge community, no customer support but online help available om forumsHuge community, no customer service but online help available through forumsDedicated customer support.

Python and You

Python is a general-purpose language that can be adapted to multiple environments. Also, in this guide to python, we have attached many resources that will be helpful. The following are a few of the multitude.

  • Web and Internet Developers

Python's standard library supports many Internet protocols: HTML and XML, JSON, Email processing, Support for FTP, IMAP, and other Internet protocols, Easy-to-use socket interface.

The Package Index has Twisted Python: a framework for asynchronous network programming, yet more libraries,  an HTML parser that can handle all kinds of HTML, requests, a powerful HTTP client library, BeautifulSoup,, Feedparser for parsing RSS/Atom feeds, Paramiko, implementing the SSH2 protocol etc.

Frameworks such as Django, Pyramid, Flask and Bottle also offer a wide array of options to the developers. Ask a developer herself and she will tell you how Python is making her life easier!

  • Deep Learning and Machine Learning Engineers

Python is widely used for scientific and numerical projects, many of which deal with modelling, architecture and training of neural networks. SciPy, Pandas, IPython are just a handful of these examples. 

With TensorFlow and Theano, while using Keras as a middle layer, Python performs its finest. 

  • Desktop GUI Developers

Most binary distributions of Python are equipped with Tk, the GUI library. Other toolkits like Qt via PyQt, wxWidgets, Kivy (specifically for multitouch applications) are also among the most widely used options.

  • Software Developers

Generally, Python is used as a support language in software development, for build control and management testing. However, other uses are being found as we go, testing still takes the biggest chunk of this usage. 

  • ERP and E-commerce Application Developers

Tools like Odoo are examples of Python’s applicability in ERP and Ecommerce application development. Considering the modular structure and simple syntax of Python, we will only see a rise in its usage.

Getting Started with Python

There are multiple options available to download and use Python. 

As mentioned above, there are multiple uses to Python and depending in your end-use, you will need to choose the respective package. 

  • Python for Programming Enthusiasts (Novices):

For those who are learning to code in Python or have just learnt so and are looking forward to focus to get proficient in coding with the language, can download the legacy version (Python 2.7) or the latest version (Python 3.1.x) whichever seems preferable. 

Here you will find self-explanatory information on how to install Python on whichever operating system you are using.

If you still have doubts, Here you will find all the help you might need to take you through the process. 

  • Python for Data Science 

If you plan on using Python for loading data, exploring this data set, visualizing and creating statistical models from this data, the Anaconda distribution for Python is just what you need. 

This package contains the Python core language, an improved REPL environment called Jupyter, numeric computing libraries (NumPy, pandas), plotting libraries (seaborn, matplotlib), and statistics and machine learning libraries (SciPy, scikit-learn, statsmodels).

Here you will find guiding steps to set up your environment and get is ready to start working with your data.

  • Python for Deep Learning

Installing Python and setting up the environment for deep learning can be confusing for anyone. As in case of application for data science, Anaconda is also the one-stop solution for installing Python as it brings with it all the associated shebang. Though to start developing deep learning solutions, one needs to download and install a scientific and computing library like TensorFlow, Theano, Caffe etc. It is even easier and more efficient when used with Keras on top of these more complex libraries as Keras makes the code simpler and portable across platforms. 

Here you will find a descriptive guide about how to use install Python, Keras and the back-end scientific and computing library (TensorFlow, Theano etc).

Libraries in Python

  • NumPy: 

An abbreviation for Numerical Python, NumPy’s strongest feature is the n-dimensional array. Along with tools for integration with multiple low level languages like Fortran, C and C++, NumPy also offers support for advanced random number capabilities, basic linear algebra functions and Fourier Transforms etc. 

  • SciPy:

Built on top of NumPy, SciPy stands for Scientific Python. For high level science and engineering modules like Optimization and Sparse Matrix, Linear Algebra, discrete Fourier Transform etc. 

  • Matplotlib

To plot graphs as complex as a histogram, heat graph etc in your python environment, Matplotlib will come to your rescue. The Pylab feature in ipython notebook (ipython notebook –pylab = inline) can be employed use these plotting features inline. 

If you forgo the inline option, then pylab converts ipython environment to a MATLAB-like environment. You can also use Latex commands to add math to your plot.

  • Pandas

This library is specifically for structured manipulations and data operations, i.e. for data wrangling and preparation. You can find a detailed Pandas cheat sheet and methods to use it here.

  • Scikit Learn

It is built on NumPy, SciPy and matplotlib, the main purpose of this library is mainly useful for statistical modeling and machine learning. 

  • StatsModels

This module is used to explore data, estimate statistical models, and carry out statistical tests. 

  • Seaborn

Developed on matplotlib, it is best suited for statistical data analysis. It has helped making data visualization more fun. 

  • Bokeh

Bokeh assists creating interactive interactive plots, dashboards and data applications for web-browsers. Nature of graphics generated is similar to those developed using D3.js. 

  • Blaze

This library expands the capacity and benefits of NumPy and Pandas to streaming and distributed datasets. Not only it can build impressive visualizations from huge datasets but also can import data from various data sources like MongoDB, BColz, SQLAlchemy etc.

  • Scarpy

Generally used for web crawling, this library finds its usage in getting specific patterns of data. This library is equipped with functions that allow a program to begin at the home URL of a website and craw through the website’s pages to gather data and information.

  • SymPy

An acronym for Symbolic Computation, this library allows you to perform simple symbolic computations to calculus, discrete mathematics, algebra and quantum physics etc. It can also produce the computations in form of LaTeX code. 

  • Requests

Quite to similar to what urllib2 for Python does, this library helps in accessing the web. However, it is much easier to code with this library.

Building Your First Data Science Project With Python

An end to end project is the best way to start your journey of developing solutions for data science using Python. 

Now, such a project is not linear in nature but generally follows these steps: 

  1. Define Problem
  2. Prepare Data
  3. Evaluate Algorithm
  4. Fine-Tune results
  5. Present Results

By working up the framework of the solution for a problem in this manner, gets you a template or a skeleton which you can improve upon on a case to case basis. As we all know, practice makes a developer perfect. 

Here you will find a step by step guide to build such a framework and various datasets to test and improve this structure. 

Sources to Study Python

While you can join a certificate course at websites like udemy.com, coursera.com or udacity.com; there are websites like codeacademy.com which offer a free course for learning and mastering Python. 

If online studying is not exactly your comfort zone, ‘Learning Python 3 The Hard Way’ by Zed Shaw is an optimum place to begin. It will take you from the basics to the advanced level of Python programming. However, you would need to be consistent and persistent in your efforts. 

Here you will find out how to import various libraries while working with Python to develop your data science or machine learning project.  

Cheat Sheet- Python:

Here you will find a cheat sheet to perform the basic functions and operation sin Python. These code blocks can be directly used to perform intended function. 

Endnote:

The primary objective of this guide to python is to make beginners the fundamentals of Python, but it is obvious that Python will see a rise in its use in various fields. As discussed already, its applications in data science and machine learning are multi-faceted and will require cross-domain expertise. 

We hope this article help you get a clear idea of how to begin developing projects of data science using Python.