web crawling or scraping using scrapy in python

Scrapy is a very popular web scraping/crawling framework, I have been using it for quite some time now. In this post, I will demonstrate creating a very basic web crawler. Install Scrapy Installation is via pip pip install scrapy Minimalistic Code A very simple scraper is created like this To Run , simply type scrapy runspider scraper.py Running, above code will output something like below 2018-12-02 14:01:18 [scrapy.utils.log] INFO: Scrapy 1.
Read more →

integration testing with apache beam using pubsub and bigtable emulators and direct runner

Summary Recently I have been looking into ways to test my Apache Beam pipelines at work. Most common use cases of Beam generally involves either batch reading data from GCS and writing to analytical platforms such as Big Query or stream reading data from Pubsub and writing to perhaps Bigtable. A pipelines consists of transforms and its generally easy to test them in isolation as a independent unit test per stage, however I am personally a big fan of “end-to-end” testing or “Integration testing” and this is where things can sometimes get tricky.
Read more →

gdal 2 on mac with homebrew

GDAL is one of the most frequently used utility in my toolkit. I am writing this post to make it easier for others to install it from scratch on their macs. Setting up GDAL The traditional way has always been to visit the dear old kyngchaos.com, and install “GDAL Complete” Framework vi deb installer. Do make sure that GDAL Framework is in your path otherwise something like this always helps
Read more →

how to print bar chart in chrome browser console

This post doesn’t really have anything valuable to contribute, just some cool console trick. Have you ever wanted to plot out a chart, very quickly ? Did you ever had an urge to visualise a bunch of numbers without having to use a charting api or copy pasting the data in a spreadsheet ? If you did then you might even learn something today :) Here is a simple , yet neat way to plot out bunch of numbers in chrome console as a** horizontal bar chart.
Read more →

how to make https requests with python httplib2 ssl

Here are few snippets to make secure http requests using various python libraries. httplib2 import httplib2 link = "https://example.com h = httplib2.Http(".cache") r, content = h.request(link, "GET") another exmaple import httplib2 h = httplib2.Http(".cache") h.add_credentials('user', 'pass') r, content = h.request("https://api.github.com", "GET") print r['status'] print r['content-type'] Urllib2 Here is a simmilar example using urlib2 for comparison and lines of code. import urllib2 gh_url = 'https://example.com' auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password(None, gh_url, 'user', 'password') opener = urllib2.
Read more →