Jun 18, 2019

JSON Tools - jq

Comparing with Python json.tool, seems jq is much light weight and powerful :)

Sample time diff:

With jq:
real 0m1.666s
user 0m0.262s
sys 0m0.290s 


With Python json.tool:
real 0m2.217s
user 0m0.492s
sys 0m0.539s



Useful links:

HTML parsing with Python


Scraping Data with Python and XPath:

Sample code from reference link which tell the whole story :)

import requests
from lxml import html

pageContent=requests.get('https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo')
tree = html.fromstring(pageContent.content)

goldWinners=tree.xpath('//*[@id="mw-content-text"]/table/tr/td[2]/a[1]/text()')
silverWinners=tree.xpath('//*[@id="mw-content-text"]/table/tr/td[3]/a[1]/text()')
#bronzeWinner we need rows where there's no rowspan - note XPath
bronzeWinners=tree.xpath('//*[@id="mw-content-text"]/table/tr/td[not(@rowspan=2)]/a[1]/text()')
medalWinners=goldWinners+silverWinners+bronzeWinners

medalTotals={}
for name in medalWinners:
    if medalTotals.has_key(name):
        medalTotals[name]=medalTotals[name]+1
    else:
        medalTotals[name]=1

for result in sorted(
        medalTotals.items(), key=lambda x:x[1],reverse=True):
        print '%s:%s' % result


BeautifulSoup is another option but different style from xpath.

Jun 13, 2019

HTML page parsing with xmllint xpath in BASH


Per HTML_parsers, there is no better HTML page parsing options for BASH. Inspired by Retrieve web using xpath, here comes the summary of using xmllint xpath:


xpath='' # sample: '//div[@class = "tides"]'

get_element_by_xpath():
    echo $HTML_PAGE | xmllint --html --xpath $xpath - 2>/dev/null

get_element_text_by_xpath():
    xpath+='/text()'
    echo $HTML_PAGE | xmllint --html --xpath $xpath - 2>/dev/null

get_elements_count_by_xpath():
    xpath="count($xpath)"
    echo $HTML_PAGE | xmllint --html --xpath $xpath - 2>/dev/null