User:EmericusPetro/sandbox/Poor mans OpenStreetMap Data Items dumper

From OpenStreetMap Wiki
Jump to navigation Jump to search

Poor's man OpenStreetMap Data Items dumper

This is a poor man's (lack of better better expression in English) of Wikibase RDF exporter, tested with Data Items. It recreate the RDF from Data Items by making request calls only to the /wiki/Special:EntityData/.

In theory, with < 50 P properties and 20.000 Q items, an traditional dump from server side would take few seconds (and final size in .ttl less than 100MB). But wikibase-wiki-dump-items.sh, with a delay default delay of 5 seconds:

  • 5 x 20.060 = 100300 seconds
  • 1672 minutes
  • 28 hours

So, yes, it takes more time. But it works, just not as fast. I do recommend look for alternatives.

How final content is merged

The script will download an cache (including errors) on local directories, and at the end will use rdfpipe from RDFLib (python package) to concatenate. The final Turtle files already will be in a normalized, pretty printed form (ideal for git and diffs).

Script

wikibase-wiki-dump-items.sh

#!/bin/bash
#===============================================================================
#
#          FILE:  wikibase-wiki-dump-items.sh
#
#         USAGE:  ./scripts/wikibase-wiki-dump-items.sh
#                 DUMP_LOG=dump.log.tsv ./scripts/wikibase-wiki-dump-items.sh
#                 DELAY=10 ./scripts/wikibase-wiki-dump-items.sh
#                 Q_START=1 Q_END=2 ./scripts/wikibase-wiki-dump-items.sh
#                 OPERATION=merge_p ./scripts/wikibase-wiki-dump-items.sh
#
#   DESCRIPTION:  This shell script will download Wikibase Ps and Qs in
#                 less efficient way, one by one. Cache individual results
#                 in disk (including errors). At the end it will merge
#                 the output into single files, already well formated
#                 in a preditable way (e.g. to allow diffs).
#                 The merging mecanism may
#
#       OPTIONS:  env WIKI_URL_ENTITYDATA=
#                     http://example.org/wiki/Special:EntityData/
#                 env DELAY
#                 env P_START
#                 env P_END
#                 env Q_START
#                 env Q_END
#                 env CACHE_ITEMS
#                 env CACHE_ITEMS_404
#                 env CACHE_ITEMS
#                 env CACHE_ITEMS_404
#                 env DUMP_LOG
#
#  REQUIREMENTS:  - curl
#                 - rdfpipe (pip install rdflib)
#                   - Used to merge results. Tested with rdflib 6.1.1. Feel
#                     free to use other tools to concatenate.
#
#          BUGS:  ---
#         NOTES:  ---
#        AUTHOR:  Emerson Rocha <rocha[at]ieee.org>
#       COMPANY:  EticaAI
#       LICENSE:  Public Domain dedication
#                 SPDX-License-Identifier: Unlicense
#       VERSION:  v1.0
#       CREATED:  2022-11-14 10:38 UTC
#      REVISION:  ---
#===============================================================================
set -e

ROOTDIR="$(pwd)"

#### Customizable environment variable _________________________________________
# User agent: https://meta.wikimedia.org/wiki/User-Agent_policy
USERAGENT="${USERAGENT:-"wikibase-wiki-dump-itemsbot/0.1 (https://github.com/fititnt/openstreetmap-wiki-rdf-exporter; rocha(at)ieee.org)"}"
WIKI_URL_ENTITYDATA="${WIKI_URL_ENTITYDATA:-"https://wiki.openstreetmap.org/wiki/Special:EntityData/"}"
P_START="${P_START:-"1"}"
P_END="${P_END:-"60"}"
Q_START="${Q_START:-"1"}"
Q_END="${Q_END:-"20000"}"
DELAY="${DELAY:-"5"}" # delay in seconds (after download success or error)
CACHE_ITEMS="${CACHE_ITEMS:-"$ROOTDIR/data/cache-wiki-item-dump"}"
CACHE_ITEMS_404="${CACHE_ITEMS_404:-"$ROOTDIR/data/cache-wiki-item-dump-404"}"
OUTPUT_DIR="${OUTPUT_DIR:-"$ROOTDIR/data/cache"}"
OPERATION="${OPERATION:-""}"
DUMP_LOG="${DUMP_LOG:-""}"

#### internal variables ________________________________________________________
#### Fancy colors constants - - - - - - - - - - - - - - - - - - - - - - - - - -
tty_blue=$(tput setaf 4)
tty_green=$(tput setaf 2)
tty_red=$(tput setaf 1)
tty_normal=$(tput sgr0)

## Example
# printf "\n\t%40s\n" "${tty_blue}${FUNCNAME[0]} STARTED ${tty_normal}"
# printf "\t%40s\n" "${tty_green}${FUNCNAME[0]} FINISHED OKAY ${tty_normal}"
# printf "\t%40s\n" "${tty_blue} INFO: [] ${tty_normal}"
# printf "\t%40s\n" "${tty_red} ERROR: [] ${tty_normal}"
#### Fancy colors constants - - - - - - - - - - - - - - - - - - - - - - - - - -

#### functions _________________________________________________________________

#######################################
# Main loop. The output to screen will be a valid .tsv format. Example:
#   item<tab>result
#   Q1<tab>error cached
#   Q2<tab>cached
#   Q3<tab>downloaded
#
# Globals:
#    CACHE_ITEMS
#    CACHE_ITEMS_404
#    DUMP_LOG
#
# Arguments:
#
# Outputs:
#
#######################################
main_loop_items() {
  printf "\n\t%40s\n" "${tty_blue}${FUNCNAME[0]} STARTED ${tty_normal}"

  if [ ! -d "$CACHE_ITEMS" ]; then
    printf "%s\n" "${tty_red} ERROR: env CACHE_ITEMS \
[$CACHE_ITEMS]? ${tty_normal}"
    exit 1
  fi

  if [ ! -d "$CACHE_ITEMS_404" ]; then
    printf "%s\n" "${tty_red} ERROR: env CACHE_ITEMS_404 \
[$CACHE_ITEMS_404]? ${tty_normal}"
    exit 1
  fi

  # tab-separated output, START
  printf "\n%s\t%s" "item" "result"
  if [ -n "$DUMP_LOG" ]; then
    printf "%s\t%s\n" "item" "result" >"$DUMP_LOG"
  fi

  for ((c = P_START; c <= P_END; c++)); do
    download_wiki_item "P${c}" ""
  done
  for ((c = Q_START; c <= Q_END; c++)); do
    download_wiki_item "Q${c}" "?flavor=dump"
  done
  echo ""
  # tab-separated output, END

  printf "\t%40s\n" "${tty_green}${FUNCNAME[0]} FINISHED OKAY ${tty_normal}"
}

#######################################
# Download an item if already not cached on disk
#
# Globals:
#   CACHE_ITEMS
#   CACHE_ITEMS_404
#   RDF_INPUT_EXT
#   WIKI_URL_ENTITYDATA
#   DELAY
#   DUMP_LOG
# Arguments:
#   item        string   (required) Examples: P2 , Q3, (...)
#   urlsuffix   string   (optional) Example:  ?flavor=dump
# Outputs:
#
#######################################
download_wiki_item() {
  item="$1"
  urlsuffix="${2-""}"
  # suffix=".nt"
  # printf "\n\t%40s\n" "${tty_blue}${FUNCNAME[0]} STARTED [$WIKI_URL_ENTITYDATA] [$item] ${tty_normal}"

  # https://www.wikidata.org/wiki/Wikidata:Data_access/pt-br#Less_verbose_RDF_output

  if [ -f "${CACHE_ITEMS_404}/${item}.ttl" ]; then
    printf "\n%s\t%s" "${item}" "error cached"
    if [ -n "$DUMP_LOG" ]; then
      printf "%s\t%s\n" "${item}" "error cached" >>"$DUMP_LOG"
    fi
  elif [ -f "${CACHE_ITEMS}/${item}.ttl" ]; then
    printf "\n%s\t%s" "${item}" "cached"
    if [ -n "$DUMP_LOG" ]; then
      printf "%s\t%s\n" "${item}" "cached" >>"$DUMP_LOG"
    fi
  else
    EXIT_CODE="0"
    # set -x
    curl \
      --user-agent "'$USERAGENT'" \
      --silent \
      --fail \
      --output "${CACHE_ITEMS}/${item}.ttl" \
      "${WIKI_URL_ENTITYDATA}${item}.ttl${urlsuffix}" || EXIT_CODE=$?
    # set +x
    if [ "$EXIT_CODE" != "0" ]; then
      printf "\n%s\t%s" "${item}" "error"
      if [ -n "$DUMP_LOG" ]; then
        printf "%s\t%s\n" "${item}" "error" >>"$DUMP_LOG"
      fi
      touch "$CACHE_ITEMS_404/${item}.ttl"
    else
      # printf "\n%s" "${tty_green}${item}${tty_normal}"
      printf "\n%s\t%s" "${item}" "downloaded"
      if [ -n "$DUMP_LOG" ]; then
        printf "%s\t%s\n" "${item}" "downloaded" >>"$DUMP_LOG"
      fi
    fi
    # echo "before delay $DELAY"
    sleep "$DELAY"
    # echo "after delay"
  fi
  # printf "\t%40s\n" "${tty_green}${FUNCNAME[0]} FINISHED OKAY ${tty_normal}"
}

#######################################
# Main loop. The output to screen will be a valid .tsv format. Example:
#   item<tab>result
#   Q1<tab>error cached
#   Q2<tab>cached
#   Q3<tab>downloaded
#
# Globals:
#    CACHE_ITEMS
#    OUTPUT_DIR
#
# Arguments:
#    itemtype    string    Type of item. Values: Q , P
# Outputs:
#
#######################################
rdf_merge_items() {
  itemtype="$1"
  printf "\n\t%40s\n" "${tty_blue}${FUNCNAME[0]} STARTED ${tty_normal}"

  if [ ! -d "$CACHE_ITEMS" ]; then
    printf "%s\n" "${tty_red} ERROR: env CACHE_ITEMS \
[$CACHE_ITEMS]? ${tty_normal}"
    exit 1
  fi

  # set -x
  rdfpipe \
    --input-format=ttl \
    --output-format=longturtle \
    "${CACHE_ITEMS}/${itemtype}"*.ttl \
    >"${OUTPUT_DIR}/${itemtype}.ttl"
  # set +x

  printf "\t%40s\n" "${tty_blue} INFO: [hotfixes after formating] ${tty_normal}"
  set -x
  # Trying to be very specific, so unlikely edit text contents
  sed -i 's/^PREFIX schema1: /PREFIX schema: /' "${OUTPUT_DIR}/${itemtype}.ttl"
  sed -i 's/^    a schema1:/    a schema:/g' "${OUTPUT_DIR}/${itemtype}.ttl"
  sed -i 's/^    schema1:/    schema:/g' "${OUTPUT_DIR}/${itemtype}.ttl"
  # Input:  PREFIX p: <file://wiki.openstreetmap.org/prop/>
  # Output: PREFIX p: <https://wiki.openstreetmap.org/prop/>
  sed -i 's/^PREFIX p: <file:\/\//PREFIX p: <https:\/\//g' "${OUTPUT_DIR}/${itemtype}.ttl"

  # sed -r works on GNU sed (Not tested on OSX which may need sed -E instead)
  sed -i -r 's/^PREFIX ([a-z]*): <file:\/\//PREFIX \1: <https:\/\//g' "${OUTPUT_DIR}/${itemtype}.ttl"
  set +x

  printf "\t%40s\n" "${tty_green}${FUNCNAME[0]} FINISHED OKAY ${tty_normal}"
}

#### main ______________________________________________________________________

if [ -z "${OPERATION}" ] || [ "${OPERATION}" = "download" ]; then
  main_loop_items
fi

if [ -z "${OPERATION}" ] || [ "${OPERATION}" = "merge_p" ]; then
  # echo "TODO merge_p"
  rdf_merge_items "P"
fi

if [ -z "${OPERATION}" ] || [ "${OPERATION}" = "merge_q" ]; then
  # echo "TODO merge_q"
  rdf_merge_items "Q"
fi

References

Dump example