using wget to get information from a web page, then using grep and sed to remove formattingprocess: download files (script: base-get-book.sh ) ##################################### wget -O INFO-1 "http://www...page 1 wget -O INFO-2 "http://www...page 2 wget -O INFO-3 "http://www...page 3 ##################################### process: strip out formatting to get specific content (script: extract-text-book.sh) ##################################### #!/bin/bash ##################################### mkdir INFO for x in `ls INFO-*` do ##################################### cat $x | grep "text $x" | sed 's/New American/ New American/g' | \ sed 's/[^<]*<\/span>/ /g' | \ sed 's/<[^>]*>//g' | sed 's/\ [[^]*\]]//g' | sed 's/ \([1-9]\)/\n\1/g' > $x.txt ##################################### mv $x INFO done ##################################### process: make sure all files have proper names for sorting ls > fix ##################################### vi fix remove lines that have digits that will sort: e.g. filename-10.txt leave lines that need to have a zero added, then: :%s/.*/mv & &/g results: mv filename-1.txt filename-1.txt then cursor to the number on the right, and insert a 0 before 1, to get: filename-01.txt repeat (usually 1-9) until done. :wq ##################################### sh ./fix ; rm -f fix ##################################### ls -al (should show all files sorted nicely) ##################################### process: remove remaining internal brackets use script to remove brackets: sh ./remove-brackets.sh * ##################################### script: remove-brackets.sh #!/bin/bash echo "remove bracketed footnotes" perl -pi -e 's/\[.\]//g' *.txt perl -pi -e 's/\[..\]//g' *.txt ##################################### process: edit all the files to remove errors, line up on one line vi INFO-* after making sure there is a number on the left and all the text on one line, then :%s/.*/INFO 9:&/g :wn :%s/.*/INFO 10:&/g :wn :%s/.*/INFO 11:&/g :wn # where INFO is the book or file name, and 9 is the chapter. # The ampersand places the text string with line number to the right repeat until done editing all files ##################################### process: copy files to proper directories and other servers: cp INFO-*.txt ../SORTED-INFO scp -r ../SORTED-INFO/ 192.168.11.11:/home/luser/FILES/SORTED-INFO or rsync -r ../SORTED-INFO/ 192.168.11.11:/home/luser/FILES/SORTED-INFO ##################################### repeat until all 1,189 chapters are cleaned up and sorted... then test to see that there will be 31,102 lines, and based on the version, 781,621 words cat INFO-* > total-info.txt ; cat total-info.txt | wc (note: as of 3/7/2017 I'm on 14/66) |
|
![]() Wagoneers FULL SIZE JEEPS JeepMeister "Jeep is America's -Enzo Ferrari MeisterTech Diesels + |
One Page Overview of Linux Commands click for an image of the 5 essential Linux commands An Intro to Linux |
at Midway Auto on SR9 in Snohomish, or at Northland Diesel in Bellingham, WA |