using find, grep and perl, with awk to find and replace a URL
30 January 2020
problem: - website contained links to the sold out DVD on Amazon, needed to update link to Oreilly
solution: - find html files with the amazon URL and replace Oreilly Media
the simple test:
grep -i Amazon *.html
(that only works within one directory, but does show all the instances, and other content)
a little more involved "test" (to make sure the special characters are escaped):
grep "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd" HEADER.html
with 4TB worth of files... need to descend through the directories using find, however a few problems resulted...
the command produced a lot of standard errors about a file being a directory, but also the information at the same time...
didn't take the time to introduce the command to send stderr to the bit bucket... or use the -type f, and/or even better, -name "*.html"
main reason was trying to find the files at first... simple always works...
this worked, well enough:
find . -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
this worked better (used later):
find jeep -type f -name "*.html" -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
AT THIS POINT, the string that needed to be replaced was found...
Could see that it was primarily HEADER.html files
the command to FIND the pages works, next, test the PERL command on a copy of a HEADER.html file renamed test.html:
perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' test.html
that was successful and tested with grep: grep -i amazon test.html
next step was to use find with the perl string by first using individual directories:
find /web/info/ -type f -name 'HEADER.html' -exec perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' {} \;
find /web/linux/ -type f -name 'HEADER.html' -exec perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' {} \;
once a few directories were tested successfully, the entire web tree was searched and updated:
find /web/ -type f -name 'HEADER.html' -exec \
perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' {} \;
find . -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \; | awk '{print $1}'
added the field separator... that worked... mostly...
find . -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \; | awk -F : '{print $1}'
now that this find was working, it was sent to a file that could be reviewed and used to create a script to update or delete, used tee with an append.
find . -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \; | awk -F : '{print $1}' | tee -a LIST.txt
while the find was running and producing the list a few tweaks to awk were made and then that output send to a file that was edited and turned into a script.
tail -f LIST.txt
cat LIST.txt | awk '{print $1}'
cat LIST.txt | awk '{print $1$2}'
cat LIST.txt | awk -F : '{print $1}'
cat LIST.txt | awk -F : '{print $1}' > edit-files
vi edit-files
find jeep -type f -name "*.html" -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
find info -type f -name "*.html" -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
find linux -type f -name "*.html" -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
found 3 non-Header files, fixed with commands in a quick script below...
--> cat edit-files
#!/bin/bash
perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' linux/Intro-to-Linux/One-Hour-Linux-Sessions-2019.html
perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' linux/Intro-to-Linux/One-Hour-Linux-Sessions-2018.html
perl -pi -e 's$https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref\=sr_1_1\?ie=UTF8\&qid\=1484728403\&sr\=8-1\&keywords\=lpic-2\+dvd$http://shop.oreilly.com/product/0636920050209.do$g' linux/LinuxMeister.net-Books-Videos-n-links.html
-------------------
testing:
grep -i amazon linux/LinuxMeister.net-Books-Videos-n-links.html
find linux -type f -name "*.html" -exec grep -H -n "https://www.amazon.com/Study-Guide-LPIC-2-Exams-201/dp/B01I25VO9A/ref=sr_1_1?ie=UTF8&qid=1484728403&sr=8-1&keywords=lpic-2+dvd" {} \;
JohnMeister.com
Today's Date:
|