I have a shell snippet that finds all external JavaScript scripts in thousands of random html pages, which use the <script src="…" paradigm to include said scripts, with absolute URLs:
find ./ -type f -print0 | xargs -0 \
perl -nle 'print $1 \
while (m%<script[^>]+((https?:)?//[-./0-9A-Z\_a-z]+)%ig);'
Since scripts could also be loaded dynamically within JavaScript itself, I'd like to expand my snippet to match any absolute URL-like string which ends in .js, and preferably appears within the script tags. (This won't be 100% accurate, but would probably be good enough to find a few extra cases of external scripts.)
I'm thinking of something like <script[^>]*>.*["']((((https?)?:)?//)?[-.0-9A-Za-z]+\.[A-Za-z]{2,}/[-./0-9A-Z\_a-z]+\.js), and maybe also with .*</script> at the end.
A tricky part comes in ensuring that multiple mentions of .js within a script results in multiple matches (which the regex above won't do by itself), but also that the two expressions that I have don't match in a way as to result in two outputs from a single mention of a given $1 matching string in the input.
What would be a good way to add this new regex to the perl snippet I have?