You've got a few issues, though only the first is really nasty:
- (By far the biggest cost in all likelihood) Checking for membership in a
list is O(n); for a 70K element list, that's a lot of work. Make it a set/frozenset and lookup is typically O(1), saving thousands of comparisons. If the types are unhashable, you can pre-sort the selected_list and use the bisect module to do lookups in O(log n) time, which would still get a multiple order of magnitude speedup for such a large list.
- If your lines are large, with several runs of whitespace, splitting at all points wastes time; you can specify
maxsplit to only split enough to get the ID
- If the IDs are always integer values it may be worth the time to make
selected_id store int instead of str and convert on read so the lookup comparisons run a little faster (this would require testing). This probably won't make a major difference, so I'll omit it from the example.
Combining all suggestions:
selected_id = frozenset(... Your original list of 70k+ str elements ...)
for line in in_fp: # input file: 10M+ lines
id, _ = line.split(None, 1) # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
You could even convert the for loop to a single call with a generator expression (though it gets a little overly compact) which pushes more work to the C layer in CPython, reducing Python byte code execution overhead:
out_fp.writelines(x for x in in_fp if x.split(None, 1)[0] in selected_id)